Chinese Segmentation – Part II

After writing up a post about Chinese Segmentation, I found some more resources around this which looked interesting:

“A Review of Chinese Word Lists Accessible on the Internet” listing a number of useful resources.

The Datapark Search Engine which has support for Chinese, Japanese, Korean and Thai tokenization, as well as a frequency dictionaries for traditional chinese, mandarin, korean, and thai (check the bottom of the home page.)

A paper titled “Word Segmentation Standard in Chinese, Japanese and Korean“.

You can also search for “Rocling standard segmentation corpus” on Google for papers referencing this.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: