Chinese Segmentation – Part II
March 29, 2010 Leave a comment
After writing up a post about Chinese Segmentation, I found some more resources around this which looked interesting:
“A Review of Chinese Word Lists Accessible on the Internet” listing a number of useful resources.
The Datapark Search Engine which has support for Chinese, Japanese, Korean and Thai tokenization, as well as a frequency dictionaries for traditional chinese, mandarin, korean, and thai (check the bottom of the home page.)
A paper titled “Word Segmentation Standard in Chinese, Japanese and Korean“.
You can also search for “Rocling standard segmentation corpus” on Google for papers referencing this.