March 27, 2010 1 Comment
Recently I was asked by a colleague about chinese tokenization. I had done a little work on it in the past while at Feedster (we used the BasisTech libraries for tokenization, and I wrote part of the infrastructure around them,) but I had not really looked closely at the problem since, so I did a little bit of searching on the subject.
I knew that chinese tokenization is usually dictionary driven, the key is in having a good dictionary as well as some smart algorithms around it.
The First International Chinese Word Segmentation Bakeoff and the Second International Chinese Word Segmentation Bakeoff provide good sets of training and test data, I preferred the latter because it is in UTF-8 which makes life easier when it comes to coding (I had some issues with the former and CP950 coding.) Also provided is a simple tool to check precision and recall of the segmentation algorithm.
The simplest approach is to use a maximum match algorithm for which Perl code is provided on this post on the Xapian-discuss mailing list. This yields pretty good results. I think you can go a little further by looking at the frequencies of individual segments too, so _C1C2_ is more likely to be the correct segment than _C1C2C3_ because the former is much more common in text than the latter.
You can add additional rules which are explained in “MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm” and in “CScanner – A Chinese Lexical Scanner”, along with demonstration code.
LingPipe has an interesting approach where they look at chinese segmentation as a spelling correction problem. Be sure to check the references at the end, there is link to a good paper called “A compression-based algorithm for Chinese word segmentation” by Teahan, William J., Yingying Wen, Rodger McNab and Ian H. Witten.
Finally “Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services” mentions ICTCLAS which is available as a Google Code download.
There is more to it though, named entity recognition is not handled, and one needs to handle mixed in ascii too, all of which I dealt with in the original parser I wrote. For bonus points, things can be speeded up considerably if the segmentation can be handled in a non-destructive manner (ie without moving memory around.)