Chinese Segmentation – Part II

After writing up a post about Chinese Segmentation, I found some more resources around this which looked interesting:

“A Review of Chinese Word Lists Accessible on the Internet” listing a number of useful resources.

The Datapark Search Engine which has support for Chinese, Japanese, Korean and Thai tokenization, as well as a frequency dictionaries for traditional chinese, mandarin, korean, and thai (check the bottom of the home page.)

A paper titled “Word Segmentation Standard in Chinese, Japanese and Korean“.

You can also search for “Rocling standard segmentation corpus” on Google for papers referencing this.


Migrating from iBatis 2.x to 3.0

I recently migrated a project from iBatis 2.x to 3.0 and thought I would share some notes on the differences between the two:

  • You will need to download asm-3.2 from
  • The iBatis xml configuration file format has been changed a little, the handling of different environments is much improved, allowing you to easily switch data sources in the xml file.
  • The mapper xml file format has been changed a lot, the naming has been cleaned up, and the dynamic SQL generation is a lot cleaner. One gotcha is that you need to set the jdbcType of columns which allow NULL values (this was not the case before, but is not a bit deal.)
  • SqlClientMap has been done away with and replaced by a SqlSession which has a much cleaner API and a much better life-cycle. Note that you can set a SqlSession to autocommit or you can call .commit(), and it is always a great idea to .close() a SqlSession.
  • On thing that did bite me is that SqlSession cache results so a second SELECT will return the same results even though the data was changed by another process (not sure if this applies to INSERT and UPDATE though,) so you should use .clearCache() if you want the results to come from the database as opposed to the cache.

A few other notes:

  • With iBatis 2.0 I would create a bunch of SqlClientMap and cache them in a hash so that I would not need to create new ones for each session. This helped performance a lot but was not good practice. You cannot do that with SqlSession, so what I did was to create and cache my SqlSessionFactory objects and create new a SqlSession from them when I needed. The reason I needed to do this is that the project in question has about 45 databases. SqlSessionFactory objects can be shared across threads.
  • I use C3P0 as my data sources and a little work was needed to make it work properly under iBatis 3.0. What I did was to create a new combo pooled data source for each driver/url/user name combination and cache those in a hash returning them as needed by iBatis.
  • iBatis 3.0 is much, much faster than 2.0 when it comes to SELECTs, this is a very welcome change, however INSERTs/UPDATEs/DELETEs are a little slower but not enough to have any impact.
  • Finally the beans used by iBatis are “enhanced by CGLIB”, so the result of ‘o.getClass()’ will return ‘O$$EnhancerByCGLIB$$2632338b’ which makes a clause like ‘getClass() != o.getClass()’ in the .equals() method quite useless and you need to use ‘!(o instanceofO)’ instead, otherwise objects which are the same will not be reported as such. Of course this assumes that you have an .equals() method in your beans.

You will also want to read the iBatis 3.0 documentation.

Chinese Segmentation

Recently I was asked by a colleague about chinese tokenization. I had done a little work on it in the past while at Feedster (we used the BasisTech libraries for tokenization, and I wrote part of the infrastructure around them,) but I had not really looked closely at the problem since, so I did a little bit of searching on the subject.

I knew that chinese tokenization is usually dictionary driven, the key is in having a good dictionary as well as some smart algorithms around it.

The First International Chinese Word Segmentation Bakeoff and the Second International Chinese Word Segmentation Bakeoff provide good sets of training and test data, I preferred the latter because it is in UTF-8 which makes life easier when it comes to coding (I had some issues with the former and CP950 coding.) Also provided is a simple tool to check precision and recall of the segmentation algorithm.

The simplest approach is to use a maximum match algorithm for which Perl code is provided on this post on the Xapian-discuss mailing list. This yields pretty good results. I think you can go a little further by looking at the frequencies of individual segments too, so _C1C2_ is more likely to be the correct segment than _C1C2C3_ because the former is much more common in text than the latter.

You can add additional rules which are explained in “MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm” and in “CScanner – A Chinese Lexical Scanner”, along with demonstration code.

LingPipe has an interesting approach where they look at chinese segmentation as a spelling correction problem. Be sure to check the references at the end, there is link to a good paper called “A compression-based algorithm for Chinese word segmentation” by Teahan, William J., Yingying Wen, Rodger McNab and Ian H. Witten.

Finally “Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services” mentions ICTCLAS which is available as a Google Code download.

There is more to it though, named entity recognition is not handled, and one needs to handle mixed in ascii too, all of which I dealt with in the original parser I wrote. For bonus points, things can be speeded up considerably if the segmentation can be handled in a non-destructive manner (ie without moving memory around.)

Goodbye Bluefin Tuna

I was deeply disappointed to see that the UN rejected a ban on the trade of bluefin tuna on March 18th (see “Bluefin tuna: Eaten Away” in the Economist and “U.N. Rejects Export Ban on Atlantic Bluefin Tuna” in the NY Times). The article highlight the voting shenanigans that we used by the Libyan delegation to effectuate this rejection.

Let’s face it, with idiots like that in charge of conservation, bluefin tuna is as good as extinct. From the Economist:

The outlook for the bluefin tuna is not good. Scientists already agree that the population is crashing, and that quotas allocated to fishermen remain too generous to give any reasonable degree of certainty of a recovery. The extent to which illegal fishing can be brought under control will also have a big impact on whether the population has a chance of recovering.

What I can’t understand is that the countries who rely most on fishing for food and trade are the least likely to implement sound fishery management practices. I have to wonder if these people have given thought to how they will make a living once the fish are all gone?

Yahoo Papers on Caching

Two very good papers from Yahoo on caching:

A Refreshing Perspective of Search Engine Caching

Commercial Web search engines have to process user queries over huge Web indexes under tight latency constraints. In practice, to achieve low latency, large result caches are employed and a portion of the query traffic is served using previously computed results. Moreover, search engines need to update their indexes frequently to incorporate changes to the Web. After every index update, however, the content of cache entries may become stale, thus decreasing the freshness of served results. In this work, we first argue that the real problem in today’s caching for large-scale search engines is not eviction policies, but the ability to cope with changes to the index, i.e., cache freshness. We then introduce a novel algorithm that uses a time-to-live value to set cache entries to expire and selectively refreshes cached results by issuing refresh queries to back-end search clusters. The algorithm prioritizes the entries to refresh according to a heuristic that combines the frequency of access with the age of an entry in the cache. In addition, for setting the rate at which refresh queries are issued, we present a mechanism that takes into account idle cycles of back-end servers. Evaluation using a real workload shows that our algorithm can achieve hit rate improvements as well as reduction in average hit ages. An implementation of this algorithm is currently in production use at Yahoo!.

Caching Search Engine Results over Incremental Indices (Poster)

A Web search engine must update its index periodically to incorporate changes to the Web, and we argue in this work that index updates fundamentally impact the design of search engine result caches. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. To enable efficient inval- idation of cached results, we propose a framework for devel- oping invalidation predictors and some concrete predictors. Evaluation using Wikipedia documents and a query log from Yahoo! shows that selective invalidation of cached search results can lower the number of query re-evaluations by as much as 30% compared to a baseline time-to-live scheme, while returning results of similar freshness.

Need to Focus the Meaning of “NoSQL”

Great post by Adam Ferrari titled Let’s not let “NoSQL” go the way of “Web 2.0” on the new to focus the definition of “NoSQL” lest it turns into the term “Web 2.0” which has become pretty much meaningless since meaning everything:

As part of a team focused on enterprise-oriented information access problems, which are a different beast from wide area data stores, I don’t apply the “NoSQL” label to what we’re doing. At our core, we’re targeting different problem spaces. And I have a huge amount of respect for what the NoSQL movement is doing. For example, the work being done on consistency models like the Vogels paper I mentioned above is big league computer science that is making large contributions to the ways that technology can play bigger and more helpful roles in our lives. I’d just hate to see the “NoSQL” label go the way of “Web 2.0,” a moniker that rapidly came to mean everything and so nothing at all.

iTunes LP Format – Bitter “Cocktail”?

I could not help having fun with the headline, it is late it the day and some steam needed venting. It turns out that the iTunes LP Format has not been selling too well:

Six months later, however, iTunes LP doesn’t prompt much consumer recognition, and none of the industry sources with whom I spoke said they viewed it as being anywhere close to game-changing from a format perspective. Rather, it’s considered more of a curiosity. Like an enhanced CD or a DVD packaged with a physical album, iTunes LP’s bonus materials may interest super-fans, but they aren’t generating much buzz among mainstream consumers, and don’t appear to be stimulating LP sales at all. “It’s something most people will look at once,” is how one person put it.

I have to admit that I had a bit of a ‘what the…’ moment when I first read about the iTunes LP Format. I used to buy LPs (dating myself) and would check out the sleeves when I was listening to the music. After that I bought CDs and found that I paid very little attention to the sleeves, that they were smaller, contained less information and were just not that interesting anymore had something to do with that I am sure, same goes for the sleeves in cassettes. Now that the music lives on iTunes or my iPod, I don’t even care that there is a ‘sleeve’ virtual or real.

Let’s face it, the way we listen to music has changed, radically! I could not listen to LPs on the go so there was time to browse a sleeve, once music became mobile (CD, cassette, MP3 player) the goal was to make the device as light as possible, and sleeves were history.