Language recognition

Recently I needed a language recognition library to identify the language of specific chunks of text. I asked a network of colleagues here in the Boston area and they came up with the following:

  1. LingPipe
  2. TextCat
  3. Simile

There is also:

  1. Lingua::Identify
  2. The language identifier in Nutch

And all this led to:

  1. TCatNG Toolkit
  2. TextCat derivatives and current home in SpamAssassin

In the event it seemed simple enough to write my own using the text collection in TextCat as source material for the ngrams and associated frequencies.

As for a corpus, I stumbled onto this project: “Corpus building for minority languages” which led to a status page, which lead to the Declaration of Human Rights in 335 languages.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: