Data is Good, More Data is Better, and Lots More Data is Best
December 7, 2008 Leave a comment
By way of Geeking with Greg who talks about a presentation that Peter Norvig gave.
His core point was that “code is a liability”. Relying on data over code as much as possible allows simpler code that is more flexible, adaptive, and robust.
In one of several examples, Peter put up a slide showing an excerpt for a rule-based spelling corrector. The snippet of code, that was just part of a much larger program, contained a nearly impossible to understand let alone verify set of case and if statements that represented rules for spelling correction in English. He then put up a slide containing a few line Python program for a statistical spelling correction program that, given a large data file of documents, learns the likelihood of seeing words and corrects misspellings to their most likely alternative. This version, he said, not only has the benefit of being simple, but also easily can be used in different languages.
For another example, Peter pulled from Jing et al, “Canonical Image Selection from the Web” (ACM), which uses a clever representation of the features of an image, a huge image database, and clustering of images with similar features to find the most representative image of, for example, the Mona Lisa on a search for [mona lisa].
Peter went on to say to say that more data seems to help in many problems more than complicated algorithms. More data can hit diminishing returns at some point, but the point seems to be fairly far out for many problems, so keeping it simple while processing as much data as possible often seems to work best. Google’s work in statistical machine translation works this way, he said, primarily using the correlations discovered between the words in different languages in a training set of 18B documents.
I think Peter Norvig made the same points in another presentation I listened to a couple of years ago so this is not new but is very interesting nonetheless because it suggests that training simple algorithms with LOTS of data is better than designing complex algorithms. I would also be curious how the simple algorithms plot against the quantity of data, basically the curve, and would be interested to see what quantities of data would be needed to reach 100%.