Large-scale Incremental Processing Using Distributed Transactions and Notifications

This is definitely worth reading:

Large-scale Incremental Processing Using Distributed Transactions and Notifications: “Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.

We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%. Links: [abstract] [pdf] [search]

(Via Recent Google Publications (Atom).)

I built an incremental indexer while at Feedster, albeit on a much smaller scale, we had a 10 minute turn around time for newly crawled stuff which wasn’t too shabby I think.


Open Source Search Conference

Just came across this on Steve Arnold’s weblogLucid Imagination is sponsoring an open source search conference in Boston, MA on October 7-8, 2010 at the Hyatt Harborside:

The first-ever conference focused on addressing the business and development aspects of open source search will take place October 7-8, 2010 at the Hyatt Harborside in Boston.

Dubbed Lucene Revolution due to the sponsor, Lucid Imagination, the commercial company dedicated to Apache Lucene technology. This inaugural event promises a full, forward-thinking agenda, creating opportunities for developers, technologists and business leaders to explore the benefits that open source enterprise search makes possible.

In addition to in-depth training provided by Lucid Imagination professionals, there will be two days of content rich talks and presentations by Lucene and Solr open source experts. Working on the program will be Stephen E. Arnold, author and consultant.

Those interested in learning more about the conference and submitting a proposal for a talk can navigate to The deadline for submissions is June 23, 2010. Individuals are encouraged to submit proposals for papers and talks that focus on categories including enterprise case studies, cloud-based deployment of Lucene/Solr, large-scale search, and data integration.

The Lucene Revolution conference comes just after success of sold-out Apache Lucene EuroCon 2010 in Prague, also sponsored by Lucid Imagination, the single largest gathering of open source search developers to date.

Need to Focus the Meaning of “NoSQL”

Great post by Adam Ferrari titled Let’s not let “NoSQL” go the way of “Web 2.0” on the new to focus the definition of “NoSQL” lest it turns into the term “Web 2.0″ which has become pretty much meaningless since meaning everything:

As part of a team focused on enterprise-oriented information access problems, which are a different beast from wide area data stores, I don’t apply the “NoSQL” label to what we’re doing. At our core, we’re targeting different problem spaces. And I have a huge amount of respect for what the NoSQL movement is doing. For example, the work being done on consistency models like the Vogels paper I mentioned above is big league computer science that is making large contributions to the ways that technology can play bigger and more helpful roles in our lives. I’d just hate to see the “NoSQL” label go the way of “Web 2.0,” a moniker that rapidly came to mean everything and so nothing at all.

MapReduce Book Draft

This is well worth taking a look at if you are interested in MapReduce/Hadoop and IR:

An updated draft of the upcoming book, Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer is available.

The book isn’t finished, but it still has interesting material. It emphasizes algorithms for processing text with Mapreduce: co-occurrence analysis, inverted index construction, and the EM algorithm applied to estimating parameters in HMMs.

You can also see Jimmy’s cloud computing course (spring 2010) and the Ivory search engine.

Cloud Platform Choices

Good article over on ArsTechnica about Cloud Platform Choices, well worth reading if for no other reason than to keep up with what is going on in that space:

Cloud computing is one of the most hyped technology concepts in recent memory, and, like many buzzwords, the term “cloud” is overloaded and overused. A while back Ars ran an article attempting to clear some of the confusion by reviewing the cloud’s hardware underpinnings and giving it a proper definition, and in this article I’ll flesh out that picture on the software side by offering a brief tour of the cloud platform options available to development teams today. I’ll also discuss these options’ key strengths and weaknesses, and I’ll conclude with some thoughts about the kinds of advances we can expect in the near term. In all, though, it’s important to keep in mind that what’s presented here is just a snapshot. The cloud is evolving very rapidly—critical features that seem to be missing today may be standard a year from now.

Why The Name NoSQL Is Meaningless (To Me)

The ‘NoSQL’ movement has gotten quite popular lately and with good reason, it is breaking new ground on distributed, scalable storage.

But the name ‘NoSQL’ really bugs me, because SQL is just a query language, it is not a storage technology. This is well illustrated in “InnoDB is a NoSQL database”, which I will quote below:

As long as the whole world is chasing this meaningless “NoSQL” buzzword, we should recognize that InnoDB is usable as an embedded database without an SQL interface. Hence, it is as much of a NoSQL database as anything else labeled with that term. And I might add, it is fast, reliable, and extremely well-tested in the real world. How many NoSQL databases have protection against partial page writes, for example?

It so happens that you can slap an SQL front-end on it, if you want: MySQL.

Another thing, it is probably better to say what you are for rather than what you are against, much more constructive. Time to get a new name/acronym I think.

Updated December 18th, 2009 – I am seeing that NoSQL is being renamed to mean Not Only SQL, which I think is much better.

exit() Rather Than free()

I have to admit that I had a bit of a reaction to this post, apologies for quoting more than 50% of the post here but here goes:

See, developers are perfectionists, and their perfectionism also includes the crazy idea that all memory has to be deallocated at server shutdown, as otherwise Valgrind and other tools will complain that someone leaked memory. Developers will write expensive code in shutdown routines that will traverse every memory structure and deallocate/free() it.

Now, guess what would happen if they wouldn’t write all this expensive memory deallocation code.

Still guessing?

OS would do it for them, much much much faster, without blocking the shutdown for minutes or using excessive amounts of CPU. \o/

I am really uncomfortable with the approach of using free() for memory cleanup for the obvious reason that it is usually much, much cheaper to keep a process running than to shut it down and restart it on a regular basis. The other reason is that to rely on free() for memory cleanup is just poor hygiene.

Reminds me of the days of SunOS where common wisdom said that restarting a server once a week was a good idea to keep the memory leaks in check.


I am have been reading Steve Arnold’s weblog on search, I have known Steve for over ten years now and he likes to challenge the status-quo and pushing people to see beyond the status-quo.

So it was very interesting to read his post about how Google is challenging Reed Elsevier and Thomson by indexing legal texts:

Google has added the full text of US federal cases and state cases. The coverage of the federal cases, district and appellate, is from 1924 to the present. US state cases cover 1950 to the present. Additional content will be added; for example, I have one source that suggested that the Commonwealth of Virginia Supreme Court will provide Google with CD ROMs of cases back to 1924. Google, according to this source, is talking with other sources of US legal information and may provide access to additional legal information as well.

His thesis is that the incumbents are like Vercingetorix stuck in Alesia (1) and that Google is like Ceasar who built two sets of wall around Alesia, one to keep the Gauls in, and the other to keep any relieving force out.

I like the analogy though it is not quite there, Google is not exactly laying siege and they don’t have to defend themselves. On the other hand the incumbents probably feel very much like the Gauls stuck in Alesia.

I was catching up with a long time friend earlier this week (a much smarter person than me), we were talking about lots of thing and one of those things was how particular species will move fluidly from one niche to another as they evolve. My feeling was that sometimes this happens in a fluid fashion without much struggle, but sometimes it can be quite violent resulting in the decimation of one or the other species (2). I wonder if this is closer to what is happening here. Google in moving in on an established market, though not in an explicitly deliberate fashion, and causing discomfort to the incumbents.

Now that a good portion of the data these incumbents charge for is available for free (it always was available for free, but access was difficult), it will likely force them to change their business model if they are to stay relevant. Steve makes that point very explicitly at he end of his post, for example:

Finally, what will be vulnerable to Google disruption will be difficult to use, expensive, and incomplete services. Maybe Reed Elsevier, Thomson Reuters, and Wolters Kluwer should merge. That will give the present crop of senior managers time to cash out. I don’t see an easy, quick, inexpensive, or painless way to prevent the lessons of Alesia being writ large in tomorrow’s digital headlines.

1 – I learned all about the siege of Alesia at when I was at school.

2 – For example, the introduction of Lionfish in the Caribbean is resulting in major population reduction in some indigenous species on the reefs.

InnoDB Compression

I had a few hours to spare a couple of days ago and decided to check InnoDB Plugin 1.0’s support for data compression.

In a project I work on from time to time, there is a table which contains three blobs which contains text data. To store the data efficiently I was using the COMPRESS() function in MySQL and doing a “CONVERT(UNCOMPRESS(text) AS utf8)” to uncompress the data and present it as utf8. No problems there, but with the recent move to the InnoDB Plugin 1.0 in MySQL 5.1 there was an opportunity to push that down the stack.

I ran a few benchmarks and it turned out that using 8K pages was the optimal trade-off between space and time. Using 16K pages did not compress the data very well, and using pages smaller than 8K increased the time needed to store the data. I should note that 8K is also the default.

There are some interesting wrinkles in all this, innodb_file_per_table needs to be enabled and I think the innodb_file_format needs to be set to ‘barracuda’ thought I am not sure about that.

Number Encoding II

To conclude my little foray into number encoding (see the presentation by Jeffrey Dean from Google titled “Challenges in Building Large-Scale Information Retrieval Systems” (video, slides)), here are a few conclusions:

  • In terms of raw performance the “Varint Encoding” is much faster than “Byte-Aligned Variable-length Encodings” and I was able to get better numbers than Google got, most likely because I am using a different machine. It would be interesting to know what kind of machine/OS they used for their timings so I could do a direct comparison. My lookup array structure is different (and more compact) than Google’s, assuming I understood Google’s lookup array structure in the presentation.
  • The “Byte-Aligned Variable-length Encodings” is faster if you are storing three numbers per posting, namely a document ID, a term position and a field ID. The “Group Varint Encoding” is faster if you are storing four number per posting, namely a document ID, a term position, a field ID and a weight.
  • As I described in the last comment in the original post, two bits are used in the header for each varint to indicate its size in bytes, so 0, 1, 2 or 3 indicate whether your varint is 1, 2, 3 or 4 bytes long respectively. However if you store deltas a lot of the numbers you store will be 0, using a byte to store 0 seems wasteful to me. So I changed this so that the two bits indicate the actual number of bytes in the varint, and 0 bytes means 0. This way I don’t actually allocate space unless there is a value other than 0 to store. This saves about 10% in my overall index size, and a lot more if you only take the term postings into account because I store some amount of document metadata in my index. Of course this means that you can’t store a number greater than 16,777,216 which won’t happen unless you are creating huge indices with more than 16,777,216 documents in them or have documents longer that 16,777,216 terms.

Basically it comes down to trade-offs, index compactness vs. decode speed, and looking at speed both in test code (usually a contrived example) and performance on a real data set. I used the Wikipedia data for that along with 200 relatively complex searches designed to read lots of postings lists.


Get every new post delivered to your Inbox.