Scaling MySpace

David Carr wrote an interesting article about scaling MySpace.

The article is well worth reading and mirrors some of the experiences we had while building and scaling Feedster.


Further Thoughts on Search & Advertising

Robert Cringely follows up his previous post about Google building data centers in South Carolina. I commented on the latter here.

I have not fully digested the article yet, but this interesting tidbit jumped out at me:

How will these three companies compete with the Google proxy strategy? As far as I see, they can’t compete. They come up short in too many ways, but the biggest way is money, moolah, cash, loot. None of these companies can afford to do what Google is doing, building hundreds of huge data centers around the globe.

This dovetails into the “Winner-Take-All: Google and the Third Age of Computing” and a follow-up on Skrentablog.

It strikes me that these articles make the implicit assumption that this is a zero-sum game, that there is only a finite number of searches to be run and only a finite number of ads to be served.

Both of which are just plain wrong.

Searches are generated by users, and the number of users on the internet is growing yearly. More users mean more searches.

Advertising on the internet is still only a small fraction of the total amount that is spent on advertising (about 6% right now) and that is set to keep on increasing over time. This article on MarketingVOX has some numbers as well as forecasts.

All this tells me that there is a lot of room for growth, and competition from established players as well as startups.

Also we can’t forget that Google came from behind in a field with some very strongly established players, Altavista and Lycos spring to mind. And this will happen again, you can count on that. Google has a lead now, but they can’t afford to stay still.

Challenges on Distributed Web Retrieval

I came across a great paper called “Challenges on Distributed Web Retrieval” by way of Greg Linden.

I read the paper twice and I would strongly recommend it to anyone who has an interest in this area. The paper does a very good job of summarizing the challenges that come with creating a search engine, partitioning up the problem into three sections: crawling, indexing and searching. The paper also contains an extensive bibliography.

I did have a few comments about the paper though.

Concerning load peaks:

…or peaks in the query workload.

In my experience there are no peaks in query workload, the workload is pretty much even all day, there are peaks for specific searches though when some event happens.

There is a good discussion about whether to partition the index along documents or terms. My experience is that documents based partitioning is much easier to deal with when you are indexing, and when you are managing the data on your search machines. The paper makes a good case for better load balancing when term based partitioning is used, but that comes at the expense of added administration because you need to be snooping the search log in real time and adjust the replication of partitions on the fly. Messy and error prone. Document based partitioning is easy to manage and it is very easy to replicate partitions to allow for redundancy and load management. My experience is that there is enough randomness in the term distribution in partitions to keep the load balanced across partitions with a normal mix of queries. The paper also seems to assume that all queries impart the same load on each query processor which is just not the same. Andy MacFarlane, a colleague of mine at City University, has done some work in this area too.

There is also mention of blogs and news sites:

In practical distributed Web search engines, indexes are usually rebuilt from scratch after each update of the underlying document collection. It might not be the case for certain special document collections, such as news articles, and blogs, where updates are so frequent that there is usually some kind of online index maintenance strategy. This dynamic index structure constrains the capacity and the response time of the system since the update operation usually requires locking the index using a mutex thus possibly jeopardizing the whole system performance. A very interesting problem is the one of understanding whether it is possible to safely lock the index without experiencing too much loss of performance. This is even more problematic in the case of term partitioned distributed IR systems. Terms that require frequent updates might be spread across different servers, thus amplifying the lockout effect. 

This problem turns out to be quite easy to solve, which I did at Feedster. What really helps here is the fact that the data is not monolithic like it would be for a normal web search engine where you generally take a snapshot of the web, create an index and make it available for searching. Blog and news updates very often and yesterday’s news is old news and so you are generally indexing new data only. The downside is that you are indexing ad nauseam

Caching also gets a good mention in the paper. Caching is a very powerful tool for web search engines since you will see the same queries over and over again. Where things get a little more interesting is when you are dealing with data that is constantly changing, like blogs and news. If the index is fairly static, you can move your cache as close to the user as you can, putting it in the query coordinators, and sending a cache invalidation message to the query coordinators when the cache is invalidated. However if the index is fairly dynamic you need to to keep your cache in the query processors since they are the ones who first know when the index has changed, and you cant keep sending the query coordinators cache invalidation messages every time the index is updated.

Latency is mentioned as a problem when working across some sort of WAN, but it is also an issue when working on a LAN. In fact latency is the killer and you need to make sure that there is as little latency in the search pipeline as possible.

P2P also gets a passing mention, but I think there is a lot to learn there. The system I designed at Feedster (which is still in use today), borrowed a number of ideas from P2P networks. Unfortunately I can’t really talk about them here.

There were some things lacking from the paper though. Spam, splogs and sparches (spam searches designed to rip off and repurpose content, it is my own term) are all big issues. Storing and accessing all the data that is generated by the crawlers, namely the raw data that will be used for indexing or the metadata that will be used when assembling the SERP are both a major issues.

Lessons Learnt – Aim in Front of the Target

While working at Feedster, I was originally responsible for the crawler, the indexer and the search engine. Since I knew nothing about crawlers, I still don’t but I know more about them now than I did then, I handed that over to people who were much more competent than me in that area.

So I was left looking after the indexer and the search engine. Initially, back in 2003, our traffic was very low but it kept increasing. As time went on, I would have to re-architect the indexer and the search engine to be able to deal with the increasing amount of data we were crawling and with the increasing number of searches we were getting. The trick to aim for the growth we were anticipating in six to twelve months time and not the growth we were anticipating in two to three months time. That way, the indexer and the search engine were able to deal with traffic growth very easily and without causing issues.

The other lesson learnt here was to aim for simplicity. You want to make the data administration easy when you are dealing with large amounts of data, and increasingly large amounts of data. You want to make the system robust so when machines crash, or networks go down, or power goes out, whatever is developed has to deal gracefully with system degradation and has to recover on its own when whatever when down comes back up.

All Good Things…

All good things end, and last friday I left Feedster after spending four years there.

Feedster was founded in early 2003 during a particularly tenebrous Boston winter. Scott Johnson and myself had started to build feed search engines independently of each other. After about a month of doing this, we started an email conversation one friday evening and decided to meet for lunch to talk about what we were doing.

We talked on and off for a few weeks and it became clear to both of us that there was something there, that neither of us could got it alone and that we both had different strengths, so we decided to join forces and Feedster was truly born.

Now after four years of work, it is time for me to move on. Working on Feedster has been a lot of fun, I have been privileged to work with a lot of very smart people, and I have learnt a lot along the way.

Feedster has been through a lot of changes over the years, and the current team has been in place for just over a year. It is a very strong team and I will miss working with them a lot. I have a very deep respect for them and I have made many good friends along the way.

I will be remaining as an advisor to Feedster for the foreseeable future, so I will be in touch with them on a regular basis. And I know that Feedster will go onto greater and greater things.

For my part, I have no immediate plans, save taking some much needed vacations and batting around some ideas that have been running around my head.

Lessons Learnt – Go For Stability

In the continuing saga of lessons learnt at Feedster, one of the top ones for me would be to go for stability.

In the very beginning, way back in 2003, we had an internal debate over which linux distribution to use. We decided to use gentoo for a few reasons. The first reason was that the sys admin who was working with us was most familiar with that particular distribution. The second reason was that it afforded us a high degree of granularity in choosing which packages would be installed on specific machines. The third reason was that it was tweacked and so performed a little better than other distributions.

I have since done a u-turn (or 180) on this. All linux distributions share a common DNA if you will, so 95% of admin skill you have on one can be transfered to another. While you can pick and choose which packages you want to install on a specific machine if you have 5 or 10 machines, this quickly becomes pointless when you have 50 or a 100 machines. Most distributions performs well enough for most workload, and my experience is that tweacked distributions lose a some amount of stability.

So at this point, I feel that CentOS is a stable, no surprises distribution.

Mysql Scaling

By way of Matt Mullenweg, an interesting article on scaling Mysql , and some very interesting notes that Matt took while at Mysql Camp.

At Feedster, we use Mysql extensively, and it has not been without some amount of pain, which I will write about soon.