Challenges on Distributed Web Retrieval

I came across a great paper called “Challenges on Distributed Web Retrieval” by way of Greg Linden.

I read the paper twice and I would strongly recommend it to anyone who has an interest in this area. The paper does a very good job of summarizing the challenges that come with creating a search engine, partitioning up the problem into three sections: crawling, indexing and searching. The paper also contains an extensive bibliography.

I did have a few comments about the paper though.

Concerning load peaks:

…or peaks in the query workload.

In my experience there are no peaks in query workload, the workload is pretty much even all day, there are peaks for specific searches though when some event happens.

There is a good discussion about whether to partition the index along documents or terms. My experience is that documents based partitioning is much easier to deal with when you are indexing, and when you are managing the data on your search machines. The paper makes a good case for better load balancing when term based partitioning is used, but that comes at the expense of added administration because you need to be snooping the search log in real time and adjust the replication of partitions on the fly. Messy and error prone. Document based partitioning is easy to manage and it is very easy to replicate partitions to allow for redundancy and load management. My experience is that there is enough randomness in the term distribution in partitions to keep the load balanced across partitions with a normal mix of queries. The paper also seems to assume that all queries impart the same load on each query processor which is just not the same. Andy MacFarlane, a colleague of mine at City University, has done some work in this area too.

There is also mention of blogs and news sites:

In practical distributed Web search engines, indexes are usually rebuilt from scratch after each update of the underlying document collection. It might not be the case for certain special document collections, such as news articles, and blogs, where updates are so frequent that there is usually some kind of online index maintenance strategy. This dynamic index structure constrains the capacity and the response time of the system since the update operation usually requires locking the index using a mutex thus possibly jeopardizing the whole system performance. A very interesting problem is the one of understanding whether it is possible to safely lock the index without experiencing too much loss of performance. This is even more problematic in the case of term partitioned distributed IR systems. Terms that require frequent updates might be spread across different servers, thus amplifying the lockout effect. 

This problem turns out to be quite easy to solve, which I did at Feedster. What really helps here is the fact that the data is not monolithic like it would be for a normal web search engine where you generally take a snapshot of the web, create an index and make it available for searching. Blog and news updates very often and yesterday’s news is old news and so you are generally indexing new data only. The downside is that you are indexing ad nauseam

Caching also gets a good mention in the paper. Caching is a very powerful tool for web search engines since you will see the same queries over and over again. Where things get a little more interesting is when you are dealing with data that is constantly changing, like blogs and news. If the index is fairly static, you can move your cache as close to the user as you can, putting it in the query coordinators, and sending a cache invalidation message to the query coordinators when the cache is invalidated. However if the index is fairly dynamic you need to to keep your cache in the query processors since they are the ones who first know when the index has changed, and you cant keep sending the query coordinators cache invalidation messages every time the index is updated.

Latency is mentioned as a problem when working across some sort of WAN, but it is also an issue when working on a LAN. In fact latency is the killer and you need to make sure that there is as little latency in the search pipeline as possible.

P2P also gets a passing mention, but I think there is a lot to learn there. The system I designed at Feedster (which is still in use today), borrowed a number of ideas from P2P networks. Unfortunately I can’t really talk about them here.

There were some things lacking from the paper though. Spam, splogs and sparches (spam searches designed to rip off and repurpose content, it is my own term) are all big issues. Storing and accessing all the data that is generated by the crawlers, namely the raw data that will be used for indexing or the metadata that will be used when assembling the SERP are both a major issues.

Advertisements

6 Responses to Challenges on Distributed Web Retrieval

  1. wolf says:

    Have a look at FAROO, a peer-to-peer web search engine. The users are connecting their computers, building a worldwide, distributed p2p web search engine. No centralized index and crawler are required anymore. Every web page visited is automatically included in the distributed index of the search engine.

  2. I would be really curious how well this works in practice, it looks like they index the web pages that people browse. This has a couple of obvious issues, first this type of crawling is not going to be exhaustive and, second, there are privacy issues. The fact that the crawl is not exhaustive means that any search is not going to be exhaustive. There is also the issue of latency in searching. The P2P applications I have used have a lot of latency built in, that is the nature of the beast, and users typically have a patience span of 2-5 seconds when waiting for search results.

    I remember reading an article related to this in the IBM DeveloperWorks (sorry, could not track it down) which described a similar system implemented at IBM. I am pretty sure that they had to implement super-node which acted as central indices to keep the latency down, which they needed to do even on an intranet.

  3. wolf says:

    > This has a couple of obvious issues, first this type of crawling is not
    > going to be exhaustive and, second, there are privacy issues.

    With an estimated one billion world wide internet users, if you would have one million faroo users, than each page which had been visited by at least 1000 users in an given time, would be also indexed at least once in the same time by one faroo user (in the statistical mean).
    This kind of distributed crawling causes no additional traffic and eliminates the problem, that your peer indexes web sites which are e.g. illegal and you could be held responsible for visiting/indexing.
    Also the crawling by normal search engines is not exhaustive. Only web sites submitted or those which are connected to the web by incoming links may be discovered and indexed.

    There should be no other privacy issue than with e.g. Google toolbar or Alexa toolbar or your ISP, which know about all your visited urls.
    Rather the opposite, as there is no central instance receiving this information, only other distibuted peers, which don’t know, what they are receiving because the visited url and the content are encrypted.

  4. I would like to offer some counter-points:

    Yes, there are lots of users on the web, and getting one million faroo users would be quite an achievement, I am not sure how you would get to those kinds of numbers since this system relies on achieving a critical mass to be useful. I am not even sure what that critical mass would be? 250,000 users? 500,000 users, 1,000,000 users?

    This system also assumes that people visit sites in an exhaustive manner, which I don’t think they do. I think it would be very interesting to see some research on how browsing patterns affect coverage, and how many users would be needed to cover a site like, say, the New York Times.

    There is a privacy issue with toolbars, it is just that those who install them are either comfortable with giving up privacy in return for whatever functionality the toolbar provides, or are oblivious the the issue. Just like there are privacy issues with this system. At its most basis level Faroo is a search engine sitting on my computer with an index of the sites I visited. Encrypting the content is meaningless since it needs to be decrypted at some point for display to any user who searches the system. And since this is a P2P network, I can address any node in the network directly and explicitly, note that this kind of thing is already being done by companies who track P2P network for copyrighted content.

    Your argument that this kind of crawling eliminates browsing illegal website does not really hold. How can you control user’s behavior? And when you say that you “could be held responsible for visiting/indexing illegal site”, doesn’t that suggest that a third party can identify the web sites someone is browsing.

    Which brings up an issue that I did not think of previously, which is that of spamming which is very easy to do with this kind of system, and would be impossible to get rid of in this set up. At least with a centralized crawler, you can filter out spam. And I can tell you that having run a search engine before (Feedster) spam will be a major issue.

  5. wolf says:

    Just to continue our interesting discourse ;-)

    > This system also assumes that people visit sites in an exhaustive measure …
    Not exhaustive in the sense, that the same user needs to visit all pages of a site. But if pages are not visited by anybody at all, they are not indexed too. Every page which is visited by at least 1000 people is also indexed by faroo (in the mean, assuming 1 million faroo users). I agree that 1 million installations is a tough task, but we have done it once before.

    > at its most basis level faroo is an search engine sitting on my computer with an index of sites I visited
    No, it is not. While with unstructured p2p networks the content is stored at the originating peers, with structured p2p networks based on distributed hash tables, the pages you visited are stored at other peers.
    Each peer has a nodeID. The document-url of a visited page is stored at the peers, which nodIDs are most close to the hash of the keywords contained in that page.
    Because the index peer does not store the IP address of the visiting peer, it is impossible to tell at a later time, which peer visited a certain page.

    > How you can control users behavior.
    The goal is not to prevent a user from visiting a certain page.
    The goal was to prevent that a distributed crawler is fetching pages without the knowledge of the user, because his IP address ends up in the log file of the visited page.

    > Encrypting the content is meaningless since it needs to be decrypted … to any user who searches the system.
    Not exactly meaningless;-) The content is encrypted by the visiting peer, arrives already encrypted at the index peer, and is decrypted after the results arrive at the searching peer. The index peer knows the IP address, but not the content, and the searching peer knows the content, but not the IP address.
    Thus the user who searches the system knows what content was indexed, but not by whom. And the index peer knows who is indexing, but not what pages. In this respect it is a kind of anonymizer.

  6. wolf says:

    > .. doesn’t this suggest that a third party can identify the web sites someone is browsing?
    Of course. At least the visited web page (and everybody who at some moment in time has access to it’s logfiles), your ISP, and authorities in the domain name system.

    > … spamming is very easy with this kind of system
    An distributed system can use the same anti-spamming measures as an centralized system. The users are visiting pages, but the system decides whether to index or not.
    I believe that for the success of spamming not the indexing is crucial, but the ranking. Faroo uses the “wisdom of crowds” for ranking. It means that only pages that are attractive to a certain percentage of users are ranked well.
    Thus spammed pages are ousted from the results by those attractive to the users.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: