French Court Fines Google $660,000 Because Google Maps Is Free

I am so glad I escaped France and that I am no longer French. Only in France could this happen, protecting a business whose model has been disrupted by a competitor. The sad thing is that this only benefits the incumbent, not its customers who could cut costs by shifting to a cheaper (free!) product, or the consumers at large for the same reason. So whatever friction was removed from the system has now been artificially reintroduced. Sucks to be a French consumer.


French Court Fines Google $660,000 Because Google Maps Is Free:

Google faces a $660,000 fine after a French court ruling that the company is abusing its dominant position in mapping by making Google Maps free.

According to The Economic Times, the French commercial court “upheld an unfair competition complaint lodged by Bottin Cartographes against Google France and its parent company Google Inc. for providing free web mapping services to some businesses.”

Bottin Cartographes provides mapping services for a cost, and its website boasts several business clients such as Louis Vuitton, Airbus and several automobile manufacturers.


Online Mendelian Inheritance in Man (OMIM)

Been spending time working on the OMIM website. Basically a tiered system with an API (developed with Java, MySQL, myBatis and Lucene/Solr), and a front end (developed with Django and JQuery).

Lots of moving parts to the site, every night we download data from about 20 sources (about 3GB of data in total), parse it all and assemble the database and all the links to external resources. Basically a big ETL machine.

What is interesting to me is the breath of quality in the data and the lack of standardization. Actually the only standard that exists is the comma delimiter. The other interesting thing is that some sites really strive to keep their data up to date while others are much more, shall we say, relaxed about it.

OMIM also now has a Twitter account.

WhistlePig: A minimalist real-time search engine

By way of “Jeff’s Search Engine Caffè“, I came across WhistlePig, a small realtime search engine written in C and Ruby.

A few things caught my eye about this. One was the sort order which is hardwired to be in reverse chronological order. This was the same default sort order we used at Feedster and this allows for quite a few optimizations. The other is that it allows for the realtime addition of documents. I played around this with in a test search engine I built for fun after I left Feedster.

I should have some downtime in the next few months so I will take a look at this search engine.

Large-scale Incremental Processing Using Distributed Transactions and Notifications

This is definitely worth reading:

Large-scale Incremental Processing Using Distributed Transactions and Notifications: “Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.

We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%. Links: [abstract] [pdf] [search]

(Via Recent Google Publications (Atom).)

I built an incremental indexer while at Feedster, albeit on a much smaller scale, we had a 10 minute turn around time for newly crawled stuff which wasn’t too shabby I think.


Information Retrieval: Implementing and Evaluating Search Engines

A book well worth getting if you are in the information retrieval field:

Information Retrieval: Implementing and Evaluating Search Engines: ” Information retrieval is the foundation for modern search engines. This textbook offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects. Wumpus, a multi-user open-source information retrieval system developed by one of the authors and available online, provides model implementations and a basis for student work. The modular structure of the book allows instructors to use it in a variety of graduate-level courses, including courses taught from a database systems implementation perspective, traditional information retrieval courses with a focus on IR theory, and courses covering the basics of Web retrieval. Additionally, professionals in computer science, computer engineering, and software engineering will find Information Retrieval a valuable reference. After an introduction to the basics of information retrieval, the text covers three major topic areas — indexing, retrieval, and evaluation — in self-contained parts. The final part of the book draws on and extends the general material in the earlier parts, treating specific application areas, including parallel search engines, link analysis, crawling, and information retrieval over collections of XML documents. End-of-chapter references point to further reading; end-of-chapter exercises range from pencil and paper problems to substantial programming projects.”

Yahoo Japan Acts Googley

I was very surprised to see that the transfer of tech was going from Google to Yahoo Japan. I say this because I recently did some research which involved very limited testing on how various search engines handled Japanese searches and I found that Yahoo Japan did the best job of the six I tested (namely Yahoo US, Yahoo Japan, Google US, Google Japan, Bing US and Bing Japan). Albeit it was very limited testing and I am not a Japanese speaker, but I checked the various behaviors with a Japanese speaker and they told me that Yahoo Japan was the ‘best behaved’:

Under the terms of the new alliance, Yahoo will use Google’s search and advertising platform technology to power its site, matching Google’s superior tech with its own, highly popular content portals. In Japan, Google hasn’t quite enjoyed the success it has elsewhere around the world, trailing Yahoo in search dominance. This new deal makes it the cock of the walk; according to the New York Times, Google and Yahoo together comprise 90.5 percent of the Japanese search market. (If you’re wondering why Yahoo would cut against Microsoft like this, the answer is that Yahoo is actually a minority owner in its own Japanese property; the biggest shareholder is the cell phone company SoftBank.)


(Via Beyond Search.)

Open Source Search Conference

Just came across this on Steve Arnold’s weblogLucid Imagination is sponsoring an open source search conference in Boston, MA on October 7-8, 2010 at the Hyatt Harborside:

The first-ever conference focused on addressing the business and development aspects of open source search will take place October 7-8, 2010 at the Hyatt Harborside in Boston.

Dubbed Lucene Revolution due to the sponsor, Lucid Imagination, the commercial company dedicated to Apache Lucene technology. This inaugural event promises a full, forward-thinking agenda, creating opportunities for developers, technologists and business leaders to explore the benefits that open source enterprise search makes possible.

In addition to in-depth training provided by Lucid Imagination professionals, there will be two days of content rich talks and presentations by Lucene and Solr open source experts. Working on the program will be Stephen E. Arnold, author and consultant.

Those interested in learning more about the conference and submitting a proposal for a talk can navigate to The deadline for submissions is June 23, 2010. Individuals are encouraged to submit proposals for papers and talks that focus on categories including enterprise case studies, cloud-based deployment of Lucene/Solr, large-scale search, and data integration.

The Lucene Revolution conference comes just after success of sold-out Apache Lucene EuroCon 2010 in Prague, also sponsored by Lucid Imagination, the single largest gathering of open source search developers to date.