French Court Fines Google $660,000 Because Google Maps Is Free

I am so glad I escaped France and that I am no longer French. Only in France could this happen, protecting a business whose model has been disrupted by a competitor. The sad thing is that this only benefits the incumbent, not its customers who could cut costs by shifting to a cheaper (free!) product, or the consumers at large for the same reason. So whatever friction was removed from the system has now been artificially reintroduced. Sucks to be a French consumer.

 

French Court Fines Google $660,000 Because Google Maps Is Free:

Google faces a $660,000 fine after a French court ruling that the company is abusing its dominant position in mapping by making Google Maps free.

According to The Economic Times, the French commercial court “upheld an unfair competition complaint lodged by Bottin Cartographes against Google France and its parent company Google Inc. for providing free web mapping services to some businesses.”

Bottin Cartographes provides mapping services for a cost, and its website boasts several business clients such as Louis Vuitton, Airbus and several automobile manufacturers.

Online Mendelian Inheritance in Man (OMIM)

Been spending time working on the OMIM website. Basically a tiered system with an API (developed with Java, MySQL, myBatis and Lucene/Solr), and a front end (developed with Django and JQuery).

Lots of moving parts to the site, every night we download data from about 20 sources (about 3GB of data in total), parse it all and assemble the database and all the links to external resources. Basically a big ETL machine.

What is interesting to me is the breath of quality in the data and the lack of standardization. Actually the only standard that exists is the comma delimiter. The other interesting thing is that some sites really strive to keep their data up to date while others are much more, shall we say, relaxed about it.

OMIM also now has a Twitter account.

WhistlePig: A minimalist real-time search engine

By way of “Jeff’s Search Engine Caffè“, I came across WhistlePig, a small realtime search engine written in C and Ruby.

A few things caught my eye about this. One was the sort order which is hardwired to be in reverse chronological order. This was the same default sort order we used at Feedster and this allows for quite a few optimizations. The other is that it allows for the realtime addition of documents. I played around this with in a test search engine I built for fun after I left Feedster.

I should have some downtime in the next few months so I will take a look at this search engine.

Large-scale Incremental Processing Using Distributed Transactions and Notifications

This is definitely worth reading:

Large-scale Incremental Processing Using Distributed Transactions and Notifications: “Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.

We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%. Links: [abstract] [pdf] [search]

(Via Recent Google Publications (Atom).)

I built an incremental indexer while at Feedster, albeit on a much smaller scale, we had a 10 minute turn around time for newly crawled stuff which wasn’t too shabby I think.

 

Information Retrieval: Implementing and Evaluating Search Engines

A book well worth getting if you are in the information retrieval field:

Information Retrieval: Implementing and Evaluating Search Engines: ” Information retrieval is the foundation for modern search engines. This textbook offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects. Wumpus, a multi-user open-source information retrieval system developed by one of the authors and available online, provides model implementations and a basis for student work. The modular structure of the book allows instructors to use it in a variety of graduate-level courses, including courses taught from a database systems implementation perspective, traditional information retrieval courses with a focus on IR theory, and courses covering the basics of Web retrieval. Additionally, professionals in computer science, computer engineering, and software engineering will find Information Retrieval a valuable reference. After an introduction to the basics of information retrieval, the text covers three major topic areas — indexing, retrieval, and evaluation — in self-contained parts. The final part of the book draws on and extends the general material in the earlier parts, treating specific application areas, including parallel search engines, link analysis, crawling, and information retrieval over collections of XML documents. End-of-chapter references point to further reading; end-of-chapter exercises range from pencil and paper problems to substantial programming projects.”

Yahoo Japan Acts Googley

I was very surprised to see that the transfer of tech was going from Google to Yahoo Japan. I say this because I recently did some research which involved very limited testing on how various search engines handled Japanese searches and I found that Yahoo Japan did the best job of the six I tested (namely Yahoo US, Yahoo Japan, Google US, Google Japan, Bing US and Bing Japan). Albeit it was very limited testing and I am not a Japanese speaker, but I checked the various behaviors with a Japanese speaker and they told me that Yahoo Japan was the ‘best behaved’:

Under the terms of the new alliance, Yahoo will use Google’s search and advertising platform technology to power its site, matching Google’s superior tech with its own, highly popular content portals. In Japan, Google hasn’t quite enjoyed the success it has elsewhere around the world, trailing Yahoo in search dominance. This new deal makes it the cock of the walk; according to the New York Times, Google and Yahoo together comprise 90.5 percent of the Japanese search market. (If you’re wondering why Yahoo would cut against Microsoft like this, the answer is that Yahoo is actually a minority owner in its own Japanese property; the biggest shareholder is the cell phone company SoftBank.)

 

(Via Beyond Search.)

Open Source Search Conference

Just came across this on Steve Arnold’s weblogLucid Imagination is sponsoring an open source search conference in Boston, MA on October 7-8, 2010 at the Hyatt Harborside:

The first-ever conference focused on addressing the business and development aspects of open source search will take place October 7-8, 2010 at the Hyatt Harborside in Boston.

Dubbed Lucene Revolution due to the sponsor, Lucid Imagination, the commercial company dedicated to Apache Lucene technology. This inaugural event promises a full, forward-thinking agenda, creating opportunities for developers, technologists and business leaders to explore the benefits that open source enterprise search makes possible.

In addition to in-depth training provided by Lucid Imagination professionals, there will be two days of content rich talks and presentations by Lucene and Solr open source experts. Working on the program will be Stephen E. Arnold, author and consultant.

Those interested in learning more about the conference and submitting a proposal for a talk can navigate to http://lucenerevolution.com/. The deadline for submissions is June 23, 2010. Individuals are encouraged to submit proposals for papers and talks that focus on categories including enterprise case studies, cloud-based deployment of Lucene/Solr, large-scale search, and data integration.

The Lucene Revolution conference comes just after success of sold-out Apache Lucene EuroCon 2010 in Prague, also sponsored by Lucid Imagination, the single largest gathering of open source search developers to date.

FoxTrot Search for Mac OS X

If you were looking for an alternative to Spotlight on Mac OS X, you might want to check out FoxTrot Search:

CTM FoxTrot Professional Search is a powerful new find-by-content product for legal, media or mobile professionals and their networks, which offers precision tools for finding the proverbial “needle in a haystack” within PDF, HTML, word processing, e-mail and rich-media metadata.

Wolfram Alpha iPhone/iPad App

I was happy to see (via Beyond Search) that Wolfram Alpha dropped the cost of their iPhone/iPad app, originally it was one cent short of $50, now it is a much more reasonable $1.99.

Assessing the Cost-Benefit of Substituting Open Source Solutions

Recently I was having a conversation with a VP of Engineering about search, amongst other topics, and they mentioned in passing that they were getting some pressure to substitute a well known open source search engine for their own internally developed search engine. They were telling me that the case was not clear cut as they had invested a lot of resources into developing their own search engine to meet their needs and the needs of their customers.

Which got me thinking how you go about doing a cost-benefit analysis to assess this and I reduced it to two sides of a balance sheet.

On one side you need to assess the cost of the component you are looking to replace, presumably this translates into a savings. This is not as simple as it looks and it may well not be as significant as it looks. There is always a large amount of support infrastructure around a basic search engine such as all the document preparation aspects, user accounts, integration into document management systems, etc… For companies who make their living taking databases from publishers and making them available on their search engine, this would probably be quite significant.

On the other side of the equation, you need to assess what needs to be changed in the open source solution you are planning to adopt to meet your needs. An open source solution would typically be quite generic, so there may well be specific features which are not catered for and which will need to be built. These features would need to be re-incorporated into future releases of the open source solution. You could also be constrained by the release cycle of the open source solution.

This is not an easy calculation.

Follow

Get every new post delivered to your Inbox.