François Schiettecatte’s Blog

Map-Reduce with a Different Flavor

Posted in Python, Scaling, Software Development by François Schiettecatte on September 23, 2008

Not sure how I came across Disco, but it somehow landed in my bookmarks of things to check out. Normally I would not post something about Map-Reduce, there is already lots of easy-to-find stuff out there about it, but this one was interesting:

Disco is an open-source implementation of the Map-Reduce framework for distributed computing. As the original framework, Disco supports parallel computations over large data sets on unreliable cluster of computers.

The Disco core is written in Erlang, a functional language that is designed for building robust fault-tolerant distributed applications. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms or data processing tasks often only in tens of lines of code. This means that you can quickly write scripts to process massive amounts of data.

Disco was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data.

The two things which caught my eye were Erlang (which seems to be getting more and more traction these days, maybe the next language to learn,) and the fact that it used Python as the ‘driving’ language.

Leave a Reply

You must be logged in to post a comment.