Interesting Bug in Django

I am working on a project that involves Django and ran into an interesting issue. Django creates a small database to keep track of various bits of data one of which is user session information in a table called django_session as follows:

CREATE TABLE django_session (
  session_key varchar(40) NOT NULL,
  session_data longtext NOT NULL,
  expire_date datetime NOT NULL,
  PRIMARY KEY (session_key)

This is all well and good but there is an issue. InnoDB orders the rows in primary key order (primary keys are SHA1 hex digest). The problem is that these keys are effectively random so a new session row be be inserted anywhere in the table causing data to move around with every insert. While this might work when the table is small, it does not work so well when you have 500,000+ rows in it (which is another issue that I will get to).

A better schema for the table is as follows:

CREATE TABLE django_session (
  session_key varchar(40) NOT NULL,
  session_data longtext NOT NULL,
  expire_date datetime NOT NULL,
  PRIMARY KEY (id),  
  UNIQUE KEY session_key (session_key)


This will ensure that rows are inserted consecutively which will ensure better performance as the table grows.

Two things to note:

I am not sure whether Django specifies the ENGINE to use when creating these tables, but MySQL 5.5 uses InnoDB rather than MyISAM, and I don’t think this will be an issue with the latter.

The other thing is that Django does not seem to clear out sessions past their expiry date, so one needs to do that regularly with the following statement:

DELETE FROM django_session WHERE expire_date <= NOW()

One more thing, I think that is the case too with database backed caches too.



Django and MySQL

It can be a little challenging to get Django to talk to MySQL especially if you have a non-standard setup (like I seem to have, every time…)

To access MySQL from Django you need to install MySQL-python, and this is usually where the trouble begins. MySQL-python will run ‘mysql_config’ to determine what the default MySQL settings are, the one to pay attention to is the ‘–socket’ one. This is the socket that MySQL-python will use to access MySQL if the DATABASE/HOST setting in the Django file is left empty or set to ‘localhost’. MySQL-python appears to disregard any config setting in /etc/my.cnf, so if you set the ‘socket’ setting in /etc/my.cnf to something else (such as ‘/var/lib/mysql/mysql.sock’) then Django will NOT be able to access the MySQL server.

There are two solutions to this. Either you set the ‘socket’ setting in the /etc/my.cnf file to match the setting reported by ‘mysql_config’. Or you set HOST in the Django file to the host name of the machine running MySQL.

One thing that I have noticed is that MySQL installations that are made through apt-get or yum have the socket default set to ‘/tmp/mysql.sock’ whereas MySQL installations that are made from the MySQL download have the socket default set to ‘/var/lib/mysql/mysql.sock’.

Other issues you may run into is installing MySQL in a non-standard directory, for example ‘/usr/local/mysql’, while MySQL-python will probable install correctly, it may not be able to pull in the MySQL libraries when running under the Apache server.

What I generally do is let MySQL-python install however it wants, and set the HOST to the hostname of the machine where MySQL is running.

Online Mendelian Inheritance in Man (OMIM)

Been spending time working on the OMIM website. Basically a tiered system with an API (developed with Java, MySQL, myBatis and Lucene/Solr), and a front end (developed with Django and JQuery).

Lots of moving parts to the site, every night we download data from about 20 sources (about 3GB of data in total), parse it all and assemble the database and all the links to external resources. Basically a big ETL machine.

What is interesting to me is the breath of quality in the data and the lack of standardization. Actually the only standard that exists is the comma delimiter. The other interesting thing is that some sites really strive to keep their data up to date while others are much more, shall we say, relaxed about it.

OMIM also now has a Twitter account.

Python 3.0

Python 3.0 has just been released, and has not been afraid to shed backward compatibility.

I am some way away from using it since I use Django which currently runs on 2.3 or better, I use 2.5 and may move to 2.6 if Django supports it, though I have seen nothing to that effect yet which means I will be using 2.5 for a while yet.

It is very brave to shed backward compatibility with a language, almost like setting the clock back to zero in terms of adoption.

Ars Technica posted a guide to online resources for learning Python.

Map-Reduce with a Different Flavor

Not sure how I came across Disco, but it somehow landed in my bookmarks of things to check out. Normally I would not post something about Map-Reduce, there is already lots of easy-to-find stuff out there about it, but this one was interesting:

Disco is an open-source implementation of the Map-Reduce framework for distributed computing. As the original framework, Disco supports parallel computations over large data sets on unreliable cluster of computers.

The Disco core is written in Erlang, a functional language that is designed for building robust fault-tolerant distributed applications. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms or data processing tasks often only in tens of lines of code. This means that you can quickly write scripts to process massive amounts of data.

Disco was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data.

The two things which caught my eye were Erlang (which seems to be getting more and more traction these days, maybe the next language to learn,) and the fact that it used Python as the ‘driving’ language.