Here is what I am currently working on:

 

New database to speed genetic discoveries »

A new online database combining symptoms, family history and genetic sequencing information is speeding the search for diseases caused by a single rogue gene. As described in an article in the May iss…

Interesting Bug in Django

I am working on a project that involves Django and ran into an interesting issue. Django creates a small database to keep track of various bits of data one of which is user session information in a table called django_session as follows:

CREATE TABLE django_session (
  session_key varchar(40) NOT NULL,
  session_data longtext NOT NULL,
  expire_date datetime NOT NULL,
  PRIMARY KEY (session_key)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This is all well and good but there is an issue. InnoDB orders the rows in primary key order (primary keys are SHA1 hex digest). The problem is that these keys are effectively random so a new session row be be inserted anywhere in the table causing data to move around with every insert. While this might work when the table is small, it does not work so well when you have 500,000+ rows in it (which is another issue that I will get to).

A better schema for the table is as follows:

CREATE TABLE django_session (
  id int(11) NOT NULL AUTO_INCREMENT,
  session_key varchar(40) NOT NULL,
  session_data longtext NOT NULL,
  expire_date datetime NOT NULL,
  PRIMARY KEY (id),  
  UNIQUE KEY session_key (session_key)

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This will ensure that rows are inserted consecutively which will ensure better performance as the table grows.

Two things to note:

I am not sure whether Django specifies the ENGINE to use when creating these tables, but MySQL 5.5 uses InnoDB rather than MyISAM, and I don’t think this will be an issue with the latter.

The other thing is that Django does not seem to clear out sessions past their expiry date, so one needs to do that regularly with the following statement:

DELETE FROM django_session WHERE expire_date <= NOW()

One more thing, I think that is the case too with database backed caches too.

 

BBEdit 10.0

BBEdit 10.0 came out last week, I have been a long time user of BBEdit (pretty much since it came out). This release comes with a lot of changes, a lot, and has required me to change some of my work processes. There are a number of rough edges too so it might be an idea to wait for the .1 release. And you will want to make sure not to upgrade unless you have some downtime to work through all the changes.

 

Django and MySQL

It can be a little challenging to get Django to talk to MySQL especially if you have a non-standard setup (like I seem to have, every time…)

To access MySQL from Django you need to install MySQL-python, and this is usually where the trouble begins. MySQL-python will run ‘mysql_config’ to determine what the default MySQL settings are, the one to pay attention to is the ‘–socket’ one. This is the socket that MySQL-python will use to access MySQL if the DATABASE/HOST setting in the Django settings.py file is left empty or set to ‘localhost’. MySQL-python appears to disregard any config setting in /etc/my.cnf, so if you set the ‘socket’ setting in /etc/my.cnf to something else (such as ‘/var/lib/mysql/mysql.sock’) then Django will NOT be able to access the MySQL server.

There are two solutions to this. Either you set the ‘socket’ setting in the /etc/my.cnf file to match the setting reported by ‘mysql_config’. Or you set HOST in the Django settings.py file to the host name of the machine running MySQL.

One thing that I have noticed is that MySQL installations that are made through apt-get or yum have the socket default set to ‘/tmp/mysql.sock’ whereas MySQL installations that are made from the MySQL download have the socket default set to ‘/var/lib/mysql/mysql.sock’.

Other issues you may run into is installing MySQL in a non-standard directory, for example ‘/usr/local/mysql’, while MySQL-python will probable install correctly, it may not be able to pull in the MySQL libraries when running under the Apache server.

What I generally do is let MySQL-python install however it wants, and set the HOST to the hostname of the machine where MySQL is running.

Online Mendelian Inheritance in Man (OMIM)

Been spending time working on the OMIM website. Basically a tiered system with an API (developed with Java, MySQL, myBatis and Lucene/Solr), and a front end (developed with Django and JQuery).

Lots of moving parts to the site, every night we download data from about 20 sources (about 3GB of data in total), parse it all and assemble the database and all the links to external resources. Basically a big ETL machine.

What is interesting to me is the breath of quality in the data and the lack of standardization. Actually the only standard that exists is the comma delimiter. The other interesting thing is that some sites really strive to keep their data up to date while others are much more, shall we say, relaxed about it.

OMIM also now has a Twitter account.

Phone Counter-Measures

I was very interested to read “The gadgets police use to snarf cell phone data” over on ArsTechnica and wondered how long it would be before there were apps that blocked these gadgets, and/or apps which counter-attacked these gadgets with bad data or viruses.

 

Firewalling Data Wierdness

A project I am currently working on requires the download of about 750MB of compressed data every night from about 10 different sites. This data is used to build links to other resources so it would be a ‘bad thing’ if the data was messed up for some reason. The two patterns I have run into so far are that the data is no longer there (file is missing), or that the data is incorrect for some reason (file is truncated.)

So I put in a couple of checks in the script that handles the download. The first is that the data is downloaded to a temporary area before being moved to its final area. The other is that I check the size of the new file against the size of the current file. If the files differ more than a certain percentage in size, the new file is not used and this is flagged. Obviously the threshold will be domain specific and there may be a direction check as well (i.e. the file should never be smaller.)

This is pretty much all I can do, the files don’t have MD5 signatures, and there are no deltas either.

Follow

Get every new post delivered to your Inbox.