Here is what I am currently working on:


New database to speed genetic discoveries »

A new online database combining symptoms, family history and genetic sequencing information is speeding the search for diseases caused by a single rogue gene. As described in an article in the May iss…

Baylor-Hopkins Center for Mendelian Genomics

And the other project I have been working on is a website for the Baylor-Hopkins Center for Mendelian Genomics. This website is designed to capture patient feature and DNA sample information for sequencing single-gene mendelian phenotypes. The site is not really for public consumption though.

OMIM – Online Mendelian Inheritance in Man

For the curious, for the past 18 months I have been working on the OMIM website on-and-off. From the website:

OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources.

Worth a look if you are interested in genetics.


Online Mendelian Inheritance in Man (OMIM)

Been spending time working on the OMIM website. Basically a tiered system with an API (developed with Java, MySQL, myBatis and Lucene/Solr), and a front end (developed with Django and JQuery).

Lots of moving parts to the site, every night we download data from about 20 sources (about 3GB of data in total), parse it all and assemble the database and all the links to external resources. Basically a big ETL machine.

What is interesting to me is the breath of quality in the data and the lack of standardization. Actually the only standard that exists is the comma delimiter. The other interesting thing is that some sites really strive to keep their data up to date while others are much more, shall we say, relaxed about it.

OMIM also now has a Twitter account.

Firewalling Data Wierdness

A project I am currently working on requires the download of about 750MB of compressed data every night from about 10 different sites. This data is used to build links to other resources so it would be a ‘bad thing’ if the data was messed up for some reason. The two patterns I have run into so far are that the data is no longer there (file is missing), or that the data is incorrect for some reason (file is truncated.)

So I put in a couple of checks in the script that handles the download. The first is that the data is downloaded to a temporary area before being moved to its final area. The other is that I check the size of the new file against the size of the current file. If the files differ more than a certain percentage in size, the new file is not used and this is flagged. Obviously the threshold will be domain specific and there may be a direction check as well (i.e. the file should never be smaller.)

This is pretty much all I can do, the files don’t have MD5 signatures, and there are no deltas either.

Wrapping up a Consulting Project

I am just wrapping up a consulting project I have been working on for the past 4 months. I have been putting together the new website for a database called Online Mendelian Inheritance in Man (OMIM for short.) This database is a catalog of human genes and genetic phenotypes (how genes manifest themselves in humans).

This has been a complex project, I spent a lot of time dealing with data issues, specifically data cleanliness, and data interchange (this community badly needs standards for data interchange and linking.)

That being said it was a lot of fun and a great learning experience.