A project I am currently working on requires the download of about 750MB of compressed data every night from about 10 different sites. This data is used to build links to other resources so it would be a ‘bad thing’ if the data was messed up for some reason. The two patterns I have run into so far are that the data is no longer there (file is missing), or that the data is incorrect for some reason (file is truncated.)
So I put in a couple of checks in the script that handles the download. The first is that the data is downloaded to a temporary area before being moved to its final area. The other is that I check the size of the new file against the size of the current file. If the files differ more than a certain percentage in size, the new file is not used and this is flagged. Obviously the threshold will be domain specific and there may be a direction check as well (i.e. the file should never be smaller.)
This is pretty much all I can do, the files don’t have MD5 signatures, and there are no deltas either.