The installer I am working on is now code-complete, I just need to test out the last remaining pieces, as well as do a number of regression tests to make sure it will works as it is supposed to. For that I am going to create a virtual machine in VMWare Fusion, take a snapshot, let the installer lose on it and see what happens.
Overall I am pretty pleased how it turned out, the code is more compact than the previous installer I wrote while at Feedster and does a lot more.
Chandler’s birth was particularly lengthy, I am going to take a look and see if the wait was worth it.
The High Scalability blog has a good summary of a presentation given by Farhan Mashraqi of Fotolog.
As I have written before, I am really, really, really ambivalent about using memcached to cache data coming out of MySQL.
Fotolog has 51 instances of memcached on 21 servers with 175G in use and 254G available.
I can’t help but wonder how MySQL would perform if given 21 extra servers with all that memory.
They also mention MySQL’s cache, my advice on that is don’t even bother with it, it is worse than useless.
One place where I was interested in their use of memcached is in caching filesystems accessed via NFS. Again I have to ask whether they are solving the wrong issue. It is a bit like saying “my car is slow therefore I will buy a faster car to tow it with, and my slow car will now go faster.” The real solution is to see why your car is slow in the first place, and then trade it in for a faster one if need be.
I am wrapping up work on the installer I started working on over the weekend.
The goal of the installer is to have a single script I can run on a pristine machine (or image), tell it what I wanted installed (crawler, api, indexer, etc…) and boom! five minutes later I have a fully installed, fully configured instance. Or course the best thing to do is to create an instance image which I can run in the cloud, but I need the installer to create the image. The exercise of building the installer is very good at getting a handle on which files, libraries and what-have-yous go with each sub-system. As a side-bar we did not have that at Feedster for the longest time, which was a bad mistake.
By the end of the weekend I decided that the best way to approach this is to assume that I was installing on a pristine machine (say a CentOS installation) and assume nothing, well assume that there is at least a network and a few tools like svn, gcc and make. So everything needs to checked for and, if needed, checked out and installed, right down to things like java and ant. This caused me to look again at the structure of my svn repository and effectuate a reorganization which helped a lot both from a code organization point of view and an install process.
I have been reading more and more about the Drizzle project these past few weeks and it looks like a very interesting project. The project itself started off from MySQL, lots of bits got ripped out and new bits are being added in. You can track progress on Planet MySQL and on the project home page.
Talking about the project home page, here is what it says:
A Lightweight SQL Database for Cloud and Web
The Drizzle project is building a database optimized for Cloud and Net applications. It is being designed for massive concurrency on modern multi-cpu/core architecture. The code is originally derived from MySQL.
While Jay Pipes does not think that this will ever make it out of the lab, I think there is a gap in the market for a lightweight, SQL-based, networked DBMS. At one end of the spectrum you have MySQL, a fairly complete and heavyweight RDBMS, further along you have SQLite and eventually BerkeleyDB from (Oracle, originally SleepyCat).
As an application grows and there is more and more data to manage, a switch has to be made from a monolithic database to a sharded database, which means that a lot of the work that was being done in the monolithic database server (referential integrity, joins, etc…) has to move to a middleware layer (this is documented ad nauseam so I am not going to expand on that.)
So if you are using MySQL in this scenario, you wind up not using 80% of the features that MySQL offers which just makes them overhead. Trouble is that at the other end of the spectrum (touched on above) there isn’t anything which does the 20% you need.
So I think (hope!) that this is where Drizzle is heading, because it really just makes sense.
Updated August 8th, 2008 - Drizzle is the subject of the current FLOSS Weekly podcast over on the TWIT network.
I have been spending the weekend on an installer script for a project I am currently working on. For me this is a very boring task, right up there with writing documentation. The incentive to get it over and done with is that I can get back to more interesting stuff.
The installer script is designed to be a standalone script which can be used to install complete sub-systems on a machine, so it could install or upgrade the crawler on a machine, or a search engine. The machine could be an actual machine, or a virtual machine or eventually a image which could them be distributed across virtual machines. The devil is in the details, tracking down all the files needed by each sub-system, etc… The reward is a very simple, completely automated installer.
I have been spending so much time on this, my mind is beginning to ‘think’ only in installer terms, so I should “./Installer.pl –make-dinner” now.
At the end of a fairly predictable article where SOAP and REST supporters take cheap shots at each other, Tim Bray being one of them, Bray comes out with some pretty eye-rolling stuff:
During a keynote presentation at OSCON on Friday, Bray will talk about the “language inflection point,” in which various languages such as Perl, Python, and Ruby have been gathering momentum at the expense of the established Java and .Net platforms.
“Up until two years ago, if you were a serious programmer you wrote code in either Java or .Net,” Bray said. “[Now], there are all these options that people are looking at and it’s really an inflection point.”
I fail to see what “serious programmer” and specific languages have to do with each other, I would have thought that a “serious programmer” would pick the language best suited to the task at hand.
The Java platform is accommodating scripting languages such as Ruby and Python on the JVM, Bray noted. Sun has been enabling these to work on the Java Virtual Machine. “The Java language is not what the cool kids are choosing to use these days,” said Bray.
IMHO the “cool kids” who are really smart learn a variety of languages and keep learning new ones. They do this to increase the breadth of their knowledge and toolbox, so they don’t approach every programming problem with the same hammer.
Still, Java will stay around, he said. “The Java language isn’t going away. It’s the world’s most popular programming language,” Bray said.
I have not seen any specific figures as to how popular a specific language is, in fact how would you measure that. Lines written? Programmers using it? Users using application written in it?
“I think that like it or not, we’re stuck with a multilanguage future,” he stressed.
What’s not to like about a “multilanguage future”, we have a multilanguage present and we have had a multilanguage past, multilanguage has served us well and will continue to do so. As for being “stuck”, I am glad we were not “stuck” 30 years ago otherwise we would all be writing stuff in COBOL, or worse assembler.
Jeff Atwood published a very interesting post “Maybe Normalizing Isn’t Normal” where he delves into whether you should normalize or denormalize. Be sure to check the comments, or this summary on the High Scalability blog if the number of comments (and tone in some cases) gives you a headache.
The post is very interesting but I took issue with this:
Both solutions have their pros and cons. So let me put the question to you: which is better — a normalized database, or a denormalized database?
Trick question! The answer is that it doesn’t matter! Until you have millions and millions of rows of data, that is. Everything is fast for small n. Even a modest PC by today’s standards — let’s say a dual-core box with 4 gigabytes of memory — will give you near-identical performance in either case for anything but the very largest of databases. Assuming your team can write reasonably well-tuned queries, of course.
While it is true that for small data sets there is no difference in performance whether you normalize you schema or not, it will make a huge difference once your data set grows. Adding to the fun is that changing your schema becomes more and more difficult as the data set grows.
Then things settle down:
First, a reality check. It’s partially an act of hubris to imagine your app as the next Flickr, YouTube, or Twitter. As Ted Dziuba so aptly said, scalability is not your problem, getting people to give a shit is. So when it comes to database design, do measure performance, but try to err heavily on the side of sane, simple design. Pick whatever database schema you feel is easiest to understand and work with on a daily basis. It doesn’t have to be all or nothing as I’ve pictured above; you can partially denormalize where it makes sense to do so, and stay fully normalized in other areas where it doesn’t.
A sane, simple design is a “good thing”, but you also need to plan for the future, you want a sane simple design which can evolve and scale.
Finally sanity is restored:
Pat Helland notes that people normalize because their professors told them to. I’m a bit more pragmatic; I think you should normalize when the data tells you to:
- Normalization makes sense to your team.
- Normalization provides better performance. (You’re automatically measuring all the queries that flow through your software, right?)
- Normalization prevents an onerous amount of duplication or avoids risk of synchronization problems that your problem domain or users are particularly sensitive to.
- Normalization allows you to write simpler queries and code.
In my experience (with Feedster amongst others), a heavily denomalized schema is easy to work with but simply does not scale well.
With my current project I took a different tack:
-
Normalize where it makes sense and group logical chunks of data together, even if it means having 1 to 1 relationships. From a performance point of view this means that you get and update the chunks you need rather than accessing tables with 50+ fields were 90% of the fields are null (don’t laugh, I have seen it happen).
- Never ever ever join to get data, better to issue two simple queries rather than one join. With the caveat that this is born of experience with MySQL and large amounts of data (1/2 TB), even with indices performance can be unpredictable.
-
Sharding your data is pretty much the only way to scale, so design that in from the start.
-
Build a data access layer which hides the schema from the application.
I am sure there is more, but this is a start.
Pascal was the first ‘real’ computer language I learnt (real in the sense that it was closely linked to computer science). Actually I learnt it using Borland TurboPascal on a 8086 PC with 256K of RAM, and it worked great. Borland had a sample TurboPascal based word processor which I was hacking to support the Cyrillic alphabet in addition to the Latin alphabet it supported. I had built in a key combination to switch from one to the other.
All this came back to me when I heard TurboPascal mentioned on the Security Now podcast, along with the Free Pascal website.
TurboPascal was the brainschild of Anders Hejlsberg who went on to do lots of other interesting work.