By way of Greg Linden, I just finished watching a tech talk given by Professor Gene Cooperman on “Disk-Based Parallel Computation, Rubik’s Cube, and Checkpointing“.
The part that interested me was Cooperman’s assertion that “disk is the new RAM” if you have enough machines in your cluster, reaching a point where the aggregate bandwidth of the disks reach that of RAM.
The first obvious thing that jumps is that this does not make any sense, while the bandwidth may be the same, the latency is quite different, and will be a performance killer.
Cooperman recognizes that and makes the point that you need to organize your disk accesses to minimize latency, basically avoiding piecemeal reading and going for batch reading. Effectively what you are doing is shuttling data to and from memory in very large batches. (As an aside this is nothing new, the Connection Machine 5 had a similar disk system and I am sure there are other such systems out there).






It doesn’t seem practical or even possible to always organize your reads into non-random accesses. At least not in an rdbms context.
Comment by noel — April 22, 2008 @ 1:07 am
I agree with you that this is not possible in an RDBMS context, this is really aimed at tackling problems where you need to process data in large chunks, like log files or text for example.
Comment by François Schiettecatte — April 22, 2008 @ 7:05 am