Nati Shalom has a very interesting take on latency.
One section struck me as interesting:
When discussing latency most people fall into one of two main camps: the “networking” camp and the “software architecture” camp. The former tends to think that the impact of software on latency is negligible, especially when it comes to Web applications.
Marc Abrams says “The bulk of this time is the round trip delay, and only a tiny portion is delay at the server. This implies that the bottleneck in accessing pages over the Internet is due to the Internet itself, and not the server speed.”
The “software architecture” camp tends to believe that network latency is a given and there is little we can do about it. The bulk of latency that we can control lies within the software/application architecture. Dan Pritchett’s Lessons for Managing Latency (located here) provides guidelines for an application architecture that addresses latency requirements using loosely-coupled components, asynchronousinterfaces, horizontal scale from the start, active/active architecture and by avoiding ACID and pessimistic transactions.
I think that splitting into two camps is a little silly and misses the picture, I view hardware and software as part of a continuum and you need to look at your system as a whole when tracking down latency issues.
I have an interesting anecdote here. About 13-14 years ago I was talking to a computer scientist at a large (photocopier company) who told me that they created their own version of 100BaseT (this was before 100BaseT existed) to speed up communications between an image rasterizer and a printer because they had identified that as te bottleneck. Unfortunately they found that the communication speed had only marginally improved because the bottleneck had moved from the hardware to the software that ran the communication.
Focus on application architecture and leave hardware and OS optimizations as a last resort. The performance provided by commodity hardware should be good enough for 80% of cases. In addition, the effort of optimizing hardware and Internet routers might involve a huge investment, and therefore, should be used sparingly.
I completely agree with this. Running on commodity hardware and generic versions of an operating system makes perfect sense because you know ahead of time the base you will be deploying too, and it makes administration easier. No sys-admin is going to want to customize machines when running a site with hundreds of machine. On the other hand, if you are running sites with thousands of machines (or more) then it would probably make sense to create your own OS distribution with whatever tools you want installed on each and every machine.
One thing I would add to the article:
Measure everything – it is really important to get as many measurements as possible on the time that operations take at every level of the stack. I have also found it useful to take measurements of the time needed to perform external operations, for example how long a piece of middleware had to wait for a external operation (such as a REST call) to complete. Additionally gathering these metrics will allow you to compare current timings against historical data so you will be able to tell very quickly when things are not operating normally.
Early on in Feedster’s life my co-founder and I had a major disagreement as to why searches were slow (at the user end of things.) The search engine was blamed and I added logging as to how long various operations took, and generated daily reports on the search times and was able to show that the slowness was not in the search engine.