Apple Extreme Base Station

Even though I connect to the internet via an Apple Extreme Base Station, I still have the firewall enabled on all my machines. On my Macs, I also block UDP traffic and I enable the stealth mode.

So I was interested in this review by Ars Technica of Norton Confidential for Mac OS X. The product looks interesting, but I was most interested in this:

Eventually I inquired with my apartment-mate about whether something in our apartment was portscanning periodically, and he said that he was unaware of anything that might be doing that. But as it turned out, the IP of the blocked porscan, according to NCO, matched up with the internal IP of our brand new Airport Extreme base station. Why does the AEBS need to be portscanning every so often?

It seems that the Base Station is scanning ports on the internal network on a regular basis, I checked the firewall logs on my Mac and came across the following entries:

Apr 30 07:19:13 Francois-MacPro ipfw: 35000 Deny UDP in via en0
Apr 30 07:31:13 Francois-MacPro ipfw: 35000 Deny UDP in via en0
Apr 30 07:43:13 Francois-MacPro ipfw: 35000 Deny UDP in via en0

These scans happen every day with no real order, sometimes it is just a single ping, and sometimes it is more than one in quick succession, as above.

Port 138 is reserved for the netbios datagram service as we can see from the /etc/services file on my linux machine:

netbios-dgm 138/tcp # NETBIOS Datagram Service
netbios-dgm 138/udp

I suspect that this is related to the fact that the new Base Station can share discs attached to it with Windows machines, but I am still curious as to why these scans are happening when I have no discs attached to the Base Station and Windows File Sharing is not enabled.

Ubuntu On Parallels

I have been reading about people having issues running the latest version of Ubuntu under Parallels.

Indeed there is an issue. If you select any Linux variant as the OS type, Ubuntu will not boot.

What works for me is to select Solaris as the OS type and Other as the OS version, and switching the OS version to Linux once Ubuntu was installed, but that is not compulsory.

Buy Vs. Subscribe

Steve Jobs (Hi Steve!) has come out against a subscription model for music because “People want to own their music.”

I would agree with him, I prefer to buy and own my music, and be able to play it on whatever device I want. And this really fits in with how people have traditionally consumed music, ie going to the store buying LPs or CDs.

But I am not sure that this model can be applied to video, whether we are talking TV shows or Movies. Most people watch TV shows and movies only once, so this lends itself much more to a rental model (like Blockbusters, or Netflix for example,) than an ownership model, which works better for music because people will listen to music over and over again.

I have subscribed to a show season on iTunes, as well as bought some good episodes from past seasons, but I am not inclined to buy much more than that, and I am not inclined to buy movies, but I would be more than happy to rent them. I recently signed up to Netflix and they serve my needs very well.

Grillin’ Your iPod

I am not sure how I feel about the iGrill. I would have thought that heat and food don’t mix well with iPods.

Scaling Without A Database

Before I wrote up an article entitled “Scaling and Uptime”, I came across another article by Frank Sommers entitled “Scaling Without A Database” which is well worth reading. It is a take on another article by Robert McIntosh entitled “Building a high volume app without a RDMS or Domain objects.”

Both articles are well worth reading and address the interesting issue of scaling without a database. I quote:

McIntosh’s basic thesis is centered around three observations. The first one is that true scalability can best be achieved in a shared-nothing architecture. Not all applications can be designed in a completely shared-nothing fashion—for instance, most consumer-facing Web sites that need the sort of scaling McIntosh envisions require access to a centralized user database (unless a single sign-on solution is used). But a surprising number of sites could be partitioned into sections with little shared data between the various site areas.

This just comes back to what I was saying earlier about partitioning data to avoid bottlenecks. I think it would close to impossible to build a site that did anything worthwhile without having some sort of database system behind it, but that does not mean that scaling is held hostage by that database. Partitioning is the key, both vertical and horizontal.

McIntosh’s final observation is that although modern Web frameworks speed up development already, a new level of rapid development can possibly be reached by managing data in plain files, such as XML:

Well this seems obvious at first glance. If you are just going to use text files to store data, accessing that data is going to be very fact since you are not dealing with the overhead incurred if you stored that data in an RDBMS. The flip side of this is that your application suddenly has to deal with parsing said text files, organizing said text files and finding said text files. All you have really done there is shifted where the work gets done, you have not really escaped the fact that it needs to get done.

Another reason to entertain some of McIntosh’s notions is that quick access to large amounts of data occurs through indexes—be those indexes managed by a relational database or indexes created ex-database, such as with Lucene. An application relying on, say, XML-based files for data storage could generate the exact indexes it needs in order to provide query execution capabilities over the files. And, in general, ex-database indexes have proven more scalable than database-specific indexes: Not only can such indexes be maintained in a distributed fashion, they can also be custom-tailored to the exact retrieval patterns of an application.

I would concur with this and point out that this is not really that new. I worked on a system in 1993 which used Sybase to store data, and extracted that data to be indexed by a full text search engine. When Feedster was first started, the search engine used was the MySQL full text search engine that was built into the MySQL server (in 2003.) Once we reached 1 million posts, the whole RDBMS became really slow, at which point we split the full text searching out from the RDBMS, and the system performed well again. The main message here is that you have to build systems leveraging the strengths of each component and not be afraid to bring in more specialised components if they get the job done better.


I turned on Google’s search history when they introduced personalized searching, mostly to see how well it worked.

There are a few side benefits. You get to see your search trends, how many searches you run a month, per day (thursday, saturday and sunday are low days) and per hour (none between 1am and 4am), what the top queries were (‘test’ ?? where did that come from), top sites and top clicks.

And you also get what they call “Interesting Items”, and this is where the “Huh!!” comes in. In the “Recent top queries related to your searches”, number 2 and 6 are “red sox” and “redsox” respectively.

I am at a loss to understand that, I have no interest in sports, none, zero, nada, zilch. I don’t even know the rules of baseball.

Scaling and Uptime

I just came across this updated post on scaling by Greg Linden.

First he has a reference to an old article “Don’t Scale”. The article makes the (flawed) argument that scaling does not matter. It also confuses uptime and scaling, the two are very different.

There is a hardware aspect and a software aspect to uptime. The hardware aspect relates to the redundancy in your hardware architecture and the software aspect relates to the redundancy in your software architecture. Getting perfect uptime is really hard, especially when many things out of your control can go wrong. Getting all those nines after the decimal point eventually mean that you can only be down for minutes per years, clearly difficult to achieve.

Scaling has to do with how your system responds to increased usage, again there is a hardware aspect and a software aspect to this. A system will scale well from a hardware aspect if all you need to do is bring more boxes online when usage increases. Similarly, a system will scale well from a software aspect if you don’t need to go through wrenching software changes as your usage increases. Scaling hardware is relatively easy to achieve because the bandwidth and latency of hardware is well understood and easy to measure. Scaling software is more difficult to achieve since it can be difficult to figure out ahead of time where bottlenecks will occur and software behavior is more difficult to predict than hardware behavior. This article by Dan Pritchett is a good primer in scaling software.

Secondly he made two comments on the article:

Stepping back for a second, a toned down version of David’s argument is clearly correct. A company should focus on users first and infrastructure second. The architecture, the software, the hardware cluster, these are just tools. They serve a purpose, to help users, and have little value on their own.

But this extreme argument that scaling and performance don’t matter is clearly wrong. People don’t like to wait and they don’t like outages. Getting what people need quickly and reliably is an important part of the user experience. Scaling does matter.

I agree that you initially need to focus on getting the application working and building up your user base, but scaling and performance really does matter and you need to think hard about those issues from the start. In my experience it is usually much easier to scale a system which was conceived with scaling in mind than a system which was just thrown together without real thought. I’ve been there, trust me, nothing makes your heart sink faster than code that has not been properly tiered, or a crappy data model, or a schema that was just thrown together, or seeing IP addresses in the code.

If you don’t build with scaling in mind from the start, you will find that you will have to throw everything away much sooner than you thought.

Greg then follows up with an except from an interview with a Twitter developer Alex Payne who says:

Twitter is the biggest Rails site on the net right now. Running on Rails has forced us to deal with scaling issues – issues that any growing site eventually contends with – far sooner than I think we would on another framework.

The common wisdom in the Rails community at this time is that scaling Rails is a matter of cost: just throw more CPUs at it. The problem is that more instances of Rails (running as part of a Mongrel cluster, in our case) means more requests to your database. At this point in time there’s no facility in Rails to talk to more than one database at a time.

The solutions to this are caching the hell out of everything and setting up multiple read-only slave databases, neither of which are quick fixes to implement. So it’s not just cost, it’s time, and time is that much more precious when people can[‘t] reach your site.

None of these scaling approaches are as fun and easy as developing for Rails. All the convenience methods and syntactical sugar that makes Rails such a pleasure for coders ends up being absolutely punishing, performance-wise. Once you hit a certain threshold of traffic, either you need to strip out all the costly neat stuff that Rails does for you (RJS, ActiveRecord, ActiveSupport, etc.) or move the slow parts of your application out of Rails, or both.

I am no expert in Ruby or Ruby on Rail, but a few things jumped out at me in this section, and this moves away from the higher level concepts and more into implementation issues.

The first was the comment about caching. Cache is king so the (corrupted) saying goes, and the more data you can cache, the less frequently you will have to go back the source for data. Data can be cached on the local file system, either in files or in a small database (Berkeleydb or SQLite being good options.) A good place for those on Linux is to use the shared memory partition (/dev/shm) which functions much like a RAM disc. Data there will usually be stored in RAM but will be swapped out if there is a memory shortage, so is only good for small amounts of data. Data on the shared memory partition will be wiped if the machine is rebooted.

There are also memory based caching systems, such as memcached, which allows for caching either on the local machine or a dedicated machine (or a batch of dedicated machines). I have used memcached before and the best configuration is to set up a batch of machines dedicated to caching, preferably with lots of RAM in each machine. The two caveats is that you much make sure that the memcached process never, ever, gets swapped out otherwise your cache performance will crash, and never rely on memcached to keep an object in cache indefinitely, objects will get aged out from the cache if space is needed for younger objects.

The second was about using multiple read-only slave databases. I used to a fan of that until Peter Zaitsev pointed out to me that this really does not work well in an environment where there is a lot of write activity because the slaves will have support both the write load and the read load, while the master will only have to support the write load (assuming you don’t want anyone reading from the master unless they have to.) This gets nastier because the temptation is make the “baddest” machine the master and to use less powerful machines for the slaves. This gets even nastier if you use MySQL (very likely) because there is only a single replication thread running on the slave which impacts the write performance. Once your slaves start to fall behind, they will never catch up.

What I am a fan of now is to dice data so that it can be spread across multiple machines. You can also slice your data into partitions too to make your individual databases and indices smaller, which will help performance. The impact of this is what you will need to move some amount of integrity checking out of your database (like triggers and constraints) into a higher layer. This is not all bad as it means that your database will be doing less work. Of course how this slices and dices depends on your application. This presentation on how eBay scales provides a very good illustration on how to do this.