Building Scalable Web Sites

I am wrapping up reading “Building Scalable Web Sites: Sites Building, scaling, and optimizing the next generation of web applications” by Cal Henderson and I strongly recommend it for anyone who wants to build a scalable web site. The books goes over a lot of the gotchas that typically bite people when their web site is discovered and there is suddenly a lot of traffic.

Coverage of the data side of things is a little sparse, if you want to find out how to really scale MySQL, you should check these videos:

Why iPlayer is flawed

I have been reading about iPlayer, mostly criticism. It seems to be a:

Windows XP-only, Windows Media Player-only, Internet Explorer-only, DRM-constrained iPlayer application.

according to Mashable.

And this is its greatest flaw. By taking this approach, the Beeb has pretty much guaranteed that it will be accessible by a small population, thereby limiting its success by default.

It is a shame because I really enjoy the BBC content I have managed to get here in the US, but the Beeb seems to be determined not to makes it content available to those who want it.

I use a Mac and I don’t think of myself as a minor platform, neither do I think of Linux that way. In fact in recent episode of dl.tv, Patrick Norton was surprised that 20% of the downloads were done from Macs. I was surprised too, I expect the number to be less than that.

Another figure that surprised me was Steve Jobs saying that iTunes had been installed on 300 million computers. That is a huge number and if I were going to distribute audio and/or video content, I would certainly look for ways to leverage that reach.

New blog

I have just created a new blog called Boston Startups in which I will write about the Boston Startup scene as I run into it, which seems to be fairly often these days.

OpenSearch

I have written about OpenSearch before, and came across this article on it on xml.com.

The articles doesn’t break any new ground but is just a quick overview of the protocol.

iPhone specific sites

Google has created an iPhone specific search page, I came across this via TUAW.

I tried the search page and it works pretty well for a demo.

What bothered me was this comment in the TUAW post:

But the problem with this goes right back to what Scott was talking about the other day– we aren’t supposed to be getting half the web on the iPhone, we’re supposed to be getting the real web. In this case, there’s not much to complain about– this really is Google, minus the extra content and the ads. However, the links actually go to regular browser windows (not iPhone formatted sites), and if you hit “More Results” at the bottom of the page, it takes you to a normal, full-screen Google page anyway. So what’s the point? Yes, this is just a demo, but why bother making an iPhone specific page in the first place? iPhone users should be able to browse to the Google homepage like everyone else.

While Apple may have meant Safari on the iPhone to allow users to browse the normal web, the reality is that the normal web is designed for displays a lot larger than that on the iPhone, even though it is an exceptional display.

Frankly, when I bring up a complex web page, such as the NY Times front page, on my iPhone, it is nice that I can see the whole page but the text looks like fly poop to me (and I have good vision) and I need to zoom to read anything.

I think it makes much more sense to have web pages specifically designed for mobile devices, so they can be optimized for smaller screens and lower bandwidth (EDGE, ahem).

Functionally I think it is much nicer for the user to get a page which they can read without having to zoom right off the bat and just scroll to read more, than getting a page which needs to be zoomed to read anything at all.

Web Innovators Group meeting

A quick post to say that registration for the 14th Web Innovators Group meeting is open. The meeting will be held on September 10th at the Royal Sonesta in Cambridge.

Privacy, are we having the right debate?

It seems like all the major search engines are falling over themselves announcing new privacy initiatives. All this is very laudable, I think it is important to have clearly defined privacy policies but I am wondering if we are actually having the right debate?

I think there are four key questions we need to look at:

  • The first is what data being stored. Currently consumers generate a lot of data as they browse the web, search histories, pages viewed, email, documents, etc… A lot of that data can be aggregated too, providing a wealth of data. I think we understand that a search engine collects that data, but I am most interested in the intersection Google and DoubleClick data.
  • The second is what that data is being used for. This flows out naturaly from the first question. Looking at search histories, pages viewed, a search engine will be able to detect trends and recommend pages we might not have otherwise found, eventually personalizing the search results. Better ad targetting is a no-brainer too. I am also very interested to know what cross-purposes the data is being put to, for example my search history being used to provide additional signals for ad targetting when I am reading my email online.
  • The third is what the data retention policy is. This is where all the action seems to be these days, how long is the data stored for, how long cookies remain active for, when is data anonymised and how. Shortening cookie expirations is privacy theater. And it has also been shown that anonymised logs are far from anonymous. Also there may be legal requirements to store data for certain lengths of time.
  • The fourth is under what circumstances data is disclosed to law enforcement agencies. This does not seem to have been all that well addressed. For example when the FBI asked the major search engines for data, all but Google rolled over and gave up the data requested. What was interesting about this is that the FBI did not press their case with Google which suggests they were on shaky legal grounds in the first place, yet everyone except Google complied.

I think it is a given that data about our browsing habits will be stored and used. This is the principal manner in which service providers learn about us have the means to provide a better browsing experience (personalization is a big factor here.)

What is important for us consumers to understand is how this data is used, aggregated, disseminated, retained and purged. At which point it will be easier to determine whether the loss of privacy is worth it.

And so far I have yet to see comprehensive information from any service providers about that.

Search, personalization & privacy

I just finished reading an interesting article about search and personalization written by Gord Hotchkiss.

Which got me thinking about search and personalization, and specifically about privacy.

First on search and personalization, I think the article put it very succinctly:

Personalization, in its simplest form, is simply knowing more about you as an individual and using that knowledge to better connect you to content and functionality on the Web.

Which tells us that the more you know about a person, the more you can personalize search results to match what they are looking for.

The articles goes on to say:

We’re trying to paint personalization into a corner based on Google’s current implementation of it. And that’s absolutely the wrong thing to do. Personalization is not a currently implemented algorithm, or even some future version of the same algorithm. It’s is an area of development that will encompass many new technologies, some of which are under development right now in some corner of Google’s labs.

I think this makes two very important points.

The first is that current personalization implementations are pretty poor, I don’t think many people would disagree that they have been pretty disappointing to date.

The second is that personalization will get better over time, but that two things will need to happen, one technological and the other social. On the technology side, new personalization implementations will have to pull in research from other areas, one obvious one is data mining and there are plenty of others. On the social side, we as users will have to get much more comfortable sharing data about ourselves with whatever personalization tools are created. Currently we share very little data, namely short searches, pages viewed and search history. For any system to be truely personalized, we are going to have to share a more data than that, a lot more.

And this is where things will get interesting, there will be usual outcry about privacy, but consumers have shown themselves again and again to be willing to part with privacy in return for convenience.

So the onus is on these new personalization technologies to really deliver.

Updated, of course I should have linked to Sepandar Kamvar who is the technical lead of personalization at Google.

The Importance of being cached

By way of Greg Linden, I read this very interesting paper from Yahoo Research about caching called “The Impact of Caching on Search Engines“.

I liked the dicussion on term versus search caching. My experience is that term caching does not really buy you much if all you are doing is caching a posting list since that is what is stored in the index. Caching terms would make more sense if there is a field restriction on the term, but most terms don’t have field restrictions. Caching a search makes a lot more sense, and caching portions of searches also makes a lot of sense. In the search engine I developed for Feedster, I implemented both. The searches were cached, and the filters in searches were also cached. By filters I mean that we had a number of searches which were restricted to a reduced set of weblogs and these restrictions were implemented using a filter expression which was separate from the actually user search. This is pretty standard stuff, and I found that caching the filter results improved performance.

I am not sure where I stand on dynamic versus static caching though. I am not sure I make much of a distinction, I implemented a dynamic cache, ie I would cache the results if they were not already cached, but I did not set a limit to the cache, and I did not ‘warm’ the cache from search logs.

Chad Walters also has some interesting thoughts on this.

Counter-intuitive optimization

Sometimes I run into an optimization problem that is somewhat counter intuitive. With experience I have gotten reasonably good at knowing what kind of code the gcc optimizer will like and what it won’t like, but sometimes what seems intuitive turns out to be counter-intuitive.

The last such example I ran into was accessing bitmaps. For example you would expect the following macro to be pretty fast:

#define UTL_BITMAP_SET_BIT_IN_POINTER2(pucMacroBitmap, uiMacroBit) \
{ \
ASSERT((pucMacroBitmap) != NULL); \
unsigned char *pucMacroBytePtr = (pucMacroBitmap) + ((uiMacroBit) / 8); \
switch ( (uiMacroBit) % 8 ) { \
case 0: *pucMacroBytePtr |= 0×01; break; \
case 1: *pucMacroBytePtr |= 0×02; break; \
case 2: *pucMacroBytePtr |= 0×04; break; \
case 3: *pucMacroBytePtr |= 0×05; break; \
case 4: *pucMacroBytePtr |= 0×10; break; \
case 5: *pucMacroBytePtr |= 0×20; break; \
case 6: *pucMacroBytePtr |= 0×40; break; \
case 7: *pucMacroBytePtr |= 0×80; break; \
} \
}

The code is a little ugly but it should not be too difficult to follow. Basically what happens is that we work out which byte the bit is in and then set the bit using the case statement. Pretty simple, no heavy computes, just a comparison, a set and a break.

Well it turns out that this is all pretty expensive.

This code looks more expensive than the code above because it works out where to set the bit dynamically, but when you profile it, it is cheaper to run:

#define UTL_BITMAP_SET_BIT_IN_POINTER1(pucMacroBitmap, uiMacroBit) \
{ \
ASSERT((pucMacroBitmap) != NULL); \
unsigned char *pucMacroBytePtr = (pucMacroBitmap) + ((uiMacroBit) / 8); \
*pucMacroBytePtr |= (1 << ((uiMacroBit) % 8)); \
}

All this to show that sometimes what looks cheaper to run is not necessarily the case.

Follow

Get every new post delivered to your Inbox.