Ad Retrieval

I have been catching up on Daniel Tunkelang’s weblog and came across a write up about the SIGIR 2009 presentation by Vanja Josifovski.

The crux of the presentation is that treating ads as a document in IR works well if you evaluate the search over the ad corpus.

What is interesting is that I spent a weekend testing just that when I was at Feedster. I pulled the ads from our ad provider, did a little bit of cleanup and indexed them into an index. I think there were about 80,000 ads so this step was very quick, on the order of seconds.

I then tried would run sample searches on the index to see what ads were retrieved and it seemed to work pretty well.

I also tested a scenario which took the text from posts in a weblog and ran that text as feedback over the ads index. By feedback I mean that there was no search but just a ‘bag of terms’ used to rank the ads. Again this worked pretty well. I pulled a popular feed (which shall remain nameless) and good ads that were very relevant to the individual posts in the weblog. Even using a large number of terms for feedback was not a problem because the ads index was so small, searches would run on the order of milliseconds.

Unfortunately we did not use this which was unfortunate, but it did show me that it could be done.


Bing on iPhone

I downloaded a copy of the Bing app for the iPhone this morning and it looks pretty good, better than Google’s app which mostly just redirects to Safari.

Same Cute Turtle

I have posted a picture of this turtle before but decided to post another because he was pretty impressive.

As I mentioned in my original post, I spent about 5 minutes watching it eating on either side of the coral head. What was impressive is that while he (or she, hard to tell) knew that I was there he just carried on eating, and checking me out from time to time.

Turtles are pretty tame if you don’t threaten them, but they will swim away very fast if you annoy them (I know because once a dive master grabbed one to show it off and it swam like hell as soon as he let go, needless to say I did not dive with that dive master again).

A recent article in the NY Times, “Turtles Are Casualties of Warming in Costa Rica”, is well worth reading, the title says it all, and there is a slideshow to go along with it.

I see turtles on most of my dives but I never tire if it, and all divers I have known never tire of seeing them either, they are truly special.

Why The Name NoSQL Is Meaningless (To Me)

The ‘NoSQL’ movement has gotten quite popular lately and with good reason, it is breaking new ground on distributed, scalable storage.

But the name ‘NoSQL’ really bugs me, because SQL is just a query language, it is not a storage technology. This is well illustrated in “InnoDB is a NoSQL database”, which I will quote below:

As long as the whole world is chasing this meaningless “NoSQL” buzzword, we should recognize that InnoDB is usable as an embedded database without an SQL interface. Hence, it is as much of a NoSQL database as anything else labeled with that term. And I might add, it is fast, reliable, and extremely well-tested in the real world. How many NoSQL databases have protection against partial page writes, for example?

It so happens that you can slap an SQL front-end on it, if you want: MySQL.

Another thing, it is probably better to say what you are for rather than what you are against, much more constructive. Time to get a new name/acronym I think.

Updated December 18th, 2009 – I am seeing that NoSQL is being renamed to mean Not Only SQL, which I think is much better.

exit() Rather Than free()

I have to admit that I had a bit of a reaction to this post, apologies for quoting more than 50% of the post here but here goes:

See, developers are perfectionists, and their perfectionism also includes the crazy idea that all memory has to be deallocated at server shutdown, as otherwise Valgrind and other tools will complain that someone leaked memory. Developers will write expensive code in shutdown routines that will traverse every memory structure and deallocate/free() it.

Now, guess what would happen if they wouldn’t write all this expensive memory deallocation code.

Still guessing?

OS would do it for them, much much much faster, without blocking the shutdown for minutes or using excessive amounts of CPU. \o/

I am really uncomfortable with the approach of using free() for memory cleanup for the obvious reason that it is usually much, much cheaper to keep a process running than to shut it down and restart it on a regular basis. The other reason is that to rely on free() for memory cleanup is just poor hygiene.

Reminds me of the days of SunOS where common wisdom said that restarting a server once a week was a good idea to keep the memory leaks in check.

Bandwidth Caps

I have been thinking about the issue of bandwidth caps recently, well thinking more about it since my provider Comcast placed an unofficial 250GB/month cap (as well as others, in all fairness Comcast is one of the less draconian caps out there.)

This thinking was spurred first by Benoit Felten’s post entitled “Is the ‘Bandwidth Hog’ a Myth?” (by way of ArsTechnica) where he challenges the ISPs:

“Here’s a challenge for them: in the next few days, I will specify on this blog a standard dataset that would enable me to do an in-depth data analysis into network usage by individual users. Any telco willing to actually understand what’s happening there and to answer the question on the existence of hogs once and for all can extract that data and send it over to me, I will analyse it for free, on my spare time. All I ask is that they let me publish the results of said research (even though their names need not be mentioned if they don’t wish it to be). Of course, if I find myself to be wrong and if indeed I manage to identify users that systematically degrade the experience for other users, I will say so publicly. If, as I suspect, there are no such users, I will also say so publicly. The data will back either of these assertions.”

And second by a discussion around this that was had on TWIT 224 this week.

Ok, so I was wondering how I would go about defining a ‘Bandwidth Hog’. Setting a cap is a little arbitrary. For example, I get about 3MB/s download speed here. So at one end of the spectrum downloading 250GB at full speed would take me just shy of 24 hours during which the ‘other users’ would probably be seriously impacted. On the other end of the spectrum, I could throttle the download to about 95KB/s which would take me about 30 days during which I doubt very much any of the ‘other users’ would be impacted.

I admit that both ends of the spectrum are contrived but they illustrate that caps are complicated to implement.

Personally I have no problems with cap if (a) the service is tiered and (b) usage stats are easily accessible.

Traffic This

Very interesting piece on the current mudsling fest between the old guard and the new whippersnappers over on Ars (“Tech tapeworms”: Bloggers denounce “parasite” label at FTC). Worth the read I think.

One comment in particular caught my eye:

Much of the bellyaching by content producers is directed at Google, most notably at Google News. Senior Business Product Manager for News, Josh Cohen, noted that News now comes in 30 different language editions and sends “a billion clicks a month to publishers worldwide.”

The Wall Street Journal’s managing editor, Robert Thomson, sees this traffic a different way: as something that mainly benefits Google. Sure, Google may send search traffic to his sites, but the company gets far more value from the practice than the WSJ does; many readers don’t need to click through to find out more about articles, but are instead satisfied with headlines, a brief excerpt, and perhaps a picture.

The question that content producers should be asking (which is not asked in the piece) is what their traffic would be like if Google was not sending them traffic. I would bet that they would not make up the “billion clicks a month” if Google were to drop them. It would be an interesting experiment to for them to conduct though.

Updated later that same day – turns out that according to Hitwise the Wall Street Journal gets 25% of its traffic from Google and Google News (obtained via