Stop words and minimum term length

This post on stop words and minimum term length by Peter Zaitsev reminded me of some search engines do’s and don’ts that I posted back in August last year.

To summarize:

Stop lists are evil, don’t use them, modern machines have enough capacity to index, store and search over very large quantities of text. Typically I have found that there is only a 5% difference in index size if you add a stop word list.

There should be no minimum term length, you want to be able to search for “Vitamin A”.

Case is important. The approach I take is to index all terms in lowercase, and also index mixed case terms as they are. Search is always done in the case supplied by the user, so “New York Times” would only find documents which contain capitalized terms, and “new york times” would find all documents which contain the terms regardless of case and capitalization.

Tokenization is important, check my original post on that.

Plural stemming is the way to do, any more (like Porter or Lovins) will just increase the ‘noise’ in the search results.

There is more in the post and I should revise it sometime, maybe this weekend.

Advertisements

What does “utter crap” mean?

So I have probably blown my “General Audience” rating here but this comment about the MacOS X file system was amusing.

Linus Torvald being interviewed by The Sidney Morning Herald came out with this quotable quote:

“I don’t think they’re equally flawed – I think Leopard is a much better system,” he said. “(But) OS X in some ways is actually worse than Windows to program for. Their file system is complete and utter crap, which is scary.”

So what does “utter crap” really mean? The article is short on specifics, and context for that matter.

Personally I find the file system to be pretty good given its heritage, and the requirements for backward compatibility.

Blu-ray vs. HD DVD, or is it?

The Economist has an interesting article about the current state of Blu-ray vs. HD DVD, and makes an interesting case that this is not the battle to watch.They suggest that the real battle is going to be over how the content is delivered.

They suggest that thumb drives may be an option, and that faster networks are another option:

One candidate is the thumb drive, the non-volatile memory stick you plug into a computer’s USB port. Their storage capacity has soared over the past few years from megabytes to gigabytes. Industry insiders expect that, within a few years, a 32-gigabyte USB drive capable of holding as much as a Blu-ray disc will cost about the same as the latter does today. And it will be more portable, more rugged, easier to play and recordable to boot.

But before Moore’s Law can work its inexorable magic, the telephone companies will start pushing their own alternative. Over the past few years, firms such as Verizon and AT&T have been laying fat optical pipes over the “last mile” from their local telephone stations to people’s homes. In what they call a “triple play”, they aim to bundle television and broadband internet access along with telephone services in order to slow the inroads being made in their own business by the cable-television providers.

That’s only half of it. Verizon’s FiOS (fibre-optic service) can deliver raw data at speeds up to 50 megabits per second. That’s twice the as much as needed to deliver the video quality of a Blu-ray or HD DVD disc. AT&T’s U-verse isn’t far behind.

Both see high-definition video as the key to beating the cable providers, which can’t match the phone companies’ ability to provide massive bandwidth to individual households. The cable industry’s new DOCSIS 3.0 technology can transmit data at a whopping 160 megabits per second, but the bandwidth has to be shared by all the households on the same cable loop. As a result, few cable subscribers can get more than five or six megabits per second—nowhere near enough to pump high-definition video into the home.

My money is on faster networks, why send physical media when you can send bits, then again sending a Blu-ray/HD DVD

1080i or 720p on my AppleTV

For a while I had been wondering if I should set my AppleTV to display 1080i or 720p. I tried both and they look pretty much the same to me, but then again I have strange eyesight, I am slightly long sighted in one eye and slightly short sighted in the other.

I found nothing about this the last time I checked on the internet for an answer, but today I found a couple of interesting threads on the apple discussion forums about this, this one talks specifically about 1080i or 720p and this one talks about resolution question/observation.

Basically it comes down to this, 720p is better than 1080i because:

It’s both. Yes, the scaler on your Bravia is superior to the one on the tv. But the bigger factor is that all of the video standards supported by tv (MPEG-4, h.264) are progressive-scan only. So, in 1080i mode, your tv interlaces while your Bravia then deinterlaces (which typically softens the image, though I’m not aware of what algorithm Sony uses on the Bravia line). Anyway, as you’re aware, that’s not an optimal combination when the source files are progressive-scan.

Yet more on “The Database Column vs. MapReduce”

The Database Column follows up on their original article about DBMSs and MapReduce.

Overall I think they do a very good job of addressing people’s comments about the issues. But I am left with one nagging thought triggered mainly by this sentence in the article:

As such, we claim that most things that are possible in MapReduce are also possible in a SQL engine.

The nagging thought is this: Are we trying too hard to solve very different problems with two very different technologies by claiming that one technology can do it all. The authors present a very good example which is hard to do with MapReduce but easy to do with an RDBMS. What they don’t present is the flip side.

I would suggest that parsing text and indexing it into an index for full text searching is a very good example of something that works really well in MapReduce, but would absolutely “suck” in an RDBMS.

Processing large amounts of log data would be another good example.

And I am sure there are plenty of others.

I would suggest that a good rule of thumb would be to look closely at the data you are processing. If there are lots of joins, or even one, an RDBMS would probably be a good choice. If your data is flat, then MapReduce would probably be a good choice. I hedge here because every situation is different and has to be considered on its own challenges and merits.

The bottom line is that someone who comes to me and says that “Language X” or “Tool Y” can do it all (and make me a cup of tea after that) is being a little jejune.