Search Engine Experiments
This post is about two experiments.
(1) The Search Engine Experiment is a web page for comparing relevance of various search engines. Try it; you might be surprised.
It’s good to see people using tools like these to question the conventional wisdom that Google is the hands-down leader on relevance. I use the big three (Google, Yahoo! and MSN) pretty much interchangeably and I find they’re almost always neck and neck in relevance, vying for first place on most queries I perform.
How does one quantify search engine relevance? Well, the general thing you’re looking for is “Did it answer my question with the correct results?”, and “How long did it take me to find the correct results within the entire list of search results?”. These are measurable things.
That said, relevance is subjective; you and I will probably differ in our opinions about the relevance of a given search result for a given query, and we can both be right. Relevance is in the eye of the beholder.
The way you get a statistically useful measure of search engine effectiveness is by doing thousands of random sample queries, in different languages and geographical regions. Ideally it also needs to be done in near-real time (so the results aren’t stale by the time you’re done) and carried out by a broad diversity of people. The big three all do this sort of testing today using in-house programs.
(2) Here’s the second experiment idea: get a neutral scientific body like SigIR to do broad-scale ongoing relevance measurement. Use a system like Amazon’s Mechanical Turk (mentioned earlier here) to distribute the sampling process to qualified participants (”judges”) across the world. Publish the results openly on a weekly basis, and make the database of accumulated queries and test results available for scientific research.
Just for the sake of argument, if you priced each “judgement” at $0.05, you could do 1000 samples a day across 3 search engines for $150. That’s peanuts! Yes, I’m ignoring a few things like auditing results for quality control, and how to generate the sample queries in the first place, but I don’t think that would change the ballpark number a great deal. Even so, let’s double it to $300 a day. Walnuts! You could even charge the search engine companies a nominal fee for access to the results to offset costs.
Wouldn’t that be a fun little experiment?
I’m sure we’ll all sleep better knowing this is possible.
By the way, when you do testing like this, you understand better why Google is trying so hard to diversify: algorithmic search relevance is hard, but it is not rocket science, and it’s being commoditized quickly by competitors. Time to move to higher ground.