The case of the phantom 404s

Sometimes tracking down 404 errors (HTTP “Resource not found”) on a web site is easy to do, but I ran into a tricky case yesterday that took me the better part of the afternoon to resolve. I want to share the steps I walked through in the hopes it saves someone else some time.

If you run a web site you’ll want to periodically verify that your site is being crawled properly by search engines. Both Google Webmaster tools and Bing Webmaster give you a handy “crawl errors” list which shows URLs the search engine fails to index, and the corresponding “origin” pages that link to the problem URLs.

This weekend I checked on Google Webmaster and was surprised to find a few hundred 404 errors in recent crawls of 5 Blocks Out. (There were no matching errors on Bing, and I’m still not sure why.)

Warning: from here on, this blog post gets ugly. Skip to “Lessons Learned” below if you just want the punchline.

All the URLs looked similar to this:

5blocksout.com/q5E3Q1Db7SpV45DINIfRr28pnGRt5G/ujfe3SPV36M=

In other words, each URL had a path consisting of a long string of alphanumeric characters, and ending with an equals sign.

We don’t support URLs like this on 5 Blocks Out, hence all the 404 errors. But where were they coming from?

The crawl errors report listed many different 5 Blocks Out web pages as the origins for these problem URLs. The next step was obvious, then: look at the source code for each origin page containing a problem URL, and then fix the problem. Well, it turned out that none of the origin pages actually contained any of these problem URLs. Googlebot had discovered these URLs at crawl time, but I couldn’t find any them myself. Phantom URLs.

I cleared my browser cache and tried again, with the same results. Then I tried wget, to more closely mimic what Googlebot would see. Again, same results: I couldn’t find any of these supposed 404 URLs on the origin pages.

My hypotheses at this point were not pretty:

  • A bug in the application is intermittently generating bad URLs.
  • The pages contain poorly formed HTML, which is causing Googlebot (but not Bing) to parse them incorrectly.
  • A virus or some sort of malware is corrupting the web pages.
  • A bug in our web server software or Rails is corrupting the pages.

Not an appealing list to investigate, especially without a way to reproduce the problem. The “bad HTML” one was easy to eliminate by running the page through the W3C Validator, so that left only gnarly possibilities.

After a while of casting about for clues I eventually returned to looking at the URLs themselves, hoping to find patterns in their structure, the crawl dates, the origin pages… anything. And a few things struck me:

1. The 404 errors all began around Nov 26, which is the date I upgraded our web server to the latest versions of nginx and Phusion Passenger. This later turned out to be a false lead, but it seemed useful at the time.

2. The strings of alphanumeric characters in these phantom URLs looked similar to the authentication tokens that Rails generates as a countermeasure against cross-site request forgery. Rails inserts these tokens into every page that contains an HTML form.

A little voice reminded me, “Hey smartypants, we have some javascript that appends authentication tokens to AJAX requests made from our web pages.” Now, that code shouldn’t have affected search engines, because it was specific to POST requests, and everyone knows that well-behaved search engine crawlers only do GET request. But still… new hypothesis: “Our javascript is mistakenly appending authentication tokens to Googlebot GET requests.”

How to prove it? Well, search engines have caches of crawled pages. I checked Google’s page cache and found a cached version of an origin page that supposedly contained a 404 URL. The HTML source showed… drum roll… no such URL. But the HTML did contain javascript with an authentication code that exactly matched the path portion of the 404 URL. Bingo!

From there, the fix was easy. I did indeed have a bug in my javascript. The code was screening out AJAX requests with a type of “GET”, but it didn’t check for the lowercase “get”, and so it appended auth tokens in cases where it shouldn’t have.

Why did this start happening all of a sudden, you might ask? I don’t know. Perhaps something changed in the way Googlebot issues its HTTP requests. Maybe we’d always had this problem, and Googlebot just hadn’t yet crawled pages that exhibited it. Maybe web fairies temporarily stopped sprinkling magic pixie dust on our site. Who knows.

Lessons learned:

  • Check webmaster tools and web server logs once in a while for unexplained crawl errors.
  • Remember that search engines allow you to search for cached versions of web pages. This can help with diagnosing crawl problems. (In fact, crawl error reports really ought to link directly to the cached versions of crawled pages.)
  • String comparisons don’t work very well if you don’t make them case-insensitive.
For reference, here’s the diff for the problem code (in our application.js file, bound to the ajaxSend event):
// Insert authenticity token into all non-GET AJAX requests
function insertAuthToken(elm, xhr, s) {
-    if (s.type == "GET") {
-        return;
-    }
+    if (s.type == 'GET' || s.type == 'get' || typeof(window.auth_token) == "undefined") return;

Thousands piled on at the last minute to push Wikileaks over 1M Facebook Fans

Sometime a little after 5pm EST on Dec 7, 2010, Wikileaks gained its 1 millionth Facebook fan. That’s about 10 hours after my prediction, as the growth rate slowed through the night (duh) and didn’t return to the previous day’s pace during most of the day.

Here’s the chart, showing Facebook likes (fans) in blue and Twitter followers in red. The gaps in the lines are gaps in data sampling. I sampled every 5 minutes.

Notice the funky little blip in the blue line right around the 1 million mark? Here’s another chart that zooms in on that timeframe within the red circle:

A significant acceleration starts around 1710 EST. During this period the average number of new likes per minute (LPM, it’s the new RPM) goes from 89 or so up to a peak of 516, and then decelerates back down to 83 LPM at 1800 EST, followed by a continuing decline from there onwards.

At first I thought the lift was tied solely to prime time East coast US and Canada news., i.e. people might have logged on to Facebook immediately after hearing news stories about Wikileaks. That probably happened, to some degree. But if that were the only factor, we should have expected a similar bump, or sustained high growth, through the 1800-1900 hour, which is also news-heavy. And that’s not the case, as you can plainly see.

My best guess is that a crowd of people were aware the Wikileaks Facebook page was almost at 1M fans, and they all piled on when it got close to push Wikileaks over the 1M mark. Probably some of them hoped to be the 1 millionth fan. (Is there a badge for that?)

In case you’re interested, below is the raw data for 1500 EST to 2000 EST. Note I’ve calculated average new likes per minute, as I’m sampling every 5 minutes.

Why bother with all this analysis? Ethics and aims of Wikileaks aside, I just think this an interesting phenomenon at the intersection of math, psychology, and social networks. It’s not often you get to watch a meme go viral, in realtime, with accurate metrics available for tracking it. Lucky us.

I’ll keep my samples running for a while, in case anything else interesting happens.

Date/Time (EST) Facebook Likes Avg. new likes/minute
2010-12-07 15:00 982510 86
2010-12-07 15:05 982962 90
2010-12-07 15:10 983389 85
2010-12-07 15:15 983855 93
2010-12-07 15:20 984282 85
2010-12-07 15:25 984725 89
2010-12-07 15:30 985115 78
2010-12-07 15:35 985505 78
2010-12-07 15:40 985940 87
2010-12-07 15:45 986320 76
2010-12-07 15:50 986728 82
2010-12-07 15:55 987136 82
2010-12-07 16:00 987557 84
2010-12-07 16:05 987979 84
2010-12-07 16:10 988429 90
2010-12-07 16:15 988856 85
2010-12-07 16:20 989309 91
2010-12-07 16:25 989751 88
2010-12-07 16:30 990164 83
2010-12-07 16:35 990584 84
2010-12-07 16:40 990998 83
2010-12-07 16:45 991404 81
2010-12-07 16:50 991845 88
2010-12-07 16:55 992301 91
2010-12-07 17:00 992730 86
2010-12-07 17:05 993175 89
2010-12-07 17:10 994691 303
2010-12-07 17:15 997270 516
2010-12-07 17:20 999698 486
2010-12-07 17:25 1001745 409
2010-12-07 17:30 1003400 331
2010-12-07 17:35 1004748 270
2010-12-07 17:40 1006032 257
2010-12-07 17:45 1006595 113
2010-12-07 17:50 1007818 245
2010-12-07 17:55 1008331 103
2010-12-07 18:00 1008748 83
2010-12-07 18:05 1009130 76
2010-12-07 18:10 1009483 71
2010-12-07 18:15 1009836 71
2010-12-07 18:20 1010182 69
2010-12-07 18:25 1010523 68
2010-12-07 18:30 1010863 68
2010-12-07 18:35 1011204 68
2010-12-07 18:40 1011495 58
2010-12-07 18:45 1011804 62
2010-12-07 18:50 1012134 66
2010-12-07 18:55 1012420 57
2010-12-07 19:00 1012703 57
2010-12-07 19:05 1012968 53
2010-12-07 19:11 1013228 52
2010-12-07 19:16 1013496 54
2010-12-07 19:21 1013766 54
2010-12-07 19:26 1013991 45
2010-12-07 19:31 1014236 49
2010-12-07 19:36 1014457 44
2010-12-07 19:41 1014672 43
2010-12-07 19:46 1014882 42
2010-12-07 19:51 1015105 45
2010-12-07 19:56 1015321 43

Wikileaks will exceed 1M Facebook fans in the next 16 hours

One of the interesting things about the Wikileaks brouhaha is the speed at which news about it, and support for it, is propagating. I’ve been watching Wikileaks’ Facebook page today (Dec 6 ,2010), and if my numbers are right it has gained an average of 105 “Likes” per minute between 11AM EST and 3:25 PM EST. That’s about 6300 new people, every hour, giving Wikileaks a thumbs-up by clicking the “Like” button on their fan page. A few minutes ago it topped 900k.

Extrapolating from that average, in about 16 hours, Wikileaks will have over 1 million Facebook supporters.

Count of "Likes" for http://facebook.com/wikileaks

How ’bout Twitter? Wikileaks’ Twitter account gained 1907 followers between 13:32 EST and 15:32 EST, up from 406,018 to 407,925.

I wish I’d been looking at the data earlier on. It would be neat to see the points where the growth accelerated, and correlate that with big news events. The only historical data I have is that on Dec 4, just two days ago, Wikileaks tweeted  they had hit 600K Facebook supporters.

Amazon, PayPal, and EveryDNS (not EasyDNS!) have all dropped support for Wikileaks’ internet presence. I wonder whether and how long Twitter and Facebook will allow these accounts to stay up.

Follow

Get every new post delivered to your Inbox.

Join 348 other followers