Sometimes tracking down 404 errors (HTTP “Resource not found”) on a web site is easy to do, but I ran into a tricky case yesterday that took me the better part of the afternoon to resolve. I want to share the steps I walked through in the hopes it saves someone else some time.
If you run a web site you’ll want to periodically verify that your site is being crawled properly by search engines. Both Google Webmaster tools and Bing Webmaster give you a handy “crawl errors” list which shows URLs the search engine fails to index, and the corresponding “origin” pages that link to the problem URLs.
This weekend I checked on Google Webmaster and was surprised to find a few hundred 404 errors in recent crawls of 5 Blocks Out. (There were no matching errors on Bing, and I’m still not sure why.)
Warning: from here on, this blog post gets ugly. Skip to “Lessons Learned” below if you just want the punchline.
All the URLs looked similar to this:
5blocksout.com/q5E3Q1Db7SpV45DINIfRr28pnGRt5G/ujfe3SPV36M=
In other words, each URL had a path consisting of a long string of alphanumeric characters, and ending with an equals sign.
We don’t support URLs like this on 5 Blocks Out, hence all the 404 errors. But where were they coming from?
The crawl errors report listed many different 5 Blocks Out web pages as the origins for these problem URLs. The next step was obvious, then: look at the source code for each origin page containing a problem URL, and then fix the problem. Well, it turned out that none of the origin pages actually contained any of these problem URLs. Googlebot had discovered these URLs at crawl time, but I couldn’t find any them myself. Phantom URLs.
I cleared my browser cache and tried again, with the same results. Then I tried wget, to more closely mimic what Googlebot would see. Again, same results: I couldn’t find any of these supposed 404 URLs on the origin pages.
My hypotheses at this point were not pretty:
- A bug in the application is intermittently generating bad URLs.
- The pages contain poorly formed HTML, which is causing Googlebot (but not Bing) to parse them incorrectly.
- A virus or some sort of malware is corrupting the web pages.
- A bug in our web server software or Rails is corrupting the pages.
Not an appealing list to investigate, especially without a way to reproduce the problem. The “bad HTML” one was easy to eliminate by running the page through the W3C Validator, so that left only gnarly possibilities.
After a while of casting about for clues I eventually returned to looking at the URLs themselves, hoping to find patterns in their structure, the crawl dates, the origin pages… anything. And a few things struck me:
1. The 404 errors all began around Nov 26, which is the date I upgraded our web server to the latest versions of nginx and Phusion Passenger. This later turned out to be a false lead, but it seemed useful at the time.
2. The strings of alphanumeric characters in these phantom URLs looked similar to the authentication tokens that Rails generates as a countermeasure against cross-site request forgery. Rails inserts these tokens into every page that contains an HTML form.
A little voice reminded me, “Hey smartypants, we have some javascript that appends authentication tokens to AJAX requests made from our web pages.” Now, that code shouldn’t have affected search engines, because it was specific to POST requests, and everyone knows that well-behaved search engine crawlers only do GET request. But still… new hypothesis: “Our javascript is mistakenly appending authentication tokens to Googlebot GET requests.”
How to prove it? Well, search engines have caches of crawled pages. I checked Google’s page cache and found a cached version of an origin page that supposedly contained a 404 URL. The HTML source showed… drum roll… no such URL. But the HTML did contain javascript with an authentication code that exactly matched the path portion of the 404 URL. Bingo!
From there, the fix was easy. I did indeed have a bug in my javascript. The code was screening out AJAX requests with a type of “GET”, but it didn’t check for the lowercase “get”, and so it appended auth tokens in cases where it shouldn’t have.
Why did this start happening all of a sudden, you might ask? I don’t know. Perhaps something changed in the way Googlebot issues its HTTP requests. Maybe we’d always had this problem, and Googlebot just hadn’t yet crawled pages that exhibited it. Maybe web fairies temporarily stopped sprinkling magic pixie dust on our site. Who knows.
Lessons learned:
- Check webmaster tools and web server logs once in a while for unexplained crawl errors.
- Remember that search engines allow you to search for cached versions of web pages. This can help with diagnosing crawl problems. (In fact, crawl error reports really ought to link directly to the cached versions of crawled pages.)
- String comparisons don’t work very well if you don’t make them case-insensitive.
For reference, here’s the diff for the problem code (in our application.js file, bound to the ajaxSend event):
// Insert authenticity token into all non-GET AJAX requests
function insertAuthToken(elm, xhr, s) {
- if (s.type == "GET") {
- return;
- }
+ if (s.type == 'GET' || s.type == 'get' || typeof(window.auth_token) == "undefined") return;
Like this:
Be the first to like this post.
Recent Comments