Pages

Thursday 30 April 2009

Welcome to ASZ.COm.Au (Or, The Resolver Library is Broken)

A few days ago, I did a Google search and got the strangest error: a 404 (File Not Found) page that claimed to be from Apache with PHP and Frontpage extensions loaded. Google doesn't run Apache.


Since that was A Very Odd Thing Indeed I went to the Google home page. This time I got a new message: "Welcome to ASZ.COm.Au" (for some reason, I feel it's important to preserve the capitalisation).

My first guess was that my DNS cache had been poisoned. Or that I'd picked up a virus somehow. When I looked on Twitter I saw that I was not alone: a few people were reporting similar problems trying to access Facebook, YouTube, Google and even eBay. A discussion about the issue had started on Whirlpool. In most cases, the issue appeared to resolve itself eventually, or after a DNS cache flush.

In my case, the issue persisted for a little while and then stopped. I was able to access Google again.

It took some time for me to figure out what was going on (hey -- I'm over 27). The intermittent nature of the issue made debugging difficult. It wasn't until my partner mentioned that she'd seen the same screen when she'd tried to access the Bureau of Meteorology  that I was able to make progress. Visiting http://bom.gov.au/ (but not http://www.bom.gov.au/) consistently reproduced the problem.

I chased down a few theories: malware, cache-poisoning, an "optus issue".  But I finally worked out that this was the result of a documented feature of the common UNIX resolver library.

When you visit a web page your computer needs to "resolve" the host name (eg "google.com.au") to an Internet address. That's the job of the resolver, which in turn uses something called a domain name server. Your browser asks the resolver "what's the address of Google.com.au" and the resolver answers. Either the resolver already knows the answer because it's looked it up before and kept a copy (a cache), or it doesn't know, and so it asks the domain name server. The domain name server in turn may ask other domain name servers, until someone, somewhere, knows the answer. The answer is then sent back through the chain, ultimately to your resolver (called the "client").

So if the resolver doesn't know an address it asks the domain name server and waits for an answer. But it will not wait forever. In fact, it typically won't wait longer than several seconds. It's an impatient little thing and if it hasn't heard back quickly enough it assumes that maybe the hostname is wrong -- maybe the user just typed part of the hostname. Or maybe the domain name server does answer in time but the resolver cannot use the answer -- the domain name server may reply with "no one knows that hostname." Either way, the resolver will start to guess what the real (or "fully qualified") hostname might be.

There are two strategies a resolver can use when it starts searching for the fully qualified hostname.

The first is to use an explicit list of search paths. You usually provide this list yourself when configuring your network settings. If you haven't set such a list, then it will employ the second strategy. (That's going to turn out to be quite handy...)

The second thing your resolver can do is a "domain name search". It takes your domain name, prepends the hostname you're looking for and does a lookup on that. I'm with Optusnet, so my domain name is "optusnet.com.au". My reslover then might lookup "bom.gov.au.optusnet.com.au" if it doesn't get a useful answer for "bom.gov.au". If it still doesn't get a useful answer (and it this case, it won't) then it starts searching "up" the domain name -- it removes the first part of the domain name and repeats the search. So its second search is for "bom.gov.au.com.au". See what it did there? It deleted "optusnet" and tried again.  (Technical note: the behaviour is documented in the resolver man page -- see RES_DNSRCH)

That right there is the flaw and the root cause of the problem.

Someone owns the domain names "au.com.au" and "com.com.au". And they have name servers. And they've set them up to answer queries for a whole host of things, among them, bom.gov.au.com.au, google.com.au.com.au and facebook.com.com.au. And our resolvers are querying them and merrily sending our browsers there if the real name servers for those domains don't get their answers back in time.

Now we know enough to say what's going on:
  1. You try and visit Google, Facebook, Twitter (or the BoM). It's been a while so the address isn't in your resolver's cache. So it does a lookup by asking your domain name server -- this is usually provided by your ISP.
  2. For whatever reason, your ISP's name server is either too slow to respond or the response is lost altogether. So your resolver "times out" and starts to "search". Your domain name ends in ".com.au" and so eventually, your resolver looks up "google.com.au.com.au" (or whatever site you're trying to visit, with ".com.au" added to the end). The name servers at "au.com.au" (or "com.com.au" depending on what you're looking up) do respond and do so in time.
  3. Your resolver gives the bogus address to the browser and stores it in the DNS cache. The "TTL" (time to live) for those addresses is 4 hours, so you're going to be stuck with that address in your cache for at most 4 hours.
  4. Eventually, your cache times out. Or maybe you know how to flush it. Either way, a second attempt by the resolver to get the right IP address works and the problem appears to be resolved.
Multiple things are going wrong here:
  • The ISPs name server is either too slow to respond or perhaps "dropping" packets (DNS packets are typically using UDP which is not a guaranteed delivery mechanism like TCP). I've seen this with Optusnet before but in the past I just got a "site not found". Such responses aren't cached so if you hit "Refresh" in your browser you typically find the site just fine the second time.
  • The name servers for "com.com.au" and "au.com.au" have records that match other peoples sites. They shouldn't. Right now, it's just confusing and annoying but its potential for phishing is obvious. It's not necessarily malicious but it should be changed.
  • The algorithm used by the resolver in both UNIX and Windows has a security flaw: it should not search all the way back to ".com.au".
To me the ultimate problem is that last one: UDP packets, and hence DNS responses, are not guaranteed to be delivered. It should be expected that sometimes, things will get busy and responses will be lost. The resolver is in error by searching the domain name all the way back to ".com.au" -- IMHO that's a security hole similar to the old "cookie monster" bug.

There's a fix though, which at least works on OSX (Mac). If you set an explicit search path, then the resolver won't use the second strategy described above. It will search the search path(s) and then stop there. I've set my search path to "optusnet.com.au", the same as my domain name, and can no longer reproduce the problem.

There are other things that can help:
  • If your ISP's domain name server is not reliable use OpenDNS. There is some anecdotal evidence that the possibility of DNS replies being late or dropped is lower. Getting the "Welcome to ASZ.COm.Au" page for Google or Facebook depended on your computer not getting the DNS response in time (or at all) so having a reliable domain name server will stop the problem happening.
  • Add a "." to the end of your hostnames when typing into the browser (for example, "google.com.au."). The trailing "." prevents the domain name searching from kicking in.
  • If you're able, configure firewalls to drop packets from the name servers at "com.com.au" and "au.com.au".
Oh, if you're wondering why "bom.gov.au" could reproduce the problem every time: Their name server does respond quickly but not with an "A" record for that domain. Which is to say, it's not a response that the resolver can use to answer the question "what's bom.gov.au's IP address". So even though the resolver gets an answer it still embarks on its domain name search until it looks up "bom.gov.au.com.au" and gets an answer (an "A" record). To reproduce the proglem with Google etc, you need to wait for the DNS response packet to go missing in order to trigger the domain search by your resolver.

Other things can be explained:
  • It appeared to be an "Optus problem" at one stage because their domain name servers are occasionally overloaded and therefore slow. The Optus domain ends in "com.au" and so the domain name search would go all the way back to ".com.au". TPG seems to have similar issues.
  • I couldn't reproduce the problem at work because my domain name there is "work.com". The resolver is smart enough not to search as far back as ".com" -- it just missed the case where a country domain has subclassifications (such as ".com.au", ".co.nz" or ".co.uk"). That's the limitation to the "counting dots" method of deciding how far to walk back.
  • Switching to OpenDNS would appear to solve the problem because the resolver didn't need to start a domain name search if it got the right answer right away.
  • Flushing the DNS cache would appear to solve the problem because it's only occasionally that DNS replies get lost. You have a good chance on your second attempt of getting the right address.
Thanks to objects, pixelgroup and 3buffalogirls for responding to my Twitter enquiries with additional information, and the posters on the Whirlpool forums.