I wish the article had more info since I have been wondering how a software upgrade downed the entire zone. Wasn't there any backup servers? Did they not test the upgrade before hand? I know I'd lose my job if I upgraded our dns servers all at once with out testing.
-----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Fergie Sent: Wednesday, August 30, 2006 3:26 PM To: gstammw@gmx.net Cc: nanog@merit.edu Subject: Re: Spain was offline
Netcraft:
[snip]
A botched software update at Spain's central domain registry knocked as many as 400,000 sites offline for several hours Tuesday, according to the Esnic registry. The error left Internet users unable to access domains using .es, the country code top-level domain for Spain.
[snip]
More: http://news.netcraft.com/archives/2006/08/30/thousands_of_span ish_web_sites_knocked_offline_by_software_error.html
- ferg
-- "Gunther Stammwitz" <gstammw@gmx.net> wrote:
He colleagues,
Spain (at least the .es-part) was offline nobody reported it...? What's going on? In the past you were faster...
Gunther
-- "Fergie", a.k.a. Paul Ferguson Engineering Architecture for the Internet fergdawg(at)netzero.net ferg's tech blog: http://fergdawg.blogspot.com/
On 31 Aug 2006, at 16:30, Joseph Jackson wrote:
I wish the article had more info since I have been wondering how a software upgrade downed the entire zone.
Oh, loads of ways.
Wasn't there any backup servers?
Well, a quick poke suggests, assuming a reasonably traditional setup, that ns1.nic.es is the master, and there are various slaves, not necessarily directly under their control. ns1.nic.es appears to be running BIND 9.3.2, and there's other versions running on the other nameservers. So if it *was* a software update of BIND, it's probably not global. OTOH, I can believe that somebody broke a Perl script critical to it and it rolled out a valid, but empty, zonefile which the secondaries faithfully replicated. Not that I've watched cascading DNS failures at too many places with bits of crufty Perl, oh no... Actually, it amazes me that this sort of thing doesn't happen more often.
Did they not test the upgrade before hand? I know I'd lose my job if I upgraded our dns servers all at once with out testing.
It's Europe, it's harder to fire people. There's probably a bit of scapegoating and shooting of messengers going on, but it's quite likely that the root cause is a general process failure that's not attributable to a single individual.
On Thu, 31 Aug 2006 17:30:37 BST, Peter Corlett said:
OTOH, I can believe that somebody broke a Perl script critical to it and it rolled out a valid, but empty, zonefile which the secondaries faithfully replicated. Not that I've watched cascading DNS failures at too many places with bits of crufty Perl, oh no...
ISTR some database extract failing in a new and unusual way a few years ago, and about 1/3 of the entire .com domain evaporolated for several hours....
On Thu, 31 Aug 2006 13:03:38 -0400, Valdis.Kletnieks@vt.edu wrote:
On Thu, 31 Aug 2006 17:30:37 BST, Peter Corlett said:
OTOH, I can believe that somebody broke a Perl script critical to it and it rolled out a valid, but empty, zonefile which the secondaries faithfully replicated. Not that I've watched cascading DNS failures at too many places with bits of crufty Perl, oh no...
ISTR some database extract failing in a new and unusual way a few years ago, and about 1/3 of the entire .com domain evaporolated for several hours....
This is an old, old story -- such failures have been with us for a long time. Not all that many years ago, the entire (US) 800 number system was down for a similar reason -- the program that populated the production database from the back end master copies hiccupped, and things got *very* confused. For many more stories like this, see the archives of the RISKS Digest (http://www.risks.org). --Steven M. Bellovin, http://www.cs.columbia.edu/~smb
participants (4)
-
Joseph Jackson
-
Peter Corlett
-
Steven M. Bellovin
-
Valdis.Kletnieks@vt.edu