how to write an incident report
For those who don't recognise the name (presumably many people), Citylink are a disruptive high-speed metro transport provider in Wellington, New Zealand. They run the most elaborate and scary layer-2 switched ethernet network I've ever heard of, and the other week they ran into some problems which caused a prolonged outage. Here's their writeup: http://news.clnz.net/2007/10/19#Loopback-Saturday-Discussion Why don't I have any suppliers like this? Joe
On 20/10/2007, at 6:45 AM, Joe Abley wrote:
For those who don't recognise the name (presumably many people), Citylink are a disruptive high-speed metro transport provider in Wellington, New Zealand. They run the most elaborate and scary layer-2 switched ethernet network I've ever heard of, and the other week they ran into some problems which caused a prolonged outage.
Here's their writeup:
http://news.clnz.net/2007/10/19#Loopback-Saturday-Discussion
Why don't I have any suppliers like this?
Wow! This has to be one of the best incident reports I have ever seen. It would be great if people took a page out of Citylink's book instead of the one paragraph "something died" type reports. -- Steven Haigh Email: netwiz@crc.id.au Web: http://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897
On Fri, 19 Oct 2007, Joe Abley wrote:
For those who don't recognise the name (presumably many people), Citylink are a disruptive high-speed metro transport provider in Wellington, New Zealand. They run the most elaborate and scary layer-2 switched ethernet network I've ever heard of, and the other week they ran into some problems which caused a prolonged outage.
Here's their writeup:
http://news.clnz.net/2007/10/19#Loopback-Saturday-Discussion
Why don't I have any suppliers like this?
Probably because your suppliers run everything thru their lawyers. This obviously was posted straight via the NOC/OPs group. They'll continue to post incident reports like this until they get hit by their first lawsuit by someone similar to "spilling hot coffee on their lap". "This take somewhat longer than it should have..." "then we are pretty sure that the measures above (enforcing MAC count limits on every port, disabling keepalives on interswitch links, single homing all 2950's) will prevent the problem from reoccuring..." There is enough info in that posting to bury them in frivilous lawsuits. -Hank
There is enough info in that posting to bury them in frivilous lawsuits.
I say good for them then! Society is litigious enough without our engineers worrying about lawsuits. The moment you start tempering your true analysis of a situation to kow-tow to spin doctors is the moment your engineering badge should be revoked. The world needs more honesty; not less. Jason PS This "Citylink" appears to be in New Zealand - perhaps they haven't been invaded by lawyers yet? PPS And it was an excellent analysis!
On 20-Oct-2007, at 1304, Hank Nussbacher wrote:
On Fri, 19 Oct 2007, Joe Abley wrote:
For those who don't recognise the name (presumably many people), Citylink are a disruptive high-speed metro transport provider in Wellington, New Zealand. They run the most elaborate and scary layer-2 switched ethernet network I've ever heard of, and the other week they ran into some problems which caused a prolonged outage.
Here's their writeup:
http://news.clnz.net/2007/10/19#Loopback-Saturday-Discussion
Why don't I have any suppliers like this?
Probably because your suppliers run everything thru their lawyers.
I've had a few responses like this, but I don't buy it. I've worked in many places, some in New Zealand and more elsewhere, where there was a general culture of fear about making public statements about operational incidents. I don't ever remember people sending proposed text to legal and having it pushed back with changes; what happened instead was that text wasn't written in the first place. Maybe Simon's level of detail is such that no legal department would ever condone it. But there's such a tremendous distance between Simon's text and the usual "there are no known issues at this time" that I suspect people just aren't trying. Joe
On Sat, 20 Oct 2007 15:24:24 -0400 Joe Abley <jabley@ca.afilias.info> wrote:
Maybe Simon's level of detail is such that no legal department would ever condone it. But there's such a tremendous distance between Simon's text and the usual "there are no known issues at this time" that I suspect people just aren't trying.
Maybe there was so much detail that the lawyers didn't understand it. :-) -- D'Arcy J.M. Cain <darcy@druid.net> | Democracy is three wolves http://www.druid.net/darcy/ | and a sheep voting on +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
On Sat, 20 Oct 2007, Joe Abley wrote:
I've had a few responses like this, but I don't buy it. I've worked in many places, some in New Zealand and more elsewhere, where there was a general culture of fear about making public statements about operational incidents. I don't ever remember people sending proposed text to legal and having it pushed back with changes; what happened instead was that text wasn't written in the first place.
Of course it wasn't, the only time public statements beyond the simple "Network Status" update are made is when the outage is so huge that the news media report it. Then the idea is to spin the problem as a freak occurrence that no amount of money and planning (which the company of course spent years and millions doing) would have prevented. Legal and PR are going to take one look at the report and then ask what the upside for the company is in releasing it. In most cases there will be none so it won't happen. Techs know this so don't even bother. In reality a large percentage of outages happen for "dumb" reasons and publicising them just makes the company look bad (look at the previous fault on the page). Look at this Citylink outage, I'm sure the sales guys for rival companies are right now working on their pitches for their customer's business based on that has been posted. "Look at these guys, they took down half the city and still don't know it wasn't caused by hackers. Half the government was offline [1] all day because they couldn't even get into their building after hours. Their phones were off, their mail servers stopped working, they couldn't login to their network themselves, and their websites were offline. They've been having these sort of outages on a smaller scale for years and just ignored them because they only affect one or two customers at a time." [1] Roughly: Beehive = Whitehouse, RBNZ = Federal reserve, Bowen St = Parliament.
Maybe Simon's level of detail is such that no legal department would ever condone it. But there's such a tremendous distance between Simon's text and the usual "there are no known issues at this time" that I suspect people just aren't trying.
Well I was pleasantly surprised at 365 Main's explanation of the problem a while back. http://www.365main.com/press_releases/pr_8_1_07_365_main_report.html but once again that was a major event that couldn't be hidden. Citylink is a slightly unusual company in it's level of openness (although getting less so) but I would guess that most people on this list would be fired if they posted something like Simon's text without running it by legal. -- Simon Lyall | Very Busy | Web: http://www.darkmere.gen.nz/ "To stay awake all night adds a day to your life" - Stilgar | eMT.
I think you greatly underestimate how customers react to the truth.
Indeed - "The Cluetrain Manifesto" (http://www.cluetrain.com/book/index.html) is probably a good starting point to understand exactly that point. MMC
On Sat, 20 Oct 2007, Hank Nussbacher wrote: > There is enough info in that posting to bury them in frivilous lawsuits. Only if they were stupid enough to move to the States. -Bill
participants (9)
-
Alex Rubenstein
-
Bill Woodcock
-
D'Arcy J.M. Cain
-
Hank Nussbacher
-
Jason Seemann
-
Joe Abley
-
Matthew Moyle-Croft
-
Simon Lyall
-
Steven Haigh