 
            Just clearing a small point about pilots (I'm a pilot) - the pilot-in-command has ultimate responsibility for his a/c and can ignore whatever ATC tells him to do if he considers that to be contrary to the safety of his flight (he may be asked to explain his actions later, though). Now, usually ignoring ATC or keeping it in the dark about one's intentions is not very clever - but dispatchers are not in the cockpit and may misunderstand the situation or be simply mistaken about something (so a pilot is encouraged to decline ATC instructions he considers to be in error - informing ATC about it, of course). But one of the first things a pilot does in an emergency is pulling out the appropriate emergency checklist. It is kind of hard to keep from forgetting to check obvious things when things get hectic (one of the distressingly common causes of accidents is trivial running out of fuel - either because the pilot didn't do homework on the ground (checking actual fuel level in tanks, etc) or because when the engine got suddenly quiet he forgot to switch to another, non-empty, tank). The mantra about priorities in both normal and emergency situations is "Aviate-Navigate-Communicate" meaning that maintaining control of a/c always comes first, no matter what. Knowing where you are and where you are going (and other pertinent situational awareness such as condition of the a/c and current plan of actions) come second. Talking is lowest priority. The pre-planned emergency checklists may be a good idea for network operators. Try obvious (when you're calm, that's it) actions first, if they fail to help, try to limit damage. Only then go file the ticket and talk to people who can investigate situation in depth and can develop a fix. The way aviation industry come with these checklists is, basically, experience - it pays to debrief after recovery from every problem not adequately fixed by existing procedures, find common ones, and develop diagnostic procedure one could follow step-by-step for these situations. (The non-punitive error or incident reporting which actually shields pilots from FAA enforcement actions in most cases also helps to collect real-world information on where and how pilots get into trouble). The all-too-common multistep ticket escalation chains (which merely work as delay lines in a significant portion of cases) is something to be avoided. Even better is to provide some drilling in diagnostic and recovery from common problems to the front-line personnel - starting from following the checklist on a simulated outage in the lab, and then getting it down to what pilots call "the flow" - a habitual memorized procedure, which is performed first and then checked against the checklist. Note that use of checklists, drilling, and flows does not make pilots a kind of robots - they still have to make decisions, recognize and deal with situations not covered in the standard procedures; what it does is speeding up dealing with common tasks, reduces mistakes, and frees up mental processing for thinking ahead. The ISP industry has a long way to go until it reaches the same level of sophistication in handling problems as aviation has. --vadim On Fri, 25 Dec 2009, George Bonser wrote:
I think any network engineer who sees a major problem is going to have a "Houston, we have a problem" moment. And actually, he was telling the ATC what he was going to need to do, he wasn't getting permission so much as telling them what he was doing so traffic could be cleared out of his way. First he told them he was returning to the airport, then he inquired about Peterburough, the ATC called Peterburough to get a runway and inform them of an inbound emergency, then the Captain told the ATC they were going to be in the Hudson. And "I hit birds, have lost both engines, and am turning back" results in a whole different chain of events these days than "I have two guys banging on the cockpit door and am returning" or simply turning back toward the airport with no communication. And any network engineer is going to say something if he sees CPU or bandwidth utilization hit the rail in either direction. Saying something like "we just got flooded with thousands of /24 and smaller wildly flapping routes from peer X and I am shutting off the BGP session until they get their stuff straight" is different than "we just got flooded with thousands of routes and it is blowing up the router and all the other routers talking to it. Can I do something about it?"
And that illustrates a point that is key. In that case the ATC was asking what the pilot needed and was prepared to clear traffic, get emergency equipment prepared, whatever it took to get that person dealing with the problem whatever they needed to get it resolved in the best way forward. The ATC isn't asking him if he was sure he set the flaps at the right angle and "did you try to restart the engine" sorts of things.
What I was getting at is that sometimes too much process can get in the way in an emergency and the time taken to implement such process can result in a failure cascading through the network making the problem much worse. I have much less of a problem with process surrounding planned events. The more the better as long as it makes sense. Migrations and additions and modifications *should* be well planned and checklisted and have backout points and procedures. That is just good operations when you have tight SLAs and tight maintenance windows with customers you want to keep.
Happy Holidays
George
From: Scott Howard
"mayday mayday mayday. Cactus fifteen thirty nine hit birds, we've lost thrust (in/on) both engines we're turning back towards LaGuardia" - Capt. Sullenberger
Not exactly "detailed", but he definitely initiated an "incident report" (the mayday), gave a "description of what was happening with his plane", the "status of [the relevant] subsystems", and his proposed plan of action - even in the order you've asked for!
His actions were then "subject to the consensus of those on the conference bridge" (ie, ATC) who could have denied his actions if they believed they would have made the situation worse (ie, if what they were proposing would have had them on a collision course with another plane). In this case, the conference bridge gave approval for his course of action ("ok uh, you need to return to LaGuardia? turn left heading of uh two two zero." - ATC)
5 seconds before they made the above call they were reaching for the QRH (Quick Reference Handbook), which contains checklists of the steps to take in such a situation - including what to do in the event of loss of both engines due to multiple birdstrikes. They had no need to confer with others as to what actions to take to try and recover from the problem, or what order to take them in, because that pre-work had already been carried out when the check-lists were written.
Of course, at the end of the day, training, skill and experience played a very large part in what transpired - but so did the actions of the people on the "conference bridge" (You can't get much more of a "conference bridge" than open radio frequencies), and the checklists they have for almost every conceivable situation.
Scott.