-----Original Message----- From: Dobbins, Roland
On Dec 25, 2009, at 7:01 AM, Michael Dillon wrote:
It would be interesting to see what others have to say about this answer.
I think it's a pretty accurate summation of how these things work in a lot of big organizations, all over the world.
I think that one must keep in mind that there are two kinds of check-lists. There is a takeoff list where you can always choose to go back to the ramp and fly another day if something doesn't check out but there is a different priority when someone is already in the air and something goes wrong. You can't decide to land a different day. In that case you must rely on experience and knowledge to handle the situation as it presents itself. Sure, you can have some basic checks for things even in an emergency but you can't know how the problem is going to present itself ahead of time. In cases like that you have set of general parameters but the person "at the controls" needs to have leeway to both clearly identify the nature of the problem and mitigate the same if possible and that might include calling in some extra eyes in order to identify things that might be going on with applications or other devices that aren't specifically network gear. So you can put a lot of process around changes in advance but there isn't quite as much to manage incidents that strike out of the clear blue. Too much process at that point could impede progress in clearing the issue. Capt. Sullenberger did not need to fill out an incident report, bring up a conference bridge, and give a detailed description of what was happening with his plane, the status of all subsystems, and his proposed plan of action (subject to consensus of those on the conference bridge) and get approval for deviation from his initial flight plan before he took the required actions to land the plane as best as he could under the circumstances. And while that is a bit extreme in the sense of most networks in that lives are not often at stake, some concepts are the same (and there might be networks supporting various occupations on this planet where lives might actually be at stake in the case of a network failure during some sort of activity). One of the most efficient shops I worked in was when the production internet operation was owned by the engineering department. Corporate operations owned the internal corporate IT, but engineering owned the internet production data centers and network operations. If engineering released a code revision that blew up the network, the VP of Engineering was responsible for the entire picture, not just the software piece. Same is true where a networking change blew up the application. Having the responsibility for the entire "system" (software, hardware platforms, and networking) under the same organization resulted in a lot smoother operation without backbiting and greater access to and sharing of resources between the application engineers, the systems administrators, and the network engineers.