Revisiting the Aviation Safety vs. Networking discussion
Those that remember the discussion may find this article interesting: http://abcnews.go.com/Health/wireStory?id=9394406 Owen
1. I grew up at the local airport watching my CFII pop train an endless stream of pilots. 2. The checklist for my last production gear swap had over 400 steps and 4 time/task gates (each with a rollback plan). As I did each sequence of steps, I called it out, and someone read their copy of the checklist and checked it off. An entire peanut gallery of rouges watched the whole thing on livemeeting, waiting to pounce on the first misstep or shortcut. 3. We migrated an entire nationwide phone system in 6 hours and nobody noticed anything. 4. We met afterward to in an after action review meeting that I picked up in the Army. I'm more persistent than smart, and I tell ya, if you prep well enough, you can hand your checklist to a stoned intern and you'll have no worries at all. David On Wed, Dec 23, 2009 at 12:48 PM, Owen DeLong <owen@delong.com> wrote:
Those that remember the discussion may find this article interesting:
http://abcnews.go.com/Health/wireStory?id=9394406
Owen
I'm more persistent than smart, and I tell ya, if you prep well enough, you can hand your checklist to a stoned intern and you'll have no worries at all.
this works in a tech culture where folk follow mops obsessively. my experience is that most north american engineers are too smart to do that, and take shortcuts. randy
On Dec 24, 2009, at 9:51 AM, Randy Bush wrote:
I'm more persistent than smart, and I tell ya, if you prep well enough, you can hand your checklist to a stoned intern and you'll have no worries at all.
this works in a tech culture where folk follow mops obsessively. my experience is that most north american engineers are too smart to do that, and take shortcuts.
randy
Being a "North American Engineer", I resent that remark. =] I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate. Eddy
On Dec 24, 2009, at 10:09 AM, Randy Bush wrote:
I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate.
imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur.
randy
=] The networking group is under control. Its the software engineers that start making edits to configs and code on the fly, improvisation at its finest. I guess my scope of interaction is greater than just networking. The hard part is that its a peer situation and how do you elevate the members of another team who have a lessor standard of operation. Also, they feel its fine to act like a cowboy and tackle problems on the fly. As long as the product is live before the window close. Then there is the almighty "We can't back out, we already made too many changes" that makes me want to grab rope and attach it to the ceiling. Have a Merry Christmas, Eddy
Eddy Martinez wrote:
On Dec 24, 2009, at 10:09 AM, Randy Bush wrote:
I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate. imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur.
I find the thought of *any* culture in which attempts to deviate "just do not occur" a little unnerving. Jim Shankland http://blog.oliver-gassner.de/archives/225-Guenter-Eich,-Traeume.html
On Dec 24, 2009, at 1:09 PM, Randy Bush wrote:
I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate.
imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur.
Are you trying to suggest that this is something horrible, or that it's the future of network engineering? :) I'm actually serious in asking the question, despite the grin. -Dave
imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur.
Are you trying to suggest that this is something horrible, or that it's the future of network engineering? :)
neither. it is one [type of] ops engineering culture, and a very successful one. it seems, from this gaijin's naive point of view, to be the common one in japan. when i try to 'sell' configuration automation, they are confused by how important it is to me. they have a hard time seeing the need because mops just work. my read is that this is because people do not have the arrogance to take shortcuts. when one is raised knowing that one's responsibility to the group is more important than how smart one may think that one is, mops work. randy
I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate.
imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur.
Are you trying to suggest that this is something horrible, or that it's the future of network engineering? :)
I'm actually serious in asking the question, despite the grin.
Possibly, he is trying to hint at a connection with Nazis, so somebody will mention it, invoking Godwin's Law, and bringing a fruitless religious thread to a close. There's a full range of methods, with "just do it" on one side, "deviation is terms for dismissal" on the other, and plenty of shades of gray in between. I've seen both extremes result in excessive downtime. (How impromptu engineering can go wrong shouldn't take much imagination; the "no deviation" rule is especially hysterical when the backout plan doesn't work, but even without that, the "one thing didn't work exactly right, back it out and try again in two weeks" effect is destructive to both progress and morale.) Working with the dynamic and quality of the team is more important than any change management paradigm. -Dave
imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur.
Are you trying to suggest that this is something horrible, or that it's the future of network engineering? :)
The model of network engineering that grew up during the 1990s is forever gone unless you work in a smaller organization where people have to wear many hats. In the big ISPs, now identical to the big telcos, operations and engineering design duties are separated. The operations folks do not deviate from the written plans that they work with. If the slightest thing happens that is not in the plan, they rollback the changes as specified in the plan. They don't fix anything unless it is officially broken with trouble tickets filed and escalations up to senior management. That is about the only time that operations people can get away with taking shortcuts and creative solutions. On the other hand, the engineering design folks should spend a good part of their day trying out things, thinking up new ideas, poking around equipment and software to see how far it can be pushed. Then, when they have learned something and are ready to implement it in the network, they write a detailed plan for operations. Then some other engineering folks test the heck out of that design to try and find fault with it. After all the faults are fixed, it goes to operations and the engineering design folks move on to something else unless serious problems occur and operations needs a design engineer to approve some sensible action to be taken. The operations folk can't take the sensible action because that would deviate from their plans, but getting engineering design folks involved, gives them an out for real emergencies. So the term "network engineering" is ambiguous because a lot of people use it to mean the 90's style job where engineering design activity and operational activity were all jumbled together. In some companies, taking the engineering design track not only means that you lose enable on the routers, but you lose all TACACS access and have to get authorisation from a VP just to ask for a copy of the running config on a production router. Some people like ops because they see a lot of stuff go by and learn from it, get their CCIE and move into design engineering. Others like ops because they are scared of the responsibility for thinking up what to do next, and making a mistake. As far as I can see, the only way to get a job that mixes ops and design is to be in 3rd or 4th level support which is the top of the technical escalation chain where a few excellent design engineers do have enable on the routers because they fix important problems in near realtime. I suspect that it would be advantageous to have a career in which you worked for a while in ops before moving into design engineering if you want to get into top-level support. Take all this with a grain of salt. Every company does things a bit different, and the terminology that is used is ambiguous. It would be interesting to see what others have to say about this answer. --Michael Dillon
On Dec 25, 2009, at 7:01 AM, Michael Dillon wrote:
It would be interesting to see what others have to say about this answer.
I think it's a pretty accurate summation of how these things work in a lot of big organizations, all over the world. There's a detrimental side to it, in that in the engineering org, the near-complete siloing away from ops can lead to an ivory-tower/King Canute type of mentality; in the ops org, this phenomenon in turn can lead to increasing frustration and lowered morale, which in turn leads to apathy and poor customer service. All too often, one ends up with mutually-hostile engineering and ops teams who waste time and energy actively working to frustrate one another's ambitions, rather than combining their efforts to design, build, and operate the best network possible. Which in turn leads to many of the frustrations experienced every day by the end-customer. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> Injustice is relatively easy to bear; what stings is justice. -- H.L. Mencken
-----Original Message----- From: Dobbins, Roland
On Dec 25, 2009, at 7:01 AM, Michael Dillon wrote:
It would be interesting to see what others have to say about this answer.
I think it's a pretty accurate summation of how these things work in a lot of big organizations, all over the world.
I think that one must keep in mind that there are two kinds of check-lists. There is a takeoff list where you can always choose to go back to the ramp and fly another day if something doesn't check out but there is a different priority when someone is already in the air and something goes wrong. You can't decide to land a different day. In that case you must rely on experience and knowledge to handle the situation as it presents itself. Sure, you can have some basic checks for things even in an emergency but you can't know how the problem is going to present itself ahead of time. In cases like that you have set of general parameters but the person "at the controls" needs to have leeway to both clearly identify the nature of the problem and mitigate the same if possible and that might include calling in some extra eyes in order to identify things that might be going on with applications or other devices that aren't specifically network gear. So you can put a lot of process around changes in advance but there isn't quite as much to manage incidents that strike out of the clear blue. Too much process at that point could impede progress in clearing the issue. Capt. Sullenberger did not need to fill out an incident report, bring up a conference bridge, and give a detailed description of what was happening with his plane, the status of all subsystems, and his proposed plan of action (subject to consensus of those on the conference bridge) and get approval for deviation from his initial flight plan before he took the required actions to land the plane as best as he could under the circumstances. And while that is a bit extreme in the sense of most networks in that lives are not often at stake, some concepts are the same (and there might be networks supporting various occupations on this planet where lives might actually be at stake in the case of a network failure during some sort of activity). One of the most efficient shops I worked in was when the production internet operation was owned by the engineering department. Corporate operations owned the internal corporate IT, but engineering owned the internet production data centers and network operations. If engineering released a code revision that blew up the network, the VP of Engineering was responsible for the entire picture, not just the software piece. Same is true where a networking change blew up the application. Having the responsibility for the entire "system" (software, hardware platforms, and networking) under the same organization resulted in a lot smoother operation without backbiting and greater access to and sharing of resources between the application engineers, the systems administrators, and the network engineers.
On Dec 25, 2009, at 9:27 AM, George Bonser wrote:
Capt. Sullenberger did not need to fill out an incident report, bring up a conference bridge, and give a detailed description of what was happening with his plane, the status of all subsystems, and his proposed plan of action (subject to consensus of those on the conference bridge) and get approval for deviation from his initial flight plan before he took the required actions to land the plane as best as he could under the circumstances.
Conversely, the ever-increasing outright hostility and contempt evinced towards their customers by airlines worldwide - especially US-based airlines - over the last decade or so, all in the name of 'regulations', offers a useful counterexample. When it comes to larger organizations, this latter scenario is more the norm than what you describe, in my experience. Critical problems are left unresolved for days/weeks/months; if one attempts to report an issue which is causing problems for many of an organizations customers worldwide, but one isn't oneself a direct customer of said organization, one is often as not ignored and shunted aside. This isn't specific to the SP realm; it's simply a function of increased size, which leads to increased bureaucritization, which leads to dehumanization and the subordination of the organization's ostensible goals to internal politics, one-upsmanship, and blame-laying, no matter the industry in question. The folks with a can-do attitude who're willing to buck the system in order to do the right thing for the customer stand out in stark contrast to their peers, and in many cases end up paying a price in terms of career advancement because of their willingness to Do The Right Thing. 'Process' is all too often merely a ruse designed to avoid responsibility, shift blame/liability, justify hiring lower-cost/unqualified employees whilst shedding expensive/competent employees, and indulge in empire-building. We've seen this throughout corporate America with the 'permanent Y2K' of SoX and HIPAA, and the increasing involvement of government in terms of telecommunications-related rule-making which ends up directly affecting SPs. I'm a big advocate of standards and change-control, and not an advocate of seat-of-the-pants, midnight engineering - except when the latter is necessary, as in the examples you give. Unfortunately, many folks who work in larger organizations are actively prohibited from indulging in fluid, situationally-approrpriate problem resolution; and because of the aforementioned siloing of ops and engineering, their valuable first-hand experiences and the lessons learned thereby aren't taken into account during the design and rulemaking processes. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> Injustice is relatively easy to bear; what stings is justice. -- H.L. Mencken
On Thu, Dec 24, 2009 at 6:27 PM, George Bonser <gbonser@seven.com> wrote:
So you can put a lot of process around changes in advance but there isn't quite as much to manage incidents that strike out of the clear blue. Too much process at that point could impede progress in clearing the issue. Capt. Sullenberger did not need to fill out an incident report, bring up a conference bridge, and give a detailed description of what was happening with his plane, the status of all subsystems, and his proposed plan of action (subject to consensus of those on the conference bridge) and get approval for deviation from his initial flight plan before he took the required actions to land the plane as best as he could under the circumstances.
"*mayday mayday mayday. **Cactus fifteen thirty nine hit birds, we've lost thrust (in/on) both engines we're turning back towards LaGuardia*" - Capt. Sullenberger Not exactly "detailed", but he definitely initiated an "incident report" (the mayday), gave a "description of what was happening with his plane", the "status of [the relevant] subsystems", and his proposed plan of action - even in the order you've asked for! His actions were then "subject to the consensus of those on the conference bridge" (ie, ATC) who could have denied his actions if they believed they would have made the situation worse (ie, if what they were proposing would have had them on a collision course with another plane). In this case, the conference bridge gave approval for his course of action ("*ok uh, you need to return to LaGuardia? turn left heading of uh two two zero.*" - ATC) 5 seconds before they made the above call they were reaching for the QRH (Quick Reference Handbook), which contains checklists of the steps to take in such a situation - including what to do in the event of loss of both engines due to multiple birdstrikes. They had no need to confer with others as to what actions to take to try and recover from the problem, or what order to take them in, because that pre-work had already been carried out when the check-lists were written. Of course, at the end of the day, training, skill and experience played a very large part in what transpired - but so did the actions of the people on the "conference bridge" (You can't get much more of a "conference bridge" than open radio frequencies), and the checklists they have for almost every conceivable situation. Scott.
I think any network engineer who sees a major problem is going to have a "Houston, we have a problem" moment. And actually, he was telling the ATC what he was going to need to do, he wasn't getting permission so much as telling them what he was doing so traffic could be cleared out of his way. First he told them he was returning to the airport, then he inquired about Peterburough, the ATC called Peterburough to get a runway and inform them of an inbound emergency, then the Captain told the ATC they were going to be in the Hudson. And "I hit birds, have lost both engines, and am turning back" results in a whole different chain of events these days than "I have two guys banging on the cockpit door and am returning" or simply turning back toward the airport with no communication. And any network engineer is going to say something if he sees CPU or bandwidth utilization hit the rail in either direction. Saying something like "we just got flooded with thousands of /24 and smaller wildly flapping routes from peer X and I am shutting off the BGP session until they get their stuff straight" is different than "we just got flooded with thousands of routes and it is blowing up the router and all the other routers talking to it. Can I do something about it?" And that illustrates a point that is key. In that case the ATC was asking what the pilot needed and was prepared to clear traffic, get emergency equipment prepared, whatever it took to get that person dealing with the problem whatever they needed to get it resolved in the best way forward. The ATC isn't asking him if he was sure he set the flaps at the right angle and "did you try to restart the engine" sorts of things. What I was getting at is that sometimes too much process can get in the way in an emergency and the time taken to implement such process can result in a failure cascading through the network making the problem much worse. I have much less of a problem with process surrounding planned events. The more the better as long as it makes sense. Migrations and additions and modifications *should* be well planned and checklisted and have backout points and procedures. That is just good operations when you have tight SLAs and tight maintenance windows with customers you want to keep. Happy Holidays George From: Scott Howard "mayday mayday mayday. Cactus fifteen thirty nine hit birds, we've lost thrust (in/on) both engines we're turning back towards LaGuardia" - Capt. Sullenberger Not exactly "detailed", but he definitely initiated an "incident report" (the mayday), gave a "description of what was happening with his plane", the "status of [the relevant] subsystems", and his proposed plan of action - even in the order you've asked for! His actions were then "subject to the consensus of those on the conference bridge" (ie, ATC) who could have denied his actions if they believed they would have made the situation worse (ie, if what they were proposing would have had them on a collision course with another plane). In this case, the conference bridge gave approval for his course of action ("ok uh, you need to return to LaGuardia? turn left heading of uh two two zero." - ATC) 5 seconds before they made the above call they were reaching for the QRH (Quick Reference Handbook), which contains checklists of the steps to take in such a situation - including what to do in the event of loss of both engines due to multiple birdstrikes. They had no need to confer with others as to what actions to take to try and recover from the problem, or what order to take them in, because that pre-work had already been carried out when the check-lists were written. Of course, at the end of the day, training, skill and experience played a very large part in what transpired - but so did the actions of the people on the "conference bridge" (You can't get much more of a "conference bridge" than open radio frequencies), and the checklists they have for almost every conceivable situation. Scott.
Just clearing a small point about pilots (I'm a pilot) - the pilot-in-command has ultimate responsibility for his a/c and can ignore whatever ATC tells him to do if he considers that to be contrary to the safety of his flight (he may be asked to explain his actions later, though). Now, usually ignoring ATC or keeping it in the dark about one's intentions is not very clever - but dispatchers are not in the cockpit and may misunderstand the situation or be simply mistaken about something (so a pilot is encouraged to decline ATC instructions he considers to be in error - informing ATC about it, of course). But one of the first things a pilot does in an emergency is pulling out the appropriate emergency checklist. It is kind of hard to keep from forgetting to check obvious things when things get hectic (one of the distressingly common causes of accidents is trivial running out of fuel - either because the pilot didn't do homework on the ground (checking actual fuel level in tanks, etc) or because when the engine got suddenly quiet he forgot to switch to another, non-empty, tank). The mantra about priorities in both normal and emergency situations is "Aviate-Navigate-Communicate" meaning that maintaining control of a/c always comes first, no matter what. Knowing where you are and where you are going (and other pertinent situational awareness such as condition of the a/c and current plan of actions) come second. Talking is lowest priority. The pre-planned emergency checklists may be a good idea for network operators. Try obvious (when you're calm, that's it) actions first, if they fail to help, try to limit damage. Only then go file the ticket and talk to people who can investigate situation in depth and can develop a fix. The way aviation industry come with these checklists is, basically, experience - it pays to debrief after recovery from every problem not adequately fixed by existing procedures, find common ones, and develop diagnostic procedure one could follow step-by-step for these situations. (The non-punitive error or incident reporting which actually shields pilots from FAA enforcement actions in most cases also helps to collect real-world information on where and how pilots get into trouble). The all-too-common multistep ticket escalation chains (which merely work as delay lines in a significant portion of cases) is something to be avoided. Even better is to provide some drilling in diagnostic and recovery from common problems to the front-line personnel - starting from following the checklist on a simulated outage in the lab, and then getting it down to what pilots call "the flow" - a habitual memorized procedure, which is performed first and then checked against the checklist. Note that use of checklists, drilling, and flows does not make pilots a kind of robots - they still have to make decisions, recognize and deal with situations not covered in the standard procedures; what it does is speeding up dealing with common tasks, reduces mistakes, and frees up mental processing for thinking ahead. The ISP industry has a long way to go until it reaches the same level of sophistication in handling problems as aviation has. --vadim On Fri, 25 Dec 2009, George Bonser wrote:
I think any network engineer who sees a major problem is going to have a "Houston, we have a problem" moment. And actually, he was telling the ATC what he was going to need to do, he wasn't getting permission so much as telling them what he was doing so traffic could be cleared out of his way. First he told them he was returning to the airport, then he inquired about Peterburough, the ATC called Peterburough to get a runway and inform them of an inbound emergency, then the Captain told the ATC they were going to be in the Hudson. And "I hit birds, have lost both engines, and am turning back" results in a whole different chain of events these days than "I have two guys banging on the cockpit door and am returning" or simply turning back toward the airport with no communication. And any network engineer is going to say something if he sees CPU or bandwidth utilization hit the rail in either direction. Saying something like "we just got flooded with thousands of /24 and smaller wildly flapping routes from peer X and I am shutting off the BGP session until they get their stuff straight" is different than "we just got flooded with thousands of routes and it is blowing up the router and all the other routers talking to it. Can I do something about it?"
And that illustrates a point that is key. In that case the ATC was asking what the pilot needed and was prepared to clear traffic, get emergency equipment prepared, whatever it took to get that person dealing with the problem whatever they needed to get it resolved in the best way forward. The ATC isn't asking him if he was sure he set the flaps at the right angle and "did you try to restart the engine" sorts of things.
What I was getting at is that sometimes too much process can get in the way in an emergency and the time taken to implement such process can result in a failure cascading through the network making the problem much worse. I have much less of a problem with process surrounding planned events. The more the better as long as it makes sense. Migrations and additions and modifications *should* be well planned and checklisted and have backout points and procedures. That is just good operations when you have tight SLAs and tight maintenance windows with customers you want to keep.
Happy Holidays
George
From: Scott Howard
"mayday mayday mayday. Cactus fifteen thirty nine hit birds, we've lost thrust (in/on) both engines we're turning back towards LaGuardia" - Capt. Sullenberger
Not exactly "detailed", but he definitely initiated an "incident report" (the mayday), gave a "description of what was happening with his plane", the "status of [the relevant] subsystems", and his proposed plan of action - even in the order you've asked for!
His actions were then "subject to the consensus of those on the conference bridge" (ie, ATC) who could have denied his actions if they believed they would have made the situation worse (ie, if what they were proposing would have had them on a collision course with another plane). In this case, the conference bridge gave approval for his course of action ("ok uh, you need to return to LaGuardia? turn left heading of uh two two zero." - ATC)
5 seconds before they made the above call they were reaching for the QRH (Quick Reference Handbook), which contains checklists of the steps to take in such a situation - including what to do in the event of loss of both engines due to multiple birdstrikes. They had no need to confer with others as to what actions to take to try and recover from the problem, or what order to take them in, because that pre-work had already been carried out when the check-lists were written.
Of course, at the end of the day, training, skill and experience played a very large part in what transpired - but so did the actions of the people on the "conference bridge" (You can't get much more of a "conference bridge" than open radio frequencies), and the checklists they have for almost every conceivable situation.
Scott.
On Fri, 25 Dec 2009, Vadim Antonov wrote:
The ISP industry has a long way to go until it reaches the same level of sophistication in handling problems as aviation has.
Well, to counter this one might talk about the medical business (doctors) which hasn't been able to embrace the checklists at all (apart from in a few places), and they still consider their profession to be a craft, just like most network engineers do. It's the classical "good/fast/cheap, please pick two". Aviation is slow/careful to bring in new tech, same with the health care side, they're both very conservative. We in the network business are still immature but quick and flexible, but as time goes on, our services are more and more important, and thus things settle in and slow down, but becomes more reliable. This is an evoltion that'll take quite some time, but it's already changed a lot the past 10 years. There was quite a buzz regarding doctor checklists a few years back, I read several articles about it, but now I can't find the one I want to find, but <http://www.healthbeatblog.org/2007/12/pilots-use-chec.html> talks a bit about the topic. -- Mikael Abrahamsson email: swmike@swm.pp.se
On Fri, Dec 25, 2009 at 5:44 AM, Vadim Antonov <avg@kotovnik.com> wrote:
The ISP industry has a long way to go until it reaches the same level of sophistication in handling problems as aviation has.
It seems that there's a logical fallacy floating around somewhere (networks have parts and are complicated, airplanes and flight involve lots of parts and are also complicated, therefore aircraft are like networks). I assert that comparing 'packet switching' to an industry that has its roots in the late 1800's and had its first "hello world" moment in 1903 isn't terribly fruitful. Further, aircraft are the asymptotic limit of 'singly homed transit.' Because of this, I think one could argue that pilots and ATC must be held to a different professional standard due to the nature of public trust at risk. At the other end of our strawman spectrum, we have end users who must accept the risk that their provider will be unable to connect them to lolcats.com on occasion, perhaps as often as 0.01% per year, and most are happy to accept this. Four nines survivability on flights, clearly, won't work. What I'm getting at is that after following this thread for a while, I'm not convinced any amount of process-borrowing is going to solve problems better, faster, or even avoid them in the first place. At best, our craft is 1/3rd as "old" (if that's somehow I measure of maturity) as flight and nobody is being sued to settle 200+ accidental deaths because of our mistakes. -Tk
What I'm getting at is that after following this thread for a while, I'm not convinced any amount of process-borrowing is going to solve problems better, faster, or even avoid them in the first place. At best, our craft is 1/3rd as "old" (if that's somehow I measure of maturity) as flight and nobody is being sued to settle 200+ accidental deaths because of our mistakes.
-Tk
Not now, that is true, but when you look at things that are "on the drawing board" such as systems designed to manage automobile traffic flows, networks that are used to fly UAVs, networks that keep track of "friendly" units in combat where the technology might someday migrate to civilian law enforcement and/or emergency services (keeping track of where firefighters are in a building or at a wildfire, for example), I can see situations in the future where people's lives could be dependent on networks working properly, or at least endangered if a network fails. But my original intent was to point out that there are two kinds of process for two different kinds of circumstances and the sort of process surrounding routine changes might not be the best process for handing emergency changes. I have seen examples of places that want to handle emergency changes with the same sort of process they use for routine changes and those places can be frustrating to work with when stuff is broken. My goal was to give managers of networks who might read this the idea that when the fan is in an unsavory condition, more can get done by shifting from a mode of questioning, analyzing and second-guessing everything the engineer is doing to a mode where the organization is responding to immediate needs, clearing obstacles out of the way, and documenting as best they can what is done and when, to make the debriefing afterwards easier. AFTER the incident is the time to go over what was done, think about how it was dealt with, consider any changes in emergency process that might have shortened the duration, etc. In fact the "What could we have done differently that would have shortened the duration of the outage" question is pretty important. The answer might be "nothing", and that is ok, too, but the question should be asked.
I can see situations in the future where people's lives could be dependent on networks working properly, or at least endangered if a network fails.
Actually it's not the future. My father's design bureau was making hardware, since 70s (including network stuff) for running industrial processes of a kind where software crash or a network malfunction was usually associated with casualties. Gas pipelines, power plants, electric grids, stuff like that. That's a completely different class of hardware, more of a kind you'd find in avionics - modules in triplicate, voting, pervasive error correction, etc. Software was also designed differently, with a lot more review processes, and with data structures designed for integrity checking (I still use this trick in my work, which saves me a lot of grief during debugging) and recovery from memory corruption and such. I'd be seriously loath to put any of the current crop of COTS network boxes into a life-critical network. --vadim
On 12/25/09 7:57 AM, Anton Kapela wrote:
What I'm getting at is that after following this thread for a while, I'm not convinced any amount of process-borrowing is going to solve problems better, faster, or even avoid them in the first place. At best, our craft is 1/3rd as "old" (if that's somehow I measure of maturity) as flight and nobody is being sued to settle 200+ accidental deaths because of our mistakes.
So, we're supposed to make the mistakes of aviation, nuclear power, the chemical industry (i.e. Bhopal), oil production & refining, etc., all over again? Checklists and MOPs are but one of the things we ignore from other industries. Some others: o Increasing complexity and tight coupling lead to systemic failures. Simply grafting redundancy onto complex systems can make them less, not more, reliable. Yet this is the trend in networking. "Want bells and whistles, firewalls, load-balancers, rate-limiters in your network? You can have 'em without sacrificing reliability if you just buy two of 'em!" o The gradual acceptance of components or procedures that have adequate reliability for a certain task (say, research) that are not reliable enough for another task (e.g. being a critical part of a 1,000 megawatt nuclear power plant) without understanding the implications. Do we know how our technology is being used and will be used? Will the people adopting IP for everything (the "smart grid," VoIP, life-supporting functions) fail to see these implications just as the people who shoved a fissile core into a pressure vessel did? This last point directly contradicts the theme of your message. The notion that what we do is not (yet) a matter of life-or-death has bitten other industries in the past and it provides a nice illustration of why we should *not* be ignoring their lessons. michael
On Dec 25, 2009, at 7:57 AM, Anton Kapela wrote:
On Fri, Dec 25, 2009 at 5:44 AM, Vadim Antonov <avg@kotovnik.com> wrote:
The ISP industry has a long way to go until it reaches the same level of sophistication in handling problems as aviation has.
It seems that there's a logical fallacy floating around somewhere (networks have parts and are complicated, airplanes and flight involve lots of parts and are also complicated, therefore aircraft are like networks). I assert that comparing 'packet switching' to an industry that has its roots in the late 1800's and had its first "hello world" moment in 1903 isn't terribly fruitful.
As someone with a fair amount of experience with both, I have to disagree with you. Yes, there are differences, and, yes you have to keep comparisons and the like in perspective, but, there are definitely areas where networking could learn from aviation, and, to some extent, vice versa.
Further, aircraft are the asymptotic limit of 'singly homed transit.' Because of this, I think one could argue that pilots and ATC must be held to a different professional standard due to the nature of public trust at risk. At the other end of our strawman spectrum, we have end users who must accept the risk that their provider will be unable to connect them to lolcats.com on occasion, perhaps as often as 0.01% per year, and most are happy to accept this. Four nines survivability on flights, clearly, won't work.
Correct... As I stated in my earliest posts on this subject, while there is value to be obtained in looking at how aviation has improved its safety/reliability record over the years, there is also value in recognizing the cost/benefit ratio of some of those improvements. If you draw a graph with one curve arcing from bottom left towards upper right, steepening as it goes to the right, that line can be thought of as the amount of cost of achieving additional reliability. A second curve sloping from top left to bottom right, flattening out as it goes to the right can be thought of as the gains achieved from those additional 9s of reliability. Finally, the point where those two curves intersect is defined by the cost of outages and/or downtime. Interestingly, this same diagram will be familiar to most pilots, but, the two arcs will be induced drag (drag from producing lift) and parasite drag (drag from friction with the air). The point where they meet is called "L/D Max" and is the airspeed at which the given aircraft will achieve it's best glide ratio.
What I'm getting at is that after following this thread for a while, I'm not convinced any amount of process-borrowing is going to solve problems better, faster, or even avoid them in the first place. At best, our craft is 1/3rd as "old" (if that's somehow I measure of maturity) as flight and nobody is being sued to settle 200+ accidental deaths because of our mistakes.
There are lessons to be learned that are valuable. Both from things aviation has done well that we could emulate, and, from things aviation has done poorly that we should avoid. There are also additional lessons to be learned about the differences in cost/benefit analysis between the two disciplines. Owen
At 03:38 PM 12/28/2009, Owen DeLong wrote:
There are lessons to be learned that are valuable. Both from things aviation has done well that we could emulate, and, from things aviation has done poorly that we should avoid. There are also additional lessons to be learned about the differences in cost/benefit analysis between the two disciplines.
Agreed. "You have to learn from the mistakes of others because you won't live long enough to make them all yourself." -Admiral Rickover -Robert "Well done is better than well said." - Benjamin Franklin
On Dec 25, 2009, at 5:44 AM, Vadim Antonov wrote:
The pre-planned emergency checklists may be a good idea for network operators. Try obvious (when you're calm, that's it) actions first, if they fail to help, try to limit damage. Only then go file the ticket and talk to people who can investigate situation in depth and can develop a fix.
This is why the US Government runs various events to test their procedures and process, and invites the private sector to participate. Cyberstorm III planning is underway. If you want to participate, let me know, I'll connect you with the right groups. ISP participation would be incredibly valuable, it's not always been there to the point where someone plays the role of the whole Internet. There are other events, Topoff, NLE, etc that are run at state or regional levels. As much as everyone here derides the US Gov't, the new cybersecurity wonk, etc.. I would love to see more people engaged in these activities than implying some disconnected group is running things. They WANT industry to participate, but getting the actual neteng types involved is something they don't know how to reach out to properly. Lets bridge this gap. I personally see a LOT of networking and computing failures from lack of employing a BCP in operations. These events are a chance to test your process and procedures, and quite possibly improve them. - Jared
On Fri, 25 Dec 2009, Jared Mauch wrote:
Cyberstorm III planning is underway. If you want to participate, let me know, I'll connect you with the right groups. ISP participation would be incredibly valuable, it's not always been there to the point where someone plays the role of the whole Internet.
And if you are not currently working at an ISP, DHS and lots of other parts of US Government are trying to hire folks with network and computer security knowledge. http://www.washingtonpost.com/wp-dyn/content/article/2009/12/22/AR2009122203... http://www.usajobs.gov/
As much as everyone here derides the US Gov't, the new cybersecurity wonk, etc.. I would love to see more people engaged in these activities than implying some disconnected group is running things. They WANT industry to participate, but getting the actual neteng types involved is something they don't know how to reach out to properly.
Industry, academics, public, just about anyone willing to participate can do something. There are reasons for the rules, processes and procedures governments need to follow; but a good idea is a good idea. Unfortunately, its sometimes harder to figure out what is a bad idea before it causes trouble; because it looked like a good idea at the time. Or to put it another way, if you don't engage, some decisions eventually still have to get made based on what people who did engage told them. Suggestions on how to do something better help make better decisions. -- on vacation, just speaking for myself.
On Thu, 24 Dec 2009, Scott Howard wrote:
His actions were then "subject to the consensus of those on the conference bridge" (ie, ATC) who could have denied his actions if they believed they would have made the situation worse (ie, if what they were proposing would have had them on a collision course with another plane).
This has been mentioned by others in this thread, but not to the level of importance I think it represents. I, too, am a pilot. The pilot in command of an aircraft always has the final say on the safety of the flight, not the controller, and not the design engineers. If the pilot in command violates the rules and the result is negative (crash, loss of separation, etc.) you better believe there will be questions to be answered and a possible loss of the pilot's license (or life!) may result. On the other hand if the pilot's decision to violate the rules results in a positive outcome (see Sullenburger or any other number of emergencies that happen every day that you never hear about) there will often never even be a single question about why the rules were violated. This can be applied directly to network engineering work. If I assign an engineer to do a network change, yes, they better have a plan/checklist/etc. before they start and they better follow it. When things go wrong, I expect that engineer to make the right decisions to minimize the damage. Sometimes that means following the rules to the letter. Sometimes that means breaking the rules. If the rules are broken, there darn better be a good reason for it, but frankly, a good engineer will always have a good explanation, just like the good pilot. Rigid procedures are no better than the lack of procedures. Process is very important, don't get me wrong, but so is the knowledge and experience to know when you should throw them out the door. Any organization that doesn't recognize that is doomed to inefficiency at best, and failure at worst. -- Brandon Ross AIM: BrandonNRoss Director of Network Engineering ICQ: 2269442 Xiocom Wireless Skype: brandonross Yahoo: BrandonNRoss
On Dec 24, 2009, at 11:08 PM, Scott Howard wrote:
On Thu, Dec 24, 2009 at 6:27 PM, George Bonser <gbonser@seven.com> wrote:
So you can put a lot of process around changes in advance but there isn't quite as much to manage incidents that strike out of the clear blue. Too much process at that point could impede progress in clearing the issue. Capt. Sullenberger did not need to fill out an incident report, bring up a conference bridge, and give a detailed description of what was happening with his plane, the status of all subsystems, and his proposed plan of action (subject to consensus of those on the conference bridge) and get approval for deviation from his initial flight plan before he took the required actions to land the plane as best as he could under the circumstances.
"*mayday mayday mayday. **Cactus fifteen thirty nine hit birds, we've lost thrust (in/on) both engines we're turning back towards LaGuardia*" - Capt. Sullenberger
Not exactly "detailed", but he definitely initiated an "incident report" (the mayday), gave a "description of what was happening with his plane", the "status of [the relevant] subsystems", and his proposed plan of action - even in the order you've asked for!
Exactly.
His actions were then "subject to the consensus of those on the conference bridge" (ie, ATC) who could have denied his actions if they believed they would have made the situation worse (ie, if what they were proposing would have had them on a collision course with another plane). In this case, the conference bridge gave approval for his course of action ("*ok uh, you need to return to LaGuardia? turn left heading of uh two two zero.*" - ATC)
Not exactly. If the others on the bridge don't consent, FAR 91.3 gives him full and absolute authority to tell them to screw themselves and do what he feels is best. FAR 91.3 reads: Responsibility and authority of the pilot in command. (a) The pilot in command of an aircraft is directly responsible for, and is the final authority as to, the operation of that aircraft. (b) In an in-flight emergency requiring immediate action, the pilot in command may deviate from any rule of this part to the extent required to meet that emergency. (c) Each pilot in command who deviates from a rule under paragraph (b) of this section shall, upon the request of the Administrator, send a written report of that deviation to the Administrator. As near as I can tell, that regulation was last modified in 1963.
5 seconds before they made the above call they were reaching for the QRH (Quick Reference Handbook), which contains checklists of the steps to take in such a situation - including what to do in the event of loss of both engines due to multiple birdstrikes. They had no need to confer with others as to what actions to take to try and recover from the problem, or what order to take them in, because that pre-work had already been carried out when the check-lists were written.
Yep.
Of course, at the end of the day, training, skill and experience played a very large part in what transpired - but so did the actions of the people on the "conference bridge" (You can't get much more of a "conference bridge" than open radio frequencies), and the checklists they have for almost every conceivable situation.
And in case there are any misconceptions here on the list, I know that in the public eye, there is often a lot of distrust and/or perceived animosity between controllers and pilots. Frankly, this is a misconception for the most part. Sure, there are incidents where pilots and controllers don't get along, each blaming the other. However, by and large, both groups are consummate professionals doing their best to make sure flights end well. In my years as a pilot, I have had more than one occasion to be very thankful for ATC and the services they provide. Generally, they are a very helpful and hardworking group. I respect them greatly and appreciate the tough job they do. Owen (Commercial Pilot, ASEL, Instrument Airplane)
imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur.
Are you trying to suggest that this is something horrible, or that it's
Shops where engineering and operations function separately can suffer from reduced efficiencies. A recent example comes to mind. Vendor X was onsite turning up some equipment, including a small VPN concentrator for remote access. It was a new model of VPN concentrator that the installers hadn't worked with before. They used "scripts", a set of a CLI commands with field-replaceable variables for site specific parameters, to configure the device. But connections to the VPN were failing. After trying different versions of the scripts (for similar models) they "broke down" and called their internal tech support department for help. Total turn-up time for the concentrator: 8+ hours. There wasn't that much wrong with the script that kept it from working, but the ops folks lacked the training to understand the problems and fix them. On the other hand, the engineering folks should probably have produced a more robust set of scripts. While having no experience myself, it would seem a good practice that every project, including the actual turn-up, include representation from engineering. This automatically creates a liaison between the two groups and keeps the engineer abreast of "real world" issues. Frank -----Original Message----- From: Michael Dillon [mailto:wavetossed@googlemail.com] Sent: Thursday, December 24, 2009 6:02 PM To: NANOG list Subject: Re: Revisiting the Aviation Safety vs. Networking discussion the future of network engineering? :) The model of network engineering that grew up during the 1990s is forever gone unless you work in a smaller organization where people have to wear many hats. In the big ISPs, now identical to the big telcos, operations and engineering design duties are separated. The operations folks do not deviate from the written plans that they work with. If the slightest thing happens that is not in the plan, they rollback the changes as specified in the plan. They don't fix anything unless it is officially broken with trouble tickets filed and escalations up to senior management. That is about the only time that operations people can get away with taking shortcuts and creative solutions. On the other hand, the engineering design folks should spend a good part of their day trying out things, thinking up new ideas, poking around equipment and software to see how far it can be pushed. Then, when they have learned something and are ready to implement it in the network, they write a detailed plan for operations. Then some other engineering folks test the heck out of that design to try and find fault with it. After all the faults are fixed, it goes to operations and the engineering design folks move on to something else unless serious problems occur and operations needs a design engineer to approve some sensible action to be taken. The operations folk can't take the sensible action because that would deviate from their plans, but getting engineering design folks involved, gives them an out for real emergencies. So the term "network engineering" is ambiguous because a lot of people use it to mean the 90's style job where engineering design activity and operational activity were all jumbled together. In some companies, taking the engineering design track not only means that you lose enable on the routers, but you lose all TACACS access and have to get authorisation from a VP just to ask for a copy of the running config on a production router. Some people like ops because they see a lot of stuff go by and learn from it, get their CCIE and move into design engineering. Others like ops because they are scared of the responsibility for thinking up what to do next, and making a mistake. As far as I can see, the only way to get a job that mixes ops and design is to be in 3rd or 4th level support which is the top of the technical escalation chain where a few excellent design engineers do have enable on the routers because they fix important problems in near realtime. I suspect that it would be advantageous to have a career in which you worked for a while in ops before moving into design engineering if you want to get into top-level support. Take all this with a grain of salt. Every company does things a bit different, and the terminology that is used is ambiguous. It would be interesting to see what others have to say about this answer. --Michael Dillon
On Thu, Dec 24, 2009 at 01:09:26PM -0500, Randy Bush wrote:
I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate.
imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur.
Whimsical deviations don't belong in the maint execution, they belong in the brainstorming and design. Gather more points of view during the peer review of the specification of work. In my experience, good engineering makes for bad drama (and conversely if it is a "dramatic save" then you have a bad engineer and likely a cowboy). Have a plan that executes in stages, tests at checkpoints where partial completion is possible, and a fallback for each step. A great way to train up junior people, document as you go, expose flaws and lines of future investigation, and if things go south you escalte to those who can judge *reasonable* new directions. To me, that kind of change management for non-automatable work is a descendent of resonable group work. If you have project-oriented autonomous teams that stick to the guideposts of "your standards" and "minimal disruptions/maximal uptime" then good work will emerge. As for automation, that enables your expensive hmans to do more smart things so should always be incorporated in processes and be something people move toward, IMO. Cheers, Joe -- RSUC / GweepNet / Spunk / FnB / Usenix / SAGE
participants (21)
-
Anton Kapela
-
bross@pobox.com
-
Dave Israel
-
David Andersen
-
David Hiers
-
Dobbins, Roland
-
Eddy Martinez
-
Frank Bulk
-
George Bonser
-
Jared Mauch
-
Jim Shankland
-
Joe Provo
-
Michael Dillon
-
Michael Sinatra
-
Mikael Abrahamsson
-
Owen DeLong
-
Randy Bush
-
Robert Boyle
-
Scott Howard
-
Sean Donelan
-
Vadim Antonov