Mitigating human error in the SP

Hello NANOG, Long time listener, first time caller. A recent organizational change at my company has put someone in charge who is determined to make things perfect. We are a service provider, not an enterprise company, and our business is doing provisioning work during the day. We recently experienced an outage when an engineer, troubleshooting a failed turn-up, changed the ethertype on the wrong port losing both management and customer data on said device. This isn't a common occurrence, and the engineer in question has a pristine track record. This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many." I am asking the respectable NANOG engineers.... What measures have you taken to mitigate human mistakes? Have they been successful? Any other comments on the subject would be appreciated, we would like to come to our next meeting armed and dangerous. Thanks! Chad

On Tue, Feb 2, 2010 at 7:51 AM, Chadwick Sorrell <mirotrem@gmail.com> wrote:
This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many."
Automated config deployment / provisioning. And sanity checking before deployment. -- Suresh Ramasubramanian (ops.lists@gmail.com)

On Feb 2, 2010, at 10:28 AM, Suresh Ramasubramanian wrote:
Automated config deployment / provisioning. And sanity checking before deployment.
A lab in which changes can be simulated and rehearsed ahead of time, new OS revisions tested, etc. A DCN. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> Injustice is relatively easy to bear; what stings is justice. -- H.L. Mencken

Vijay Gill had some real interesting insights into this in a presentation he gave back at NANOG 44: http://www.nanog.org/meetings/nanog44/presentations/Monday/Gill_programatic_... His Blog article on "Infrastructure is Software" further expounds upon the benefits of such an approach - http://vijaygill.wordpress.com/2009/07/22/infrastructure-is-software/ That stuff is light years ahead of anything anybody is doing today (well, apart from maybe Vijay himself ;) ... but IMO it's where we need to start heading. Stefan Fouant, CISSP, JNCIE-M/T www.shortestpathfirst.net GPG Key ID: 0xB5E3803D
-----Original Message----- From: Suresh Ramasubramanian [mailto:ops.lists@gmail.com] Sent: Monday, February 01, 2010 9:29 PM To: Chadwick Sorrell Cc: nanog@nanog.org Subject: Re: Mitigating human error in the SP
On Tue, Feb 2, 2010 at 7:51 AM, Chadwick Sorrell <mirotrem@gmail.com> wrote:
This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many."
Automated config deployment / provisioning. And sanity checking before deployment.
-- Suresh Ramasubramanian (ops.lists@gmail.com)

On Mon, Feb 01, 2010 at 09:46:07PM -0500, Stefan Fouant wrote:
Vijay Gill had some real interesting insights into this in a presentation he gave back at NANOG 44:
http://www.nanog.org/meetings/nanog44/presentations/Monday/Gill_programatic_...
His Blog article on "Infrastructure is Software" further expounds upon the benefits of such an approach - http://vijaygill.wordpress.com/2009/07/22/infrastructure-is-software/
That stuff is light years ahead of anything anybody is doing today (well, apart from maybe Vijay himself ;) ... but IMO it's where we need to start heading.
Vijay's stuff is fascinating. The vision is great. But in my experience, the vendors and implementations basically ruin the dream for anyone who doesn't have his pull. I'm sure my software is nowhere close to being as sophisticated as his, but my plans are pretty much in line with his suggestions. Some problems I've run into that I don't see any kind of solution for: 1) Forwarding-impacting bugs: IOS bugs that are triggered by SNMP are easily the #1 cause of our accidental service impact. Most seem to be race conditions that require real-world config and forwarding load - not something a small shop can afford to build a lab to reproduce. If we stuck to manual deployment, we might have made a few mistakes but would it have been worse? Maybe - but honestly, it could be a wash. 2) Vendor support is highly suspicious of automation: anytime I open a ticket, even unrelated to an automated software process, the first thing the vendor support demands is to disable all automation. Juniper is by far the best about this, and they *still* don't actually believe their own automation tools work. Cisco TAC's answer has always been "don't ever use SNMP if it causes crashes!" Procurve doesn't even bother to respond to tickets related to automation bugs, even if they are remotely triggerable crashes in the default config. 3) Automation interfaces are largely unsupported: I imagine vendor software development having one or two guys that are the masterminds for SNMP/NETCONF/whatever - and that's it. When I have a question on how to find a particular tool, or find a bug in an automation function, I can often go months on a ticket with people that have no idea what I'm talking about. What documentation exists is typically incomplete or inconsistent across versions and product lines. 4) Related tools prevent reliable error reporting: as far as I can tell, Net-SNMP returns random values if a request fails; if there's a pattern, I've failed to discern it. expect is similar. ScreenOS's SSH implementation always returns that a file copy failed. Procurve only this year implemented ssh key-based auth in combination with remote authentication. The best-of-breed seems to be an oft-pathetic collection of tools. 5) Management support: developing automation software is hard - network devices aren't nearly as easy to deal with as they should be. When I spend weeks developing features that later causes IOS to spontaneously reload, people that don't understand the relation to operational impact start to advocate dismantling the automation just like the vendors above. I'm sure we'll continue to build automated policy and configuration tools. I'm just not convinced it's the panacea that everyone thinks. Unless you're one of the biggest, it puts your network at someone else's mercy - and that someone else doesn't care about your operational expenses. Ross -- Ross Vandegrift ross@kallisti.us "If the fight gets hot, the songs get hotter. If the going gets tough, the songs get tougher." --Woody Guthrie

On Wed, Feb 3, 2010 at 11:14 AM, Ross Vandegrift <ross@kallisti.us> wrote:
On Mon, Feb 01, 2010 at 09:46:07PM -0500, Stefan Fouant wrote:
Vijay Gill had some real interesting insights into this in a presentation he gave back at NANOG 44:
http://www.nanog.org/meetings/nanog44/presentations/Monday/Gill_programatic_...
His Blog article on "Infrastructure is Software" further expounds upon the benefits of such an approach - http://vijaygill.wordpress.com/2009/07/22/infrastructure-is-software/
That stuff is light years ahead of anything anybody is doing today (well, apart from maybe Vijay himself ;) ... but IMO it's where we need to start heading.
Vijay's stuff is fascinating. The vision is great. But in my experience, the vendors and implementations basically ruin the dream for anyone who doesn't have his pull.
you know what helps? lots of operations folks asking for the same set of capabilities... Vendors build what will make them money. If you want a device to do X, getting lots of your friends in the operator community to agree and talk to the vendor with the same message helps the vendor understand and prioritize the request.
If you want more/better/faster/simpler configuration via 'script' (program) it makes sense to ask the vendor(s) for these capabilities... -chris

3) Automation interfaces are largely unsupported:
CLI is an automation interface. Combine that with a management server from which telnet sessions to the router can be managed, and you have probably the lowest risk automation interface possible. This may force you into building your own tools, but if you really want low risk, that's the price you pay.
I'm sure we'll continue to build automated policy and configuration tools. I'm just not convinced it's the panacea that everyone thinks. Unless you're one of the biggest, it puts your network at someone else's mercy - and that someone else doesn't care about your operational expenses.
That is not a risk of automation. That is a risk of buy versus build. More and more businesses of all sorts are beginning to take a new look at their software and automated systems with a view towards building and owning and maintaining the parts that really are business critical for their unique business. In this brave new world, only the non-essential stuff will be bought in as packages. --Michael Dillon

You can completely implement Vijay's most impressive stuff and simply move the problem to a different level of abstraction. No matter what you do, it still comes down to some geek banging on some plastic thingy. I'm as likely to screw up an "Extensible Entity-Attribute-Relationship" as I am an ACL. David On Wed, Feb 3, 2010 at 8:14 AM, Ross Vandegrift <ross@kallisti.us> wrote:
On Mon, Feb 01, 2010 at 09:46:07PM -0500, Stefan Fouant wrote:
Vijay Gill had some real interesting insights into this in a presentation he gave back at NANOG 44:
http://www.nanog.org/meetings/nanog44/presentations/Monday/Gill_programatic_...
His Blog article on "Infrastructure is Software" further expounds upon the benefits of such an approach - http://vijaygill.wordpress.com/2009/07/22/infrastructure-is-software/
That stuff is light years ahead of anything anybody is doing today (well, apart from maybe Vijay himself ;) ... but IMO it's where we need to start heading.
Vijay's stuff is fascinating. The vision is great. But in my experience, the vendors and implementations basically ruin the dream for anyone who doesn't have his pull.
I'm sure my software is nowhere close to being as sophisticated as his, but my plans are pretty much in line with his suggestions. Some problems I've run into that I don't see any kind of solution for:
1) Forwarding-impacting bugs: IOS bugs that are triggered by SNMP are easily the #1 cause of our accidental service impact. Most seem to be race conditions that require real-world config and forwarding load - not something a small shop can afford to build a lab to reproduce. If we stuck to manual deployment, we might have made a few mistakes but would it have been worse? Maybe - but honestly, it could be a wash.
2) Vendor support is highly suspicious of automation: anytime I open a ticket, even unrelated to an automated software process, the first thing the vendor support demands is to disable all automation. Juniper is by far the best about this, and they *still* don't actually believe their own automation tools work. Cisco TAC's answer has always been "don't ever use SNMP if it causes crashes!" Procurve doesn't even bother to respond to tickets related to automation bugs, even if they are remotely triggerable crashes in the default config.
3) Automation interfaces are largely unsupported: I imagine vendor software development having one or two guys that are the masterminds for SNMP/NETCONF/whatever - and that's it. When I have a question on how to find a particular tool, or find a bug in an automation function, I can often go months on a ticket with people that have no idea what I'm talking about. What documentation exists is typically incomplete or inconsistent across versions and product lines.
4) Related tools prevent reliable error reporting: as far as I can tell, Net-SNMP returns random values if a request fails; if there's a pattern, I've failed to discern it. expect is similar. ScreenOS's SSH implementation always returns that a file copy failed. Procurve only this year implemented ssh key-based auth in combination with remote authentication. The best-of-breed seems to be an oft-pathetic collection of tools.
5) Management support: developing automation software is hard - network devices aren't nearly as easy to deal with as they should be. When I spend weeks developing features that later causes IOS to spontaneously reload, people that don't understand the relation to operational impact start to advocate dismantling the automation just like the vendors above.
I'm sure we'll continue to build automated policy and configuration tools. I'm just not convinced it's the panacea that everyone thinks. Unless you're one of the biggest, it puts your network at someone else's mercy - and that someone else doesn't care about your operational expenses.
Ross
-- Ross Vandegrift ross@kallisti.us
"If the fight gets hot, the songs get hotter. If the going gets tough, the songs get tougher." --Woody Guthrie
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAktpoNEACgkQMlMoONfO+HB6PACeLoFhmwv8K07Zq9tQDZgKcHYq 5nEAoMnrd2YLrSzGkA71N8vRgFWG/SL1 =FQbw -----END PGP SIGNATURE-----

Automated config deployment / provisioning. And sanity checking before deployment.
Easy to say, not so easy to do. For instance, that incorrect port was identified by a number or name. Theoretically, if an automated tool pulls the number/name from a database and issues the command, then the error cannot happen. But how does the number/name get into the database. I've seen a situation where a human being enters that number, copying it from another application screen. We hope that it is done by copy/paste all the time but who knows? And even copy/paste can make mistakes if the selection is done by mouse by someone who isn't paying enough attention. But wait! How did the other application come up with that number for copying? Actually, it was copy-pasted from yet a third application, and that application got it by copy paste from a spreadsheet. It is easy to create a tangled mess of OSS applications that are glued together by lots of manual human effort creating numerous opportunities for human error. So while I wholeheartedly support automation of network configuration, that is not a magic bullet. You also need to pay attention to the whole process, the whole chain of information flow. And there are other things that may be even more effective such as hiding your human errors. This is commonly called a "maintenance window" and it involves an absolute ban on making any network change, no matter how trivial, outside of a maintenance window. The human error can still occur but because it is in a maintenance window, the customer either doesn't notice, or if it is planned maintenance, they don't complain because they are expecting a bit of disruption and have agreed to the planned maintenance window. That only leaves break-fix work which is where the most skilled and trusted engineers work on the live network outside of maintenance windows to fix stuff that is seriously broken. It sounds like the event in the original posting was something like that, but perhaps not, because this kind of break-fix work should only be done when there is already a customer-affecting issue. By the way, even break-fix changes can, and should be, tested in a lab environment before you push them onto the network. --Michael Dillon

Never said it was, and never said foolproof either. Minimizing the chance of error is what I'm after - and ssh'ing in + hand typing configs isn't the way to go. Use a known good template to provision stuff - and automatically deploy it, and the chances of human error go down quite a lot. Getting it down to zero defect from there is another kettle of fish altogether - a much more expensive with dev / test, staging and production environments, documented change processes, maintenance windows etc. On Wed, Feb 3, 2010 at 7:00 AM, Michael Dillon <wavetossed@googlemail.com> wrote:
It is easy to create a tangled mess of OSS applications that are glued together by lots of manual human effort creating numerous opportunities for human error. So while I wholeheartedly support automation of network configuration, that is not a magic bullet. You also need to pay attention to the whole process, the whole chain of information flow.
-- Suresh Ramasubramanian (ops.lists@gmail.com)

On Feb 2, 2010, at 8:36 PM, Suresh Ramasubramanian wrote:
Never said it was, and never said foolproof either. Minimizing the chance of error is what I'm after - and ssh'ing in + hand typing configs isn't the way to go.
Use a known good template to provision stuff - and automatically deploy it, and the chances of human error go down quite a lot. Getting it down to zero defect from there is another kettle of fish altogether - a much more expensive with dev / test, staging and production environments, documented change processes, maintenance windows etc.
Yup. Or use a database and a template-driven compiler. See "Configuration management and security", IEEE Journal on Selected Areas in Communications, 27(3):268-274, April 2009, by myself and Randy Bush, http://www.cs.columbia.edu/~smb/papers/config-jsac.pdf (the system described is Randy's work, from many years ago). --Steve Bellovin, http://www.cs.columbia.edu/~smb

Never said it was, and never said foolproof either. Minimizing the chance of error is what I'm after - and ssh'ing in + hand typing configs isn't the way to go.
Use a known good template to provision stuff - and automatically deploy it, and the chances of human error go down quite a lot. Getting it down to zero defect from there is another kettle of fish altogether - a much more expensive with dev / test, staging and production environments, documented change processes, maintenance windows etc.
On Wed, Feb 3, 2010 at 7:00 AM, Michael Dillon <wavetossed@googlemail.com> wrote:
It is easy to create a tangled mess of OSS applications that are glued
together
by lots of manual human effort creating numerous opportunities for human error. So while I wholeheartedly support automation of network configuration,
not a magic bullet. You also need to pay attention to the whole process,
Reminds me of the saying, nothing is foolproof given a sufficiently talented fool. I do agree that checklist, peer reviews, parallel turnups, and lab testing when used and not jury rigged have helped me prepare for issue. Usually when I skipped those things are the time I kick myself for not doing it. Another thing that helps is giving yourself enough time, doing what you can ahead of time, and being ready on time. Just my two bits. -- ---------------------- Brian Raaen Network Engineer braaen@zcorum.com On Tuesday 02 February 2010, Suresh Ramasubramanian wrote: that is the
whole chain of information flow.
-- Suresh Ramasubramanian (ops.lists@gmail.com)

On 2/1/2010 6:21 PM, Chadwick Sorrell wrote:
Any other comments on the subject would be appreciated, we would like to come to our next meeting armed and dangerous.
If upper management believes humans can be required to make no errors, ask whether they have achieved that ideal for themselves. If they say yes, start a recorder and ask them how. When they get done, ask them why they think the solution that worked for them will scale to a broader population. (Don't worry, you won't get to the point of needing the recorder.) Otherwise, as Suresh notes, the only way to eliminate human error completely is to eliminate the presence of humans in the activity. For those processes retaining human involvement, procedures and interfaces can be designed to minimize human error. Well-established design specialty. Human factors. Usability. Etc. Typically can be quite effective. Worthy using. d/ -- Dave Crocker Brandenburg InternetWorking bbiw.net

I'll say "as vijay gill notes" after Stefan posted those two very interesting links. He's saying much the same that I did - in a great deal more detail. Fascinating.
http://www.nanog.org/meetings/nanog44/presentations/Monday/Gill_programatic_... His Blog article on "Infrastructure is Software" further expounds upon the benefits of such an approach - http://vijaygill.wordpress.com/2009/07/22/infrastructure-is-software/
On Tue, Feb 2, 2010 at 8:28 AM, Dave CROCKER <dhc2@dcrocker.net> wrote:
Otherwise, as Suresh notes, the only way to eliminate human error completely is to eliminate the presence of humans in the activity.
-- Suresh Ramasubramanian (ops.lists@gmail.com)

On Mon, 1 Feb 2010 21:21:52 -0500 Chadwick Sorrell <mirotrem@gmail.com> wrote:
Hello NANOG,
Long time listener, first time caller.
A recent organizational change at my company has put someone in charge who is determined to make things perfect. We are a service provider, not an enterprise company, and our business is doing provisioning work during the day. We recently experienced an outage when an engineer, troubleshooting a failed turn-up, changed the ethertype on the wrong port losing both management and customer data on said device. This isn't a common occurrence, and the engineer in question has a pristine track record.
Why didn't the customer have a backup link if their service was so important to them and indirectly your upper management? If your upper management are taking this problem that seriously, then your *sales people* didn't do their job properly - they should be ensuring that customers with high availability requirements have a backup link, or aren't led to believe that the single-point-of-failure service will be highly available.
This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many."
If upper management don't understand that human error is a risk factor that can't be completely eliminated, then I suggest "self-eliminating" and find yourself a job somewhere else. The only way you'll avoid human error having any impact on production services is to not change anything - which pretty much means not having a job anyway ...
I am asking the respectable NANOG engineers....
What measures have you taken to mitigate human mistakes?
Have they been successful?
Any other comments on the subject would be appreciated, we would like to come to our next meeting armed and dangerous.
Thanks! Chad

Humans make errors. For your upper management to think they can build a foundation of reliability on the theory that humans won't make errors is self deceiving. But that isn't where the story ends. That's where it begins. Your infrastructure, processes and tools should all be designed with that in mind so as to reduce or eliminate the impact that human error will have on the reliability of the service you provide to your customers. So, for the example you gave there are a few things that could be put in place. The first one, already mentioned by Chad, is that mission critical services should not be designed with single points of failure - that situation should be remediated. Another question to be asked - since this was provisioning work being done, and it was apparently being done on production equipment, could the work have been done at a time of day (or night) when an error would not have been as much of a problem? You don't say how long the outage lasted, but given the reaction by your upper management, I would infer that it lasted for a while. That raises the next question. Who besides the engineer making the mistake was aware of the fact that work on production equipment was occurring? The reason this is important is because having the NOC know that work is occurring would give them a leg up on locating where the problem is once they get the trouble notification. Paul On Feb 2, 2010, at 8:16 AM, Mark Smith wrote:
On Mon, 1 Feb 2010 21:21:52 -0500 Chadwick Sorrell <mirotrem@gmail.com> wrote:
Hello NANOG,
Long time listener, first time caller.
A recent organizational change at my company has put someone in charge who is determined to make things perfect. We are a service provider, not an enterprise company, and our business is doing provisioning work during the day. We recently experienced an outage when an engineer, troubleshooting a failed turn-up, changed the ethertype on the wrong port losing both management and customer data on said device. This isn't a common occurrence, and the engineer in question has a pristine track record.
Why didn't the customer have a backup link if their service was so important to them and indirectly your upper management? If your upper management are taking this problem that seriously, then your *sales people* didn't do their job properly - they should be ensuring that customers with high availability requirements have a backup link, or aren't led to believe that the single-point-of-failure service will be highly available.
This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many."
If upper management don't understand that human error is a risk factor that can't be completely eliminated, then I suggest "self-eliminating" and find yourself a job somewhere else. The only way you'll avoid human error having any impact on production services is to not change anything - which pretty much means not having a job anyway ...
I am asking the respectable NANOG engineers....
What measures have you taken to mitigate human mistakes?
Have they been successful?
Any other comments on the subject would be appreciated, we would like to come to our next meeting armed and dangerous.
Thanks! Chad

On Tue, Feb 2, 2010 at 9:09 AM, Paul Corrao <pcorrao@voxeo.com> wrote:
Humans make errors.
For your upper management to think they can build a foundation of reliability on the theory that humans won't make errors is self deceiving.
But that isn't where the story ends. That's where it begins. Your infrastructure, processes and tools should all be designed with that in mind so as to reduce or eliminate the impact that human error will have on the reliability of the service you provide to your customers.
So, for the example you gave there are a few things that could be put in place. The first one, already mentioned by Chad, is that mission critical services should not be designed with single points of failure - that situation should be remediated.
Agreed.
Another question to be asked - since this was provisioning work being done, and it was apparently being done on production equipment, could the work have been done at a time of day (or night) when an error would not have been as much of a problem?
As it stands now, business want to turn their services up when they are in the office. We do all new turn-ups during the day, anything requiring a roll or maintenance window is schedule in the middle of the night.
You don't say how long the outage lasted, but given the reaction by your upper management, I would infer that it lasted for a while. That raises the next question. Who besides the engineer making the mistake was aware of the fact that work on production equipment was occurring? The reason this is important is because having the NOC know that work is occurring would give them a leg up on locating where the problem is once they get the trouble notification.
The actual error happened when someone was troubleshooting a turn-up, where in the past the customer in question has had their ethertype set wrong. It wasn't a provisioning problem as much as someone troubleshooting why it didn't come up with the customer. Ironically, the NOC was on the phone when it happened, and the switch was rebooted almost immediately and the outage lasted 5 minutes. Chad

The actual error happened when someone was troubleshooting a turn-up, where in the past the customer in question has had their ethertype set wrong. It wasn't a provisioning problem as much as someone troubleshooting why it didn't come up with the customer. Ironically, the NOC was on the phone when it happened, and the switch was rebooted almost immediately and the outage lasted 5 minutes.
This is why large operators have a "ready for service" protocol. The customer is never billed until it is officially RFS, and to make it RFS requires more than an operational network, it also requires the customer to agree in writing that they have a fully functional connection. This is another way of hiding human error, because now the up-down-up is just part of the provisioning process. There is a record of the RFS date-time so if the customer complains about an outage BEFORE that point, they can be politely reminded that when RFS happened and that charging does not start until AFTER that point. --Michael Dillon

If your manager pretends that they can manage humans without a few well-worn human factor books on their shelf, quit. David On Tue, Feb 2, 2010 at 5:36 PM, Michael Dillon <wavetossed@googlemail.com> wrote:
The actual error happened when someone was troubleshooting a turn-up, where in the past the customer in question has had their ethertype set wrong. It wasn't a provisioning problem as much as someone troubleshooting why it didn't come up with the customer. Ironically, the NOC was on the phone when it happened, and the switch was rebooted almost immediately and the outage lasted 5 minutes.
This is why large operators have a "ready for service" protocol. The customer is never billed until it is officially RFS, and to make it RFS requires more than an operational network, it also requires the customer to agree in writing that they have a fully functional connection.
This is another way of hiding human error, because now the up-down-up is just part of the provisioning process. There is a record of the RFS date-time so if the customer complains about an outage BEFORE that point, they can be politely reminded that when RFS happened and that charging does not start until AFTER that point.
--Michael Dillon

Humans make errors. For your upper management to think they can build a foundation of reliability on the theory that humans won't make errors is self deceiving. But that isn't where the story ends. That's where it begins. Your infrastructure, processes and tools should all be designed with that in mind so as to reduce or eliminate the impact that human error will have on the reliability of the service you provide to your customers. So, for the example you gave there are a few things that could be put in place. The first one, already mentioned by Chad, is that mission critical services should not be designed with single points of failure - that situation should be remediated. Another question to be asked - since this was provisioning work being done, and it was apparently being done on production equipment, could the work have been done at a time of day (or night) when an error would not have been as much of a problem? You don't say how long the outage lasted, but given the reaction by your upper management, I would infer that it lasted for a while. That raises the next question. Who besides the engineer making the mistake was aware of the fact that work on production equipment was occurring? The reason this is important is because having the NOC know that work is occurring would give them a leg up on locating where the problem is once they get the trouble notification. Paul On Feb 2, 2010, at 8:16 AM, Mark Smith wrote:
On Mon, 1 Feb 2010 21:21:52 -0500 Chadwick Sorrell <mirotrem@gmail.com> wrote:
Hello NANOG,
Long time listener, first time caller.
A recent organizational change at my company has put someone in charge who is determined to make things perfect. We are a service provider, not an enterprise company, and our business is doing provisioning work during the day. We recently experienced an outage when an engineer, troubleshooting a failed turn-up, changed the ethertype on the wrong port losing both management and customer data on said device. This isn't a common occurrence, and the engineer in question has a pristine track record.
Why didn't the customer have a backup link if their service was so important to them and indirectly your upper management? If your upper management are taking this problem that seriously, then your *sales people* didn't do their job properly - they should be ensuring that customers with high availability requirements have a backup link, or aren't led to believe that the single-point-of-failure service will be highly available.
This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many."
If upper management don't understand that human error is a risk factor that can't be completely eliminated, then I suggest "self-eliminating" and find yourself a job somewhere else. The only way you'll avoid human error having any impact on production services is to not change anything - which pretty much means not having a job anyway ...
I am asking the respectable NANOG engineers....
What measures have you taken to mitigate human mistakes?
Have they been successful?
Any other comments on the subject would be appreciated, we would like to come to our next meeting armed and dangerous.
Thanks! Chad

On 02/02/2010 02:21, Chadwick Sorrell wrote:
This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many."
Leaving the PHB rhetoric aside for a few moments, this comes down to two things: 1. cost vs. return and 2. realisation that service availability is a matter of risk management, not a product bolt-on that you can install in your operations department in a matter of days. Pilot error can be substantially reduced by a variety of different things, most notably good quality training, good quality procedures and documentation, lab staging of all potentially service-affecting operations, automation of lots of tasks, good quality change management control, pre/post project analysis, and basic risk analysis of all regular procedures. You'll note that all of these things cost time and money to develop, implement and maintain; also, depending on the operational service model which you currently use, some of them may dramatically affect operational productivity one way or another. This often leads to a significant increase in staffing / resourcing costs in order to maintain similar levels of operational service. It also tends to lead to inflexibility at various levels, which can have a knock-on effect in terms of customer expectation. Other things which will help your situation from a customer interaction point of view is rigorous use of maintenance windows and good communications to ensure that they understand that there are risks associated with maintenance. Your management is obviously pretty upset about this incident. If they want things to change, then they need to realise that reducing pilot error is not just a matter of getting someone to bark at the tech people until the problem goes away. They need to be fully aware at all levels that risk management of this sort is a major undertaking for a small company, and that it needs their full support and buy-in. Nick

On Mon, Feb 01, 2010 at 09:21:52PM -0500, Chadwick Sorrell wrote:
Hello NANOG,
Long time listener, first time caller. [snip] What measures have you taken to mitigate human mistakes?
Have they been successful?
Any other comments on the subject would be appreciated, we would like to come to our next meeting armed and dangerous.
Define your processes well, have management sign off so no blame game and people realize they are all on the same side Use peer review. Don't start automating until you have a working system, and then get the humans out of the repetitive bits. Don't build monolithic systems. Test your automation well. Be sure have the symmetric *de-provisioning* to any provisioning else you will be relying on humans to clean out the cruft instead of addressing the problem. Extend accountability throughout the organization - replace commission- minded sales folks with relationship-minded account management. Always have OoB. Require vendors to be *useful* under OoB conditions, at least to your more advanced employees. Expect errors in the system and in execution; develop ways to check for them and be prepared to modify methods, procedures and tools without multiple years and inter-departmental bureaucracy. Change and errors happen, so capitalize on to those events to improve you service and systems rather than emphasizing punishment. Cheers, Joe -- RSUC / GweepNet / Spunk / FnB / Usenix / SAGE

We have solved 98% of this with standard configurations and templates. To deviate from this requires management approval/exception approval after an evaluation of the business risks. Automation of config building is not too hard, and certainly things like peer-groups (cisco) and regular groups (juniper) make it easier. If you go for the holy grail, you want something that takes into account the following: 1) each phase in the provisioning/turn-up state 2) each phase in infrastructure troubleshooting (turn-up, temporary outage/temporary testing, production) 3) automated pushing of config via load override/commit replace to your config space. Obviously testing, etc.. is important. I've found that whenever a human is involved, mistakes happen. There is also the "Software is imperfect" mantra that should be repeated. I find vendors at times have demanding customers who want perfection. Bugs happen, Outages happen, the question is how do you respond to these risks. If you have poor handling of bugs, outages, etc.. in your process or are decision gridlocked, very bad things happen. - Jared On Feb 1, 2010, at 9:21 PM, Chadwick Sorrell wrote:
Hello NANOG,
Long time listener, first time caller.
A recent organizational change at my company has put someone in charge who is determined to make things perfect. We are a service provider, not an enterprise company, and our business is doing provisioning work during the day. We recently experienced an outage when an engineer, troubleshooting a failed turn-up, changed the ethertype on the wrong port losing both management and customer data on said device. This isn't a common occurrence, and the engineer in question has a pristine track record.
This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many."
I am asking the respectable NANOG engineers....
What measures have you taken to mitigate human mistakes?
Have they been successful?
Any other comments on the subject would be appreciated, we would like to come to our next meeting armed and dangerous.
Thanks! Chad

On Feb 2, 2010, at 9:33 AM, Jared Mauch wrote:
We have solved 98% of this with standard configurations and templates.
To deviate from this requires management approval/exception approval after an evaluation of the business risks.
I would also point Chad to this book: http://bit.ly/cShEIo (Amazon Link to Visual Ops). It's very useful to have your management read it. You may or may not be able to or want to use a full ITIL process, but understanding how these policies and procedures can/should work, and using the ones that apply makes sense. Change control, tracking, and configuration management are going to be key to avoiding mistakes, and being able to rapidly repair when one is made. Unfortunately, most management that demands No Tolerance, Zero Error from operations won't read the book. Good luck.. I'd bet most of the people on this list have been there one time or another. Cheers, -j

On 2/2/2010 11:33 AM, Jared Mauch wrote:
We have solved 98% of this with standard configurations and templates.
To deviate from this requires management approval/exception approval after an evaluation of the business risks.
Automation of config building is not too hard, and certainly things like peer-groups (cisco) and regular groups (juniper) make it easier.
Those things and some of the others that have been mentioned will go a very long way to prevent the second occurrence. Only training, adequate (number and quality) staff, and a quality-above-all-all-else culture have a prayer of preventing the first occurrence. (For sure, lots of the second-occurrence-preventers may be part of that quality first culture.) -- "Government big enough to supply everything you need is big enough to take everything you have." Remember: The Ark was built by amateurs, the Titanic by professionals. Requiescas in pace o email Ex turpi causa non oritur actio Eppure si rinfresca ICBM Targeting Information: http://tinyurl.com/4sqczs http://tinyurl.com/7tp8ml

Chadwick Sorrell wrote:
This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many."
Good, Fast, Cheap - pick any two. No you can't have all three. Here, Good is defined by your pointy-haired bosses as an impossible-to-achieve zero error rate.[1] Attempting to achieve this is either going to cost $$$, or your operations speed (how long it takes people to do things) is going to drop like a rock. Your first action should be to make sure upper management understands this so they can set the appropriate priorities on Good, Fast, and Cheap, and make the appropriate budget changes. It's going to cost $$$ to hire enough people to have the staff necessary to double-check things in a timely manner, OR things are going to slow way down as the existing staff is burdened by necessary double-checking of everything and triple-checking of some things required to try to achieve a zero error rate. They will also need to spend $$$ on software (to automate as much as possible) and testing equipment. They will also never actually achieve a zero error rate as this is an impossible task that no organization has ever achieved, no matter how much emphasis or money they pour into it (e.g. Windows vulnerabilities) or how important (see Challenger, Columbia, and the Mars Climate Orbiter incidents). When you put a $$$ cost on trying to achieve a zero error rate, pointy-haired bosses are usually willing to accept a normal error rate. Of course, they want you to try to avoid errors, and there are a lot of simple steps you can take in that effort (basic checklists, automation, testing) which have been mentioned elsewhere in this thread that will cost some money but not the $$$ that is required to try to achieve a zero error rate. Make sure they understand that the budget they allocate for these changes will be strongly correlated to how Good (zero error rate) and Fast (quick operational responses to turn-ups and problems) the outcome of this initiative. jc [1] http://www.godlessgeeks.com/LINKS/DilbertQuotes.htm 2. "What I need is a list of specific unknown problems we will encounter." (Lykes Lines Shipping) 6. "Doing it right is no excuse for not meeting the schedule." (R&D Supervisor, Minnesota Mining & Manufacturing/3M Corp.)

Thanks for all the comments! On Tue, Feb 2, 2010 at 1:01 PM, JC Dill <jcdill.lists@gmail.com> wrote:
Chadwick Sorrell wrote:
This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many."
Good, Fast, Cheap - pick any two. No you can't have all three.
Here, Good is defined by your pointy-haired bosses as an impossible-to-achieve zero error rate.[1] Attempting to achieve this is either going to cost $$$, or your operations speed (how long it takes people to do things) is going to drop like a rock. Your first action should be to make sure upper management understands this so they can set the appropriate priorities on Good, Fast, and Cheap, and make the appropriate budget changes.
It's going to cost $$$ to hire enough people to have the staff necessary to double-check things in a timely manner, OR things are going to slow way down as the existing staff is burdened by necessary double-checking of everything and triple-checking of some things required to try to achieve a zero error rate. They will also need to spend $$$ on software (to automate as much as possible) and testing equipment. They will also never actually achieve a zero error rate as this is an impossible task that no organization has ever achieved, no matter how much emphasis or money they pour into it (e.g. Windows vulnerabilities) or how important (see Challenger, Columbia, and the Mars Climate Orbiter incidents).
When you put a $$$ cost on trying to achieve a zero error rate, pointy-haired bosses are usually willing to accept a normal error rate. Of course, they want you to try to avoid errors, and there are a lot of simple steps you can take in that effort (basic checklists, automation, testing) which have been mentioned elsewhere in this thread that will cost some money but not the $$$ that is required to try to achieve a zero error rate. Make sure they understand that the budget they allocate for these changes will be strongly correlated to how Good (zero error rate) and Fast (quick operational responses to turn-ups and problems) the outcome of this initiative.
jc
[1] http://www.godlessgeeks.com/LINKS/DilbertQuotes.htm
2. "What I need is a list of specific unknown problems we will encounter." (Lykes Lines Shipping)
6. "Doing it right is no excuse for not meeting the schedule." (R&D Supervisor, Minnesota Mining & Manufacturing/3M Corp.)
participants (19)
-
Brian Raaen
-
Chadwick Sorrell
-
Christopher Morrow
-
Dave CROCKER
-
David Hiers
-
Dobbins, Roland
-
James Downs
-
Jared Mauch
-
JC Dill
-
Joe Provo
-
Larry Sheldon
-
Mark Smith
-
Michael Dillon
-
Nick Hilliard
-
Paul Corrao
-
Ross Vandegrift
-
Stefan Fouant
-
Steven Bellovin
-
Suresh Ramasubramanian