
On 02/02/2010 02:21, Chadwick Sorrell wrote:
This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many."
Leaving the PHB rhetoric aside for a few moments, this comes down to two things: 1. cost vs. return and 2. realisation that service availability is a matter of risk management, not a product bolt-on that you can install in your operations department in a matter of days. Pilot error can be substantially reduced by a variety of different things, most notably good quality training, good quality procedures and documentation, lab staging of all potentially service-affecting operations, automation of lots of tasks, good quality change management control, pre/post project analysis, and basic risk analysis of all regular procedures. You'll note that all of these things cost time and money to develop, implement and maintain; also, depending on the operational service model which you currently use, some of them may dramatically affect operational productivity one way or another. This often leads to a significant increase in staffing / resourcing costs in order to maintain similar levels of operational service. It also tends to lead to inflexibility at various levels, which can have a knock-on effect in terms of customer expectation. Other things which will help your situation from a customer interaction point of view is rigorous use of maintenance windows and good communications to ensure that they understand that there are risks associated with maintenance. Your management is obviously pretty upset about this incident. If they want things to change, then they need to realise that reducing pilot error is not just a matter of getting someone to bark at the tech people until the problem goes away. They need to be fully aware at all levels that risk management of this sort is a major undertaking for a small company, and that it needs their full support and buy-in. Nick