
In a message written on Mon, Mar 04, 2013 at 09:31:13AM +0200, Saku Ytti wrote:
Probably only thing you could have done to plan against this, would have been to have solid dual-vendor strategy, to presume that sooner or later, software defect will take one vendor completely out. And maybe they did plan for it, but decided dual-vendor costs more than the rare outages.
From what I have heard so far there is something else they could have done, hire higher quality people. Any competent network admin would have stopped and questioned a 90,000+ byte packet and done more investigation. Competent programmers writing their internal tools would have flagged that data as out of rage. I can't tell you how many times I've sat in a post mortem meeting about some issue and the answer from senior management is "why don't you just provide a script to our NOC guys, so the next time they can run it and make it all better." Of course it's easy to say that, the smart people have diagnosed the problem! You can buy these "scripts" for almost any profession. There are manuals on how to fix everything on a car, and treatment plans for almost every disease. Yet most people intuitively understand you take your car to a mechanic and your body to a doctor for the proper diagnosis. The primary thing you're paying for is expertise in what to fix, not how to fix it. That takes experience and training. But somehow it doesn't sink in with networking. I would not at all be surprised to hear that someone over at Cloudflare right now is saying "let's make a script to check the packet size" as if that will fix the problem. It won't. Next time the issue will be different, and the same undertrained person who missed the packet size this time will miss the next issue as well. They should all be sitting around saying, "how can we hire compentent network admins for our NOC", but that would cost real money. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/