Owen,
We could learn a lot about this from Aviation. Nowhere in human history has more research, care, training, and discipline been applied to accident prevention, mitigation, and analysis as in aviation. A few examples:
Others later in this thread duly noted a definite relationship of costs associated, which are clearly "worth it" given the particular application of these methods [snipped]. However, I assert this is warranted because of the specific public trust that commercial aviation must be given. Additionally, this form of professional or industry "standard" isn't unique in the world; you can find (albeit small) parallels in most states' PE certification tracks and the like. In the case of the big-I internet, I assert we can't (yet) successfully argue that it's deserving of similar public trust. In short, I'm arguing that big-I internet deserves special-pleading status in these sorts of "instrument -> record -> improve" strawmen and that we shouldn't apply similar concepts or regulation. (Robert B. then responded):
All, The real problem is same human factors we have in aviation which cause most accidents. Look at the list below and replace the word Pilot with Network Engineer or Support Tech or Programmer or whatever... and think about all the problems where something didn't work out right. It's because someone circumvented the rules, processes, and cross checks put in place to prevent the problem in the first place. Nothing can be made idiot proof because idiots are so creative.
I'd like to suggest we also swap "bug" for "software defect" or "hardware defect" - perhaps if operators started talking about problems like engineers, we'd get more global buy-in for a process-based solution. I certainly like the idea of improving the state of affairs where possible - especially the operator->device direction (i.e fat-fingering acl, prefix list, community list, etc). When people make mistakes, it seems very wise to accurately record the entrance criteria, the results of their actions, and ways to avoid it - then shared to all operators (like at NANOG meetings!). The part I don't like is being ultimately responsible for, or to "design around" a class of systemic problems which are entirely outside of an operators sphere of control. What curve must we shift to get routers with hardware and software that's both a) fast b) reliable and c) cheap -- in the hopes that the only problems left to solve indeed are human ones? -Tk