
On Mon, Mar 4, 2013 at 10:40 AM, Saku Ytti <saku@ytti.fi> wrote:
On (2013-03-04 13:23 -0500), Jeff Wheeler wrote:
We have lots of stupid people in our industry because so few understand "The Way Things Work."
We have tendency to view mistakes we do as unavoidable human errors and mistakes other people do as avoidable stupidity.
We should actively plan for mistakes/errors, if you actively plan for no 'stupid mistakes', you're gonna have bad time
From my point of view, outages are caused by: 1) operator 2) software defect 3) hardware defect
Most people design only against 3), often with design which actually increases likelihood of 2) and 1), reducing overall MTBF on design which strictly theoretically increases it.
...And a lot of people who know the heirarchy solve 3 and then solve 2 in a way that increases 1 (multiple parallel environments with different vendors' equipment) only to find that 1 increased, due to additional complexity. On the other hand, I've seen people who had horrible explosions of 2 or 3 due to ignoring all but 1. If you ACTUALLY need that many 9s, you need all of redundancy, diversity of vendors, and suitably trained, exercised, process-supported net admins. That's a few multiples of 2 more expense than nearly anyone typically wants to pay for. -- -george william herbert george.herbert@gmail.com