"Michel Py" <michel@arneill-py.sacramento.ca.us> writes:
The dead processor still has to be replaced, but this is scheduled maintenance, not outage. A little extra ammo when you have to hunt five or six nines.
MTTR on a single box is irrelevant when you are off playing Ponce de Leon, hunting the Fountain of Five or Six Nines. Even when your architecture doesn't depend on any one particular machine (or even whole big sets of machines) being available, you don't get to "five or six nines"... just ask Google, Akamai, or Microsoft - there are other things beyond your control that spoil the picnic first. As has been observed time and time again, the tried and true way to make five or six nines of reliability in a system of more than trivial complexity is to take a lesson from the telcos (the progenitors of the "five nines" lie) and build a framework and evaluation methodology that excludes broad classes of unavailability-causing events or prorates them in such a way as to make them non-reportable. Add to that list incrementally, until the remaining time listed shows your target number of nines of reliability. Presto, five nines. ---Rob