On Fri, 24 November 2000, Roeland Meyer wrote:
After all the discussion that we had on the Datacenter list, I am surprised at this. You'd think that they'd have redundant PS's with redundant UPS's.
There are interesting electrical faults which will kill redundant UPSes and redundant power supplies. However, I don't know if Sprint uses redudant power supplies, or had a failure affecting multiple power supplies on the STP. They use Compaq/Tandem NonStop hardware for their SCPs. After previous power problems at Sprint COs (i.e. Kansas City) which affected my service, I've asked my Sprint sales person what happened. He was never able to get anyone to call me back with an answer.
For the internet, I see an amazing number of systems with no redundancy whatsoever. Of course, the first hardware failure usually corrects the problem, at the cost of substantial down-time. But many second-tier ISPs and dot-coms are still operating on brand-new equipment that hasn't started hitting its MTBF specs yet and they don't even have a clue on their MTTR ratings. In the next few years, I expect to see a lot more failures, as the equipment starts to age.
I'm not sure that is true. Brand new electronic equipment tends to have a period of infant mortality. If it survives, it tends to be reliable for a fairly long period of time. I had customers still using Proteon routers 10 years after Proteon discontinued the model. However scaling requires most dot-coms to replace/upgrade their equipment every few months, so they are always dealing with infant mortality. But back to my question. What is the real requirement? Amazon.COM had system problems on Friday, and their site was unusuable for 30 minutes, definitely not 99.999%. But what did that really mean? The FAA loses its radar for several hours in various parts of the country. What did that really mean? Essentially every system given as an example of "high- availability, high-reliability" I've looked at, doesn't hold up under close examination. Is 99.999% just F.U.D. created by consultants? Instead of pretending we can build systems which will never fail, should we work on a realistic understanding of what can be delivered?