RE: availability and resiliency
It refers to the percent uptime of a host or site. 99% is two "9"'s (2x9?) and 99.9% is three nines. Getting a single host to meet more than three nines (99.9%) can be a challenge ( <8.76 hours outage, per year), but can be more easily met with multiple hosts in a site. Four nines (99.99% uptime, <0.88 hours annual downtime) is extremely difficult for a single host, less difficult for internal data centers, and (given lots of $$$) a bit easier for a internet site (using multiply redundant hosts). Five nines (99.999%, <5.26 annual minutes down) is almost impossible for a single affordable host to meet. This is where we enter the world of High-Availability (H-A) systems. These are usually high transaction flow critical systems and are found in large corps, telcos, and reliable internet sites. At this time, only governments are willing to part with the required cash to build systems meeting six nines (99.9999%, <0.53 minutes annual downtime), or better (NASA, NORAD, US Space Command, etc). Usually, this is done using multiple site redundancy. Hosts meeting three nines, or better, typically have redundant power supplies and integrated UPS, bootable RAID for the OS, redundant NICs, and SMP CPU configurations. -----Original Message----- From: Leo Nelson [mailto:lnelson@axient.com] Sent: Friday, September 29, 2000 10:08 AM To: 'nop@alt.net'; Andrew Bangs Cc: nanog@merit.edu Subject: RE: availability and resiliency Pardon my ignorance, but what the heck does a "9" refer too? Is it a UPS, rack, floor space, circuit... Thanks in advance -leo -----Original Message----- From: Lionel Lauer [ mailto:longword@newsguy.com] Sent: Friday, September 29, 2000 10:54 AM To: Andrew Bangs Cc: nanog@merit.edu Subject: Re: availability and resiliency On Fri, 29 Sep 2000 17:39:04 +0100, Andrew Bangs <andrewb@demon.net> wrote:
On Thu, Sep 28, 2000 at 02:39:40PM -0600, Irwin Lazar wrote:
Hi all, Does anyone know if a template exists for what it takes to provide 5
9's of
availability, 4 9's etc.for Internet data centers? Specifically I'm looking for something that would say "if you want 5 9's of availability, here's what you need to do", and so on.
For 5 9s you need:
1) Lots of money 2) Lots of clue 3) Lots of luck 4) Lots of balls
You can do 4 9s with any 3 of the above.
Too true. But you forgot to include 'halfway-clued management' - without that you haven't got a hope in hell of even getting three 9's. ;) -- W . | ,. w , "Some people are alive only because \|/ \|/ it is illegal to kill them." Perna condita delenda est ---^----^---------------------------------------------------------------
.... At this time, only governments are willing to part with the required cash to build systems meeting six nines (99.9999%, <0.53 minutes annual downtime), or better (NASA, NORAD, US Space Command, etc). Usually, this is done using multiple site redundancy.
Hosts meeting three nines, or better, typically have redundant power supplies and integrated UPS, bootable RAID for the OS, redundant NICs, and SMP CPU configurations.
um...is an smp cpu configuration really going to help your uptime? or are there operating systems or hardware out there that can say to themselves "hmph! cpu 2 seems not to be working correctly...i'd better spin it down." just for fun a few years back i decided to check if the sun e4000 we had had hot-swappable cpus (i figured it didn't, but why not try it?) and i pulled one of the boards. it didn't like it too much. -- |-----< "CODE WARRIOR" >-----| codewarrior@daemon.org * "ah! i see you have the internet twofsonet@graffiti.com (Andrew Brown) that goes *ping*!" andrew@crossbar.com * "information is power -- share the wealth."
On Fri, 29 Sep 2000 18:42:12 EDT, Andrew Brown said:
um...is an smp cpu configuration really going to help your uptime? or are there operating systems or hardware out there that can say to themselves "hmph! cpu 2 seems not to be working correctly...i'd better spin it down."
IBM mainframes have been doing this for decades. I believe that both OS/VS1 and VM/370 for the S370-158 supported this back in the 1973 timeframe. About 10 years ago, our 3090-300 blew a TCM and lost one of the 3 CPUs. As I was sitting there diagnosing the problem at the console, I got a popup dialog box from the onboard support processor. Basically, it wanted to phone IBM Hardware Support and tell them to send a guy with a new TCM, but it had detected that it was more than 7 digits and therefor probably a long distance phone call, was this OK? Yes, it asked permission to rack up the phone bill before it called for repairs itself. Current mainframe state of the art is described in the IBM Journal of Research and Development - Vol 43, Number 5/6 (Sep/No 99), which was devoted to the G5 and G6 chipsets used in current IBM S/390 big iron. The article "RAS strategy for IBM S/390 G5 and G6" (page 875) talks about the system's ability to not only detect a failing CPU, but on detection it will latch out the last known good state from the previous instruction, and retry the failing machine instruction on a hot-spare. That's after a reset-and-retry on the failing processor has proven it's a hard failure and not a soft one. The mind boggles.... ;) -- Valdis Kletnieks Operating Systems Analyst Virginia Tech
On Fri, Sep 29, 2000, Valdis.Kletnieks@vt.edu wrote:
On Fri, 29 Sep 2000 18:42:12 EDT, Andrew Brown said:
um...is an smp cpu configuration really going to help your uptime? or are there operating systems or hardware out there that can say to themselves "hmph! cpu 2 seems not to be working correctly...i'd better spin it down."
IBM mainframes have been doing this for decades. I believe that both OS/VS1 and VM/370 for the S370-158 supported this back in the 1973 timeframe.
About 10 years ago, our 3090-300 blew a TCM and lost one of the 3 CPUs. As I was sitting there diagnosing the problem at the console, I got a popup dialog box from the onboard support processor. Basically, it wanted to phone IBM Hardware Support and tell them to send a guy with a new TCM, but it had detected that it was more than 7 digits and therefor probably a long distance phone call, was this OK?
Yes, it asked permission to rack up the phone bill before it called for repairs itself.
Current mainframe state of the art is described in the IBM Journal of Research and Development - Vol 43, Number 5/6 (Sep/No 99), which was devoted to the G5 and G6 chipsets used in current IBM S/390 big iron. The article "RAS strategy for IBM S/390 G5 and G6" (page 875) talks about the system's ability to not only detect a failing CPU, but on detection it will latch out the last known good state from the previous instruction, and retry the failing machine instruction on a hot-spare. That's after a reset-and-retry on the failing processor has proven it's a hard failure and not a soft one.
The mind boggles.... ;)
.. and the concept of this happening on Wintel hardware running anything is sheer ludicrousy. Whoever mentioned that SMP can help you get high uptime boxes is smoking heavy crack in most cases. Note that the big-end Alpha and Sun gear is NUMA, not SMP. Different kettle of fish there, and if you need an explanation as to why its more likely to happen with NUMA and not SMP, there are lots of hardware books out there. :-) Adrian, who notes a lot of "5 9's" computing problems were solved in the 70s and yet don't appear in most equipment in the naughties. -- Adrian Chadd "If a butterfly flaps its wings in China, <adrian@creative.net.au> will a women get naked in Amsterdam?" -- Ashley Penney on Chaos Theory
and retry the failing machine instruction on a hot-spare. That's after a reset-and-retry on the failing processor has proven it's a hard failure and not a soft one.
The mind boggles.... ;)
.. and the concept of this happening on Wintel hardware running anything is sheer ludicrousy. Whoever mentioned that SMP can help you get high uptime boxes is smoking heavy crack in most cases.
Note that the big-end Alpha and Sun gear is NUMA, not SMP. Different kettle of fish there, and if you need an explanation as to why its more likely to happen with NUMA and not SMP, there are lots of hardware books out there. :-)
If you're looking at implementing "5 9's" check out Suns FT1800 - very nice box (read: looks nice ;), easy to admin, and so far has been rock solid for us (not that Solaris crashes much anymore anyway.. but at least you no longer have to worry about hardware resilience with the FT) All you have to worry about then is disparate power, and software stability. -- Regards, Jay Tribick Senior Systems Engineer Carrier1 Voice: +44 207 531 3874
participants (5)
-
Adrian Chadd
-
Andrew Brown
-
Jay Tribick
-
Roeland M.J. Meyer
-
Valdis.Kletnieks@vt.edu