pete@cobra.brass.com (Peter Polasek) writes:
This interface is extremely mission critical to the point that a 99.9% uptime will not be acceptable. I have the following questions:
1) Bell Atlantic assures us that, because of the redundancy, we can expect 100% uptime from the OC-12. I would like feedback as to whether this is a realistic portrayal of the SONET environment.
How many significant digits do you consider acceptable? Even in an ideal APS environment, link failure detection and protection switching does take finite time. You might get 99.999% uptime, but probably not 99.9999999%. Methinks that you've been subjected to Marketing. ;-) Tony
At 05:35 PM 10/15/98 -0700, Tony Li wrote:
pete@cobra.brass.com (Peter Polasek) writes:
This interface is extremely mission critical to the point that a 99.9% uptime will not be acceptable. I have the following questions:
1) Bell Atlantic assures us that, because of the redundancy, we can expect 100% uptime from the OC-12. I would like feedback as to whether this is a realistic portrayal of the SONET environment.
How many significant digits do you consider acceptable? Even in an ideal APS environment, link failure detection and protection switching does take finite time. You might get 99.999% uptime, but probably not 99.9999999%.
Methinks that you've been subjected to Marketing. ;-)
For total system uptime 90.0% (one nine or less) Desktop systems. 99.0% (two nines) Intermediate business systems 99.9% (three nines) Most business data systems and workgroup servers 99.99% (four nines) High-end business systems and your friendly neighborhood telco 99.999% (five nines) Bank Data Centers and Telco Data Centers, some ISPs 99.9999% (six nines) Only God and Norad live here. 99.99999% (seven nines) Even God doesn't have pockets this deep. There is a matching exponential cost increment with each step. ___________________________________________________ Roeland M.J. Meyer, ISOC (InterNIC RM993) e-mail: <mailto:rmeyer@mhsc.com>rmeyer@mhsc.com Internet phone: hawk.mhsc.com Personal web pages: <http://www.mhsc.com/~rmeyer>www.mhsc.com/~rmeyer Company web-site: <http://www.mhsc.com/>www.mhsc.com/ ___________________________________________ I bet the human brain is a kludge. -- Marvin Minsky
For total system uptime 90.0% (one nine or less) Desktop systems. 99.0% (two nines) Intermediate business systems 99.9% (three nines) Most business data systems and workgroup servers 99.99% (four nines) High-end business systems and your friendly neighborhood telco 99.999% (five nines) Bank Data Centers and Telco Data Centers, some ISPs 99.9999% (six nines) Only God and Norad live here. 99.99999% (seven nines) Even God doesn't have pockets this deep.
What's your source for this data? -a
At 10:34 PM 10/16/98 -0700, Alan Hannan wrote:
For total system uptime 90.0% (one nine or less) Desktop systems. 99.0% (two nines) Intermediate business systems 99.9% (three nines) Most business data systems and workgroup servers 99.99% (four nines) High-end business systems and your friendly neighborhood telco 99.999% (five nines) Bank Data Centers and Telco Data Centers, some ISPs 99.9999% (six nines) Only God and Norad live here. 99.99999% (seven nines) Even God doesn't have pockets this deep.
What's your source for this data?
You mean, besides 22 years in the system design and development trade? Well, you can start with the various companies I have worked for. Then the manufacturing specs on the various systems. My last analysis involved two-headed server setup on HP 9000 series T520 with MC service guard and shared RAID5. HP guarantees that at three nines. With the right add-ons I got it to four nines (Complete second site in AZ, 1500 miles away). Very expensive. Five nines would have broken the budget, that was Wells Fargo. Northrup/Grumman MD-18 flight-line support. The PacBell broadband system was quad redundant data centers in Fairfield and San Diego. I was hired in as the Techinical Architect for that system. Again, HP equipment. That system would have hit five nines, or better, in production. I think we were pushing past $16M on that system, thirty-six specially configured T520's plus RAID packs. Various systems I worked on in Patrice Carrol's org in MCI COS (Garden of the Gods facility), including the Fraud Management System. This stuff is more art than science, too many non-deterministic variables. Experience is the only thing that counts. It tells you which formulaii to use and when they have a chance of working. I should have my web-site up again this week-end, we're converting to FastTrack with LiveWire, in addition to Apache-SSL/mod_perl. ___________________________________________________ Roeland M.J. Meyer, ISOC (InterNIC RM993) e-mail: <mailto:rmeyer@mhsc.com>rmeyer@mhsc.com Internet phone: hawk.mhsc.com Personal web pages: <http://www.mhsc.com/~rmeyer>www.mhsc.com/~rmeyer Company web-site: <http://www.mhsc.com/>www.mhsc.com/ ___________________________________________ I bet the human brain is a kludge. -- Marvin Minsky
There's an interesting white paper about a Bell Atlantic SONET deployment for military organizations at: http://www.bell-atl.atd.net/s-wpaper
How many significant digits do you consider acceptable? Even in an ideal APS environment, link failure detection and protection switching does take finite time. You might get 99.999% uptime, but probably not 99.9999999%.
The thing that always got me was that there never seems to be a mention of the sampling period for the stat.
Methinks that you've been subjected to Marketing. ;-)
Well ... I'll give you 99.9999999% on any system you like - with a sampling period of say every billion years. I think that allows me to stay down for the first 100 years, long enough to extend beyong the life of any stressed sysadmin :) More seriously - SLA's that specify a sampling period then also give an indication what is considered too long an outage. If you get just under the .1% downtime allowed per year all in one go you may well be pretty pissed at being told the 8 hour outage was within the SLA. Manar
At 10:36 PM 10/16/98 +0100, Manar Hussain wrote:
How many significant digits do you consider acceptable? Even in an ideal APS environment, link failure detection and protection switching does take finite time. You might get 99.999% uptime, but probably not 99.9999999%.
The thing that always got me was that there never seems to be a mention of the sampling period for the stat.
Methinks that you've been subjected to Marketing. ;-)
Well ... I'll give you 99.9999999% on any system you like - with a sampling period of say every billion years. I think that allows me to stay down for the first 100 years, long enough to extend beyong the life of any stressed sysadmin :)
More seriously - SLA's that specify a sampling period then also give an indication what is considered too long an outage. If you get just under the .1% downtime allowed per year all in one go you may well be pretty pissed at being told the 8 hour outage was within the SLA.
The quasi Engineering guidelines for many CLECs when calculating average downtime over a year's span is 52 minutes (meaning .0001% downtime over the year). Anything above and beyond this estimate would be suspect. Obviously, these Engineering baselines vary from carrier to carrier. Also, this 52 minute guideline relates to the SONET ring and the muxes and not the tributaries (OC-3 or OC-12) or the optical/electrical hand-offs that might fail due to bad terminations/bad wiring/or misconfigured nodes. A common failure for OC-3c or OC-12c is the 2-fiber optical handoff to the customer which has nothing to do with the SONET ring itself or the associated SONET gear. Dave Cooper Electric Lightwave, Inc. Disclaimer: Comments above reflect my experience with numerous CLECs and not specifically ELI.
Manar
At 10:42 AM 10/19/98 -0500, Dave Cooper wrote:
At 10:36 PM 10/16/98 +0100, Manar Hussain wrote:
How many significant digits do you consider acceptable? Even in an ideal APS environment, link failure detection and protection switching does take finite time. You might get 99.999% uptime, but probably not 99.9999999%.
The thing that always got me was that there never seems to be a mention of the sampling period for the stat.
Methinks that you've been subjected to Marketing. ;-)
Well ... I'll give you 99.9999999% on any system you like - with a sampling period of say every billion years. I think that allows me to stay down for the first 100 years, long enough to extend beyong the life of any stressed sysadmin :)
More seriously - SLA's that specify a sampling period then also give an indication what is considered too long an outage. If you get just under the .1% downtime allowed per year all in one go you may well be pretty pissed at being told the 8 hour outage was within the SLA.
The quasi Engineering guidelines for many CLECs when calculating average downtime over a year's span is 52 minutes (meaning .0001% downtime over the year). Anything above and beyond this estimate would be suspect.
Sorry, drop the % on the .0001 -> should be .01%. Coffee wasn't strong enough this morning. Thanks Barry. -dave cooper eli
Obviously, these Engineering baselines vary from carrier to carrier. Also, this 52 minute guideline relates to the SONET ring and the muxes and not the tributaries (OC-3 or OC-12) or the optical/electrical hand-offs that might fail due to bad terminations/bad wiring/or misconfigured nodes. A common failure for OC-3c or OC-12c is the 2-fiber optical handoff to the customer which has nothing to do with the SONET ring itself or the associated SONET gear.
Dave Cooper Electric Lightwave, Inc. Disclaimer: Comments above reflect my experience with numerous CLECs and not specifically ELI.
Manar
participants (6)
-
Alan Hannan
-
Dave Cooper
-
Howard C. Berkowitz
-
Manar Hussain
-
Roeland M.J. Meyer
-
Tony Li