dual router vs. single "reliable" router
Okay, I'll bite... the mentioned file, http://www.nspllc.com/New%20Pages/Reliable%20IP%20Nodes.pdf seems to be fluff to me. There are many assumptions and statements about reliability, but the methodology of how the numbers were reached is not present. If one assumes that one has a router which fails very rarely, this would dramatically affect network design. However, this is an assumption, not a conclusion. The assumption of the paper is that the Alcatel box has ultra-low failure rates, while the Juniper and Cisco boxen have relatively high failure rates. Personally, before I let something like this influence my buying/design decisions, I'd want to see some serious raw data... ===== David Barak -fully RFC 1925 compliant- __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com
On Thu, Apr 10, 2003 at 08:07:14AM -0700, David Barak wrote:
There are many assumptions and statements about reliability, but the methodology of how the numbers were reached is not present. If one assumes that one has a router which fails very rarely, this would dramatically affect network design. However, this is an assumption, not a conclusion. The assumption of the paper is that the Alcatel box has ultra-low failure rates, while the Juniper and Cisco boxen have relatively high failure rates. Personally, before I let something like this influence my buying/design decisions, I'd want to see some serious raw data...
2x the hardware means 2x the number of hardware failures. It also means 2x the number of software upgrades, and probably some multiplier greater than 2x for the increased complexity and opportunity for software to go wrong. Dual routers just increases the number of overall failures in exchange for hoping that only one goes down at any given time. Throw in some assumptions (which may or may not be true, I'll agree that some of their numbers are a little "off") that every one of those failures involves some service impact, you could easily make a case that one box which doesn't go down is better than two boxes which routinely go down. On one side of the coin, Cisco has done a masterful job at convincing the networking industry that the correct answer to their routine failures is to purchase double of everything. On the other side... Show me the box that never goes down. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
--- Richard A Steenbergen <ras@e-gerbil.net> wrote:
2x the hardware means 2x the number of hardware failures. It also means 2x the number of software upgrades, and probably some multiplier greater than 2x for the increased complexity and opportunity for software to go wrong. Dual routers just increases the number of overall failures in exchange for hoping that only one goes down at any given time.
The fallacy here is that the greater number of failures which a dual-router scenario will encounter are of the same Qualitative type as the failures your single router will encounter. This is clearly not true: one of a pair failing means that there will be a period of convergence, and then the remaining router will carry the load. If a single router fails, the load will not be carried until the router can be restored.
On one side of the coin, Cisco has done a masterful job at convincing the networking industry that the correct answer to their routine failures is to purchase double of everything. On the other side... Show me the box that never goes down. :)
My point exactly: from a design perspective it's much simpler to have a single box, but I have not seen single boxen which don't fail. I'm actually a big fan of the "cold-spare" approach: you preserve your simplicity, and any outage only lasts as long as it takes to unplug and re-plug... ===== David Barak -fully RFC 1925 compliant- __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com
Thus spake "Richard A Steenbergen" <ras@e-gerbil.net>
Throw in some assumptions (which may or may not be true, I'll agree that some of their numbers are a little "off") that every one of those failures involves some service impact, you could easily make a case that one box which doesn't go down is better than two boxes which routinely go down.
If a tree falls down in a forest, but service isn't affected, do we care if the tree falling made a noise? If you have two devices which are up 99% of the time, then one of the two is up 99.99% of the time. While designing with two of everything is indeed more complex, it's often simpler than designing a single product that's more reliable. Having two of everything also simplifies maintenance, since you don't care (much) about an individual box being down. Public ATM networks are hell to maintain because every node must be up 24x7 and simple things (to routerheads) like a software upgrade are a 3+ month project because it must be done online without dropping a single cell. S Stephen Sprunk "God does not play dice." --Albert Einstein CCIE #3723 "God is an inveterate gambler, and He throws the K5SSS dice at every possible opportunity." --Stephen Hawking
Thus spake "David Barak" <thegameiam@yahoo.com>
There are many assumptions and statements about reliability, but the methodology of how the numbers were reached is not present. If one assumes that one has a router which fails very rarely, this would dramatically affect network design. However, this is an assumption, not a conclusion. The assumption of the paper is that the Alcatel box has ultra-low failure rates, while the Juniper and Cisco boxen have relatively high failure rates. Personally, before I let something like this influence my buying/design decisions, I'd want to see some serious raw data...
Nearly all the Cisco device failures I've seen were either software or human problems; actual hardware failure is _way_ down the list. Also, I've observed significantly worse reliability among devices specifically designed to be highly reliable compared to devices simply designed to work. There are several networks out there using Cisco devices to achieve over six 9's availability, and the way they do that is by extensive procedure review and rigorous software testing. Writing more reliable software is certainly doable, but more-reliable humans aren't likely and more-reliable hardware is unnecessary. IMHO. S Stephen Sprunk "God does not play dice." --Albert Einstein CCIE #3723 "God is an inveterate gambler, and He throws the K5SSS dice at every possible opportunity." --Stephen Hawking
On Thu, Apr 10, 2003 at 11:59:25AM -0500, Stephen Sprunk wrote:
Nearly all the Cisco device failures I've seen were either software or human problems; actual hardware failure is _way_ down the list. Also, I've observed significantly worse reliability among devices specifically designed to be highly reliable compared to devices simply designed to work.
There are several networks out there using Cisco devices to achieve over six 9's availability, and the way they do that is by extensive procedure review and rigorous software testing. Writing more reliable software is certainly doable, but more-reliable humans aren't likely and more-reliable hardware is unnecessary. IMHO.
This is also my experience. The chance of a forklift or ceiling tile taking out your infrastructure is not even close to the amount of times you have to tell the junior guys "nononononono, `debug all' is _bad_ idea". Until the software reduces/eliminates pilot error to a severe degree, and is proven to prevent forwarding issues (read: fib bugs) -- there is just no big motivation to run single box. Single box has its application in IP networks, but moreso in the access layer (customer edge), or Internet edge (peering). I'm just flinching thinking about using single box in the core (per POP), when a single command of any type can just take out the whole box (`no ip routing' immediately comes to mind). There's just more software on IP boxes compared to telco technology. The only work I've seen in the IETF on topic is this draft: http://www.ietf.org/internet-drafts/draft-kilsdonk-router-upgrade-01.txt But it has left a lot to be desired, IMO. dre
The report seermed to like the boxes from the companies that Alcatel bought. On this side of the pond they did not get alot of traction. From my experience with a "large" Cisco network besides the normal too many fingers on the CLI to the router there were about one to two hardware failures a week on the big "C" devices. Most of these were interface cards but some were control cards on switches. As long as you did not overrun the 12xxx series with too many BGP updates per unit of time they as a box were stable. To calculate "99.99" per cent uptime you - Define scheduled down time as any maintanance that can be done after giving the worldwide network at least five minutes notice of major network outages ... Define the device to be up as long as you can secure telnet into the box and get a command prompt - Ignore BGP, routing or forwarding tables or any other routing issues because "Up-Time" means the box is up not that it is doing any usefull work ... And finally go back to a "real person" deterministic protocol like "ATM" with PNNI and UNI so you "know" when the network is down and do not have to guess ... My layer two network ran for two years with no hardware or routing issues ... The other minor technical issue is that say you can calculate uptime on a "router" how do you calculate up time on the network? ... Since just because one router is down the "network" is still up so individual up-time for a router in an IP network may only affect SLAs for customer directly attached to it. John (It Suites Dennis's Needs) Lee David Barak wrote:
Okay, I'll bite...
the mentioned file, http://www.nspllc.com/New%20Pages/Reliable%20IP%20Nodes.pdf seems to be fluff to me.
There are many assumptions and statements about reliability, but the methodology of how the numbers were reached is not present. If one assumes that one has a router which fails very rarely, this would dramatically affect network design. However, this is an assumption, not a conclusion. The assumption of the paper is that the Alcatel box has ultra-low failure rates, while the Juniper and Cisco boxen have relatively high failure rates. Personally, before I let something like this influence my buying/design decisions, I'd want to see some serious raw data...
===== David Barak -fully RFC 1925 compliant-
__________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com
participants (5)
-
David Barak
-
dre
-
John L Lee
-
Richard A Steenbergen
-
Stephen Sprunk