Re: One-element vs two-element design
One key consideration you should think about is the ability to perform maintenance on redundant devices in the N+1 model without impacting the availability of the network. Brent Timothy Brown <tim@tux.org> Sent by: owner-nanog@merit.edu 01/16/2004 10:14 PM To: nanog@merit.edu cc: Subject: One-element vs two-element design I fear this may be a mother of a debate. In my (short?) career, i've been involved in several designs, some successful, some less so. I've recently been asked to contribute a design for one of the networks I work on. The design brings with it a number of challenges, but also, unlike a greenfield network, has a lot of history. One of the major decisions i'm being faced with is a choice between one-element or two-element design. When I refer to elements, what I really mean to say is N or N+1. For quite some time now, vendors have been improving hardware to the point where most components in a given device, with the exception of a line card, can be made redundant. This includes things like routing and switching processors, power supplies, busses, and even, in the case of vendor J and several others, the possibility of inflight restarts of particular portions of the software as part of either scheduled maintenance or to correct a problem. I have always been traditionally of the school of learning that states that it is best to have two devices of equal power and on the same footing, and, in multiple site configurations, four devices of equal power and equal footing. I feel like a safe argument to make is N+1, so that is the philosophy that I tend to adopt. N+2 or N...whatever doesn't seem to add a lot of additional security into the network's model of availability. This adds complexity, but I prefer to think of this in terms of, "Well, I can manage software or design complexity in my configurations, but I can't manage the loss of a single device which holds my network together." Now I must view this assertion in the context of better designed hardware and cheap spares-on-hand. Of course, like many other folks, I have tried to drink as deeply as I can from the well of knowledge. I've perused at length Cisco Press' High Availability Network Fundamentals, and understand MTBF calculations and some of the design issues in building a highly available network. But from a cost perspective, it seems that a single, larger box may be able to offer me as much redundancy as two equally configured boxes handling the same traffic load. Of course, there's that little demon on my shoulder, that tells me that I could always lose a complete device due to a power issue or short, and then i'd be up a creek. We have a history of adopting the N+1 model on the specific network i'm talking about, and it has worked very well so far in the face of occassional software failures by a vendor we occassionally have ridiculed here on nanog-l. However, in considering a comprehensive redesign, another vendor offers significantly more software stability, so i'm re-evaluating the need for multiple devices. My mind's more or less already made up, but i'd like to hear the design philosophies of other members of the operational community when adopting a N+1 approach. In particular, i'd love to hear a catastrophic operational failure which either proves or disproves either of the potential options. Tim ObDisclaimer: Please contact me off-list if you're okay with your thoughts on this matter being published in a book targeted to the operations community.
I personally favor the N+1 design model as it allows maintenance to be performed on network elements without causing outages which makes the customers happy. In many instances you can leverage the N+1 model to share the load between the devices thereby increasing network capacity. As an addtional benefit in the event of a element failure your network degrades gracefully rather than failing hard and requiring a "all hands" operation to get it back online. This tends to reduce your operational costs for your network even though your implementation cost is higher so over the lifetime of your network the overall cost is lower. i.e. service contracts can be NBD rather than 24x7x2. The N+1 model also takes into account the simple fact that stuff breaks!. I was reading the FIPS standards for machine room design one day and an entire page was devoted to "ALL EQUIPMENT WILL FAIL EVENTUALLY" this is a lesson which is often forgotten. This is why commercial airliners have multiple engines even though the system is less reliable overall than a well designed single engine craft the failure of a single component does not entail the catastrophic failure of the entire system. (there are exceptions to this but the overall concept does work). In the end it comes down to reliable vs resilient network. s in a reliable network components fail infrequently but they have catastrophic failure modes in a resilient network component failure is taken as a given but the overall system reliability is much higher than a reliable network since a component failure does not equal a functional failure. Scott C. McGrath On Fri, 16 Jan 2004 Brent_OKeeffe@asc.aon.com wrote:
One key consideration you should think about is the ability to perform maintenance on redundant devices in the N+1 model without impacting the availability of the network.
Brent
Timothy Brown <tim@tux.org> Sent by: owner-nanog@merit.edu 01/16/2004 10:14 PM
To: nanog@merit.edu cc: Subject: One-element vs two-element design
I fear this may be a mother of a debate.
In my (short?) career, i've been involved in several designs, some successful, some less so. I've recently been asked to contribute a design for one of the networks I work on. The design brings with it a number of challenges, but also, unlike a greenfield network, has a lot of history.
One of the major decisions i'm being faced with is a choice between one-element or two-element design. When I refer to elements, what I really mean to say is N or N+1. For quite some time now, vendors have been improving hardware to the point where most components in a given device, with the exception of a line card, can be made redundant. This includes things like routing and switching processors, power supplies, busses, and even, in the case of vendor J and several others, the possibility of inflight restarts of particular portions of the software as part of either scheduled maintenance or to correct a problem.
I have always been traditionally of the school of learning that states that it is best to have two devices of equal power and on the same footing, and, in multiple site configurations, four devices of equal power and equal footing. I feel like a safe argument to make is N+1, so that is the philosophy that I tend to adopt. N+2 or N...whatever doesn't seem to add a lot of additional security into the network's model of availability. This adds complexity, but I prefer to think of this in terms of, "Well, I can manage software or design complexity in my configurations, but I can't manage the loss of a single device which holds my network together." Now I must view this assertion in the context of better designed hardware and cheap spares-on-hand.
Of course, like many other folks, I have tried to drink as deeply as I can from the well of knowledge. I've perused at length Cisco Press' High Availability Network Fundamentals, and understand MTBF calculations and some of the design issues in building a highly available network. But from a cost perspective, it seems that a single, larger box may be able to offer me as much redundancy as two equally configured boxes handling the same traffic load. Of course, there's that little demon on my shoulder, that tells me that I could always lose a complete device due to a power issue or short, and then i'd be up a creek.
We have a history of adopting the N+1 model on the specific network i'm talking about, and it has worked very well so far in the face of occassional software failures by a vendor we occassionally have ridiculed here on nanog-l. However, in considering a comprehensive redesign, another vendor offers significantly more software stability, so i'm re-evaluating the need for multiple devices.
My mind's more or less already made up, but i'd like to hear the design philosophies of other members of the operational community when adopting a N+1 approach. In particular, i'd love to hear a catastrophic operational failure which either proves or disproves either of the potential options.
Tim
ObDisclaimer: Please contact me off-list if you're okay with your thoughts on this matter being published in a book targeted to the operations community.
[stuff snipped]
but the overall system reliability is much higher than a reliable network since a component failure does not equal a functional failure.
s/reliability/availabilty. You meant reliability when comparing a 1 vs 2 engine airplane, but a network (from a customer point of view) isn't defined by reliability, its defined by availability. If you are using your backup (N+1) router(s) for extra capacity, than you don't fail back to full capacity, but you do have limited availabilty. Availability/Performance of the overall system (network) is what we all engineer for. Customers don't care about reliability as long as the first two items are not impuned. (For example, they don't care if you have to replace their physical dialup port every hour on the hour, provided that they can get in and off in between service intervals --not a very reliable port, but a highly available network from the customer perspective). Maybe I am just picking on semantics. Deepak
From a customers standpoint limited availability of bits is still better
Point taken, Availability would have been a better term to use. than no bits flowing and in an ideal world your published capacity would be N rather than N+1. Appreciate the thoughtful comments Regards - Scott Scott C. McGrath On Sat, 17 Jan 2004, Deepak Jain wrote:
[stuff snipped]
but the overall system reliability is much higher than a reliable network since a component failure does not equal a functional failure.
s/reliability/availabilty.
You meant reliability when comparing a 1 vs 2 engine airplane, but a network (from a customer point of view) isn't defined by reliability, its defined by availability.
If you are using your backup (N+1) router(s) for extra capacity, than you don't fail back to full capacity, but you do have limited availabilty.
Availability/Performance of the overall system (network) is what we all engineer for. Customers don't care about reliability as long as the first two items are not impuned. (For example, they don't care if you have to replace their physical dialup port every hour on the hour, provided that they can get in and off in between service intervals --not a very reliable port, but a highly available network from the customer perspective).
Maybe I am just picking on semantics.
Deepak
This is why commercial airliners have multiple engines even though the system is less reliable overall than a well designed single engine craft the failure of a single component does not entail the catastrophic failure of the entire system. (there are exceptions to this but the overall concept does work).
Last year, a Boeing in flight over the middle of the pacific ocean had its entire glass cockpit system go dark. After frantic conversation with the air traffic controllers a decision was made to toggle the circuit breakers for the TRIPLE-REDUNDANT computer system onboard, which brought back the displays. Even with a 2+1 setup, things can still go wrong...
Eric Kuhnke wrote:
Last year, a Boeing in flight over the middle of the pacific ocean had its entire glass cockpit system go dark. After frantic conversation with the air traffic controllers a decision was made to toggle the circuit breakers for the TRIPLE-REDUNDANT computer system onboard, which brought back the displays. Even with a 2+1 setup, things can still go wrong...
Most, if not all, redundant systems have a single instance of synchronization protocol. One significant vendor of packet forwarding gear was known for hanging the secondary RP almost every time when the primary failed. The hang was usually associated with chatter failing with the failed card :-) Pete
participants (5)
-
Brent_OKeeffe@asc.aon.com
-
Deepak Jain
-
Eric Kuhnke
-
Petri Helenius
-
Scott McGrath