RE: Network Reliability Engineering
While it is possible to get the FIT numbers for hardware and calculate network availability, our experience has been that modelling hardware reliability and calculating network availability was not particularly usefull as hardware and fiber transmission systems are usually the least signifigant factor in overall network availability. Hardware failures are also easy to design around by redundant hardware, or more boxes, or diverse fiber routes. Network software issues and Operational mistakes seem to affect Network Availability more than hardware. An example would be a bug in a routing protocol that causes an erroneous update to propagate through the network. Or in the operational category, a typo which causes unintended results. In both cases these failures are not limited to one box, but often cause problems or their effects to propagate throughout the entire network. How do you objectively calculate the network availability when the network is highly dependant upon the correct functioning of a binary blob of proprietary code, but your only visability inside the blob is a release note listing the symptoms experienced by others who have run the code in a similar, but probably not identical network configuration? It seems unlikely that vendors are going to disclose more about their proprietary blob of binary to protect their I.P. assets. This leaves teh netwrok operator without much to assess code reliability. Perhaps we need to change the business model around network code licensing to ensure vendors comprehend the impact of a bad release, and share the pain when they release a buggy blob that has customer impact on the network. Rather than a one-time fee to license the code when you buy the box, a small recurring monthly license fee, with no payment in any month that a software bug crashes your network, would act as a continuous form of positive reinforcement for your box vendor to ensure your network has high availability code. The box vendor would have a recurring revenue stream for software licensing that is only as stable and reliable as their software. -R -----Original Message----- From: Pete Kruckenberg To: nanog@merit.edu Sent: 5/18/2002 7:13 PM Subject: Network Reliability Engineering I'm looking for some good reference materials to do some "reliability engineering" calculations and projections. This is to justify increased redundancy, and I want to include quantifiable numbers based on MTBF data and other reliability factors, kind of a scientific justification instead of just the typical emotional appeal using analyst/vendor FUD. I'd appreciate references on how to do this in a network environment (what data to collect, how to collect it, how to analyze, etc). Also any data (or rules of thumb) on typical MTBFs for network events that I won't find on vendor product slicks (like what's the MTBF on IOS, or human-caused service outages of various types, etc). If someone has put together something remotely like this that they'd care to share, that'd be incredibly helpful. Thanks. Pete.
participants (1)
-
Randy Neals