RE: Network Reliability Engineering

21 May 2002

      While it is possible to get the FIT numbers for hardware and calculate
network availability, our experience has been that modelling hardware
reliability and calculating network availability was not particularly
usefull as hardware and fiber transmission systems are usually the least
signifigant factor in overall network availability. Hardware failures are
also easy to design around by redundant hardware, or more boxes, or diverse
fiber routes.

Network software issues and Operational mistakes seem to affect Network
Availability more than hardware.

An example would be a bug in a routing protocol that causes an erroneous
update to propagate through the network. Or in the operational category, a
typo which causes unintended results.

In both cases these failures are not limited to one box, but often cause
problems or their effects to propagate throughout the entire network.

How do you objectively calculate the network availability when the network
is highly dependant upon the correct functioning of a binary blob of
proprietary code, but your only visability inside the blob is a release note
listing the symptoms experienced by others who have run the code in a
similar, but probably not identical network configuration?

It seems unlikely that vendors are going to disclose more about their
proprietary blob of binary to protect their I.P. assets. This leaves teh
netwrok operator without much to assess code reliability.

Perhaps we need to change the business model around network code licensing
to ensure vendors comprehend the impact of a bad release, and share the pain
when they release a buggy blob that has customer impact on the network.

Rather than a one-time fee to license the code when you buy the box, a small
recurring monthly license fee, with no payment in any month that a software
bug crashes your network, would act as a continuous form of positive
reinforcement for your box vendor to ensure your network has high
availability code.

The box vendor would have a recurring revenue stream for software licensing
that is only as stable and reliable as their software.

-R

-----Original Message-----
From: Pete Kruckenberg
To: nanog@merit.edu
Sent: 5/18/2002 7:13 PM
Subject: Network Reliability Engineering

I'm looking for some good reference materials to do some
"reliability engineering" calculations and projections.

This is to justify increased redundancy, and I want to
include quantifiable numbers based on MTBF data and other
reliability factors, kind of a scientific justification
instead of just the typical emotional appeal using
analyst/vendor FUD.

I'd appreciate references on how to do this in a network
environment (what data to collect, how to collect it, how to
analyze, etc). Also any data (or rules of thumb) on typical
MTBFs for network events that I won't find on vendor product
slicks (like what's the MTBF on IOS, or human-caused service
outages of various types, etc).

If someone has put together something remotely like this
that they'd care to share, that'd be incredibly helpful.

Thanks.
Pete.

Randy Neals

tags

participants (1)