Followup British Telecom outage reason
BT is telling ISPs the reason for the multi-hour outage was a software bug in the interface cards used in BT's core network. BT installed a new version of the software. When that didn't fix the problem, they fell back to a previous version of the software. BT didn't identify the vendor, but BT is identified as a "Cisco Powered Network(tm)." Non-BT folks believe the problem was with GSR interface cards. I can't independently confirm it.
BT is telling ISPs the reason for the multi-hour outage was a software bug in the interface cards used in BT's core network. BT installed a new version of the software. When that didn't fix the problem, they fell back to a previous version of the software.
BT didn't identify the vendor, but BT is identified as a "Cisco Powered Network(tm)." Non-BT folks believe the problem was with GSR interface cards. I can't independently confirm it.
I'd be surprised if it was the GSR, and in anycase that doesn't absolve anyone. If it was a software issue- why wasn't the software properly tested? Why was such a critical upgrade rolled out across the entire network at the same time? It doesn't add up. Neil.
On Sat, 24 Nov 2001, Neil J. McRae wrote:
I'd be surprised if it was the GSR, and in anycase that doesn't absolve anyone. If it was a software issue- why wasn't the software properly tested? Why was such a critical upgrade rolled out across the entire network at the same time? It doesn't add up.
It appears to be yet another CEF bug. If you want to use a GSR you are stuck using some version of IOS with a CEF bug. The question is which bug do you want. Each version of IOS has a slightly different set. Several US network providers have also been bitten by CEF bugs too. While trying to fix one set of bugs, BT upgraded of their network. I'm not sure if they were upgrading at 9am in the morning, or had upgraded earlier and the bug finally came out under load at 9am. When the BT network melted down, Cisco suggested installing a different version of IOS, which had previously been tested. At noon, BT found the new version had an even worse bug, sending packets out the wrong interface. It was until 2200 (13 hours later), BT and Cisco found a version of IOS which stablized the network. "Stablized" not fixed. The running version of IOS still has a bug, but it isn't as severe.
On Sat, Nov 24, 2001 at 02:16:38PM -0500, Sean Donelan wrote:
On Sat, 24 Nov 2001, Neil J. McRae wrote:
I'd be surprised if it was the GSR, and in anycase that doesn't absolve anyone. If it was a software issue- why wasn't the software properly tested? Why was such a critical upgrade rolled out across the entire network at the same time? It doesn't add up.
After a fully run lab-test as well as limited "real-life" deployment you can still never see all the possible cases that would possibly come to haunt you later. Sometimes you do an across-the-board upgrade for security as well as specific feature/bugset reasons to fix the set of bugs into the "we know what they are and how to deal with them". No vendor claims to have perfect software. Nor will you find anyone but the irresponsible vendor to suggest that any specific image is "perfect".
It appears to be yet another CEF bug. If you want to use a GSR you are stuck using some version of IOS with a CEF bug. The question is which bug do you want. Each version of IOS has a slightly different set. Several US network providers have also been bitten by CEF bugs too.
True, but most of those are in the past. I'm not familiar with the specifics of the bugs that BT encountered but something that should be taken note of is the ability for a Cisco router to function when in a "broken" state and you want to get a 'fixed' image onto it. It would be nice if there were easier ways to do it in some cases but you can't have a perfect environment esp when you do sw upgrades you don't always have your on-site hands standing by to help you swap flash cards or deal with whatever logistical issues you may encounter.
While trying to fix one set of bugs, BT upgraded of their network. I'm not sure if they were upgrading at 9am in the morning, or had upgraded earlier and the bug finally came out under load at 9am. When the BT network melted down, Cisco suggested installing a different version of IOS, which had previously been tested. At noon, BT found the new version had an even worse bug, sending packets out the wrong interface. It was until 2200 (13 hours later), BT and Cisco found a version of IOS which stablized the network. "Stablized" not fixed. The running version of IOS still has a bug, but it isn't as severe.
I'm sure that BT and Cisco have had some conversations about what can be done to improve the testing that Cisco does to better simulate their network at this time from such a public outage. -- Jared Mauch | pgp key available via finger from jared@puck.nether.net clue++; | http://puck.nether.net/~jared/ My statements are only mine.
you are right. But in an era of focusing on the box, vendors are forgetting that solid software and knowledgable support are just as important. Possibly slow down a bit on rolling all those new features and widgets into the software.... Make the software do what it should, reliably.. then put the new stuff in there. ie.. bug scrub a train, per chassis. Make it solid.. then put the toyz in. These days you don't see boxes hitting the 1 year mark that often.. It is usually interupted somewhere in the 20 week range with something beautiful like, SBN uptime is 2 weeks, 4 days, 6 hours, 12 minutes System returned to ROM by processor memory parity error at PC 0x607356A0, address 0x0 at 21:02:01 UTC Tue Nov 6 2001 or BMG uptime is 34 weeks, 3 hours, 44 minutes System returned to ROM by error - a Software forced crash, PC 0x6047F3E8 at 18:28:17 est Sat Mar 31 2001 or LVX uptime is 24 weeks, 1 day, 20 hours, 21 minutes System returned to ROM by abort at PC 0x60527DD4 at 00:38:36 EST Fri Jun 8 2001 At least its not 0xDEADBEEF.. yet.
On Sat, Nov 24, 2001 at 02:16:38PM -0500, Sean Donelan wrote:
On Sat, 24 Nov 2001, Neil J. McRae wrote:
<snip>
No vendor claims to have perfect software. Nor will you find anyone but the irresponsible vendor to suggest that any specific image is "perfect".
<snip>
I'm sure that BT and Cisco have had some conversations about what can be done to improve the testing that Cisco does to better simulate their network at this time from such a public outage.
-- Jared Mauch | pgp key available via finger from jared@puck.nether.net clue++; | http://puck.nether.net/~jared/ My statements are only mine.
you are right. But in an era of focusing on the box, vendors are forgetting that solid software and knowledgable support are just as important.
Possibly slow down a bit on rolling all those new features and widgets into the software.... Make the software do what it should, reliably.. then put the new stuff in there.
ie.. bug scrub a train, per chassis. Make it solid.. then put the toyz in.
easier said than done when everybody wants every fancy new feature 110% solid and yesterday.
easier said than done when everybody wants every fancy new feature 110% solid and yesterday.
Not everybody. What I want more than the fancy new feature is: honest schedules and honest self-appraisals. A vendor who promises me what they think I want to hear or maybe even what I really do wish I could hear is not as valuable to me as a vendor who tells me the bold bald truth no matter how much it hurts my proposed rollout schedule or how much it might help one of their competitors who can deliver $FANCY_NEW_FEATURE earlier. -- Paul Vixie <vixie@eng.paix.net> President, PAIX.Net Inc. (NASD:MFNX)
easier said than done when everybody wants every fancy new feature 110% solid and yesterday.
Not everybody. What I want more than the fancy new feature is: honest schedules and honest self-appraisals. A vendor who promises me what they think I want to hear or maybe even what I really do wish I could hear is not as valuable to me as a vendor who tells me the bold bald truth no matter how much it hurts my proposed rollout schedule or how much it might help one of their competitors who can deliver $FANCY_NEW_FEATURE earlier.
It took a long long time for one router vendor in particular to pay any attention to a number of high-spending customers who said 'stop implementing the {3,4}-letter acronym du-jour protocols, or at least stop trying to integrate them into s/w we want to run, just fix the bugs the the s/w train we all run'. The lesson seemed to be learnt for a little while, then spent the next year trying (unsuccessfully) to abandon that release train. I can only assume what drives them is the either (a) desire to support slideware protocols early, and in code people actually try and use, (b) the knowledge that such slideware protocols, in aggregate across a large network, eat more router horse power, and thus sales-$$$, for those gullible enough to implement them, or (c) some combination between the two. I guess some time someone will realize routers are both hardware, and software, and shock horror both, if done well, can actually add value. [hint & example: compare the scheduler on, say, Linux/FreeBSD, Windows 95 (sic), and your favourite router OS (*); pay particular attention to suitability for running realtime, or near realtime tasks, where such tasks may occasionally crash or overrun their expected timeslice; note how the best OS amongst the bunch for this aint exactly great]. (*) results may vary according to personal choice here. -- Alex Bligh Personal Capacity
I guess some time someone will realize routers are both hardware, and software, and shock horror both, if done well, can actually add value. [hint & example: compare the scheduler on, say, Linux/FreeBSD, Windows 95 (sic), and your favourite router OS (*); pay particular attention to suitability for running realtime, or near realtime tasks, where such tasks may occasionally crash or overrun their expected timeslice; note how the best OS amongst the bunch for this aint exactly great].
(*) results may vary according to personal choice here.
Don't use a non-realtime OS for something that you expect realtime or near-realtime OS functionality. There are specific systems to address these kinds of needs with rather complicated scheduling mechanism to accomodate such requirements in a sensible manner. Is IOS a realtime operating system? No. Are any of the other listed OS realtime operating systems? No. (The (*) doesn't count). Do I wish some of these clowns would use sophisticated realtime OS? Yes. Will it solve world hunger? Decidedly not. Cheers, Chris
On Wed, 28 Nov 2001, Christian Kuhtz wrote:
I guess some time someone will realize routers are both hardware, and software, and shock horror both, if done well, can actually add value. [hint & example: compare the scheduler on, say, Linux/FreeBSD, Windows 95 (sic), and your favourite router OS (*); pay particular attention to suitability for running realtime, or near realtime tasks, where such tasks may occasionally crash or overrun their expected timeslice; note how the best OS amongst the bunch for this aint exactly great].
(*) results may vary according to personal choice here.
Don't use a non-realtime OS for something that you expect realtime or near-realtime OS functionality. There are specific systems to address these kinds of needs with rather complicated scheduling mechanism to accomodate such requirements in a sensible manner.
Is IOS a realtime operating system? No. Are any of the other listed OS realtime operating systems? No.
Actually there are multiple Linux-based RTOSes.
--On Wednesday, 28 November, 2001 11:39 PM -0500 Christian Kuhtz <christian@kuhtz.com> wrote:
Don't use a non-realtime OS for something that you expect realtime or near-realtime OS functionality. There are specific systems to address these kinds of needs with rather complicated scheduling mechanism to accomodate such requirements in a sensible manner.
Is IOS a realtime operating system? No. Are any of the other listed OS realtime operating systems? No. (The (*) doesn't count). Do I wish some of these clowns would use sophisticated realtime OS? Yes. Will it solve world hunger? Decidedly not.
One of us has the wrong end of the stick. My point was that the requirement for some of the basic functionality of preemptive multitasking, and (near) real time performance, is rather more necessary in a router, where one tends to have a requirement to run multiple instances of multiple protocols in a time dependent manner, schedule queues etc. etc., than in a desktop or server operating system. However, this doesn't correspond to how one or two router OS's are actually designed. Alex Bligh Personal Capacity
I can only assume what drives them is the either (a) desire to support slideware protocols early, and in code people actually try and use, (b) the knowledge that such slideware protocols, in aggregate across a large network, eat more router horse power, and thus sales-$$$, for those gullible enough to implement them, or (c) some combination between the two.
I have not seen anyone note that all major vendors are usually focused on revenue from new sales and that maintenance income is seen as a bonus. While new sales are driven by being able to compete with the next vendor, new features will always be prioritised over stability. Peter
I have not seen anyone note that all major vendors are usually focused on revenue from new sales and that maintenance income is seen as a bonus. While new sales are driven by being able to compete with the next vendor, new features will always be prioritised over stability.
Actually I think in recent history the opposite is true. Anyone remember the 11.2 and 11.3 [and non 11.1CC] days? Cisco had huge issues with IOS stability, and its was around about this time that juniper and a few less successful others appeared. Cisco changed something and things got significantly better - I'd look at that as revenue loss prevention :-). Regards, Neil.
This used to be the cc train, then later the S train. However, the S train has never been as stable as cc, and it has become increasingly less stable over time, with too many new features rolling in. I'd be curious as to exactly which CEF bugs bit them. The introduction of greater MPLS functionality seems to have given CEF a nasty bit of destabilization. - Daniel Golding
-----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu]On Behalf Of Christian Kuhtz Sent: Sunday, November 25, 2001 12:57 AM To: S; nanog@merit.edu Subject: RE: Followup British Telecom outage reason
you are right. But in an era of focusing on the box, vendors are forgetting that solid software and knowledgable support are just as important.
Possibly slow down a bit on rolling all those new features and widgets into the software.... Make the software do what it should, reliably.. then put the new stuff in there.
ie.. bug scrub a train, per chassis. Make it solid.. then put the toyz in.
easier said than done when everybody wants every fancy new feature 110% solid and yesterday.
[I may digress a bit.. apologies] I think there is a second order effect on revenue though. The more reliable/stable the gear is, the less expensive the support contract you may be inclined to get for it. Would you really get 4hr replacement if you can't think of the last time a particular series of box failed? If you really need 4 hr replacement [this gets mentioned very often] its cheaper to keep spares of your own, so then wouldn't next business day suffice? Although we keep spares for our low end switch blades and switches, they really aren't worth keeping on Smartnet because 1) the OS virtually never changes, and 2) they don't fail outside of warranty. Or at least, not in my experience. I think recognizing this, Cisco for example has given smaller switches lifetime warranties with lifetime OS upgrades. If routers worked [/could work] this way, no one with a good technical team in house would buy support contracts. [Think of the pharmaceutical industry] Its much more profitable to "treat" an illness than cure it. If as an entity, expensive gear is expensive to maintain, there is more money to be made. Mainframes were the same way until PCs and workstations continuously ate at their market share and the costs of problem solving. Today even though mainframes have gotten oodles cheaper to buy and run, if you really want a mainframe, you have to have a great reason. Deepak Jain AiNET -----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu]On Behalf Of Daniel Golding Sent: Monday, November 26, 2001 11:51 AM To: Christian Kuhtz; S; nanog@merit.edu Subject: RE: Followup British Telecom outage reason This used to be the cc train, then later the S train. However, the S train has never been as stable as cc, and it has become increasingly less stable over time, with too many new features rolling in. I'd be curious as to exactly which CEF bugs bit them. The introduction of greater MPLS functionality seems to have given CEF a nasty bit of destabilization. - Daniel Golding
-----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu]On Behalf Of Christian Kuhtz Sent: Sunday, November 25, 2001 12:57 AM To: S; nanog@merit.edu Subject: RE: Followup British Telecom outage reason
you are right. But in an era of focusing on the box, vendors are forgetting that solid software and knowledgable support are just as important.
Possibly slow down a bit on rolling all those new features and widgets into the software.... Make the software do what it should, reliably.. then put the new stuff in there.
ie.. bug scrub a train, per chassis. Make it solid.. then put the toyz in.
easier said than done when everybody wants every fancy new feature 110% solid and yesterday.
On Mon, Nov 26, 2001 at 11:51:08AM -0500, Daniel Golding wrote:
This used to be the cc train, then later the S train. However, the S train has never been as stable as cc, and it has become increasingly less stable over time, with too many new features rolling in.
I'd be curious as to exactly which CEF bugs bit them. The introduction of greater MPLS functionality seems to have given CEF a nasty bit of destabilization.
Tell me about it, 12000's disabling CEF on LineCards due to various bugs has probably been the biggest problem I've seen over the past year or more. Especially since our IGP (ISIS) doesn't use IP as transport, and thus keep it's adjacencies, and blackhole all traffic recieved on such a LineCard until a human acts :-( But it seems that Cisco has a undocumented, and little know command, that will disable ISIS on interfaces residing on LineCards with CEF disabled, and trigger a reroute around the affected LineCard. router isis external overload signalling Note: We've configured it recently, and hasn't been in a situation where we've seen it in action yet, but Cisco claim that atleast one significant carrier has been using it for months. /Jesper -- Jesper Skriver, jesper(at)skriver(dot)dk - CCIE #5456 Work: Network manager @ AS3292 (Tele Danmark DataNetworks) Private: FreeBSD committer @ AS2109 (A much smaller network ;-) One Unix to rule them all, One Resolver to find them, One IP to bring them all and in the zone to bind them.
Possibly slow down a bit on rolling all those new features and widgets into the software.... Make the software do what it should, reliably.. then put the new stuff in there.
Yeah but... these activities support existing customers with existing products, and they enable no revenue (the support fees are already in the bag). To dedicate that much engineering energy -- probably more than 50% of the corporate total if "doing it right" is the goal -- would put new customer revenue and new product revenue at risk. This icky tradeoff is why new (as in pre-IPO in some cases) vendors can still get a fair test in existing networks. Eng&Ops type people have told me more than once that they thought $NEW_ROUTER_VENDOR could be a good investment simply because nearly 100% of their engineering resources would be dedicated to making their small number of customers happy, and being a larger customer amongst a small set increased this advantage even more. The big challenge at an established router company is in management of the competing priorities more than in management of, or doing of, engineering. -- Paul Vixie <vixie@eng.paix.net> President, PAIX.Net Inc. (NASD:MFNX)
This icky tradeoff is why new (as in pre-IPO in some cases) vendors can still get a fair test in existing networks. Eng&Ops type people have told me more than once that they thought $NEW_ROUTER_VENDOR could be a good investment simply because nearly 100% of their engineering resources would be dedicated to making their small number of customers happy, and being a larger customer amongst a small set increased this advantage even more.
.. which is certainly true until small $NEW_ROUTER_VENDOR IPO'ed or otherwise grew into not-so-small $NEW_ROUTER_VENDOR, as we have all witnessed on numerous occasions. At which point, they're all the same again. You only gain an advantage for a limited amount of time. There are costs attributed to this as well which need to be realized.
The big challenge at an established router company is in management of the competing priorities more than in management of, or doing of, engineering.
And there also is a business reality. If you get almost everything right, most people are happy with that. Few people demand, need, and can afford to pay for perfection. In fact, one could argue that it is poor design to rely on anything to operate perfectly 100% of the time. Now, if lack of infrastructure realiability can harm human life you may feel differently, but that isn't the case for most of us at the present time. We can sit here all day long and argue back and forth for strategies from multiple vendor based networks to why single source offers advantages to why bla bla is cool. The bottom line is that there is no free lunch here. If you want perfection, you will pay for perfection either in house or for your vendor or lost revenue or all of the above. And sometimes business cases cannot support perfection. The trade-off that has to be made here is how much "slack" you can get away with while still making your customers happy and at the same time supporting your business case. Anything else has no long term viability. From an engineering perspective this view certainly stinks, but when you take into account business realities engineering's perspectives may be an illusion. It's the old wisdom of 'pick any two: cheap, fast, reliable'. Faults will happen. And nothing matters as much as how your prepare for when they do. Cheers, Chris
On Mon, 26 Nov 2001 04:47:24 EST, Christian Kuhtz <christian@kuhtz.com> said:
.. which is certainly true until small $NEW_ROUTER_VENDOR IPO'ed or otherwise grew into not-so-small $NEW_ROUTER_VENDOR, as we have all witnessed on numerous occasions. At which point, they're all the same again. You only gain an advantage for a limited amount of time. There are costs attributed to this as well which need to be realized.
"The parting on the left.... is now the parting on the right... and the beards have all grown longer overnight" -- Pete Townsend, "Wont get fooled again" And yes, router design *is* politics, not engineering.
On Mon, 26 Nov 2001, Christian Kuhtz wrote:
Now, if lack of infrastructure realiability can harm human life you may feel differently, but that isn't the case for most of us at the present time.
I've designed software and networks used for public safety and emergencies. And yes, people have died on my watch. It is a somewhat different mindset, but not that different. A lot of "good engineering practice" applies to any engineering activity, including software engineering. Its not even a matter of cost. A typical hospital spends less on their emergency power system than a Internet/telco hotel. The major difference is the hospital staff knows (more or less) what to do when the generators don't work. The big secret is most "life safety" systems fail regularly. Most of the time it doesn't matter because the "big one" doesn't coincide with the failure.
Faults will happen. And nothing matters as much as how your prepare for when they do.
Mean Time To Repair is a bigger contributor to Availability calculations than the Mean Time To Failure. It would be great if things never failed. But some people are making their systems so complicated chasing the Holy Grail of 100% uptime, they can't figure out what happened when it does fail. Murphy's revenge: The more reliable you make a system, the longer it will take you to figure out what's wrong when it breaks.
Wandering off the subject of BT's misfortune ... Sean Donelan wrote:
On Mon, 26 Nov 2001, Christian Kuhtz wrote:
[...]
Faults will happen. And nothing matters as much as how your prepare for when they do.
Mean Time To Repair is a bigger contributor to Availability calculations than the Mean Time To Failure. It would be great if things never failed.
And Mean Time To Fault Detected (Accurately) is usually the biggest sub-contributor within Repair but that's kinda your point.
But some people are making their systems so complicated chasing the Holy Grail of 100% uptime, they can't figure out what happened when it does fail.
Similar people pursue creation of perpetuum mobile. A strange and somewhat congruent example stumbled into recently is: http://www.sce.carleton.ca/netmanage/perpetum.shtml. Overall simplicity of the system, including failure detection mechanisms, and real redundancy are the most reliable tools for availablity. Of course, popping just a few layers out, profit and politics are elements of most systems.
Murphy's revenge: The more reliable you make a system, the longer it will take you to figure out what's wrong when it breaks.
Hmm.
--On Monday, 26 November, 2001 6:28 AM -0500 Sean Donelan <sean@donelan.com> wrote:
Its not even a matter of cost. A typical hospital spends less on their emergency power system than a Internet/telco hotel. The major difference is the hospital staff knows (more or less) what to do when the generators don't work.
That's probably debatable, however what is clear is that the standard deviation in Internet/telco hotel performance is far greater than that of hospitals, and it's hard to judge your (potential) vendor's capability of responding to an unplanned power situation in advance of one happening. Bizarrely, those who seem to get it wrong, don't seem to learn from their mistakes. Alex Bligh Personal Capacity
My first thought in response to this is the vendor's support costs - wouldn't shipping more reliable images bring down those costs signficantly? Or is it just that the extra revenue opportunities gained by adding $WHIZBANG_FEATURE_DU_JOUR outweigh those potential support savings? -C On Sun, Nov 25, 2001 at 07:30:07PM -0800, Paul Vixie wrote:
Yeah but... these activities support existing customers with existing products, and they enable no revenue (the support fees are already in the bag). To dedicate that much engineering energy -- probably more than 50% of the corporate total if "doing it right" is the goal -- would put new customer revenue and new product revenue at risk.
Paul Vixie <vixie@eng.paix.net> President, PAIX.Net Inc. (NASD:MFNX)
-- --------------------------- Christopher A. Woodfield rekoil@semihuman.com PGP Public Key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xB887618B
--On 11/26/2001 09:22:01 AM -0500 Christopher A. Woodfield wrote:
My first thought in response to this is the vendor's support costs - wouldn't shipping more reliable images bring down those costs signficantly? Or is it just that the extra revenue opportunities gained by adding $WHIZBANG_FEATURE_DU_JOUR outweigh those potential support savings?
-C
What's the upside to $ROUTER_VENDOR in reducing support cost? They already make money on the support but can't make too much, so a reduction in cost would probably imply a reduction in revenue. Also, given that network engineering rarely make support cost a key issue in vendor selection and negotiation, reducing support costs look like they have little payback to $ROUTER_VENDOR in terms of equipment sold. With that, $WHIZBANG_FEATURE_DU_JOUR, sure looks like a good profit decision. To change this, stop buying gear from vendors that charge too much for support. just my jaded opinion, jerry
I'm referring to the _vendor's_ support costs - as in, you don't need as many people in the TAC if people don't keep running into IOS bugs; you don't need as large of a RMA pool if the hardware is more reliable, etc. As the vendor would most likley decline to pass these savings along to the customer, I would see this as a profit opportunity for the vendor. -C On Mon, Nov 26, 2001 at 08:31:06AM -0800, jerry scharf wrote:
--On 11/26/2001 09:22:01 AM -0500 Christopher A. Woodfield wrote:
My first thought in response to this is the vendor's support costs - wouldn't shipping more reliable images bring down those costs signficantly? Or is it just that the extra revenue opportunities gained by adding $WHIZBANG_FEATURE_DU_JOUR outweigh those potential support savings?
-C
What's the upside to $ROUTER_VENDOR in reducing support cost? They already make money on the support but can't make too much, so a reduction in cost would probably imply a reduction in revenue. Also, given that network engineering rarely make support cost a key issue in vendor selection and negotiation, reducing support costs look like they have little payback to $ROUTER_VENDOR in terms of equipment sold. With that, $WHIZBANG_FEATURE_DU_JOUR, sure looks like a good profit decision.
To change this, stop buying gear from vendors that charge too much for support.
just my jaded opinion, jerry
-- --------------------------- Christopher A. Woodfield rekoil@semihuman.com PGP Public Key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xB887618B
I'm referring to the _vendor's_ support costs - as in, you don't need as many people in the TAC if people don't keep running into IOS bugs; you don't need as large of a RMA pool if the hardware is more reliable, etc.
What percentage of TAC personnel's time is spent dealing with calls that ultimately result in a BugID? NANOG isn't representative; mostly, TAC exists to take calls from idiots who bought a box that they don't know how to configure. Large network operators have a staff of people to handle that, so when they call TAC, the box is probably broken. I don't think that's the case with the majority of TAC cases, though. -- Brett
On Mon, Nov 26, 2001 at 10:20:55AM -0600, Brett Frankenberger wrote:
I'm referring to the _vendor's_ support costs - as in, you don't need as many people in the TAC if people don't keep running into IOS bugs; you don't need as large of a RMA pool if the hardware is more reliable, etc.
What percentage of TAC personnel's time is spent dealing with calls that ultimately result in a BugID? NANOG isn't representative; mostly, TAC exists to take calls from idiots who bought a box that they don't know how to configure. Large network operators have a staff of people to handle that, so when they call TAC, the box is probably broken. I don't think that's the case with the majority of TAC cases, though.
For most Vendors, depends which TAC and at what level. Having certain letters after your name (Exact combination depends on the Vendor of course) means you get cases picked up in record time by top TAC engineers and if you get Service Provider contracts you can get your "own" TAC engineers/teams to ring when things break. I would imagine most people on NANOG have access at at least one of these methods for their main vendors. Of course, some vendors don't make "end-user" kit and don't have to worry too much about people ringing up we really don't know what they're doing. I've had the misfortune in the past of having to convince vendor engineers from more than one company that I *do* know what I'm doing and that we can dispense with the do-the-LEDs-come-on checklist - now I just go on the web and order what I like, as long as I return the broken kit fairly quickly. Saves me time getting replacement parts and saves the Vendor time as TAC never have to see dead-hardware cases. And peoples time is money, so everyone saves money. -- Ryan O'Connell - CCIE #8174 <ryan@complicity.co.uk> - http://www.complicity.co.uk I'm not losing my mind, no I'm not changing my lines, I'm just learning new things with the passage of time
--On Monday, 26 November, 2001 11:43 AM -0500 "Christopher A. Woodfield" <rekoil@semihuman.com> wrote:
I'm referring to the _vendor's_ support costs - as in, you don't need as many people in the TAC if people don't keep running into IOS bugs; you don't need as large of a RMA pool if the hardware is more reliable, etc.
As the vendor would most likley decline to pass these savings along to the customer, I would see this as a profit opportunity for the vendor.
1 well known vendors enjoy publishing to their victims^Wcustomers just how often they use their support channel, in order to provide justification (large usage) for the large support invoices they send in each quarter. To be fair, one of the above (at least) has now worked out it's a good idea to break this out by support request type.
Alex Bligh Personal Capacity
My first thought in response to this is the vendor's support costs - wouldn't shipping more reliable images bring down those costs signficantly? Or is it just that the extra revenue opportunities gained by adding $WHIZBANG_FEATURE_DU_JOUR outweigh those potential support savings?
When presented with an either-or decision like "Doing <X> will make $M but doing <NOT-X> will save $S" then in most public companies $S would have to be more than $M x 2 before <X> will stop happening. In this case $S is not even a notable fraction of $M so it's not even worth discussing. Somebody here said router design was a political process. I disagree. But networks are complex systems owing their existence (and their nature) to an ever shifting matrix of politics, economics, and physics. And, Heraclitus' maxim is very much appropos here. The target for a router designer is moving.
They probably did. The vendor probably did also. Of course, they can't always simulate real network conditions. Nor can your own labs. Heck, even a small deployment on 2 or 3 routers (out of, say, 200) can't catch everything. It is a simple fact that some bugs don't show up until its too late. And cascade failures occure more often than you might think (and not necessarily from software.) Remember the AT&T frame outage? Procedural error. How about the netcom outage of a few years ago? Someone misplaced a '.*' if I remember correctly. Human error of the simplest kind. I've had a data center go offline because someone slipped and turned off one side of a large breaker box. These things happen. The challenge is to eliminate the ones you CAN control. And, IMO, the industry is generally doing a good job of that. I chalk this whole thing up to bad karma for BT. -Wayne On Sat, Nov 24, 2001 at 11:05:20AM +0000, Neil J. McRae wrote:
BT is telling ISPs the reason for the multi-hour outage was a software bug in the interface cards used in BT's core network. BT installed a new version of the software. When that didn't fix the problem, they fell back to a previous version of the software.
BT didn't identify the vendor, but BT is identified as a "Cisco Powered Network(tm)." Non-BT folks believe the problem was with GSR interface cards. I can't independently confirm it.
I'd be surprised if it was the GSR, and in anycase that doesn't absolve anyone. If it was a software issue- why wasn't the software properly tested? Why was such a critical upgrade rolled out across the entire network at the same time? It doesn't add up.
Neil.
--- Wayne Bouchard web@typo.org Network Engineer http://www.typo.org/~web/resume.html
participants (19)
-
Alex Bligh
-
Brett Frankenberger
-
Christian Kuhtz
-
Christopher A. Woodfield
-
Daniel Golding
-
Deepak Jain
-
Ian Duncan
-
Jared Mauch
-
jerry scharf
-
Jesper Skriver
-
neil@DOMINO.ORG
-
Patrick Greenwell
-
Paul Vixie
-
Peter Galbavy
-
Ryan O'Connell
-
S
-
Sean Donelan
-
Valdis.Kletnieks@vt.edu
-
Wayne E. Bouchard