ATT GigE issue on 11/19 in Kansas City

comptech＠kc.rr.com

30 Nov 2011 30 Nov '11

2:17 a.m.

We lost several of our GigE links to AT&T for 6 hours on 11/19, anyone else see this and get a root cause from AT&T? All I can get is that they believe a change caused the issue.

Show replies by date

Brad Fleming

30 Nov 30 Nov

2:21 p.m.

On Nov 29, 2011, at 8:17 PM, <comptech@kc.rr.com> wrote:

...

We lost several of our GigE links to AT&T for 6 hours on 11/19, anyone else see this and get a root cause from AT&T? All I can get is that they believe a change caused the issue.

We lost several (but not all) of our Optiman circuits on 11/19 at about 10:20am. We were told the root issue was that all VLANs in one of their switches had been accidentally deleted / removed. We were never able to get any additional detail (like "how") but services were restored about 16:45.

Stefan

2:53 p.m.

On Wed, Nov 30, 2011 at 8:21 AM, Brad Fleming <bdflemin@gmail.com> wrote:

...

On Nov 29, 2011, at 8:17 PM, <comptech@kc.rr.com> wrote:

...
We lost several of our GigE links to AT&T for 6 hours on 11/19, anyone else see this and get a root cause from AT&T? All I can get is that they believe a change caused the issue.

We lost several (but not all) of our Optiman circuits on 11/19 at about 10:20am. We were told the root issue was that all VLANs in one of their switches had been accidentally deleted / removed. We were never able to get any additional detail (like "how") but services were restored about 16:45.

+1 to the above - we received the following RFO, from the their NOC: "All impacted VLANS were rebuilt to restore service. It is believed there were some configuration changes that caused the VLAN troubles. A case has been opened with Cisco to further investigate the root cause." ***Stefan Mititelu http://twitter.com/netfortius http://www.linkedin.com/in/netfortius

Blake Hudson

3:51 p.m.

Stefan wrote the following on 11/30/2011 8:53 AM:

...

On Wed, Nov 30, 2011 at 8:21 AM, Brad Fleming<bdflemin@gmail.com> wrote:

...
On Nov 29, 2011, at 8:17 PM,<comptech@kc.rr.com> wrote:

...
We lost several of our GigE links to AT&T for 6 hours on 11/19, anyone else see this and get a root cause from AT&T? All I can get is that they believe a change caused the issue.

We lost several (but not all) of our Optiman circuits on 11/19 at about 10:20am. We were told the root issue was that all VLANs in one of their switches had been accidentally deleted / removed. We were never able to get any additional detail (like "how") but services were restored about 16:45. +1 to the above - we received the following RFO, from the their NOC:

"All impacted VLANS were rebuilt to restore service. It is believed there were some configuration changes that caused the VLAN troubles. A case has been opened with Cisco to further investigate the root cause."

Sounds like a VTP mishap.

Brad Fleming

5:37 p.m.

On Nov 30, 2011, at 9:51 AM, Blake Hudson wrote:

...

Stefan wrote the following on 11/30/2011 8:53 AM:

...
On Wed, Nov 30, 2011 at 8:21 AM, Brad Fleming<bdflemin@gmail.com> wrote:

...
On Nov 29, 2011, at 8:17 PM,<comptech@kc.rr.com> wrote:

...
We lost several of our GigE links to AT&T for 6 hours on 11/19, anyone else see this and get a root cause from AT&T? All I can get is that they believe a change caused the issue.

We lost several (but not all) of our Optiman circuits on 11/19 at about 10:20am. We were told the root issue was that all VLANs in one of their switches had been accidentally deleted / removed. We were never able to get any additional detail (like "how") but services were restored about 16:45. +1 to the above - we received the following RFO, from the their NOC:

"All impacted VLANS were rebuilt to restore service. It is believed there were some configuration changes that caused the VLAN troubles. A case has been opened with Cisco to further investigate the root cause."

Sounds like a VTP mishap.

That was my first thought as well.. it would just surprise me if a huge provider like AT&T was using VTP instead of using a provisioning tool that automates the manual pruning process to avoid issues like this. In either case I'm a customer and will likely never be told what went wrong. I'm OK with that so long as it doesn't happen again!

Joe Maimon

5:45 p.m.

Brad Fleming wrote:

...

...
In either case I'm a customer and will likely never be told what went wrong. I'm OK with that so long as it doesn't happen again!

Does being told what happened somehow prevent it from happening it again? What is the utilitarian value in an RFO? Joe

Brad Fleming

5:58 p.m.

...

...
...
In either case I'm a customer and will likely never be told what went wrong. I'm OK with that so long as it doesn't happen again!

Does being told what happened somehow prevent it from happening it again?

Nope. But if this same issue crops up again we'll have to "work the system" harder and demand calls with knowledgeable people; not an easy task for a customer my size (I'm not Starbucks with thousands of sites). A single outage can be understood, seeing repeated issues means I want to know what's going wrong. If the issue is something simply mitigated and the service provider hasn't taken steps, I need to start looking for a different service provider. Everything has a little downtime every now and again and I can live with it on lower speed circuits.

...

What is the utilitarian value in an RFO?

To determine whether its an honest mistake or a more systemic issue that should push me toward another option.

Holmes,David A

6:56 p.m.

What I have seen lately with telco's building and operating Metro Ethernet Forum (MEF) based Ethernet networks is that relatively inexperienced telco staff are in charge of configuring and operating the networks, where telco operational staff are unaware of layer 2 Ethernet network nuances, nuances that in an Enterprise environment network engineers must know, or else. I have seen numerous instances of telco MEF layer 2 outages of 20-30 seconds where my layer 3 routing keep-alives time out. Subsequent telco root cause analysis has determined that spanning tree convergence brought down multiple links in the telco MEF network. One telco technician, assigned to Ethernet switch configuration, told me that a 20-30 second network hang is not really a big deal. -----Original Message----- From: Brad Fleming [mailto:bdflemin@gmail.com] Sent: Wednesday, November 30, 2011 9:58 AM To: Joe Maimon Cc: nanog@nanog.org Subject: Re: ATT GigE issue on 11/19 in Kansas City

...

...
...
In either case I'm a customer and will likely never be told what went wrong. I'm OK with that so long as it doesn't happen again!

Does being told what happened somehow prevent it from happening it again?

...

What is the utilitarian value in an RFO?

To determine whether its an honest mistake or a more systemic issue that should push me toward another option. This communication, together with any attachments or embedded links, is for the sole use of the intended recipient(s) and may contain information that is confidential or legally protected. If you are not the intended recipient, you are hereby notified that any review, disclosure, copying, dissemination, distribution or use of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by return e-mail message and delete the original and all copies of the communication, along with any attachments or embedded links, from your system.

Mark Tinka

2 Dec 2 Dec

2:57 p.m.

On Thursday, December 01, 2011 02:56:37 AM Holmes,David A wrote:

...

What I have seen lately with telco's building and operating Metro Ethernet Forum (MEF) based Ethernet networks is that relatively inexperienced telco staff are in charge of configuring and operating the networks, where telco operational staff are unaware of layer 2 Ethernet network nuances, nuances that in an Enterprise environment network engineers must know, or else.

We use RANCID here, quite heavily, to help guide provisioning engineers so they are better prepared for the future, and actually understand what it is they are configuring. Pre-provisioning training is all good and well, but hands-on experience always has the chance of "going the other way". While RANCID is after-the-fact, it's a great tool for refining what the folk on the ground know. It certainly has helped us a great deal, over the years. Mark.

Soni, Miraaj

30 Nov 30 Nov

6 p.m.

No. It doesn't prevent it from happening again. But at least you can have them check for that same issue when it happens next time. I guess the RFO gives the customer the feeling that the vendor was able to isolate the issue and fix it; as opposed to "issue was resolved before isolation". - Miraaj Soni -----Original Message----- From: Joe Maimon [mailto:jmaimon@ttec.com] Sent: Wednesday, November 30, 2011 10:46 AM To: Brad Fleming Cc: nanog@nanog.org Subject: Re: ATT GigE issue on 11/19 in Kansas City Brad Fleming wrote:

...

...
In either case I'm a customer and will likely never be told what went wrong. I'm OK with that so long as it doesn't happen again!

Does being told what happened somehow prevent it from happening it again? What is the utilitarian value in an RFO? Joe

Mike Jones

7:35 p.m.

On 30 November 2011 17:45, Joe Maimon <jmaimon@ttec.com> wrote:

...

Brad Fleming wrote:

...
...
In either case I'm a customer and will likely never be told what went wrong. I'm OK with that so long as it doesn't happen again!

Does being told what happened somehow prevent it from happening it again?

What is the utilitarian value in an RFO?

"The outage was caused by an engineer turning off the wrong router, it has been turned back on and service restored" "The outage appears to have been caused by a bug in the routers firmware, we are working with the vendor on a fix" "There was an outage, now service is back up again" A brief isolated incident in any case you probably don't care enough to change providers (if you care about outages that much, you just divert traffic to your other redundant connections), but say you've had 2 outages in a week with that given as the explanation, which one makes you feel more concerned about going shopping for another provider? Technically the first provider knows the causes of the outages and it has been fixed while the second one doesn't know for sure what the problem is and they might have fixed it or might not, however I suspect most people would probably not agree with that interpretation. The third provider I don't think there's any way to interpret it to make them look good.

...

From a utilitarian point of view the more detail customers get the less angry they normally are, and I believe "less angry" is a generally accepted form of "happier" in the ISP world (at least some ISPs seem to think so). Therefore for utilitarian reasons you should write nice long details reports, unless the cause is incompetence then you should probably just shut up and let people assume incompetence instead of confirming it, as confirming it might make them less happy. Although one could also argue that by being honest about incompetence your customers will likely change providers sooner, causing an overall increase in their level of happiness. This utilitarian thing is complicated.

- Mike

Jay Hennigan

3 Dec 3 Dec

6:06 a.m.

New subject: RFOs, was:ATT GigE issue...

On 11/30/11 11:35 AM, Mike Jones wrote:

...

On 30 November 2011 17:45, Joe Maimon <jmaimon@ttec.com> wrote:

...

"The outage was caused by an engineer turning off the wrong router, it has been turned back on and service restored" "The outage appears to have been caused by a bug in the routers firmware, we are working with the vendor on a fix" "There was an outage, now service is back up again"

When the RFO gets filtered through the marketing department, it gets interesting, and totally useless. This is what we got as an official RFO for an outsourced hosted VoIP service (carrier shall remain nameless) that was for all practical purposes down hard for two DAYS due to a botched planned software upgrade, verbatim and in its entirety: "Coincident with this upgrade, we experienced an Operating System-level failure on the underlying application server platform which had the effect of defeating the redundancy paradigm designed into our service architecture." -- Jay Hennigan - CCIE #7880 - Network Engineering - jay@impulse.net Impulse Internet Service - http://www.impulse.net/ Your local telephone and internet company - 805 884-6323 - WB6RDV

4970

Age (days ago)

4973

Last active (days ago)

List overview

Download

11 comments

10 participants

participants (10)

Blake Hudson
Brad Fleming
comptech＠kc.rr.com
Holmes,David A
Jay Hennigan
Joe Maimon
Mark Tinka
Mike Jones
Soni, Miraaj
Stefan