L3 East cost maint / fiber 05FEB2012 maintenance
I know a lot of you are out of the office right now, but does anybody have any info on what happened with L3 this morning? They went into a 5 hour maintenance window with expected downtime of about 30 minutes while they upgraded something like *40* of their "core routers" (their words), but also did this during some fiber work and completely cut off several of their east coast peers for the entirety of the 5 hour window. If anybody has any more info on this, on a NOC contact for them on the East Coast for future issues, you can hit me off off-list if you don't feel comfortable replying with that info here. Thanks, and I hope hope you guys are enjoying Orlando. -- *Josh Reynolds* esseph@gmail.com - (270) 302-3552
We saw the same thing out of their Tampa location; there was a brief drop around 2am EST and a more severe one around 4:05 AM which lasted about 10 minutes for us. Unfortunately whatever they did, they did it in a way that our BGP sessions stayed up so we couldn't react until bgpmon altered me about some route withdrawals but by that time things were back to normal and remained stable.
-----Original Message----- From: Josh Reynolds [mailto:esseph@gmail.com] Sent: Tuesday, February 05, 2013 10:40 AM To: nanog@nanog.org Subject: L3 East cost maint / fiber 05FEB2012 maintenance
I know a lot of you are out of the office right now, but does anybody have any info on what happened with L3 this morning? They went into a 5 hour maintenance window with expected downtime of about 30 minutes while they upgraded something like *40* of their "core routers" (their words), but also did this during some fiber work and completely cut off several of their east coast peers for the entirety of the 5 hour window.
If anybody has any more info on this, on a NOC contact for them on the East Coast for future issues, you can hit me off off-list if you don't feel comfortable replying with that info here.
Thanks, and I hope hope you guys are enjoying Orlando.
-- *Josh Reynolds* esseph@gmail.com - (270) 302-3552
We also noticed outage due to L3 Maintenance that went into the outage. We were not even notified about the Maintenance itself. We also noticed black hauling in their network. -Thanks, Viral On 5 February 2013 21:09, Josh Reynolds <esseph@gmail.com> wrote:
I know a lot of you are out of the office right now, but does anybody have any info on what happened with L3 this morning? They went into a 5 hour maintenance window with expected downtime of about 30 minutes while they upgraded something like *40* of their "core routers" (their words), but also did this during some fiber work and completely cut off several of their east coast peers for the entirety of the 5 hour window.
If anybody has any more info on this, on a NOC contact for them on the East Coast for future issues, you can hit me off off-list if you don't feel comfortable replying with that info here.
Thanks, and I hope hope you guys are enjoying Orlando.
-- *Josh Reynolds* esseph@gmail.com - (270) 302-3552
On Tue, 5 Feb 2013, Josh Reynolds wrote:
I know a lot of you are out of the office right now, but does anybody have any info on what happened with L3 this morning? They went into a 5 hour maintenance window with expected downtime of about 30 minutes while they upgraded something like *40* of their "core routers" (their words), but also did this during some fiber work and completely cut off several of their east coast peers for the entirety of the 5 hour window.
If anybody has any more info on this, on a NOC contact for them on the East Coast for future issues, you can hit me off off-list if you don't feel comfortable replying with that info here.
Thanks, and I hope hope you guys are enjoying Orlando.
We're a Level3 customer in Orlando. Our BGP sessions stayed up, but the number of routes received from Level3 fell to only a few tens of thousands at about 4:10am, and gradually returned to normal numbers by about 4:35am. ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
I got notification of their maintenance window, albeit with < 24 hours notice. Notice came in at 11:00GMT-5 yesterday, maintenance was scheduled for 00:00GMT-5 this morning. That said, the notice said that the maintenance was in Phoenix but I got a notice about my IPT circuit at 60 Hudson which I found confusing. Based on my logs, our BGP session with them went down at 03:06GMT-5 and back up at 03:15GMT-5. Down again at 03:37GMT-5 until 04:20GMT-5. A third time at 06:41GMT-5 and back at 06:45GMT-5. Traffic graphs tell a bit of a different story. Just before 05:00GMT-5, our outbound traffic to Level 3 dropped substantially. About that time, I started getting reports about issues to Level 3 destinations. Traces seemed to indicate a black hole condition within Level 3's network in NYC, seemingly at, or just past csw3.NewYork1.Level3.net. Stuff seemed to correct itself by about 06:45GMT-5, but due to Level 3 sending only about 180k routes. About 20 minutes later, the table was back to ~431K and all's been fine since. On 2013-02-05, at 10:39 AM, Josh Reynolds <esseph@gmail.com> wrote:
I know a lot of you are out of the office right now, but does anybody have any info on what happened with L3 this morning? They went into a 5 hour maintenance window with expected downtime of about 30 minutes while they upgraded something like *40* of their "core routers" (their words), but also did this during some fiber work and completely cut off several of their east coast peers for the entirety of the 5 hour window.
If anybody has any more info on this, on a NOC contact for them on the East Coast for future issues, you can hit me off off-list if you don't feel comfortable replying with that info here.
Thanks, and I hope hope you guys are enjoying Orlando.
-- *Josh Reynolds* esseph@gmail.com - (270) 302-3552
My hunch is that this is fallout and repairs from Juniper PR839412. Only fix is an upgrade. Not sure why they're not able to do a hitless upgrade though; that's unfortunate. Specially-crafted TCP packets that can get past RE/loopback filters can crash the box. --j On Tue, Feb 5, 2013 at 7:39 AM, Josh Reynolds <esseph@gmail.com> wrote:
I know a lot of you are out of the office right now, but does anybody have any info on what happened with L3 this morning? They went into a 5 hour maintenance window with expected downtime of about 30 minutes while they upgraded something like *40* of their "core routers" (their words), but also did this during some fiber work and completely cut off several of their east coast peers for the entirety of the 5 hour window.
If anybody has any more info on this, on a NOC contact for them on the East Coast for future issues, you can hit me off off-list if you don't feel comfortable replying with that info here.
Thanks, and I hope hope you guys are enjoying Orlando.
-- *Josh Reynolds* esseph@gmail.com - (270) 302-3552
Workaround is proper filtering and other techniques on the RE/Loopback to prevent the issue from happening. Should an upgrade be performed? Yes, but certainly doesn't have to have right away or without notice to customers. On Tue, Feb 5, 2013 at 11:23 AM, Jonathan Lassoff <jof@thejof.com> wrote:
My hunch is that this is fallout and repairs from Juniper PR839412. Only fix is an upgrade. Not sure why they're not able to do a hitless upgrade though; that's unfortunate.
Specially-crafted TCP packets that can get past RE/loopback filters can crash the box.
--j
On Tue, Feb 5, 2013 at 7:39 AM, Josh Reynolds <esseph@gmail.com> wrote:
I know a lot of you are out of the office right now, but does anybody have any info on what happened with L3 this morning? They went into a 5 hour maintenance window with expected downtime of about 30 minutes while they upgraded something like *40* of their "core routers" (their words), but also did this during some fiber work and completely cut off several of their east coast peers for the entirety of the 5 hour window.
If anybody has any more info on this, on a NOC contact for them on the East Coast for future issues, you can hit me off off-list if you don't feel comfortable replying with that info here.
Thanks, and I hope hope you guys are enjoying Orlando.
-- *Josh Reynolds* esseph@gmail.com - (270) 302-3552
-- Jason
On Tue, Feb 5, 2013 at 9:33 AM, Jason Biel <jason@biel-tech.com> wrote:
Workaround is proper filtering and other techniques on the RE/Loopback to prevent the issue from happening.
Agreed. However, if it only takes one packet, what if an attacker sources the traffic from your management address space? Guarding against this requires either a separate VRF/table for management traffic or transit traffic, RPF checking, or TTL security. If these weren't setup ahead of time, maybe it would be easier to upgrade than lab, test, and deploy a new configuration. This is all speculation about Level3 on my part; I don't know their network from an internal perspective. --j
Should an upgrade be performed? Yes, but certainly doesn't have to have right away or without notice to customers.
On Tue, Feb 5, 2013 at 11:23 AM, Jonathan Lassoff <jof@thejof.com> wrote:
My hunch is that this is fallout and repairs from Juniper PR839412. Only fix is an upgrade. Not sure why they're not able to do a hitless upgrade though; that's unfortunate.
Specially-crafted TCP packets that can get past RE/loopback filters can crash the box.
--j
On Tue, Feb 5, 2013 at 7:39 AM, Josh Reynolds <esseph@gmail.com> wrote:
I know a lot of you are out of the office right now, but does anybody have any info on what happened with L3 this morning? They went into a 5 hour maintenance window with expected downtime of about 30 minutes while they upgraded something like *40* of their "core routers" (their words), but also did this during some fiber work and completely cut off several of their east coast peers for the entirety of the 5 hour window.
If anybody has any more info on this, on a NOC contact for them on the East Coast for future issues, you can hit me off off-list if you don't feel comfortable replying with that info here.
Thanks, and I hope hope you guys are enjoying Orlando.
-- *Josh Reynolds* esseph@gmail.com - (270) 302-3552
-- Jason
Agree as well. Bad assumption on my part that Level3 would doing the items listed in the workaround already. On Tue, Feb 5, 2013 at 11:41 AM, Jonathan Lassoff <jof@thejof.com> wrote:
On Tue, Feb 5, 2013 at 9:33 AM, Jason Biel <jason@biel-tech.com> wrote:
Workaround is proper filtering and other techniques on the RE/Loopback to prevent the issue from happening.
Agreed. However, if it only takes one packet, what if an attacker sources the traffic from your management address space?
Guarding against this requires either a separate VRF/table for management traffic or transit traffic, RPF checking, or TTL security. If these weren't setup ahead of time, maybe it would be easier to upgrade than lab, test, and deploy a new configuration.
This is all speculation about Level3 on my part; I don't know their network from an internal perspective.
--j
Should an upgrade be performed? Yes, but certainly doesn't have to have right away or without notice to customers.
On Tue, Feb 5, 2013 at 11:23 AM, Jonathan Lassoff <jof@thejof.com>
wrote:
My hunch is that this is fallout and repairs from Juniper PR839412. Only fix is an upgrade. Not sure why they're not able to do a hitless upgrade though; that's unfortunate.
Specially-crafted TCP packets that can get past RE/loopback filters can crash the box.
--j
On Tue, Feb 5, 2013 at 7:39 AM, Josh Reynolds <esseph@gmail.com> wrote:
I know a lot of you are out of the office right now, but does anybody have any info on what happened with L3 this morning? They went into a 5
maintenance window with expected downtime of about 30 minutes while
hour they
upgraded something like *40* of their "core routers" (their words), but also did this during some fiber work and completely cut off several of their east coast peers for the entirety of the 5 hour window.
If anybody has any more info on this, on a NOC contact for them on the East Coast for future issues, you can hit me off off-list if you don't feel comfortable replying with that info here.
Thanks, and I hope hope you guys are enjoying Orlando.
-- *Josh Reynolds* esseph@gmail.com - (270) 302-3552
-- Jason
-- Jason
On 2/5/13 10:02 AM, Jason Biel wrote:
Agree as well.
Bad assumption on my part that Level3 would doing the items listed in the workaround already.
On Tue, Feb 5, 2013 at 11:41 AM, Jonathan Lassoff <jof@thejof.com> wrote:
Workaround is proper filtering and other techniques on the RE/Loopback to prevent the issue from happening. Agreed. However, if it only takes one packet, what if an attacker
On Tue, Feb 5, 2013 at 9:33 AM, Jason Biel <jason@biel-tech.com> wrote: sources the traffic from your management address space?
Guarding against this requires either a separate VRF/table for management traffic or transit traffic, RPF checking, or TTL security. If these weren't setup ahead of time, maybe it would be easier to upgrade than lab, test, and deploy a new configuration.
This is all speculation about Level3 on my part; I don't know their network from an internal perspective. Routers that show up on exchange fabrics are a particular problem...
For this issue... For what it's worth we have several dzone circuits with them from 100mb/s office links to 10Gb/s paths and we have notifications for maintenances last night and tonight and touching locations in europe us east and us west coasts. I'm presuming that there is further internal work that is not directly impactful. I have evidence of various other providers as well as ourselves undertaking fixes to this issue.
Should an upgrade be performed? Yes, but certainly doesn't have to have right away or without notice to customers.
On Tue, Feb 5, 2013 at 11:23 AM, Jonathan Lassoff <jof@thejof.com> wrote:
My hunch is that this is fallout and repairs from Juniper PR839412. Only fix is an upgrade. Not sure why they're not able to do a hitless upgrade though; that's unfortunate.
Specially-crafted TCP packets that can get past RE/loopback filters can crash the box.
--j
On Tue, Feb 5, 2013 at 7:39 AM, Josh Reynolds <esseph@gmail.com> wrote:
I know a lot of you are out of the office right now, but does anybody have any info on what happened with L3 this morning? They went into a 5 hour maintenance window with expected downtime of about 30 minutes while
--j they
upgraded something like *40* of their "core routers" (their words), but also did this during some fiber work and completely cut off several of their east coast peers for the entirety of the 5 hour window.
If anybody has any more info on this, on a NOC contact for them on the East Coast for future issues, you can hit me off off-list if you don't feel comfortable replying with that info here.
Thanks, and I hope hope you guys are enjoying Orlando.
-- *Josh Reynolds* esseph@gmail.com - (270) 302-3552
-- Jason
participants (8)
-
David Hubbard
-
Jason Biel
-
Jason Lixfeld
-
joel jaeggli
-
Jon Lewis
-
Jonathan Lassoff
-
Josh Reynolds
-
Viral Vira