Level3 worldwide emergency upgrade?
Does anyone have details on tonight's apparent worldwide emergency router upgrade? All I managed to get out of the portal was 30 minutes, "Service Affecting" (no kidding?) and the NOC line gave me the recording about it and disconnected me. -R>
On 06 Feb 2013, at 11:58 AM, Ray Wong <rayw@rayw.net> wrote:
Does anyone have details on tonight's apparent worldwide emergency router upgrade? All I managed to get out of the portal was 30 minutes, "Service Affecting" (no kidding?) and the NOC line gave me the recording about it and disconnected me.
Nothing confirmed from my side, but the general guess I saw was that it was Juniper-related. -J
That is general guess. On Wed, Feb 6, 2013 at 5:11 AM, Stephane Bortzmeyer <bortzmeyer@nic.fr>wrote:
On Wed, Feb 06, 2013 at 01:04:40PM +0200, JP Viljoen <froztbyte@froztbyte.net> wrote a message of 10 lines which said:
the general guess I saw was that it was Juniper-related.
Juniper Technical Bulletin PSN-2013-01-823, probably?
-- Jason
I just received this email from level3 ---- Summary Level 3 Communications will perform a mandatory network upgrade that will be service impacting and will impact devices in multiple locations. We are upgrading the code on portions of the global network to increase stability for the overall network. During this maintenance activity customers may be impacted for approximately 30 minutes. Updates This window of this maintenance has completed successfully. ---- On Feb 6, 2013, at 4:21 AM, Jason Biel <jason@biel-tech.com> wrote:
That is general guess.
On Wed, Feb 6, 2013 at 5:11 AM, Stephane Bortzmeyer <bortzmeyer@nic.fr>wrote:
On Wed, Feb 06, 2013 at 01:04:40PM +0200, JP Viljoen <froztbyte@froztbyte.net> wrote a message of 10 lines which said:
the general guess I saw was that it was Juniper-related.
Juniper Technical Bulletin PSN-2013-01-823, probably?
-- Jason
ugh! On Wed, Feb 6, 2013 at 6:04 AM, JP Viljoen <froztbyte@froztbyte.net> wrote:
On 06 Feb 2013, at 11:58 AM, Ray Wong <rayw@rayw.net> wrote:
Does anyone have details on tonight's apparent worldwide emergency router upgrade? All I managed to get out of the portal was 30 minutes, "Service Affecting" (no kidding?) and the NOC line gave me the recording about it and disconnected me.
Nothing confirmed from my side, but the general guess I saw was that it was Juniper-related.
-J
Also received same ... On Wed, Feb 6, 2013 at 10:58 AM, Ray Wong <rayw@rayw.net> wrote:
Does anyone have details on tonight's apparent worldwide emergency router upgrade? All I managed to get out of the portal was 30 minutes, "Service Affecting" (no kidding?) and the NOC line gave me the recording about it and disconnected me.
-R>
-- Warm Regards Peter(CCIE 23782).
On Feb 6, 2013, at 6:38 AM, Peter Ehiwe <peterehiwe@gmail.com> wrote:
Also received same ...
On Wed, Feb 6, 2013 at 10:58 AM, Ray Wong <rayw@rayw.net> wrote:
Does anyone have details on tonight's apparent worldwide emergency router upgrade? All I managed to get out of the portal was 30 minutes, "Service Affecting" (no kidding?) and the NOC line gave me the recording about it and disconnected me.
So, I'm wondering what is shocking that someone may have to push out some sort of upgrade either urgently or periodically that is so impacting and causes these emails on the list. There seems to be some sort of psychological event happening in addition to the technological one. In the past I've had to push out software fixes "urgently" due to various reasons, either being a software thing like the PSN or some weird hardware+software interaction that causes bad things to happen. Would you rather your ISP not maintain their devices? Are the consequences "so bad" of a 30 minute outage that your business is severely impacted? - Jared
Would you rather your ISP not maintain their devices? Are the consequences "so bad" of a 30 minute outage that your business is severely impacted?
- Jared
You had me up until that line. That should be expanded a little ... First, I'd say, yes - many businesses would be severely impacted and may even have consequential issues if they had to sustain a 30 minute outage. Suppose for a moment they couldn't process money machines transactions for 30 minutes; or Netflix couldn't serve content for 30 minutes; or youporn was offline for 30 minutes. The question should be more along the lines of, "why aren't you multihomed in a way that would make a 30 minute outage (which is inevitable) irrelevant to you?
On Wed, Feb 06, 2013 at 07:57:06AM -0500, Alex Rubenstein scribbled: # The question should be more along the lines of, "why aren't you multihomed in a way that would make a 30 minute outage (which is inevitable) irrelevant to you? The fun part of this emergency maintenance in the northeast USA was that even folks who are multihomed felt it: Level3 managed to do this in a way that kept BGP sessions up but killed the ability to actually pass traffic. I'm not sure what they did that caused this, or whether anyone but northeast folks were affected by it, but it sure was neat to be effectively blackholed in and out of one of your provided circuits for a while. Also, in the northeast, they managed to make it quite a bit more than a 30min outage for many people; they even slid hours outside of their advertised emergency window. I do applaud them for what I can only assume was a *massive* undertaking: emergency upgrading that many routers in such a short period of time. -- Jonathan Towne
On Wed, Feb 6, 2013 at 5:10 AM, Jonathan Towne <jtowne@slic.com> wrote:
On Wed, Feb 06, 2013 at 07:57:06AM -0500, Alex Rubenstein scribbled: # The question should be more along the lines of, "why aren't you multihomed in a way that would make a 30 minute outage (which is inevitable) irrelevant to you?
The fun part of this emergency maintenance in the northeast USA was that even folks who are multihomed felt it: Level3 managed to do this in a way that kept BGP sessions up but killed the ability to actually pass traffic. I'm not sure what they did that caused this, or whether anyone but northeast folks were affected by it, but it sure was neat to be effectively blackholed in and out of one of your provided circuits for a while.
I recommend you grab http://kestrel3.netflight.com/2013.02.05-NANOG57-day2-afternoon-session.txt and search for PR8361907 Richard did a very good lightning talk about why Juniper boxes will bring up BGP but blackhole traffic for 30 minutes to over an hour, depending on number of BGP sessions it is handling. His recommendation--if you don't like it, go tell Juniper to fix that bug. Matt
The Juniper PR in question is actually 836197. On Wed, Feb 6, 2013 at 10:22 AM, Matthew Petach <mpetach@netflight.com>wrote:
On Wed, Feb 6, 2013 at 5:10 AM, Jonathan Towne <jtowne@slic.com> wrote:
On Wed, Feb 06, 2013 at 07:57:06AM -0500, Alex Rubenstein scribbled: # The question should be more along the lines of, "why aren't you multihomed in a way that would make a 30 minute outage (which is inevitable) irrelevant to you?
The fun part of this emergency maintenance in the northeast USA was that even folks who are multihomed felt it: Level3 managed to do this in a way that kept BGP sessions up but killed the ability to actually pass traffic. I'm not sure what they did that caused this, or whether anyone but northeast folks were affected by it, but it sure was neat to be effectively blackholed in and out of one of your provided circuits for a while.
I recommend you grab http://kestrel3.netflight.com/2013.02.05-NANOG57-day2-afternoon-session.txt
and search for PR8361907
Richard did a very good lightning talk about why Juniper boxes will bring up BGP but blackhole traffic for 30 minutes to over an hour, depending on number of BGP sessions it is handling.
His recommendation--if you don't like it, go tell Juniper to fix that bug.
Matt
-- Brian C Landers http://www.packetslave.com/ CCIE #23115 (R&S + Security)
I think you might find its this issue. PSN-2013-01-823 "Junos: Crafted TCP packet can lead to kernel crash" -----Original Message----- From: Matthew Petach [mailto:mpetach@netflight.com] Sent: Thursday, 7 February 2013 7:23 a.m. To: Jonathan Towne Cc: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade? On Wed, Feb 6, 2013 at 5:10 AM, Jonathan Towne <jtowne@slic.com> wrote:
On Wed, Feb 06, 2013 at 07:57:06AM -0500, Alex Rubenstein scribbled: # The question should be more along the lines of, "why aren't you multihomed in a way that would make a 30 minute outage (which is inevitable) irrelevant to you?
The fun part of this emergency maintenance in the northeast USA was that even folks who are multihomed felt it: Level3 managed to do this in a way that kept BGP sessions up but killed the ability to actually pass traffic. I'm not sure what they did that caused this, or whether anyone but northeast folks were affected by it, but it sure was neat to be effectively blackholed in and out of one of your provided circuits for a while.
I recommend you grab http://kestrel3.netflight.com/2013.02.05-NANOG57-day2-afternoon-session.txt and search for PR8361907 Richard did a very good lightning talk about why Juniper boxes will bring up BGP but blackhole traffic for 30 minutes to over an hour, depending on number of BGP sessions it is handling. His recommendation--if you don't like it, go tell Juniper to fix that bug. Matt -- BEGIN-ANTISPAM-VOTING-LINKS ------------------------------------------------------ Teach Email Guard if this mail (ID 09IV6SM1n) is spam: Spam: https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1n&m=d5617dabf346&t=20130207&c=s Not spam: https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1n&m=d5617dabf346&t=20130207&c=n Forget vote: https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1n&m=d5617dabf346&t=20130207&c=f ------------------------------------------------------ END-ANTISPAM-VOTING-LINKS
Sorry, should rephrase. The reason for the upgrade is PSN-2013-01-823 (PR 839412) The reason for the BGP blackhole, is as you point out PR8361907 -----Original Message----- From: Simon Allard [mailto:Simon.Allard@team.orcon.net.nz] Sent: Monday, 11 February 2013 2:48 p.m. To: Matthew Petach; Jonathan Towne Cc: nanog@nanog.org Subject: RE: Level3 worldwide emergency upgrade? I think you might find its this issue. PSN-2013-01-823 "Junos: Crafted TCP packet can lead to kernel crash" -----Original Message----- From: Matthew Petach [mailto:mpetach@netflight.com] Sent: Thursday, 7 February 2013 7:23 a.m. To: Jonathan Towne Cc: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade? On Wed, Feb 6, 2013 at 5:10 AM, Jonathan Towne <jtowne@slic.com> wrote:
On Wed, Feb 06, 2013 at 07:57:06AM -0500, Alex Rubenstein scribbled: # The question should be more along the lines of, "why aren't you multihomed in a way that would make a 30 minute outage (which is inevitable) irrelevant to you?
The fun part of this emergency maintenance in the northeast USA was that even folks who are multihomed felt it: Level3 managed to do this in a way that kept BGP sessions up but killed the ability to actually pass traffic. I'm not sure what they did that caused this, or whether anyone but northeast folks were affected by it, but it sure was neat to be effectively blackholed in and out of one of your provided circuits for a while.
I recommend you grab http://kestrel3.netflight.com/2013.02.05-NANOG57-day2-afternoon-session.txt and search for PR8361907 Richard did a very good lightning talk about why Juniper boxes will bring up BGP but blackhole traffic for 30 minutes to over an hour, depending on number of BGP sessions it is handling. His recommendation--if you don't like it, go tell Juniper to fix that bug. Matt -- BEGIN-ANTISPAM-VOTING-LINKS ------------------------------------------------------ Teach Email Guard if this mail (ID 09IV6SM1n) is spam: Spam: https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1n&m=d5617dabf346&t=20130207&c=s Not spam: https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1n&m=d5617dabf346&t=20130207&c=n Forget vote: https://emailguard.orcon.net.nz/canit/b.php?i=09IV6SM1n&m=d5617dabf346&t=20130207&c=f ------------------------------------------------------ END-ANTISPAM-VOTING-LINKS -- BEGIN-ANTISPAM-VOTING-LINKS ------------------------------------------------------ Teach Email Guard if this mail (ID 08IWO9iX8) is spam:Spam: https://emailguard.orcon.net.nz/canit/b.php?i=08IWO9iX8&m=e4a08b3bbde1&t=20130211&c=sNot spam: https://emailguard.orcon.net.nz/canit/b.php?i=08IWO9iX8&m=e4a08b3bbde1&t=20130211&c=nForget vote: https://emailguard.orcon.net.nz/canit/b.php?i=08IWO9iX8&m=e4a08b3bbde1&t=20130211&c=f ------------------------------------------------------ END-ANTISPAM-VOTING-LINKS
On Feb 6, 2013, at 7:57 AM, Alex Rubenstein <alex@corp.nac.net> wrote:
Would you rather your ISP not maintain their devices? Are the consequences "so bad" of a 30 minute outage that your business is severely impacted?
- Jared
You had me up until that line.
That should be expanded a little ...
First, I'd say, yes - many businesses would be severely impacted and may even have consequential issues if they had to sustain a 30 minute outage. Suppose for a moment they couldn't process money machines transactions for 30 minutes; or Netflix couldn't serve content for 30 minutes; or youporn was offline for 30 minutes.
The question should be more along the lines of, "why aren't you multihomed in a way that would make a 30 minute outage (which is inevitable) irrelevant to you?
Yeah, perhaps not as elegantly worded as I would have hoped, but there are many reasons things "go down". Just one of those elements is the internet part, there's also transport, power, and other elements that combine to make this complex system called the internet. If you N+N or N+1 your power, perhaps something similar for your connectivity is important. Or you just plan to be down/broken periodically for 30 minutes and have a plan to cover that. The building where our NOC is located sometimes gets evacuated. Having a plan for that is important. During one visit, there was a small fire in the building (or so we were told). Certainly an unexpected event that disrupted us for ~30 minutes. The handling and response of these events certainly is important. I do want to understand why and how it's so bad so if there are things as a SP in the community we can improve upon we can do that. That's my real goal, not poking at people who are single homed and down. - Jared
Yeah, perhaps not as elegantly worded as I would have hoped, but there are many reasons things "go down". Just one of those elements is the internet part, there's also transport, power, and other elements that combine to make this complex system called the internet. If you N+N or N+1 your power, perhaps something similar for your connectivity is important. Or you just plan to be down/broken periodically for 30 minutes and have a plan to cover that.
Agreed.
The building where our NOC is located sometimes gets evacuated. Having a plan for that is important. During one visit, there was a small fire in the building (or so we were told). Certainly an unexpected event that disrupted us for ~30 minutes.
And, if it is important to you, you will have N+N NOC's - ie, more than one, and different buildings, cities, or countries, depending on your requirement.
The handling and response of these events certainly is important. I do want to understand why and how it's so bad so if there are things as a SP in the community we can improve upon we can do that.
I suspect, as I touched previously, the most noise will come from the people who are the least realistic, and least prepared. Personally, I live with the expectation that whatever it is (power, fiber, transport, ISP, highways, fuel delivery, etc.) will at some point be broken, degraded, or otherwise unavailable, and you have to plan accordingly. Personally (and I speak for NAC) I/we don't care, really, if any upstream IP provider breaks; we have made appropriate plans to work around that in an automated fashion. Hope that answers your more general question.
On Wed, 2013-02-06 at 07:57 -0500, Alex Rubenstein wrote:
Would you rather your ISP not maintain their devices? Are the consequences "so bad" of a 30 minute outage that your business is severely impacted?
- Jared
You had me up until that line.
That should be expanded a little ...
First, I'd say, yes - many businesses would be severely impacted and may even have consequential issues if they had to sustain a 30 minute outage. Suppose for a moment they couldn't process money machines transactions for 30 minutes; or Netflix couldn't serve content for 30 minutes; or youporn was offline for 30 minutes.
The question should be more along the lines of, "why aren't you multihomed in a way that would make a 30 minute outage (which is inevitable) irrelevant to you?
multihomed or simply redundantly equipped to switch over faster ?
On Wed, Feb 06, 2013 at 07:39:14AM -0500, Jared Mauch wrote:
So, I'm wondering what is shocking that someone may have to push out some sort of upgrade either urgently or periodically that is so impacting and causes these emails on the list.
My impression is mostly that people are left feeling uncomfortable by a massive upgrade of this sort with so little communication about why and so on. "Emergency work for five hours and 30 minutes disconnection" that turns out to take longer than 30 minutes of disconnection probably ought to come with some explanation (at least after the fact). Regards, A -- Andrew Sullivan Dyn, Inc. asullivan@dyn.com v: +1 603 663 0448
On Wed, Feb 6, 2013 at 7:10 AM, Andrew Sullivan <asullivan@dyn.com> wrote:
On Wed, Feb 06, 2013 at 07:39:14AM -0500, Jared Mauch wrote:
So, I'm wondering what is shocking that someone may have to push out some sort of upgrade either urgently or periodically that is so impacting and causes these emails on the list.
My impression is mostly that people are left feeling uncomfortable by a massive upgrade of this sort with so little communication about why and so on. "Emergency work for five hours and 30 minutes disconnection" that turns out to take longer than 30 minutes of disconnection probably ought to come with some explanation (at least after the fact).
Especially in the wake they already recently did one. It's unsettling to receive little communication, and even multihomed, there's always the question of being pushed into overages around other providers. Yes, short notice maintenance does happen. Better communication happens much less often. I was more looking for details, i.e. the sort of problem this is, as it probably also means all my *other* providers are going to be scrambling in the next few days/weeks/months, depending on what gear they're all using. I'm out of the global infrastructure game myself for a few years currently, but I still have to think ahead to the network I do maintain. -R>
On 2/6/13 7:43 AM, Ray Wong wrote:
On Wed, Feb 6, 2013 at 7:10 AM, Andrew Sullivan <asullivan@dyn.com> wrote:
On Wed, Feb 06, 2013 at 07:39:14AM -0500, Jared Mauch wrote:
So, I'm wondering what is shocking that someone may have to push out some sort of upgrade either urgently or periodically that is so impacting and causes these emails on the list.
My impression is mostly that people are left feeling uncomfortable by a massive upgrade of this sort with so little communication about why and so on. "Emergency work for five hours and 30 minutes disconnection" that turns out to take longer than 30 minutes of disconnection probably ought to come with some explanation (at least after the fact).
Especially in the wake they already recently did one. It's unsettling to receive little communication, and even multihomed, there's always the question of being pushed into overages around other providers.
Yes, short notice maintenance does happen. Better communication happens much less often. I recieved advance (24 hours) notification of maintenances over the last two days to circuits ranging in size from 100MB/s to 10Gb/s in about a dozen locations. I assumed there would be further disruption as devices I'm not directly connected to were touched.
I was more looking for details, i.e. the sort of problem this is, as it probably also means all my *other* providers are going to be scrambling in the next few days/weeks/months, depending on what gear they're all using. All your other providers using that vendor have been scrambling for about a week as well. Junos devices should be upgraded. I'm out of the global infrastructure game myself for a few years currently, but I still have to think ahead to the network I do maintain.
-R>
Given the issue was announced a week ago, I'm surprised they didn't provide some sort of emergency notification prior to the upgrade. However, I certainly understand their immediate desire to deploy this update. I don't think it's bad as the BGP one from not too long ago in that exploit code is not yet publicly available to my knowledge, but it certainly won't take long. On Wed, Feb 6, 2013 at 9:04 AM, joel jaeggli <joelja@bogus.com> wrote:
On 2/6/13 7:43 AM, Ray Wong wrote:
On Wed, Feb 6, 2013 at 7:10 AM, Andrew Sullivan <asullivan@dyn.com> wrote:
On Wed, Feb 06, 2013 at 07:39:14AM -0500, Jared Mauch wrote:
So, I'm wondering what is shocking that someone may have to push out some sort of upgrade either urgently or periodically that is so impacting and causes these emails on the list.
My impression is mostly that people are left feeling uncomfortable by a massive upgrade of this sort with so little communication about why and so on. "Emergency work for five hours and 30 minutes disconnection" that turns out to take longer than 30 minutes of disconnection probably ought to come with some explanation (at least after the fact).
Especially in the wake they already recently did one. It's unsettling to receive little communication, and even multihomed, there's always the question of being pushed into overages around other providers.
Yes, short notice maintenance does happen. Better communication happens much less often.
I recieved advance (24 hours) notification of maintenances over the last two days to circuits ranging in size from 100MB/s to 10Gb/s in about a dozen locations. I assumed there would be further disruption as devices I'm not directly connected to were touched.
I was more looking for details, i.e. the sort of problem this is, as it probably also means all my *other* providers are going to be scrambling in the next few days/weeks/months, depending on what gear they're all using.
All your other providers using that vendor have been scrambling for about a week as well. Junos devices should be upgraded.
I'm out of the global infrastructure game myself
for a few years currently, but I still have to think ahead to the network I do maintain.
-R>
OK, having had that first cup of coffee, I can say perhaps the main reason I was wondering is I've gotten used to Level3 always being on top of things (and admittedly, rarely communicating). They've reached the top by often being a black box of reliability, so it's (perhaps unrealistically) surprising to see them caught by surprise. Anything that pushes them into scramble mode causes me to lose a little sleep anyway. The alternative to what they did seems likely for at least a few providers who'll NOT manage to fix things in time, so I may well be looking at longer outages from other providers, and need to issue guidance to others on what to do if/when other links go down for periods long enough that all the cost-bounding monitoring alarms start to scream even louder. I was also grumpy at myself for having not noticed advance communication, which I still don't seem to have, though since I outsourced my email to bigG, I've noticed I'm more likely to miss things. Perhaps giving up maintaining that massive set of procmail rules has cost me a bit more edge. Related, of course, just because you design/run your network to tolerate some issues doesn't mean you can also budget to be in support contract as well. :) Knowing more about the exploit/fix might mean trying to find a way to get free upgrades to some kit to prevent more localized attacks to other types of gear, as well, though in this case it's all about Juniper PR839412 then, so vendor specific, it seems? There are probably more reasons to wish for more info, too. There's still more of them (exploiters/attackers) than there are those of us trying to keep things running smoothly and transparently, so anything that smells of "OMG new exploit found!" also triggers my desire to share information. The network bad guys share information far more quickly and effectively than we do, it often seems. -R>
Hi Ray, This topic reminds me of yesterday's discussion in the conference around getting some BCOP's drafted. it would be useful to confirm my own view of the BCOP around communicating security issues. My understanding for the best practice is to limit knowledge distribution of security related problems both before and after the patches are deployed. You limit knowledge before the patch is deployed to prevent yourself from being exploited, but you also limit knowledge afterwards in order to limit potential damage to others (customers, competitors...the Internet at large). You also do not want to announce that you will be deploying a security patch until you have a fix in hand and know when you will deploy it (typically, next available maintenance window unless the cat is out of the bag and danger is real and imminent). As a service provider, you should stay on top of security alerts from your vendors so that you can make your own decision about what action is required. I would not recommend relying on service provider maintenance bulletins or public operations mailing lists for obtaining this type of information. There is some information that can cause more harm than good if it is distributed in the wrong way and information relating to security vulnerabilities definitely falls into that category. Dave -----Original Message----- From: Ray Wong [mailto:rayw@rayw.net] Sent: Wednesday, February 06, 2013 9:16 AM To: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade?
OK, having had that first cup of coffee, I can say perhaps the main reason I was wondering is I've gotten used to Level3 always being on top of things (and admittedly, rarely communicating). They've reached the top by often being a black box of reliability, so it's (perhaps unrealistically) surprising to see them caught by surprise. Anything that pushes them into scramble mode causes me to lose a little sleep anyway. The alternative to what they did seems likely for at least a few providers who'll NOT manage to fix things in time, so I may well be looking at longer outages from other providers, and need to issue guidance to others on what to do if/when other links go down for periods long enough that all the cost-bounding monitoring alarms start to scream even louder. I was also grumpy at myself for having not noticed advance communication, which I still don't seem to have, though since I outsourced my email to bigG, I've noticed I'm more likely to miss things. Perhaps giving up maintaining that massive set of procmail rules has cost me a bit more edge. Related, of course, just because you design/run your network to tolerate some issues doesn't mean you can also budget to be in support contract as well. :) Knowing more about the exploit/fix might mean trying to find a way to get free upgrades to some kit to prevent more localized attacks to other types of gear, as well, though in this case it's all about Juniper PR839412 then, so vendor specific, it seems? There are probably more reasons to wish for more info, too. There's still more of them (exploiters/attackers) than there are those of us trying to keep things running smoothly and transparently, so anything that smells of "OMG new exploit found!" also triggers my desire to share information. The network bad guys share information far more quickly and effectively than we do, it often seems. -R>
David. I am on an evening shift and am just now reading this thread. I was almost tempted to write an explanation that would have had identical content with yours based simply on Level3 doing something and keeping the information close. Responsible Vendors do not try to hide what is being done unless it is an Op Sec issue and I have never seen Level3 act with less than responsibility so it had to be Op Sec. When it is that, it is best if the remainder of us sit quietly on the sidelines. Ralph Brandt -----Original Message----- From: Siegel, David [mailto:David.Siegel@Level3.com] Sent: Wednesday, February 06, 2013 12:01 PM To: 'Ray Wong'; nanog@nanog.org Subject: RE: Level3 worldwide emergency upgrade? Hi Ray, This topic reminds me of yesterday's discussion in the conference around getting some BCOP's drafted. it would be useful to confirm my own view of the BCOP around communicating security issues. My understanding for the best practice is to limit knowledge distribution of security related problems both before and after the patches are deployed. You limit knowledge before the patch is deployed to prevent yourself from being exploited, but you also limit knowledge afterwards in order to limit potential damage to others (customers, competitors...the Internet at large). You also do not want to announce that you will be deploying a security patch until you have a fix in hand and know when you will deploy it (typically, next available maintenance window unless the cat is out of the bag and danger is real and imminent). As a service provider, you should stay on top of security alerts from your vendors so that you can make your own decision about what action is required. I would not recommend relying on service provider maintenance bulletins or public operations mailing lists for obtaining this type of information. There is some information that can cause more harm than good if it is distributed in the wrong way and information relating to security vulnerabilities definitely falls into that category. Dave -----Original Message----- From: Ray Wong [mailto:rayw@rayw.net] Sent: Wednesday, February 06, 2013 9:16 AM To: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade?
OK, having had that first cup of coffee, I can say perhaps the main reason I was wondering is I've gotten used to Level3 always being on top of things (and admittedly, rarely communicating). They've reached the top by often being a black box of reliability, so it's (perhaps unrealistically) surprising to see them caught by surprise. Anything that pushes them into scramble mode causes me to lose a little sleep anyway. The alternative to what they did seems likely for at least a few providers who'll NOT manage to fix things in time, so I may well be looking at longer outages from other providers, and need to issue guidance to others on what to do if/when other links go down for periods long enough that all the cost-bounding monitoring alarms start to scream even louder. I was also grumpy at myself for having not noticed advance communication, which I still don't seem to have, though since I outsourced my email to bigG, I've noticed I'm more likely to miss things. Perhaps giving up maintaining that massive set of procmail rules has cost me a bit more edge. Related, of course, just because you design/run your network to tolerate some issues doesn't mean you can also budget to be in support contract as well. :) Knowing more about the exploit/fix might mean trying to find a way to get free upgrades to some kit to prevent more localized attacks to other types of gear, as well, though in this case it's all about Juniper PR839412 then, so vendor specific, it seems? There are probably more reasons to wish for more info, too. There's still more of them (exploiters/attackers) than there are those of us trying to keep things running smoothly and transparently, so anything that smells of "OMG new exploit found!" also triggers my desire to share information. The network bad guys share information far more quickly and effectively than we do, it often seems. -R>
On 2/6/13 4:41 PM, Brandt, Ralph wrote:
David. I am on an evening shift and am just now reading this thread.
I was almost tempted to write an explanation that would have had identical content with yours based simply on Level3 doing something and keeping the information close.
Responsible Vendors do not try to hide what is being done unless it is an Op Sec issue and I have never seen Level3 act with less than responsibility so it had to be Op Sec.
When it is that, it is best if the remainder of us sit quietly on the sidelines. To be clear. the existence of the PR has been know publicly and the software releases that address it have been available for a week now.
Everyone who has potential exposure should be addressing the issue, and soon.
Ralph Brandt
-----Original Message----- From: Siegel, David [mailto:David.Siegel@Level3.com] Sent: Wednesday, February 06, 2013 12:01 PM To: 'Ray Wong'; nanog@nanog.org Subject: RE: Level3 worldwide emergency upgrade?
Hi Ray,
This topic reminds me of yesterday's discussion in the conference around getting some BCOP's drafted. it would be useful to confirm my own view of the BCOP around communicating security issues. My understanding for the best practice is to limit knowledge distribution of security related problems both before and after the patches are deployed. You limit knowledge before the patch is deployed to prevent yourself from being exploited, but you also limit knowledge afterwards in order to limit potential damage to others (customers, competitors...the Internet at large). You also do not want to announce that you will be deploying a security patch until you have a fix in hand and know when you will deploy it (typically, next available maintenance window unless the cat is out of the bag and danger is real and imminent).
As a service provider, you should stay on top of security alerts from your vendors so that you can make your own decision about what action is required. I would not recommend relying on service provider maintenance bulletins or public operations mailing lists for obtaining this type of information. There is some information that can cause more harm than good if it is distributed in the wrong way and information relating to security vulnerabilities definitely falls into that category.
Dave
-----Original Message----- From: Ray Wong [mailto:rayw@rayw.net] Sent: Wednesday, February 06, 2013 9:16 AM To: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade?
OK, having had that first cup of coffee, I can say perhaps the main reason I was wondering is I've gotten used to Level3 always being on top of things (and admittedly, rarely communicating). They've reached the top by often being a black box of reliability, so it's (perhaps unrealistically) surprising to see them caught by surprise. Anything that pushes them into scramble mode causes me to lose a little sleep anyway. The alternative to what they did seems likely for at least a few providers who'll NOT manage to fix things in time, so I may well be looking at longer outages from other providers, and need to issue guidance to others on what to do if/when other links go down for periods long enough that all the cost-bounding monitoring alarms start to scream even louder.
I was also grumpy at myself for having not noticed advance communication, which I still don't seem to have, though since I outsourced my email to bigG, I've noticed I'm more likely to miss things. Perhaps giving up maintaining that massive set of procmail rules has cost me a bit more edge.
Related, of course, just because you design/run your network to tolerate some issues doesn't mean you can also budget to be in support contract as well. :) Knowing more about the exploit/fix might mean trying to find a way to get free upgrades to some kit to prevent more localized attacks to other types of gear, as well, though in this case it's all about Juniper PR839412 then, so vendor specific, it seems?
There are probably more reasons to wish for more info, too. There's still more of them (exploiters/attackers) than there are those of us trying to keep things running smoothly and transparently, so anything that smells of "OMG new exploit found!" also triggers my desire to share information. The network bad guys share information far more quickly and effectively than we do, it often seems.
-R>
Hell, we used to not have to bother notifying customers of anything, we just fixed the problem. Reminds me a of a story I've probably shared on the past. 1995, IETF in Dallas. The "big ISP" I worked for at the time got tripped up on a 24-day IS-IS timer bug (maybe all of them at the time did, I don't recall) where all adjacencies reset at once. That's like, entire network down. Working with our engineering team in the *terminal* lab mind you, and Ravi Chandra (then at Cisco) we reloaded the entire network of routers with new code from Cisco once they'd fixed the bug. I seem to remember this being my first exposure to Tony Li's infamous line, "... Confidence Level: boots in the lab." Good times. -b On Feb 6, 2013, at 5:41 PM, Brandt, Ralph wrote:
David. I am on an evening shift and am just now reading this thread.
I was almost tempted to write an explanation that would have had identical content with yours based simply on Level3 doing something and keeping the information close.
Responsible Vendors do not try to hide what is being done unless it is an Op Sec issue and I have never seen Level3 act with less than responsibility so it had to be Op Sec.
When it is that, it is best if the remainder of us sit quietly on the sidelines.
Ralph Brandt
-----Original Message----- From: Siegel, David [mailto:David.Siegel@Level3.com] Sent: Wednesday, February 06, 2013 12:01 PM To: 'Ray Wong'; nanog@nanog.org Subject: RE: Level3 worldwide emergency upgrade?
Hi Ray,
This topic reminds me of yesterday's discussion in the conference around getting some BCOP's drafted. it would be useful to confirm my own view of the BCOP around communicating security issues. My understanding for the best practice is to limit knowledge distribution of security related problems both before and after the patches are deployed. You limit knowledge before the patch is deployed to prevent yourself from being exploited, but you also limit knowledge afterwards in order to limit potential damage to others (customers, competitors...the Internet at large). You also do not want to announce that you will be deploying a security patch until you have a fix in hand and know when you will deploy it (typically, next available maintenance window unless the cat is out of the bag and danger is real and imminent).
As a service provider, you should stay on top of security alerts from your vendors so that you can make your own decision about what action is required. I would not recommend relying on service provider maintenance bulletins or public operations mailing lists for obtaining this type of information. There is some information that can cause more harm than good if it is distributed in the wrong way and information relating to security vulnerabilities definitely falls into that category.
Dave
-----Original Message----- From: Ray Wong [mailto:rayw@rayw.net] Sent: Wednesday, February 06, 2013 9:16 AM To: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade?
OK, having had that first cup of coffee, I can say perhaps the main reason I was wondering is I've gotten used to Level3 always being on top of things (and admittedly, rarely communicating). They've reached the top by often being a black box of reliability, so it's (perhaps unrealistically) surprising to see them caught by surprise. Anything that pushes them into scramble mode causes me to lose a little sleep anyway. The alternative to what they did seems likely for at least a few providers who'll NOT manage to fix things in time, so I may well be looking at longer outages from other providers, and need to issue guidance to others on what to do if/when other links go down for periods long enough that all the cost-bounding monitoring alarms start to scream even louder.
I was also grumpy at myself for having not noticed advance communication, which I still don't seem to have, though since I outsourced my email to bigG, I've noticed I'm more likely to miss things. Perhaps giving up maintaining that massive set of procmail rules has cost me a bit more edge.
Related, of course, just because you design/run your network to tolerate some issues doesn't mean you can also budget to be in support contract as well. :) Knowing more about the exploit/fix might mean trying to find a way to get free upgrades to some kit to prevent more localized attacks to other types of gear, as well, though in this case it's all about Juniper PR839412 then, so vendor specific, it seems?
There are probably more reasons to wish for more info, too. There's still more of them (exploiters/attackers) than there are those of us trying to keep things running smoothly and transparently, so anything that smells of "OMG new exploit found!" also triggers my desire to share information. The network bad guys share information far more quickly and effectively than we do, it often seems.
-R>
ah - those were the days of glory... :) On Wed, Feb 06, 2013 at 06:06:39PM -0700, Brett Watson wrote:
Hell, we used to not have to bother notifying customers of anything, we just fixed the problem. Reminds me a of a story I've probably shared on the past.
1995, IETF in Dallas. The "big ISP" I worked for at the time got tripped up on a 24-day IS-IS timer bug (maybe all of them at the time did, I don't recall) where all adjacencies reset at once. That's like, entire network down. Working with our engineering team in the *terminal* lab mind you, and Ravi Chandra (then at Cisco) we reloaded the entire network of routers with new code from Cisco once they'd fixed the bug. I seem to remember this being my first exposure to Tony Li's infamous line, "... Confidence Level: boots in the lab."
Good times.
-b
On Feb 6, 2013, at 5:41 PM, Brandt, Ralph wrote:
David. I am on an evening shift and am just now reading this thread.
I was almost tempted to write an explanation that would have had identical content with yours based simply on Level3 doing something and keeping the information close.
Responsible Vendors do not try to hide what is being done unless it is an Op Sec issue and I have never seen Level3 act with less than responsibility so it had to be Op Sec.
When it is that, it is best if the remainder of us sit quietly on the sidelines.
Ralph Brandt
-----Original Message----- From: Siegel, David [mailto:David.Siegel@Level3.com] Sent: Wednesday, February 06, 2013 12:01 PM To: 'Ray Wong'; nanog@nanog.org Subject: RE: Level3 worldwide emergency upgrade?
Hi Ray,
This topic reminds me of yesterday's discussion in the conference around getting some BCOP's drafted. it would be useful to confirm my own view of the BCOP around communicating security issues. My understanding for the best practice is to limit knowledge distribution of security related problems both before and after the patches are deployed. You limit knowledge before the patch is deployed to prevent yourself from being exploited, but you also limit knowledge afterwards in order to limit potential damage to others (customers, competitors...the Internet at large). You also do not want to announce that you will be deploying a security patch until you have a fix in hand and know when you will deploy it (typically, next available maintenance window unless the cat is out of the bag and danger is real and imminent).
As a service provider, you should stay on top of security alerts from your vendors so that you can make your own decision about what action is required. I would not recommend relying on service provider maintenance bulletins or public operations mailing lists for obtaining this type of information. There is some information that can cause more harm than good if it is distributed in the wrong way and information relating to security vulnerabilities definitely falls into that category.
Dave
-----Original Message----- From: Ray Wong [mailto:rayw@rayw.net] Sent: Wednesday, February 06, 2013 9:16 AM To: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade?
OK, having had that first cup of coffee, I can say perhaps the main reason I was wondering is I've gotten used to Level3 always being on top of things (and admittedly, rarely communicating). They've reached the top by often being a black box of reliability, so it's (perhaps unrealistically) surprising to see them caught by surprise. Anything that pushes them into scramble mode causes me to lose a little sleep anyway. The alternative to what they did seems likely for at least a few providers who'll NOT manage to fix things in time, so I may well be looking at longer outages from other providers, and need to issue guidance to others on what to do if/when other links go down for periods long enough that all the cost-bounding monitoring alarms start to scream even louder.
I was also grumpy at myself for having not noticed advance communication, which I still don't seem to have, though since I outsourced my email to bigG, I've noticed I'm more likely to miss things. Perhaps giving up maintaining that massive set of procmail rules has cost me a bit more edge.
Related, of course, just because you design/run your network to tolerate some issues doesn't mean you can also budget to be in support contract as well. :) Knowing more about the exploit/fix might mean trying to find a way to get free upgrades to some kit to prevent more localized attacks to other types of gear, as well, though in this case it's all about Juniper PR839412 then, so vendor specific, it seems?
There are probably more reasons to wish for more info, too. There's still more of them (exploiters/attackers) than there are those of us trying to keep things running smoothly and transparently, so anything that smells of "OMG new exploit found!" also triggers my desire to share information. The network bad guys share information far more quickly and effectively than we do, it often seems.
-R>
Good times indeed... Regards, Jeff On Feb 7, 2013, at 2:09, "Brett Watson" <brett@the-watsons.org> wrote:
Hell, we used to not have to bother notifying customers of anything, we just fixed the problem. Reminds me a of a story I've probably shared on the past.
1995, IETF in Dallas. The "big ISP" I worked for at the time got tripped up on a 24-day IS-IS timer bug (maybe all of them at the time did, I don't recall) where all adjacencies reset at once. That's like, entire network down. Working with our engineering team in the *terminal* lab mind you, and Ravi Chandra (then at Cisco) we reloaded the entire network of routers with new code from Cisco once they'd fixed the bug. I seem to remember this being my first exposure to Tony Li's infamous line, "... Confidence Level: boots in the lab."
Good times.
-b
On Feb 6, 2013, at 5:41 PM, Brandt, Ralph wrote:
David. I am on an evening shift and am just now reading this thread.
I was almost tempted to write an explanation that would have had identical content with yours based simply on Level3 doing something and keeping the information close.
Responsible Vendors do not try to hide what is being done unless it is an Op Sec issue and I have never seen Level3 act with less than responsibility so it had to be Op Sec.
When it is that, it is best if the remainder of us sit quietly on the sidelines.
Ralph Brandt
-----Original Message----- From: Siegel, David [mailto:David.Siegel@Level3.com] Sent: Wednesday, February 06, 2013 12:01 PM To: 'Ray Wong'; nanog@nanog.org Subject: RE: Level3 worldwide emergency upgrade?
Hi Ray,
This topic reminds me of yesterday's discussion in the conference around getting some BCOP's drafted. it would be useful to confirm my own view of the BCOP around communicating security issues. My understanding for the best practice is to limit knowledge distribution of security related problems both before and after the patches are deployed. You limit knowledge before the patch is deployed to prevent yourself from being exploited, but you also limit knowledge afterwards in order to limit potential damage to others (customers, competitors...the Internet at large). You also do not want to announce that you will be deploying a security patch until you have a fix in hand and know when you will deploy it (typically, next available maintenance window unless the cat is out of the bag and danger is real and imminent).
As a service provider, you should stay on top of security alerts from your vendors so that you can make your own decision about what action is required. I would not recommend relying on service provider maintenance bulletins or public operations mailing lists for obtaining this type of information. There is some information that can cause more harm than good if it is distributed in the wrong way and information relating to security vulnerabilities definitely falls into that category.
Dave
-----Original Message----- From: Ray Wong [mailto:rayw@rayw.net] Sent: Wednesday, February 06, 2013 9:16 AM To: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade?
OK, having had that first cup of coffee, I can say perhaps the main reason I was wondering is I've gotten used to Level3 always being on top of things (and admittedly, rarely communicating). They've reached the top by often being a black box of reliability, so it's (perhaps unrealistically) surprising to see them caught by surprise. Anything that pushes them into scramble mode causes me to lose a little sleep anyway. The alternative to what they did seems likely for at least a few providers who'll NOT manage to fix things in time, so I may well be looking at longer outages from other providers, and need to issue guidance to others on what to do if/when other links go down for periods long enough that all the cost-bounding monitoring alarms start to scream even louder.
I was also grumpy at myself for having not noticed advance communication, which I still don't seem to have, though since I outsourced my email to bigG, I've noticed I'm more likely to miss things. Perhaps giving up maintaining that massive set of procmail rules has cost me a bit more edge.
Related, of course, just because you design/run your network to tolerate some issues doesn't mean you can also budget to be in support contract as well. :) Knowing more about the exploit/fix might mean trying to find a way to get free upgrades to some kit to prevent more localized attacks to other types of gear, as well, though in this case it's all about Juniper PR839412 then, so vendor specific, it seems?
There are probably more reasons to wish for more info, too. There's still more of them (exploiters/attackers) than there are those of us trying to keep things running smoothly and transparently, so anything that smells of "OMG new exploit found!" also triggers my desire to share information. The network bad guys share information far more quickly and effectively than we do, it often seems.
-R>
I remember being glued to my workstation for 10 straight hours due to an OSPF bug that took down the whole of net99's network. I was pretty proud of our size at the time...about 30Mbps at peak. Times are different and so are expectations. :-) Dave -----Original Message----- From: Brett Watson [mailto:brett@the-watsons.org] Sent: Wednesday, February 06, 2013 6:07 PM To: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade? Hell, we used to not have to bother notifying customers of anything, we just fixed the problem. Reminds me a of a story I've probably shared on the past. 1995, IETF in Dallas. The "big ISP" I worked for at the time got tripped up on a 24-day IS-IS timer bug (maybe all of them at the time did, I don't recall) where all adjacencies reset at once. That's like, entire network down. Working with our engineering team in the *terminal* lab mind you, and Ravi Chandra (then at Cisco) we reloaded the entire network of routers with new code from Cisco once they'd fixed the bug. I seem to remember this being my first exposure to Tony Li's infamous line, "... Confidence Level: boots in the lab." Good times. -b On Feb 6, 2013, at 5:41 PM, Brandt, Ralph wrote:
David. I am on an evening shift and am just now reading this thread.
I was almost tempted to write an explanation that would have had identical content with yours based simply on Level3 doing something and keeping the information close.
Responsible Vendors do not try to hide what is being done unless it is an Op Sec issue and I have never seen Level3 act with less than responsibility so it had to be Op Sec.
When it is that, it is best if the remainder of us sit quietly on the sidelines.
Ralph Brandt
-----Original Message----- From: Siegel, David [mailto:David.Siegel@Level3.com] Sent: Wednesday, February 06, 2013 12:01 PM To: 'Ray Wong'; nanog@nanog.org Subject: RE: Level3 worldwide emergency upgrade?
Hi Ray,
This topic reminds me of yesterday's discussion in the conference around getting some BCOP's drafted. it would be useful to confirm my own view of the BCOP around communicating security issues. My understanding for the best practice is to limit knowledge distribution of security related problems both before and after the patches are deployed. You limit knowledge before the patch is deployed to prevent yourself from being exploited, but you also limit knowledge afterwards in order to limit potential damage to others (customers, competitors...the Internet at large). You also do not want to announce that you will be deploying a security patch until you have a fix in hand and know when you will deploy it (typically, next available maintenance window unless the cat is out of the bag and danger is real and imminent).
As a service provider, you should stay on top of security alerts from your vendors so that you can make your own decision about what action is required. I would not recommend relying on service provider maintenance bulletins or public operations mailing lists for obtaining this type of information. There is some information that can cause more harm than good if it is distributed in the wrong way and information relating to security vulnerabilities definitely falls into that category.
Dave
-----Original Message----- From: Ray Wong [mailto:rayw@rayw.net] Sent: Wednesday, February 06, 2013 9:16 AM To: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade?
OK, having had that first cup of coffee, I can say perhaps the main reason I was wondering is I've gotten used to Level3 always being on top of things (and admittedly, rarely communicating). They've reached the top by often being a black box of reliability, so it's (perhaps unrealistically) surprising to see them caught by surprise. Anything that pushes them into scramble mode causes me to lose a little sleep anyway. The alternative to what they did seems likely for at least a few providers who'll NOT manage to fix things in time, so I may well be looking at longer outages from other providers, and need to issue guidance to others on what to do if/when other links go down for periods long enough that all the cost-bounding monitoring alarms start to scream even louder.
I was also grumpy at myself for having not noticed advance communication, which I still don't seem to have, though since I outsourced my email to bigG, I've noticed I'm more likely to miss things. Perhaps giving up maintaining that massive set of procmail rules has cost me a bit more edge.
Related, of course, just because you design/run your network to tolerate some issues doesn't mean you can also budget to be in support contract as well. :) Knowing more about the exploit/fix might mean trying to find a way to get free upgrades to some kit to prevent more localized attacks to other types of gear, as well, though in this case it's all about Juniper PR839412 then, so vendor specific, it seems?
There are probably more reasons to wish for more info, too. There's still more of them (exploiters/attackers) than there are those of us trying to keep things running smoothly and transparently, so anything that smells of "OMG new exploit found!" also triggers my desire to share information. The network bad guys share information far more quickly and effectively than we do, it often seems.
-R>
No one had hit the ISIS bug before the IETF enforced maintenance freeze because no one in their right mind would be running three week old code back then. I don't think things have changed that much. ;) -dorian On Feb 7, 2013, at 4:19 PM, Siegel, David wrote:
I remember being glued to my workstation for 10 straight hours due to an OSPF bug that took down the whole of net99's network.
I was pretty proud of our size at the time...about 30Mbps at peak. Times are different and so are expectations. :-)
Dave
-----Original Message----- From: Brett Watson [mailto:brett@the-watsons.org] Sent: Wednesday, February 06, 2013 6:07 PM To: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade?
Hell, we used to not have to bother notifying customers of anything, we just fixed the problem. Reminds me a of a story I've probably shared on the past.
1995, IETF in Dallas. The "big ISP" I worked for at the time got tripped up on a 24-day IS-IS timer bug (maybe all of them at the time did, I don't recall) where all adjacencies reset at once. That's like, entire network down. Working with our engineering team in the *terminal* lab mind you, and Ravi Chandra (then at Cisco) we reloaded the entire network of routers with new code from Cisco once they'd fixed the bug. I seem to remember this being my first exposure to Tony Li's infamous line, "... Confidence Level: boots in the lab."
Good times.
-b
On Feb 6, 2013, at 5:41 PM, Brandt, Ralph wrote:
David. I am on an evening shift and am just now reading this thread.
I was almost tempted to write an explanation that would have had identical content with yours based simply on Level3 doing something and keeping the information close.
Responsible Vendors do not try to hide what is being done unless it is an Op Sec issue and I have never seen Level3 act with less than responsibility so it had to be Op Sec.
When it is that, it is best if the remainder of us sit quietly on the sidelines.
Ralph Brandt
-----Original Message----- From: Siegel, David [mailto:David.Siegel@Level3.com] Sent: Wednesday, February 06, 2013 12:01 PM To: 'Ray Wong'; nanog@nanog.org Subject: RE: Level3 worldwide emergency upgrade?
Hi Ray,
This topic reminds me of yesterday's discussion in the conference around getting some BCOP's drafted. it would be useful to confirm my own view of the BCOP around communicating security issues. My understanding for the best practice is to limit knowledge distribution of security related problems both before and after the patches are deployed. You limit knowledge before the patch is deployed to prevent yourself from being exploited, but you also limit knowledge afterwards in order to limit potential damage to others (customers, competitors...the Internet at large). You also do not want to announce that you will be deploying a security patch until you have a fix in hand and know when you will deploy it (typically, next available maintenance window unless the cat is out of the bag and danger is real and imminent).
As a service provider, you should stay on top of security alerts from your vendors so that you can make your own decision about what action is required. I would not recommend relying on service provider maintenance bulletins or public operations mailing lists for obtaining this type of information. There is some information that can cause more harm than good if it is distributed in the wrong way and information relating to security vulnerabilities definitely falls into that category.
Dave
-----Original Message----- From: Ray Wong [mailto:rayw@rayw.net] Sent: Wednesday, February 06, 2013 9:16 AM To: nanog@nanog.org Subject: Re: Level3 worldwide emergency upgrade?
OK, having had that first cup of coffee, I can say perhaps the main reason I was wondering is I've gotten used to Level3 always being on top of things (and admittedly, rarely communicating). They've reached the top by often being a black box of reliability, so it's (perhaps unrealistically) surprising to see them caught by surprise. Anything that pushes them into scramble mode causes me to lose a little sleep anyway. The alternative to what they did seems likely for at least a few providers who'll NOT manage to fix things in time, so I may well be looking at longer outages from other providers, and need to issue guidance to others on what to do if/when other links go down for periods long enough that all the cost-bounding monitoring alarms start to scream even louder.
I was also grumpy at myself for having not noticed advance communication, which I still don't seem to have, though since I outsourced my email to bigG, I've noticed I'm more likely to miss things. Perhaps giving up maintaining that massive set of procmail rules has cost me a bit more edge.
Related, of course, just because you design/run your network to tolerate some issues doesn't mean you can also budget to be in support contract as well. :) Knowing more about the exploit/fix might mean trying to find a way to get free upgrades to some kit to prevent more localized attacks to other types of gear, as well, though in this case it's all about Juniper PR839412 then, so vendor specific, it seems?
There are probably more reasons to wish for more info, too. There's still more of them (exploiters/attackers) than there are those of us trying to keep things running smoothly and transparently, so anything that smells of "OMG new exploit found!" also triggers my desire to share information. The network bad guys share information far more quickly and effectively than we do, it often seems.
-R>
On Wed, 6 Feb 2013, Ray Wong wrote:
My impression is mostly that people are left feeling uncomfortable by a massive upgrade of this sort with so little communication about why and so on. "Emergency work for five hours and 30 minutes disconnection" that turns out to take longer than 30 minutes of disconnection probably ought to come with some explanation (at least after the fact).
I was more looking for details, i.e. the sort of problem this is, as it probably also means all my *other* providers are going to be scrambling in the next few days/weeks/months, depending on what gear they're all using. I'm out of the global infrastructure game myself for a few years currently, but I still have to think ahead to the network I do maintain.
If Level3 is pushing this upgrade because of a security vulnerability, like the recent Juniper PSN, any public notification will likely be tersely worded out of necessity. You might be able to get more details by contacting your account team, but it's highly unlikely that you'll see the level of detail you're looking for in a public communication. That's not a knock against Level3, and most other carriers will likely be equally tight-lipped on the details. jms
On 2/6/13 8:34 AM, Justin M. Streiner wrote:
On Wed, 6 Feb 2013, Ray Wong wrote:
My impression is mostly that people are left feeling uncomfortable by a massive upgrade of this sort with so little communication about why and so on. "Emergency work for five hours and 30 minutes disconnection" that turns out to take longer than 30 minutes of disconnection probably ought to come with some explanation (at least after the fact).
I was more looking for details, i.e. the sort of problem this is, as it probably also means all my *other* providers are going to be scrambling in the next few days/weeks/months, depending on what gear they're all using. I'm out of the global infrastructure game myself for a few years currently, but I still have to think ahead to the network I do maintain.
If Level3 is pushing this upgrade because of a security vulnerability, like the recent Juniper PSN, any public notification will likely be tersely worded out of necessity.
The one that motivated us to upgrade is: PR839412 I assume that applies to most people with interest in running current junos. My imagination is pretty good so that got my attention.
You might be able to get more details by contacting your account team, but it's highly unlikely that you'll see the level of detail you're looking for in a public communication. That's not a knock against Level3, and most other carriers will likely be equally tight-lipped on the details.
jms
* Andrew Sullivan:
My impression is mostly that people are left feeling uncomfortable by a massive upgrade of this sort with so little communication about why and so on.
That's a side effect of Juniper's notification policy. Perhaps someone should them take them by their word ("Security patches and advisories are freely available from our web site.") and post this stuff publicly, so that everybody feels rightly scared and complains less about these disruptions.
participants (25)
-
Alex Rubenstein
-
Alexander Maassen
-
Andrew Sullivan
-
bmanning@vacation.karoshi.com
-
Brandt, Ralph
-
Bret Palsson
-
Brett Watson
-
Brian Landers
-
Dorian Kim
-
Florian Weimer
-
james jones
-
Jared Mauch
-
Jason Biel
-
Jeff Tantsura
-
joel jaeggli
-
Jonathan Towne
-
JP Viljoen
-
Justin M. Streiner
-
Matthew Petach
-
PC
-
Peter Ehiwe
-
Ray Wong
-
Siegel, David
-
Simon Allard
-
Stephane Bortzmeyer