TATA problems?

Todd Snyder

7 Nov 2011 7 Nov '11

3 p.m.

We seem to be having some problems with our tata links - first seen in EU about 45 minutes ago, now we're seeing problems in NA. I'm focused on DNS, so I'm seeing a lot of timeouts/servfails, but our networking folks are talking about links dropping. Anyone else seeing oddness on the NA Internet right now? http://downrightnow.com/ confirms - something is up. cheers, t.

Show replies by date

Stephane Bortzmeyer

7 Nov 7 Nov

3:05 p.m.

On Mon, Nov 07, 2011 at 10:00:34AM -0500, Todd Snyder <todd@borked.ca> wrote a message of 12 lines which said:

...

We seem to be having some problems with our tata links

They probably use Juniper routers :-)

Tim Vollebregt

3:06 p.m.

Hi, This issue seems to be much bigger, we lost about 20 Level3 and some TATA sessions. Also we lost about 15% of our total traffic. On #IX there are rumours about Junos version 10.3R2.11 being core dumped and rebooted, which makes sense. Currently traffic is restored. Tim On 07-11-11 16:00, Todd Snyder wrote:

...

We seem to be having some problems with our tata links - first seen in EU about 45 minutes ago, now we're seeing problems in NA. I'm focused on DNS, so I'm seeing a lot of timeouts/servfails, but our networking folks are talking about links dropping.

Anyone else seeing oddness on the NA Internet right now?

http://downrightnow.com/ confirms - something is up.

cheers,

t.

Kelly Kane

3:55 p.m.

On Mon, Nov 7, 2011 at 07:06, Tim Vollebregt <tim@interworx.nl> wrote:

...

On #IX there are rumours about Junos version 10.3R2.11 being core dumped and rebooted, which makes sense.

Perhaps related to Juniper PSN-2011-08-327? Did the whole router reboot, or just the service module? We saw one TATA session, and one Abovenet session flap. Kelly

Dan

4:08 p.m.

We got a panic message about the PFE that core'd and looks like it restarted our FPC's. JUNOS 10.2R2.11 -Dan On Nov 7, 2011, at 8:55 AM, Kelly Kane wrote:

...

On Mon, Nov 7, 2011 at 07:06, Tim Vollebregt <tim@interworx.nl> wrote:

...
On #IX there are rumours about Junos version 10.3R2.11 being core dumped and rebooted, which makes sense.

Perhaps related to Juniper PSN-2011-08-327? Did the whole router reboot, or just the service module?

We saw one TATA session, and one Abovenet session flap.

Kelly

Pierre-Yves Maunier

4:43 p.m.

2011/11/7 Kelly Kane <kelly@hawknetworks.com>

...

On Mon, Nov 7, 2011 at 07:06, Tim Vollebregt <tim@interworx.nl> wrote:

...
On #IX there are rumours about Junos version 10.3R2.11 being core dumped

and

...
rebooted, which makes sense.

Perhaps related to Juniper PSN-2011-08-327? Did the whole router reboot, or just the service module?

We saw one TATA session, and one Abovenet session flap.

Kelly

On our side we did not have any reboot, just a core dump generated and all interfaces flapped. -- Pierre-Yves Maunier

Tom Hill

3:08 p.m.

On Mon, 2011-11-07 at 10:00 -0500, Todd Snyder wrote:

...

We seem to be having some problems with our tata links - first seen in EU about 45 minutes ago, now we're seeing problems in NA. I'm focused on DNS, so I'm seeing a lot of timeouts/servfails, but our networking folks are talking about links dropping.

Anyone else seeing oddness on the NA Internet right now?

http://downrightnow.com/ confirms - something is up.

There are widespread issues across the Internet; certain versions of Juniper firmware have core dumped after seeing a particular BGP 'UPDATE' message. (That's the running theory at least). It's affected multiple service providers, globally, not just those connected to TATA. Tom

Jared Mauch

3:31 p.m.

New subject: General Internet Instability

On Nov 7, 2011, at 10:08 AM, Tom Hill wrote:

...

On Mon, 2011-11-07 at 10:00 -0500, Todd Snyder wrote:

...
We seem to be having some problems with our tata links - first seen in EU about 45 minutes ago, now we're seeing problems in NA. I'm focused on DNS, so I'm seeing a lot of timeouts/servfails, but our networking folks are talking about links dropping.

Anyone else seeing oddness on the NA Internet right now?

http://downrightnow.com/ confirms - something is up.

There are widespread issues across the Internet; certain versions of Juniper firmware have core dumped after seeing a particular BGP 'UPDATE' message.

(That's the running theory at least).

It's affected multiple service providers, globally, not just those connected to TATA.

Pretty much any major BGP event will impact multiple providers. A threshold you should use to view the general instability (which I find valuable, you may as well) is route views data. If you look at the BGP UPDATES archive sizes, you can see when something happens, e.g.: http://archive.routeviews.org/bgpdata/2011.11/UPDATES/ Take a look at the size of the updates.20111107.1400.bz2 file and the 1415 file. They are abnormally large compared to a normal period of time. This shows there were a lot of updates out there being processed and a reference to levels of instability. If you are not feeding route views or similar community projects, please consider doing so. It helps paint the view for those doing analysis. - Jared

Todd Snyder

4:09 p.m.

New subject: General Internet Instability

Can anyone point to any authoritative updates about this? On Mon, Nov 7, 2011 at 10:31 AM, Jared Mauch <jared@puck.nether.net> wrote:

...

On Nov 7, 2011, at 10:08 AM, Tom Hill wrote:

...
On Mon, 2011-11-07 at 10:00 -0500, Todd Snyder wrote:

...
We seem to be having some problems with our tata links - first seen in EU about 45 minutes ago, now we're seeing problems in NA. I'm focused on DNS, so I'm seeing a lot of timeouts/servfails, but our networking folks are talking about links dropping.

Anyone else seeing oddness on the NA Internet right now?

http://downrightnow.com/ confirms - something is up.

There are widespread issues across the Internet; certain versions of Juniper firmware have core dumped after seeing a particular BGP 'UPDATE' message.

(That's the running theory at least).

It's affected multiple service providers, globally, not just those connected to TATA.

Pretty much any major BGP event will impact multiple providers.

A threshold you should use to view the general instability (which I find valuable, you may as well) is route views data.

If you look at the BGP UPDATES archive sizes, you can see when something happens, e.g.:

http://archive.routeviews.org/bgpdata/2011.11/UPDATES/

Take a look at the size of the updates.20111107.1400.bz2 file and the 1415 file. They are abnormally large compared to a normal period of time. This shows there were a lot of updates out there being processed and a reference to levels of instability.

If you are not feeding route views or similar community projects, please consider doing so. It helps paint the view for those doing analysis.

- Jared

-Hammer-

4:14 p.m.

New subject: General Internet Instability

I'm struggling to do the same. All the various "Internet Health" sites show(ed) some upticks in negative performance but I don't have any specifics. We are a Gomez customer and Gomez is showing issues In St. Louis (SAVVIS) and Philly (L3) that specifically impacts the availability of our applications but it's not clear on the underlying reason. I'm giving cautious updates to management because even though it's obvious something is going on I don't have anything official except random email threads. Looking for more insight before misinforming management. -Hammer- "I was a normal American nerd" -Jack Herer On 11/07/2011 10:09 AM, Todd Snyder wrote:

...

Can anyone point to any authoritative updates about this?

On Mon, Nov 7, 2011 at 10:31 AM, Jared Mauch<jared@puck.nether.net> wrote:

...
On Nov 7, 2011, at 10:08 AM, Tom Hill wrote:

...
On Mon, 2011-11-07 at 10:00 -0500, Todd Snyder wrote:

...
We seem to be having some problems with our tata links - first seen in

EU

...
...
about 45 minutes ago, now we're seeing problems in NA. I'm focused on

DNS,

...
...
so I'm seeing a lot of timeouts/servfails, but our networking folks are talking about links dropping.

Anyone else seeing oddness on the NA Internet right now?

http://downrightnow.com/ confirms - something is up.

There are widespread issues across the Internet; certain versions of Juniper firmware have core dumped after seeing a particular BGP 'UPDATE' message.

(That's the running theory at least).

It's affected multiple service providers, globally, not just those connected to TATA.

Pretty much any major BGP event will impact multiple providers.

A threshold you should use to view the general instability (which I find valuable, you may as well) is route views data.

If you look at the BGP UPDATES archive sizes, you can see when something happens, e.g.:

http://archive.routeviews.org/bgpdata/2011.11/UPDATES/

Take a look at the size of the updates.20111107.1400.bz2 file and the 1415 file. They are abnormally large compared to a normal period of time. This shows there were a lot of updates out there being processed and a reference to levels of instability.

If you are not feeding route views or similar community projects, please consider doing so. It helps paint the view for those doing analysis.

- Jared

Richard Golodner

4:27 p.m.

New subject: General Internet Instability

On Mon, 2011-11-07 at 11:09 -0500, Todd Snyder wrote:

...

Can anyone point to any authoritative updates about this?

I think Jared's suggestion was about as close as your going to get for right now. Look at the size of the files he mentioned as compared to the average size of the others. Hopefully someone will come forth with an authoritative answer later today. Richard Golodner

Jared Mauch

4:37 p.m.

New subject: General Internet Instability

On Nov 7, 2011, at 11:27 AM, Richard Golodner wrote:

...

On Mon, 2011-11-07 at 11:09 -0500, Todd Snyder wrote:

...
Can anyone point to any authoritative updates about this?

I think Jared's suggestion was about as close as your going to get for right now. Look at the size of the files he mentioned as compared to the average size of the others. Hopefully someone will come forth with an authoritative answer later today. Richard Golodner

One can do some analysis of the files to determine what prefixes and autonomous system neighbors were impacted. I can do some of this as I have some other tools that quickly process this data if people are interested. Please send those replies/votes off list to me directly. - Jared

Joel jaeggli

4:52 p.m.

New subject: General Internet Instability

On 11/7/11 08:37 , Jared Mauch wrote:

...

On Nov 7, 2011, at 11:27 AM, Richard Golodner wrote:

...
On Mon, 2011-11-07 at 11:09 -0500, Todd Snyder wrote:

...
Can anyone point to any authoritative updates about this?

I think Jared's suggestion was about as close as your going to get for right now. Look at the size of the files he mentioned as compared to the average size of the others. Hopefully someone will come forth with an authoritative answer later today. Richard Golodner

One can do some analysis of the files to determine what prefixes and autonomous system neighbors were impacted.

I can do some of this as I have some other tools that quickly process this data if people are interested. Please send those replies/votes off list to me directly.

according to my peakflow the level-3 update spike was from ~1408 utc to ~1424 utc.

...

- Jared

Todd Snyder

4:40 p.m.

New subject: General Internet Instability

On Mon, Nov 7, 2011 at 11:27 AM, Richard Golodner < rgolodner@infratection.com> wrote:

...

On Mon, 2011-11-07 at 11:09 -0500, Todd Snyder wrote:

...
Can anyone point to any authoritative updates about this?

I think Jared's suggestion was about as close as your going to get for right now. Look at the size of the files he mentioned as compared to the average size of the others. Hopefully someone will come forth with an authoritative answer later today. Richard Golodner

Management don't understand or care about BGP updates, they just want to know if the problem is ours, and if it's not, who to blame :) thank goodness for NANOG - updates here have been helpful explaining things to management. t.

Leigh Porter

4:43 p.m.

New subject: General Internet Instability

On 7 Nov 2011, at 16:41, "Todd Snyder" <todd@hatescomputers.org> wrote:

...

On Mon, Nov 7, 2011 at 11:27 AM, Richard Golodner < rgolodner@infratection.com> wrote:

...
On Mon, 2011-11-07 at 11:09 -0500, Todd Snyder wrote:

...
Can anyone point to any authoritative updates about this?

I think Jared's suggestion was about as close as your going to get for right now. Look at the size of the files he mentioned as compared to the average size of the others. Hopefully someone will come forth with an authoritative answer later today. Richard Golodner

Management don't understand or care about BGP updates, they just want to know if the problem is ours, and if it's not, who to blame :)

thank goodness for NANOG - updates here have been helpful explaining things to management.

t.

Just blame Shub Internet.. Oh no, I've said it now! -- Leigh ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________

Jay Ashworth

6:14 p.m.

New subject: General Internet Instability

----- Original Message -----

...

From: "Leigh Porter" <leigh.porter@ukbroadband.com>

...

Just blame Shub Internet..

Oh no, I've said it now!

Nah; Brad took down everything but the webserver years ago. :-) Cheers, -- jra -- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

-Hammer-

4:41 p.m.

New subject: General Internet Instability

So the file size was 30% higher implies that the number of updates is larger and therefore there is instability? I see the logic but if you scroll thru that page (the whole month of November) there are tons of

...

1M files. Trying to see what is different about today....

-Hammer- "I was a normal American nerd" -Jack Herer On 11/07/2011 10:27 AM, Richard Golodner wrote:

...

On Mon, 2011-11-07 at 11:09 -0500, Todd Snyder wrote:

...
Can anyone point to any authoritative updates about this?

I think Jared's suggestion was about as close as your going to get for right now. Look at the size of the files he mentioned as compared to the average size of the others. Hopefully someone will come forth with an authoritative answer later today. Richard Golodner

Jared Mauch

4:45 p.m.

New subject: General Internet Instability

On Nov 7, 2011, at 11:41 AM, -Hammer- wrote:

...

So the file size was 30% higher implies that the number of updates is larger and therefore there is instability? I see the logic but if you scroll thru that page (the whole month of November) there are tons of >1M files. Trying to see what is different about today....

This is an easy benchmark to gauge overall stability. Large files mean something was unstable. Then you need to actually look at them to see *why*. Also since the files are compressed you lose some visibility into what is really in them. - Jared

-Hammer-

4:50 p.m.

New subject: General Internet Instability

Thank you. This is somewhat of a learning opportunity for me. I hit all the generic Internet health sites and I understand that there IS an issue. Now I'm getting to learn how you guys attempt to understand WHY we had an issue. But my point is the same. If this is the case than the entire month of November reflects "instability" where I see transitions from 600k to 1M between updates. Yet we didn't experience the same negative customer experience for those. So how do you see the difference with todays events? Digging into files now. -Hammer- "I was a normal American nerd" -Jack Herer On 11/07/2011 10:45 AM, Jared Mauch wrote:

...

On Nov 7, 2011, at 11:41 AM, -Hammer- wrote:

...
So the file size was 30% higher implies that the number of updates is larger and therefore there is instability? I see the logic but if you scroll thru that page (the whole month of November) there are tons of>1M files. Trying to see what is different about today....

This is an easy benchmark to gauge overall stability. Large files mean something was unstable. Then you need to actually look at them to see *why*. Also since the files are compressed you lose some visibility into what is really in them.

- Jared

Erik Bais

8:40 p.m.

New subject: General Internet Instability

I just got pointed towards the following : https://twitter.com/#!/JuniperNetworks/status/133637820081389568 And a (re)post on Pastbin : http://pastebin.com/HBWiH92j Juniper Networks replied to my post on Twitter: https://twitter.com/#!/erikbais/status/133641575585677312 That they are working on an official public URL for all to read. Regards, Erik Bais A2B Internet

Jared Mauch

5:01 p.m.

New subject: real data [Re: General Internet Instability]

Here's some real data for those interested. It seems a quick view seems many TATA <-> Level3 and TATA <-> GBLX sets of instability. Combined with the overall update levels seen over that 30 minutes, we saw ~1.566M updates at route views. Compared with the 24h prior (2011.11.06 14:15 as reference) we see ~17-20x the updates in the same time period. Now take your control plane and make it work 17-20x as hard, and you can see why we had some even general instability. I don't have better tools for doing a more detailed analysis at this time, but this should get you started in talking to others about what happened. (There were no stand-out prefixes for this time-period, they all had about the same number of updates per prefix). #Updates|Filename --------+----------------- 42041 20111106.1415.out 727711 20111107.1400.out 838894 20111107.1415.out 2011.11.07 14:15 datafile - Most updates/as-path (complete as seen @ rviews) Count AS_PATH -----+---------------------------- 4563 3549 3356 3172 3549 3356 11492 2912 8492 3356 6453 4755 2729 812 6453 4755 2695 3549 6453 14080 10620 2504 3549 3356 11492 11492 11492 2421 3549 3356 7029 2362 3356 209 721 27064 2244 3356 209 20115 2130 3356 209 22561 2044 7018 6453 14080 10620 2000 3356 209 1934 3549 3356 2907 1909 3356 15412 18101 1821 3549 6453 4755 1807 3356 6453 4755 1660 8492 3356 174 1634 3549 3356 18566 1619 3549 3356 4837 17623 1617 7018 6453 4755 1581 2497 6453 7545 7545 7545 1496 8492 3356 209 721 27064 1478 2497 6453 4755 1437 8492 8928 3356 11492 11492 11492 1427 3356 3549 3216 8402 1395 3356 701 19262 1362 8492 3356 209 20115 1357 3549 7843 7843 7843 10796 1336 3549 3356 6830 6830 6830 1259 2497 3356 1238 3356 9498 1229 812 6453 14080 10620 1229 3356 6453 14080 10620 1221 3356 9121 42926 1220 3549 3356 29314 1203 3549 3356 680 680 680 680 680 1194 2497 3356 5650 7011 1165 2497 3356 9498 1151 8001 4436 7843 10796 1150 3549 3356 15412 18101 18101 18101 18101 17803 1106 8492 3356 6762 7303 1105 3549 3356 55410 45528 1098 3549 3356 15412 18101 17803 1073 3549 7843 7843 7843 1072 3549 3356 7713 17974 1069 3549 3356 15290 1041 3356 3549 1032 3549 3356 11992 1030 3356 4323 1021 3549 6453 7843 11351 1020 8492 8928 3356 7843 11351 1016 2497 3356 11492 1014 3356 2907 1011 7018 3356 2907 2011.11.07 14:00 datafile - Most updates/as-path (complete as seen @ rviews) Count AS_PATH -----+---------------------------- 3894 3356 6453 4755 3504 3549 3356 2930 812 6453 4755 2793 3549 6453 4755 2254 3549 3356 11492 11492 11492 2235 812 6453 14080 10620 1826 2497 6453 4755 1795 3549 3356 7029 1753 3356 3549 1689 7018 6453 14080 10620 1653 812 6453 4755 17488 1649 3549 3356 18566 1613 2497 3356 1568 3549 3356 6830 6830 6830 1496 7018 6453 4755 1419 3549 6453 14080 10620 1394 3356 701 19262 1389 3356 9121 42926 1382 3549 3356 21565 1293 3549 6453 4755 17488 1252 3549 3356 13407 1188 2497 3356 5650 7011 1156 7018 6453 4755 17488 1154 3356 6453 14080 10620 1127 3356 701 3216 8402 1106 8492 9002 3549 6762 7303 1104 3549 3356 4837 17623 1049 39756 3356 7018 1048 39756 3257 7018 1035 2497 6453 7713 17974 1008 3356 3320

-Hammer-

5:13 p.m.

New subject: real data [Re: General Internet Instability]

Jared, This is good stuff and I'm understanding how you interpret the data. So this confirms what we are seeing. How do we take this towards a root cause? Mash it with the Juniper threads and see where it goes? -Hammer- "I was a normal American nerd" -Jack Herer On 11/07/2011 11:01 AM, Jared Mauch wrote:

...

Here's some real data for those interested. It seems a quick view seems many TATA<-> Level3 and TATA<-> GBLX sets of instability.

Combined with the overall update levels seen over that 30 minutes, we saw ~1.566M updates at route views.

Compared with the 24h prior (2011.11.06 14:15 as reference) we see ~17-20x the updates in the same time period. Now take your control plane and make it work 17-20x as hard, and you can see why we had some even general instability.

I don't have better tools for doing a more detailed analysis at this time, but this should get you started in talking to others about what happened. (There were no stand-out prefixes for this time-period, they all had about the same number of updates per prefix).

#Updates|Filename --------+----------------- 42041 20111106.1415.out 727711 20111107.1400.out 838894 20111107.1415.out

2011.11.07 14:15 datafile - Most updates/as-path (complete as seen @ rviews) Count AS_PATH -----+---------------------------- 4563 3549 3356 3172 3549 3356 11492 2912 8492 3356 6453 4755 2729 812 6453 4755 2695 3549 6453 14080 10620 2504 3549 3356 11492 11492 11492 2421 3549 3356 7029 2362 3356 209 721 27064 2244 3356 209 20115 2130 3356 209 22561 2044 7018 6453 14080 10620 2000 3356 209 1934 3549 3356 2907 1909 3356 15412 18101 1821 3549 6453 4755 1807 3356 6453 4755 1660 8492 3356 174 1634 3549 3356 18566 1619 3549 3356 4837 17623 1617 7018 6453 4755 1581 2497 6453 7545 7545 7545 1496 8492 3356 209 721 27064 1478 2497 6453 4755 1437 8492 8928 3356 11492 11492 11492 1427 3356 3549 3216 8402 1395 3356 701 19262 1362 8492 3356 209 20115 1357 3549 7843 7843 7843 10796 1336 3549 3356 6830 6830 6830 1259 2497 3356 1238 3356 9498 1229 812 6453 14080 10620 1229 3356 6453 14080 10620 1221 3356 9121 42926 1220 3549 3356 29314 1203 3549 3356 680 680 680 680 680 1194 2497 3356 5650 7011 1165 2497 3356 9498 1151 8001 4436 7843 10796 1150 3549 3356 15412 18101 18101 18101 18101 17803 1106 8492 3356 6762 7303 1105 3549 3356 55410 45528 1098 3549 3356 15412 18101 17803 1073 3549 7843 7843 7843 1072 3549 3356 7713 17974 1069 3549 3356 15290 1041 3356 3549 1032 3549 3356 11992 1030 3356 4323 1021 3549 6453 7843 11351 1020 8492 8928 3356 7843 11351 1016 2497 3356 11492 1014 3356 2907 1011 7018 3356 2907

2011.11.07 14:00 datafile - Most updates/as-path (complete as seen @ rviews) Count AS_PATH -----+---------------------------- 3894 3356 6453 4755 3504 3549 3356 2930 812 6453 4755 2793 3549 6453 4755 2254 3549 3356 11492 11492 11492 2235 812 6453 14080 10620 1826 2497 6453 4755 1795 3549 3356 7029 1753 3356 3549 1689 7018 6453 14080 10620 1653 812 6453 4755 17488 1649 3549 3356 18566 1613 2497 3356 1568 3549 3356 6830 6830 6830 1496 7018 6453 4755 1419 3549 6453 14080 10620 1394 3356 701 19262 1389 3356 9121 42926 1382 3549 3356 21565 1293 3549 6453 4755 17488 1252 3549 3356 13407 1188 2497 3356 5650 7011 1156 7018 6453 4755 17488 1154 3356 6453 14080 10620 1127 3356 701 3216 8402 1106 8492 9002 3549 6762 7303 1104 3549 3356 4837 17623 1049 39756 3356 7018 1048 39756 3257 7018 1035 2497 6453 7713 17974 1008 3356 3320

Robert Mathews (OSIA)

9:29 p.m.

New subject: real data

On 11/7/2011 12:01 PM, Jared Mauch wrote:

...

..... It seems a quick view seems many TATA <-> Level3 and TATA <-> GBLX sets of instability.

Combined with the overall update levels seen over that 30 minutes, we saw ~1.566M updates at route views. Compared with the 24h prior (2011.11.06 14:15 as reference) we see ~17-20x the updates in the same time period. Now take your control plane and make it work 17-20x as hard, and you can see why we had some even general instability. Jared:

Thanks, for posting this, and for sharing data. Too bad it is not Huawei that is involved. It has precluded someone, somewhere, from framing this SNAFU, a coordinated global Cyber Event, instigated by the Chinese PLA. 8-;) Then, there is still time for that. [again, tongue firmly planted into cheek] -- Dr. Robert Mathews, D.Phil. Distinguished Senior Research Scholar - National Security Affairs & US Industrial Preparedness University of Hawai'i (OSIA) |-- Sent from "the mother-ship," high above the "blue planet."

Pierre-Yves Maunier

3:33 p.m.

2011/11/7 Tom Hill <tom@ninjabadger.net>

...

On Mon, 2011-11-07 at 10:00 -0500, Todd Snyder wrote:

...
We seem to be having some problems with our tata links - first seen in EU about 45 minutes ago, now we're seeing problems in NA. I'm focused on DNS, so I'm seeing a lot of timeouts/servfails, but our networking folks are talking about links dropping.

Anyone else seeing oddness on the NA Internet right now?

http://downrightnow.com/ confirms - something is up.

There are widespread issues across the Internet; certain versions of Juniper firmware have core dumped after seeing a particular BGP 'UPDATE' message.

(That's the running theory at least).

It's affected multiple service providers, globally, not just those connected to TATA.

Tom

On our side all our 10.3R2.11 core dumped which made all our interfaces flapped. I've been told 10.4R1.9 is affected too. -- Pierre-Yves Maunier

Leigh Porter

3:45 p.m.

My 10.4r1.9 boxes died also but I saw interfaces go down whilst bgpd seemed stable. -- Leigh On 7 Nov 2011, at 15:34, "Pierre-Yves Maunier" <nanog@maunier.org> wrote:

...

2011/11/7 Tom Hill <tom@ninjabadger.net>

...
On Mon, 2011-11-07 at 10:00 -0500, Todd Snyder wrote:

...
We seem to be having some problems with our tata links - first seen in EU about 45 minutes ago, now we're seeing problems in NA. I'm focused on DNS, so I'm seeing a lot of timeouts/servfails, but our networking folks are talking about links dropping.

Anyone else seeing oddness on the NA Internet right now?

http://downrightnow.com/ confirms - something is up.

There are widespread issues across the Internet; certain versions of Juniper firmware have core dumped after seeing a particular BGP 'UPDATE' message.

(That's the running theory at least).

It's affected multiple service providers, globally, not just those connected to TATA.

Tom

On our side all our 10.3R2.11 core dumped which made all our interfaces flapped. I've been told 10.4R1.9 is affected too.

-- Pierre-Yves Maunier

______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________

______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________

John van Oppen

7:52 p.m.

We saw several customers go away this morning as well. Our network itself is cisco so we did not see anything directly. John van Oppen @ AS11404. -----Original Message----- From: Tom Hill [mailto:tom@ninjabadger.net] Sent: Monday, November 07, 2011 7:09 AM To: nanog@nanog.org Subject: Re: TATA problems? On Mon, 2011-11-07 at 10:00 -0500, Todd Snyder wrote:

...

We seem to be having some problems with our tata links - first seen in EU about 45 minutes ago, now we're seeing problems in NA. I'm focused on DNS, so I'm seeing a lot of timeouts/servfails, but our networking folks are talking about links dropping.

Anyone else seeing oddness on the NA Internet right now?

http://downrightnow.com/ confirms - something is up.

Leigh Porter

10:09 p.m.

Any thoughts on just how wide read this was? Did every Juniper that receives Internet BGP updates with the affected software break? Or did it die out quite quickly? -- Leigh On 7 Nov 2011, at 19:55, "John van Oppen" <jvanoppen@spectrumnet.us> wrote:

...

We saw several customers go away this morning as well. Our network itself is cisco so we did not see anything directly.

John van Oppen @ AS11404.

-----Original Message----- From: Tom Hill [mailto:tom@ninjabadger.net] Sent: Monday, November 07, 2011 7:09 AM To: nanog@nanog.org Subject: Re: TATA problems?

On Mon, 2011-11-07 at 10:00 -0500, Todd Snyder wrote:

...
We seem to be having some problems with our tata links - first seen in EU about 45 minutes ago, now we're seeing problems in NA. I'm focused on DNS, so I'm seeing a lot of timeouts/servfails, but our networking folks are talking about links dropping.

Anyone else seeing oddness on the NA Internet right now?

http://downrightnow.com/ confirms - something is up.

There are widespread issues across the Internet; certain versions of Juniper firmware have core dumped after seeing a particular BGP 'UPDATE' message.

(That's the running theory at least).

It's affected multiple service providers, globally, not just those connected to TATA.

Tom

______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________

-Hammer-

10:07 p.m.

This was posted on pastebin earlier today in case it helps. 1. View Bulletin PSN-2011-08-327 2. Title MX Series MPC crash in Ktree::createFourWayNode after BGP UPDATE 3. Products Affected This issue can affect any MX Series router with port concentrators based on the Trio chipset -- such as the MPC or embedded into the MX80 -- with active protocol-based route prefix additions/deletions occurring. 4. Platforms Affected 5. Security 6. JUNOS 11.x 7. MX-series 8. JUNOS 10.x 9. SIRT Security Advisory 10. SIRT Security Notice 11. Revision Number 1 12. Issue Date 2011-08-08 13. 14. PSN Issue : 15. MPCs (Modular Port Concentrators) installed in an MX Series router may crash upon receipt of very specific and unlikely route prefix install/delete actions, such as a BGP routing update. The set of route prefix updates is non-deterministic and exceedingly unlikely to occur. Junos versions affected include 10.0, 10.1, 10.2, 10.3, 10.4 prior to 10.4R6, and 11.1 prior to 11.1R4. The trigger for the MPC crash was determined to be a valid BGP UPDATE received from a registered network service provider, although this one UPDATE was determined to not be solely responsible for the crashes. A complex sequence of preconditions is required to trigger this crash. Both IPv4 and IPv6 routing prefix updates can trigger this MPC crash. 16. 17. There is no indication that this issue was triggered maliciously. Given the complexity of conditions required to trigger this issue, the probability of exploiting this defect is extremely low. 18. 19. The assertions (crash) all occurred in the code used to store routing information, called Ktree, on the MPC. Due to the order and mix of adds and deletes to the tree, certain combinations of address adds and deletes can corrupt the data structures within the MPC, which in turn can cause this line card crash. The MPC recovers and returns to service quickly, and without operator intervention. 20. 21. This issue only affects MX Series routers with port concentrators based on the Trio chipset, such as the MPC or embedded into the MX80. No other product or platform is vulnerable to this issue. 22. 23. Solution: 24. The Ktree code has been updated and enhanced to ensure that combinations and permutations of routing updates will not corrupt the state of the line card. Extensive testing has been performed to validate an exceedingly large combination and permutation of route prefix additions and deletions. 25. 26. All Junos OS software releases built on or after 2011-08-03 have fixed this specific issue. Releases containing the fix specifically include: 10.0S18, 10.4R6, 11.1R4, 11.2R1, and all subsequent releases (i.e. all releases built after 11.2R1). 27. 28. This issue is being tracked as PR 610864. While this PR may not be viewable by customers, it can be used as a reference when discussing the issue with JTAC. 29. 30. KB16765 - "In which releases are vulnerabilities fixed?" describes which release vulnerabilities are fixed as per our End of Engineering and End of Life support policies. 31. 32. Workarounds 33. No known workaround exists for this issue. -Hammer- "I was a normal American nerd" -Jack Herer On 11/07/2011 04:09 PM, Leigh Porter wrote:

...

Any thoughts on just how wide read this was? Did every Juniper that receives Internet BGP updates with the affected software break? Or did it die out quite quickly?

5151

Age (days ago)

5151

Last active (days ago)

List overview

Download

27 comments

17 participants

participants (17)

-Hammer-
Dan
Erik Bais
Jared Mauch
Jay Ashworth
Joel jaeggli
John van Oppen
Kelly Kane
Leigh Porter
Pierre-Yves Maunier
Richard Golodner
Robert Mathews (OSIA)
Stephane Bortzmeyer
Tim Vollebregt
Todd Snyder
Todd Snyder
Tom Hill

TATA problems?

Todd Snyder

Tim Vollebregt

Kelly Kane

Dan

Pierre-Yves Maunier

Todd Snyder

Todd Snyder

Leigh Porter

Erik Bais

Robert Mathews (OSIA)

Pierre-Yves Maunier

Leigh Porter

John van Oppen

Leigh Porter

tags

participants (17)