Hello are there any issues with CloudFlare services now? Dmitry Sherman dmitry@interhost.net Interhost Networks Ltd Web: http://www.interhost.co.il fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
Yes, traffic from Greek networks is routed through NYC (alter.net <http://alter.net/>), and previously it had a 60% packet loss. Now it’s still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem definitely exists. Antonis
On 24 Jun 2019, at 13:55, Dmitry Sherman <dmitry@interhost.net <mailto:dmitry@interhost.net>> wrote:
Hello are there any issues with CloudFlare services now?
Dmitry Sherman dmitry@interhost.net <mailto:dmitry@interhost.net> Interhost Networks Ltd Web: http://www.interhost.co.il fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
From https://www.cloudflarestatus.com/: Identified - We have identified a possible route leak impacting some Cloudflare IP ranges and are working with the network involved to resolve this. Jun 24, 11:36 UTC Seeing issues in Australia too for some sites that are routing through Cloudflare. [https://info.serversaustralia.com.au/hubfs/Brand-2018/logo-font-sau.gif] Jaden Roberts Senior Network Engineer 4 Amy Close, Wyong, NSW 2259 Need assistance? We are here 24/7 +61 2 8115 8888 [https://app.frontapp.com/api/1/noauth/companies/servers_australia_pty_ltd/se...] On June 24, 2019, 9:06 PM GMT+10 daknob.mac@gmail.com<mailto:daknob.mac@gmail.com> wrote: Yes, traffic from Greek networks is routed through NYC (alter.net<http://alter.net/>), and previously it had a 60% packet loss. Now it’s still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem definitely exists. Antonis On 24 Jun 2019, at 13:55, Dmitry Sherman <dmitry@interhost.net<mailto:dmitry@interhost.net>> wrote: Hello are there any issues with CloudFlare services now? Dmitry Sherman dmitry@interhost.net<mailto:dmitry@interhost.net> Interhost Networks Ltd Web: http://www.interhost.co.il<http://www.interhost.co.il/> fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
On Mon, Jun 24, 2019 at 02:03:47PM +0300, Antonios Chariton wrote:
Yes, traffic from Greek networks is routed through NYC (alter.net <http://alter.net/>), and previously it had a 60% packet loss. Now it???s still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem definitely exists.
It seems Verizon has stopped filtering a downstream customer, or filtering broke. Time to implement peer locking path filters for those using VZ as paid peer.. Network Next Hop Metric LocPrf Weight Path * 2.18.64.0/24 137.39.3.55 0 701 396531 33154 174 6057 i * 2.19.251.0/24 137.39.3.55 0 701 396531 33154 174 6057 i * 2.22.24.0/23 137.39.3.55 0 701 396531 33154 174 6057 i * 2.22.26.0/23 137.39.3.55 0 701 396531 33154 174 6057 i * 2.22.28.0/24 137.39.3.55 0 701 396531 33154 174 6057 i * 2.24.0.0/16 137.39.3.55 0 701 396531 33154 3356 12576 i * 202.232.0.2 0 2497 701 396531 33154 3356 12576 i * 2.24.0.0/13 202.232.0.2 0 2497 701 396531 33154 3356 12576 i * 2.25.0.0/16 137.39.3.55 0 701 396531 33154 3356 12576 i * 202.232.0.2 0 2497 701 396531 33154 3356 12576 i * 2.26.0.0/16 137.39.3.55 0 701 396531 33154 3356 12576 i * 202.232.0.2 0 2497 701 396531 33154 3356 12576 i * 2.27.0.0/16 137.39.3.55 0 701 396531 33154 3356 12576 i * 202.232.0.2 0 2497 701 396531 33154 3356 12576 i * 2.28.0.0/16 137.39.3.55 0 701 396531 33154 3356 12576 i * 202.232.0.2 0 2497 701 396531 33154 3356 12576 i * 2.29.0.0/16 137.39.3.55 0 701 396531 33154 3356 12576 i * 202.232.0.2 0 2497 701 396531 33154 3356 12576 i * 2.30.0.0/16 137.39.3.55 0 701 396531 33154 3356 12576 i * 202.232.0.2 0 2497 701 396531 33154 3356 12576 i * 2.31.0.0/16 137.39.3.55 0 701 396531 33154 3356 12576 i * 202.232.0.2 0 2497 701 396531 33154 3356 12576 i * 2.56.16.0/22 137.39.3.55 0 701 396531 33154 1239 9009 i * 2.56.150.0/24 137.39.3.55 0 701 396531 33154 1239 9009 i * 2.57.48.0/22 137.39.3.55 0 701 396531 33154 174 50782 i * 2.58.47.0/24 137.39.3.55 0 701 396531 33154 1239 9009 i * 2.59.0.0/23 137.39.3.55 0 701 396531 33154 1239 9009 i * 2.59.244.0/22 137.39.3.55 0 701 396531 33154 3356 29119 i * 2.148.0.0/14 137.39.3.55 0 701 396531 33154 3356 2119 i * 3.5.128.0/24 137.39.3.55 0 701 396531 33154 3356 16509 i * 3.5.128.0/22 137.39.3.55 0 701 396531 33154 3356 16509 i
From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:
This appears to be a routing problem with Level3. All our systems are running normally but traffic isn't getting to us for a portion of our domains. 1128 UTC update Looks like we're dealing with a route leak and we're talking directly with the leaker and Level3 at the moment. 1131 UTC update Just to be clear this isn't affecting all our traffic or all our domains or all countries. A portion of traffic isn't hitting Cloudflare. Looks to be about an aggregate 10% drop in traffic to us. 1134 UTC update We are now certain we are dealing with a route leak. On Mon, Jun 24, 2019 at 04:04 Antonios Chariton <daknob.mac@gmail.com> wrote:
Yes, traffic from Greek networks is routed through NYC (alter.net), and previously it had a 60% packet loss. Now it’s still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem definitely exists.
Antonis
On 24 Jun 2019, at 13:55, Dmitry Sherman <dmitry@interhost.net> wrote:
Hello are there any issues with CloudFlare services now?
Dmitry Sherman dmitry@interhost.net Interhost Networks Ltd Web: http://www.interhost.co.il fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
We are seeing issues as well getting to HE. The traffic is going via Alter. On Mon, Jun 24, 2019 at 7:48 AM Robbie Trencheny <me@robbiet.us> wrote:
From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:
This appears to be a routing problem with Level3. All our systems are running normally but traffic isn't getting to us for a portion of our domains.
1128 UTC update Looks like we're dealing with a route leak and we're talking directly with the leaker and Level3 at the moment.
1131 UTC update Just to be clear this isn't affecting all our traffic or all our domains or all countries. A portion of traffic isn't hitting Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.
1134 UTC update We are now certain we are dealing with a route leak.
On Mon, Jun 24, 2019 at 04:04 Antonios Chariton <daknob.mac@gmail.com> wrote:
Yes, traffic from Greek networks is routed through NYC (alter.net), and previously it had a 60% packet loss. Now it’s still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem definitely exists.
Antonis
On 24 Jun 2019, at 13:55, Dmitry Sherman <dmitry@interhost.net> wrote:
Hello are there any issues with CloudFlare services now?
Dmitry Sherman dmitry@interhost.net Interhost Networks Ltd Web: http://www.interhost.co.il fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
*1147 UTC update* Staring at internal graphs looks like global traffic is now at 97% of expected so impact lessening. On Mon, Jun 24, 2019 at 04:51 Dovid Bender <dovid@telecurve.com> wrote:
We are seeing issues as well getting to HE. The traffic is going via Alter.
On Mon, Jun 24, 2019 at 7:48 AM Robbie Trencheny <me@robbiet.us> wrote:
From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:
This appears to be a routing problem with Level3. All our systems are running normally but traffic isn't getting to us for a portion of our domains.
1128 UTC update Looks like we're dealing with a route leak and we're talking directly with the leaker and Level3 at the moment.
1131 UTC update Just to be clear this isn't affecting all our traffic or all our domains or all countries. A portion of traffic isn't hitting Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.
1134 UTC update We are now certain we are dealing with a route leak.
On Mon, Jun 24, 2019 at 04:04 Antonios Chariton <daknob.mac@gmail.com> wrote:
Yes, traffic from Greek networks is routed through NYC (alter.net), and previously it had a 60% packet loss. Now it’s still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem definitely exists.
Antonis
On 24 Jun 2019, at 13:55, Dmitry Sherman <dmitry@interhost.net> wrote:
Hello are there any issues with CloudFlare services now?
Dmitry Sherman dmitry@interhost.net Interhost Networks Ltd Web: http://www.interhost.co.il fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
a Verizon downstream BGP customer is leaking the full table, and some more specific from us and many other providers. On Mon, Jun 24, 2019 at 7:56 AM Robbie Trencheny <me@robbiet.us> wrote:
*1147 UTC update* Staring at internal graphs looks like global traffic is now at 97% of expected so impact lessening.
On Mon, Jun 24, 2019 at 04:51 Dovid Bender <dovid@telecurve.com> wrote:
We are seeing issues as well getting to HE. The traffic is going via Alter.
On Mon, Jun 24, 2019 at 7:48 AM Robbie Trencheny <me@robbiet.us> wrote:
From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:
This appears to be a routing problem with Level3. All our systems are running normally but traffic isn't getting to us for a portion of our domains.
1128 UTC update Looks like we're dealing with a route leak and we're talking directly with the leaker and Level3 at the moment.
1131 UTC update Just to be clear this isn't affecting all our traffic or all our domains or all countries. A portion of traffic isn't hitting Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.
1134 UTC update We are now certain we are dealing with a route leak.
On Mon, Jun 24, 2019 at 04:04 Antonios Chariton <daknob.mac@gmail.com> wrote:
Yes, traffic from Greek networks is routed through NYC (alter.net), and previously it had a 60% packet loss. Now it’s still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem definitely exists.
Antonis
On 24 Jun 2019, at 13:55, Dmitry Sherman <dmitry@interhost.net> wrote:
Hello are there any issues with CloudFlare services now?
Dmitry Sherman dmitry@interhost.net Interhost Networks Ltd Web: http://www.interhost.co.il fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
*1204 UTC update* This leak is wider spread that just Cloudflare. *1208 UTC update* Amazon Web Services now reporting external networking problem On Mon, Jun 24, 2019 at 05:18 Tom Paseka <tom@cloudflare.com> wrote:
a Verizon downstream BGP customer is leaking the full table, and some more specific from us and many other providers.
On Mon, Jun 24, 2019 at 7:56 AM Robbie Trencheny <me@robbiet.us> wrote:
*1147 UTC update* Staring at internal graphs looks like global traffic is now at 97% of expected so impact lessening.
On Mon, Jun 24, 2019 at 04:51 Dovid Bender <dovid@telecurve.com> wrote:
We are seeing issues as well getting to HE. The traffic is going via Alter.
On Mon, Jun 24, 2019 at 7:48 AM Robbie Trencheny <me@robbiet.us> wrote:
From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:
This appears to be a routing problem with Level3. All our systems are running normally but traffic isn't getting to us for a portion of our domains.
1128 UTC update Looks like we're dealing with a route leak and we're talking directly with the leaker and Level3 at the moment.
1131 UTC update Just to be clear this isn't affecting all our traffic or all our domains or all countries. A portion of traffic isn't hitting Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.
1134 UTC update We are now certain we are dealing with a route leak.
On Mon, Jun 24, 2019 at 04:04 Antonios Chariton <daknob.mac@gmail.com> wrote:
Yes, traffic from Greek networks is routed through NYC (alter.net), and previously it had a 60% packet loss. Now it’s still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem definitely exists.
Antonis
On 24 Jun 2019, at 13:55, Dmitry Sherman <dmitry@interhost.net> wrote:
Hello are there any issues with CloudFlare services now?
Dmitry Sherman dmitry@interhost.net Interhost Networks Ltd Web: http://www.interhost.co.il fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
This is my final update, I’m going back to bed, wake me up when the internet is working again. https://news.ycombinator.com/item?id=20262316 —— 1230 UTC update We are working with networks around the world and are observing network routes for Google and AWS being leaked at well. On Mon, Jun 24, 2019 at 05:20 Robbie Trencheny <me@robbiet.us> wrote:
*1204 UTC update* This leak is wider spread that just Cloudflare.
*1208 UTC update* Amazon Web Services now reporting external networking problem
On Mon, Jun 24, 2019 at 05:18 Tom Paseka <tom@cloudflare.com> wrote:
a Verizon downstream BGP customer is leaking the full table, and some more specific from us and many other providers.
On Mon, Jun 24, 2019 at 7:56 AM Robbie Trencheny <me@robbiet.us> wrote:
*1147 UTC update* Staring at internal graphs looks like global traffic is now at 97% of expected so impact lessening.
On Mon, Jun 24, 2019 at 04:51 Dovid Bender <dovid@telecurve.com> wrote:
We are seeing issues as well getting to HE. The traffic is going via Alter.
On Mon, Jun 24, 2019 at 7:48 AM Robbie Trencheny <me@robbiet.us> wrote:
From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:
This appears to be a routing problem with Level3. All our systems are running normally but traffic isn't getting to us for a portion of our domains.
1128 UTC update Looks like we're dealing with a route leak and we're talking directly with the leaker and Level3 at the moment.
1131 UTC update Just to be clear this isn't affecting all our traffic or all our domains or all countries. A portion of traffic isn't hitting Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.
1134 UTC update We are now certain we are dealing with a route leak.
On Mon, Jun 24, 2019 at 04:04 Antonios Chariton <daknob.mac@gmail.com> wrote:
Yes, traffic from Greek networks is routed through NYC (alter.net), and previously it had a 60% packet loss. Now it’s still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem definitely exists.
Antonis
On 24 Jun 2019, at 13:55, Dmitry Sherman <dmitry@interhost.net> wrote:
Hello are there any issues with CloudFlare services now?
Dmitry Sherman dmitry@interhost.net Interhost Networks Ltd Web: http://www.interhost.co.il fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
-- -- Robbie Trencheny (@robbie <http://twitter.com/robbie>) 925-884-3728 robbie.io
On Mon, Jun 24, 2019 at 08:18:27AM -0400, Tom Paseka via NANOG wrote:
a Verizon downstream BGP customer is leaking the full table, and some more specific from us and many other providers.
It appears that one of the implicated ASNs, AS 33154 "DQE Communications LLC" is listed as customer on Noction's website: https://www.noction.com/clients/dqe I suspect AS 33154's customer AS 396531 turned up a new circuit with Verizon, but didn't have routing policies to prevent sending routes from 33154 to 701 and vice versa, or their router didn't have support for RFC 8212. I'd like to point everyone to an op-ed I wrote on the topic of "BGP optimizers": https://seclists.org/nanog/2017/Aug/318 So in summary, I believe the following happened: - 33154 generated fake more-specifics, which are not visible in the DFZ - 33154 announces those fake more-specifics to at least one customer (396531) - this customer (396531) propagated to to another upstream provider (701) - it appears that 701 did not sufficient prefix filtering, or a maximum-prefix limit While it is easy to point at the alleged BGP optimizer as the root cause, I do think we now have observed a cascading catastrophic failure both in process and technologies. Here are some recommendations that all of us can apply, that may have helped dampen the negative effects: - deploy RPKI based BGP Origin validation (with invalid == reject) - apply maximum prefix limits on all EBGP sessions - ask your router vendor to comply with RFC 8212 ('default deny') - turn off your 'BGP optimizers' I suspect we, collectively, suffered significant financial damage in this incident. Kind regards, Job
I'd like to point everyone to an op-ed I wrote on the topic of "BGP optimizers": >https://seclists.org/nanog/2017/Aug/318
Hehe, I haven't seen this text before. Can't agree more. Get your tie back on Job, nobody listened again. More seriously, I see no difference between prefix hijacking and the so called bgp optimisation based on completely fake announces on behalf of other people. If ever your upstream or any other party who your company pays money to does this dirty thing, now it's just the right moment to go explain them that you consider this dangerous for your business and are looking for better partners among those who know how to run internet without breaking it.
On 24/Jun/19 18:09, Pavel Lunin wrote:
Hehe, I haven't seen this text before. Can't agree more.
Get your tie back on Job, nobody listened again.
More seriously, I see no difference between prefix hijacking and the so called bgp optimisation based on completely fake announces on behalf of other people.
If ever your upstream or any other party who your company pays money to does this dirty thing, now it's just the right moment to go explain them that you consider this dangerous for your business and are looking for better partners among those who know how to run internet without breaking it.
We struggled with a number of networks using these over eBGP sessions they had with networks that shared their routing data with BGPmon. It sent off all sorts of alarms, and troubleshooting it was hard when a network thinks you are de-aggregating massively, and yet you know you aren't. Each case took nearly 3 weeks to figure out. BGP optimizers are the bane of my existence. Mark.
FYI for the group -- we just published this: https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-pa... _________________ *Justin Paine* Director of Trust & Safety PGP: BBAA 6BCE 3305 7FD6 6452 7115 57B6 0114 DE0B 314D 101 Townsend St., San Francisco, CA 94107 On Mon, Jun 24, 2019 at 2:25 PM Mark Tinka <mark.tinka@seacom.mu> wrote:
On 24/Jun/19 18:09, Pavel Lunin wrote:
Hehe, I haven't seen this text before. Can't agree more.
Get your tie back on Job, nobody listened again.
More seriously, I see no difference between prefix hijacking and the so called bgp optimisation based on completely fake announces on behalf of other people.
If ever your upstream or any other party who your company pays money to does this dirty thing, now it's just the right moment to go explain them that you consider this dangerous for your business and are looking for better partners among those who know how to run internet without breaking it.
We struggled with a number of networks using these over eBGP sessions they had with networks that shared their routing data with BGPmon. It sent off all sorts of alarms, and troubleshooting it was hard when a network thinks you are de-aggregating massively, and yet you know you aren't.
Each case took nearly 3 weeks to figure out.
BGP optimizers are the bane of my existence.
Mark.
Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701. My comments are my own opinions only. Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not. You are 100% right that 701 should have had some sort of protection mechanism in place to prevent this. But do we know they didn’t? Do we know it was there and just setup wrong? Did another change at another time break what was there? I used 701 many jobs ago and they absolutely had filtering in place; it saved my bacon when I screwed up once and started readvertising a full table from a 2nd provider. They smacked my session down an I got a nice call about it. You guys have repeatedly accused them of being dumb without even speaking to anyone yet from the sounds of it. Shouldn’t we be working on facts? Should they have been easier to reach once an issue was detected? Probably. They’re certainly not the first vendor to have a slow response time though. Seems like when an APAC carrier takes 18 hours to get back to us, we write it off as the cost of doing business. It also would have been nice, in my opinion, to take a harder stance on the BGP optimizer that generated he bogus routes, and the steel company that failed BGP 101 and just gladly reannounced one upstream to another. 701 is culpable for their mistakes, but there doesn’t seem like there is much appetite to shame the other contributors. You’re right to use this as a lever to push for proper filtering , RPKI, best practices. I’m 100% behind that. We can all be a hell of a lot better at what we do. This stuff happens more than it should, but less than it could. But this industry is one big ass glass house. What’s that thing about stones again? On Mon, Jun 24, 2019 at 18:06 Justin Paine via NANOG <nanog@nanog.org> wrote:
FYI for the group -- we just published this: https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-pa...
_________________ *Justin Paine* Director of Trust & Safety PGP: BBAA 6BCE 3305 7FD6 6452 7115 57B6 0114 DE0B 314D 101 Townsend St., San Francisco, CA 94107 <https://www.google.com/maps/search/101+Townsend+St.,+San+Francisco,+CA+94107?entry=gmail&source=g>
On Mon, Jun 24, 2019 at 2:25 PM Mark Tinka <mark.tinka@seacom.mu> wrote:
On 24/Jun/19 18:09, Pavel Lunin wrote:
Hehe, I haven't seen this text before. Can't agree more.
Get your tie back on Job, nobody listened again.
More seriously, I see no difference between prefix hijacking and the so called bgp optimisation based on completely fake announces on behalf of other people.
If ever your upstream or any other party who your company pays money to does this dirty thing, now it's just the right moment to go explain them that you consider this dangerous for your business and are looking for better partners among those who know how to run internet without breaking it.
We struggled with a number of networks using these over eBGP sessions they had with networks that shared their routing data with BGPmon. It sent off all sorts of alarms, and troubleshooting it was hard when a network thinks you are de-aggregating massively, and yet you know you aren't.
Each case took nearly 3 weeks to figure out.
BGP optimizers are the bane of my existence.
Mark.
On Mon, Jun 24, 2019 at 08:03:26PM -0400, Tom Beecher wrote:
You are 100% right that 701 should have had some sort of protection mechanism in place to prevent this. But do we know they didn???t? Do we know it was there and just setup wrong? Did another change at another time break what was there? I used 701 many jobs ago and they absolutely had filtering in place; it saved my bacon when I screwed up once and started readvertising a full table from a 2nd provider. They smacked my session down an I got a nice call about it.
In my past (and current) dealings with AS701, I do agree that they have generally been good about filtering customer sessions and running a tight ship. But, manual config changes being what they are, I suppose an honest mistake or oversight issue had occurred at 701 today that made them contribute significantly to today's outage.
It also would have been nice, in my opinion, to take a harder stance on the BGP optimizer that generated he bogus routes, and the steel company that failed BGP 101 and just gladly reannounced one upstream to another. 701 is culpable for their mistakes, but there doesn???t seem like there is much appetite to shame the other contributors.
I think the biggest question to be asked here -- why the hell is a BGP optimizer (Noction in this case) injecting fake more specifics to steer traffic? And why did a regional provider providing IP transit (DQE), use such a dangerous accident-waiting-to- happen tool in their network, especially when they have other ASNs taking transit feeds from them, with all these fake man-in-the-middle routes being injected? I get that BGP optimizers can have some use cases, but IMO, in most of the situations, (especially if you are a network provider selling transit and taking peering yourself) a well crafted routing policy and interconnection strategy eliminates the need for implementing flawed route selection optimizers in your network. The notion of BGP Optimizer generating fake more specifics is absurd, and is definitely not a tool that is designed to "fail -> safe". Instead of failing safe, it has failed epically and catastrophically today. I remember long time ago, when Internap used to sell their FCP product, Internap SE were advising the customer to make appropriate adjustments to local-preference to prefer the FCP generated routes to ensure optimal selection. That is a much more sane design choice, than injecting man-in-the-middle attacks and relying on customers to prevent a disaster. Any time I have a sit down with any engineer who "outsources" responsibility of maintaining robustness principle onto their customer, it makes me want to puke. James
Maybe I'm in the minority here, but I have higher standards for a T1 than any of the other players involved. Clearly several entities failed to do what they should have done, but Verizon is not a small or inexperienced operation. Taking 8+ hours to respond to a critical operational problem is what stood out to me as unacceptable. And really - does it matter if the protection *was* there but something broke it? I don't think it does. Ultimately, Verizon failed implement correct protections on their network. And then failed to respond when it became a problem. On Mon, Jun 24, 2019, 8:06 PM Tom Beecher <beecher@beecher.cc> wrote:
Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701. My comments are my own opinions only.
Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
You are 100% right that 701 should have had some sort of protection mechanism in place to prevent this. But do we know they didn’t? Do we know it was there and just setup wrong? Did another change at another time break what was there? I used 701 many jobs ago and they absolutely had filtering in place; it saved my bacon when I screwed up once and started readvertising a full table from a 2nd provider. They smacked my session down an I got a nice call about it.
You guys have repeatedly accused them of being dumb without even speaking to anyone yet from the sounds of it. Shouldn’t we be working on facts?
Should they have been easier to reach once an issue was detected? Probably. They’re certainly not the first vendor to have a slow response time though. Seems like when an APAC carrier takes 18 hours to get back to us, we write it off as the cost of doing business.
It also would have been nice, in my opinion, to take a harder stance on the BGP optimizer that generated he bogus routes, and the steel company that failed BGP 101 and just gladly reannounced one upstream to another. 701 is culpable for their mistakes, but there doesn’t seem like there is much appetite to shame the other contributors.
You’re right to use this as a lever to push for proper filtering , RPKI, best practices. I’m 100% behind that. We can all be a hell of a lot better at what we do. This stuff happens more than it should, but less than it could.
But this industry is one big ass glass house. What’s that thing about stones again?
On Mon, Jun 24, 2019 at 18:06 Justin Paine via NANOG <nanog@nanog.org> wrote:
FYI for the group -- we just published this: https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-pa...
_________________ *Justin Paine* Director of Trust & Safety PGP: BBAA 6BCE 3305 7FD6 6452 7115 57B6 0114 DE0B 314D 101 Townsend St., San Francisco, CA 94107 <https://www.google.com/maps/search/101+Townsend+St.,+San+Francisco,+CA+94107?entry=gmail&source=g>
On Mon, Jun 24, 2019 at 2:25 PM Mark Tinka <mark.tinka@seacom.mu> wrote:
On 24/Jun/19 18:09, Pavel Lunin wrote:
Hehe, I haven't seen this text before. Can't agree more.
Get your tie back on Job, nobody listened again.
More seriously, I see no difference between prefix hijacking and the so called bgp optimisation based on completely fake announces on behalf of other people.
If ever your upstream or any other party who your company pays money to does this dirty thing, now it's just the right moment to go explain them that you consider this dangerous for your business and are looking for better partners among those who know how to run internet without breaking it.
We struggled with a number of networks using these over eBGP sessions they had with networks that shared their routing data with BGPmon. It sent off all sorts of alarms, and troubleshooting it was hard when a network thinks you are de-aggregating massively, and yet you know you aren't.
Each case took nearly 3 weeks to figure out.
BGP optimizers are the bane of my existence.
Mark.
On Jun 24, 2019, at 8:50 PM, Ross Tajvar <ross@tajvar.io> wrote:
Maybe I'm in the minority here, but I have higher standards for a T1 than any of the other players involved. Clearly several entities failed to do what they should have done, but Verizon is not a small or inexperienced operation. Taking 8+ hours to respond to a critical operational problem is what stood out to me as unacceptable.
Are you talking about a press response or a technical one? The impacts I saw were for around 2h or so based on monitoring I’ve had up since 2007. Not great but far from the worst as Tom mentioned. I’ve seen people cease to announce IP space we reclaimed from them for months (or years) because of stale config. I’ve also seen routes come back from the dead because they were pinned to an interface that was down for 2 years but never fully cleaned up. (Then the telco looped the circuit, interface came up, route in table, announced globally — bad day all around).
And really - does it matter if the protection *was* there but something broke it? I don't think it does. Ultimately, Verizon failed implement correct protections on their network. And then failed to respond when it became a problem.
I think it does matter. As I said in my other reply, people do things like drop ACLs to debug. Perhaps that’s unsafe, but it is something you do to debug. Not knowing what happened, I dunno. It is also 2019 so I hold networks to a higher standard than I did in 2009 or 1999. - Jared
On Mon, Jun 24, 2019 at 9:01 PM Jared Mauch <jared@puck.nether.net> wrote:
On Jun 24, 2019, at 8:50 PM, Ross Tajvar <ross@tajvar.io> wrote:
Maybe I'm in the minority here, but I have higher standards for a T1
than any of the other players involved. Clearly several entities failed to do what they should have done, but Verizon is not a small or inexperienced operation. Taking 8+ hours to respond to a critical operational problem is what stood out to me as unacceptable.
Are you talking about a press response or a technical one? The impacts I
saw were for around 2h or so based on monitoring I’ve had up since 2007. Not great but far from the worst as Tom mentioned. I’ve seen people cease to announce IP space we reclaimed from them for months (or years) because of stale config. I’ve also seen routes come back from the dead because they were pinned to an interface that was down for 2 years but never fully cleaned up. (Then the telco looped the circuit, interface came up, route in table, announced globally — bad day all around).
And really - does it matter if the protection *was* there but something broke it? I don't think it does. Ultimately, Verizon failed implement correct protections on their network. And then failed to respond when it became a problem.
I think it does matter. As I said in my other reply, people do things
A technical one - see below from CF's blog post: "It is unfortunate that while we tried both e-mail and phone calls to reach out to Verizon, at the time of writing this article (over 8 hours after the incident), we have not heard back from them, nor are we aware of them taking action to resolve the issue." like drop ACLs to debug. Perhaps that’s unsafe, but it is something you do to debug. Not knowing what happened, I dunno. It is also 2019 so I hold networks to a higher standard than I did in 2009 or 1999.
Dropping an ACL is fine, but then you have to clean it up when you're done. Your customers don't care that you *almost* didn't have an outage because you *almost* did your job right. Yeah, there's a difference between not following policy and not having a policy, but neither one is acceptable behavior from a T1 imo. If it's that easy to cause an outage by not following policy, then I argue that the policy should be better, or *something *should be better - monitoring, automation, sanity checks. etc. There are lots of ways to solve that problem. And in 2019 I really think there's no excuse for a T1 not to be doing that kind of thing.
- Jared
On Jun 24, 2019, at 9:39 PM, Ross Tajvar <ross@tajvar.io> wrote:
On Mon, Jun 24, 2019 at 9:01 PM Jared Mauch <jared@puck.nether.net> wrote:
On Jun 24, 2019, at 8:50 PM, Ross Tajvar <ross@tajvar.io> wrote:
Maybe I'm in the minority here, but I have higher standards for a T1 than any of the other players involved. Clearly several entities failed to do what they should have done, but Verizon is not a small or inexperienced operation. Taking 8+ hours to respond to a critical operational problem is what stood out to me as unacceptable.
Are you talking about a press response or a technical one? The impacts I saw were for around 2h or so based on monitoring I’ve had up since 2007. Not great but far from the worst as Tom mentioned. I’ve seen people cease to announce IP space we reclaimed from them for months (or years) because of stale config. I’ve also seen routes come back from the dead because they were pinned to an interface that was down for 2 years but never fully cleaned up. (Then the telco looped the circuit, interface came up, route in table, announced globally — bad day all around).
A technical one - see below from CF's blog post: "It is unfortunate that while we tried both e-mail and phone calls to reach out to Verizon, at the time of writing this article (over 8 hours after the incident), we have not heard back from them, nor are we aware of them taking action to resolve the issue.”
I don’t know if CF is a customer (or not) of VZ, but it’s likely easy enough to find with a looking glass somewhere, but they were perhaps a few of the 20k prefixes impacted (as reported by others). We have heard from them and not a lot of the other people, but most of them likely don’t do business with VZ directly. I’m not sure VZ is going to contact them all or has the capability to respond to them all (or respond to non-customers except via a press release).
And really - does it matter if the protection *was* there but something broke it? I don't think it does. Ultimately, Verizon failed implement correct protections on their network. And then failed to respond when it became a problem.
I think it does matter. As I said in my other reply, people do things like drop ACLs to debug. Perhaps that’s unsafe, but it is something you do to debug. Not knowing what happened, I dunno. It is also 2019 so I hold networks to a higher standard than I did in 2009 or 1999.
Dropping an ACL is fine, but then you have to clean it up when you're done. Your customers don't care that you almost didn't have an outage because you almost did your job right. Yeah, there's a difference between not following policy and not having a policy, but neither one is acceptable behavior from a T1 imo. If it's that easy to cause an outage by not following policy, then I argue that the policy should be better, or something should be better - monitoring, automation, sanity checks. etc. There are lots of ways to solve that problem. And in 2019 I really think there's no excuse for a T1 not to be doing that kind of thing.
I don’t know about the outage (other than what I observed). I offered some suggestions for people to help prevent it from happening, so I’ll leave it there. We all make mistakes, I’ve been part of many and I’m sure that list isn’t yet complete. - Jared
On Mon, Jun 24, 2019 at 09:39:13PM -0400, Ross Tajvar wrote:
A technical one - see below from CF's blog post: "It is unfortunate that while we tried both e-mail and phone calls to reach out to Verizon, at the time of writing this article (over 8 hours after the incident), we have not heard back from them, nor are we aware of them taking action to resolve the issue."
Which is why an operation the size of Verizon should be able to manage the trivial task of monitoring its RFC 2142 role addresses 24x7 with a response time measured in minutes. And not just Verizon: every large operation should be doing the same. There is no excuse for failure to implement this rudimentary operational practice. [ And let me add that a very good way to deal with mail sent to those addresses is to use procmail to pre-sort based on who it's from. Every time a message is received from a new source, a new procmail rule should be added to classify it appropriately. Over time, this makes it very easy to identify traffic from clueful people vs. traffic from idiots, and thus to readily discern what needs to be triaged first. ] ---rsk
On Jun 24, 2019, at 8:03 PM, Tom Beecher <beecher@beecher.cc> wrote:
Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701. My comments are my own opinions only.
Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
I presume that seeing a CF blog post isn’t regular for you. :-). — please read on
You are 100% right that 701 should have had some sort of protection mechanism in place to prevent this. But do we know they didn’t? Do we know it was there and just setup wrong? Did another change at another time break what was there? I used 701 many jobs ago and they absolutely had filtering in place; it saved my bacon when I screwed up once and started readvertising a full table from a 2nd provider. They smacked my session down an I got a nice call about it.
You guys have repeatedly accused them of being dumb without even speaking to anyone yet from the sounds of it. Shouldn’t we be working on facts?
Should they have been easier to reach once an issue was detected? Probably. They’re certainly not the first vendor to have a slow response time though. Seems like when an APAC carrier takes 18 hours to get back to us, we write it off as the cost of doing business.
It also would have been nice, in my opinion, to take a harder stance on the BGP optimizer that generated he bogus routes, and the steel company that failed BGP 101 and just gladly reannounced one upstream to another. 701 is culpable for their mistakes, but there doesn’t seem like there is much appetite to shame the other contributors.
You’re right to use this as a lever to push for proper filtering , RPKI, best practices. I’m 100% behind that. We can all be a hell of a lot better at what we do. This stuff happens more than it should, but less than it could.
But this industry is one big ass glass house. What’s that thing about stones again?
I’m careful to not talk about the people impacted. There were a lot of people impacted, roughly 3-4% of the IP space was impacted today and I personally heard from more providers than can be counted on a single hand about their impact. Not everyone is going to write about their business impact in public. I’m not authorized to speak for my employer about any impacts that we may have had (for example) but if there was impact to 3-4% of IP space, statistically speaking there’s always a chance someone was impacted. I do agree about the glass house thing. There’s a lot of blame to go around, and today I’ve been quoting “go read _normal accidents_” to people. It’s because sufficiently complex systems tend to have complex failures where numerous safety systems or controls were bypassed. Those of us with more than a few days of experience likely know what some of them are, we also don’t know if those safety systems were disabled as part of debugging by one or more parties. Who hasn’t dropped an ACL to debug why it isn’t working, or if that fixed the problem? I don’t know what happened, but I sure know the symptoms and sets of fixes that the industry should apply and enforce. I have been communicating some of them in public and many of them in private today, including offering help to other operators with how to implement some of the fixes. It’s a bad day when someone changes your /16 to two /17’s and sends them out regardless of if the packets flow through or not. These things aren’t new, nor do I expect things to be significantly better tomorrow either. I know people at VZ and suspect once they woke up they did something about it. I also know how hard it is to contact someone you don’t have a business relationship with. A number of the larger providers have no way for a non-customer to phone, message or open a ticket online about problems they may have. Who knows, their ticket system may be in the cloud and was also impacted. What I do know is that if 3-4% of the home/structures were flooded or temporarily unusable because of some form of disaster or evacuation, people would be proposing better engineering methods or inspection techniques for these structures. If you are a small network and just point default, there is nothing for you to see here and nothing that you can do. If you speak BGP with your upstream, you can filter out some of the bad routes. You perhaps know that 1239, 3356 and others should only be seen directly from a network like 701 and can apply filters of this sort to prevent from accepting those more specifics. I don’t believe it’s just 174 that the routes went to, but they were one of the networks aside from 701 where I saw paths from today. (Now the part where you as a 3rd party to this event can help!) If you peer, build some pre-flight and post-flight scripts to check how many routes you are sending. Most router vendors support either on-box scripting, or you can do a show | display xml, JSON or some other structured language you can automate with. AS_PATH filters are simple, low cost and can help mitigate problems. Consider monitoring your routes with a BMP server (pmacct has a great one!). Set max-prefix (and monitor if you near thresholds!). Configure automatic restarts if you won’t be around to fix it. I hate to say “automate all the things”, but at least start with monitoring so you can know when things go bad. Slack and other things have great APIs and you can have alerts sent to your systems telling you of problems. Try hard to automate your debugging. Monitor for announcements of your space. The new RIS Live API lets you do this and it’s super easy to spin something up. Hold your suppliers accountable as well. If you are a customer of a network that was impacted or accepted these routes, ask for a formal RFO and what the corrective actions are. Don’t let them off the hook as it will happen again. If you are using route optimization technology, make double certain it’s not possible to leak routes. Cisco IOS and Noction are two products that I either know or have been told don’t have default safe settings enabled. I learned early on in the 90s the perils of having “everything on, unprotected” by default. There were great bugs in software that allowed devices to be compromised at scale which made comparable cleanup problems to what we’ve seen in recent years with IoT or other technologies. Tell your vendors you want them to be secure by default, and vote with your personal and corporate wallet when you can. It won’t always work, some vendors will not be able or willing to clean up their acts, but unless we act together as an industry to clean up the glass inside our own homes, expect someone from the outside to come at some point who can force it, and it may not even make sense (ask anyone who deals with security audit checklists) but you will be required to do it. Please take action within your power at your company. Stand up for what is right for everyone with this shared risk and threat. You may not enjoy who the messenger is (or the one who is the loudest) but set that aside for the industry. </soapbox> - Jared PS. We often call ourselves network engineers or architects. If we are truly that, we are using those industry standards as building blocks to ensure a solid foundation. Make sure your foundation is stable. Learn from others mistakes to design and operate the best network feasible.
Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
I presume that seeing a CF blog post isn’t regular for you. :-).
never seen such a thing :) amidst all this conjecturbation and blame casting, have any of the parties *directly* involved, i.e. 701 and their customer, issued any sort of post mortem from which we might learn? randy
On 25/06/2019 03:03, Tom Beecher wrote:
Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701. My comments are my own opinions only.
Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
Perhaps suggest to VZ management to use their blog: https://www.verizondigitalmedia.com/blog/ to contradict what CF blogged about? -Hank
On Tue, Jun 25, 2019 at 12:49 AM Hank Nussbacher <hank@efes.iucc.ac.il> wrote:
On 25/06/2019 03:03, Tom Beecher wrote:
Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701. My comments are my own opinions only.
Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
Perhaps suggest to VZ management to use their blog: https://www.verizondigitalmedia.com/blog/
#coughwrongvz I think anyway - you probably mean: https://enterprise.verizon.com/ GoodLuck! I think it's 3 clicks to: "www22.verizon.com" which gets even moar fun! The NOC used to answer if you called: +1-800-900-0241 which is in their whois records...
to contrandict what CF blogged about?
-Hank
On 25/06/2019 08:17, Christopher Morrow wrote:
On Tue, Jun 25, 2019 at 12:49 AM Hank Nussbacher <hank@efes.iucc.ac.il> wrote:
On 25/06/2019 03:03, Tom Beecher wrote:
Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701. My comments are my own opinions only.
Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
Perhaps suggest to VZ management to use their blog: https://www.verizondigitalmedia.com/blog/ #coughwrongvz
I think anyway - you probably mean: https://enterprise.verizon.com/ This post is unrelated to Verizon Enterprise? https://www.verizondigitalmedia.com/blog/2019/06/exponential-global-growth-a...
-Hank
GoodLuck! I think it's 3 clicks to: "www22.verizon.com" which gets even moar fun! The NOC used to answer if you called: +1-800-900-0241 which is in their whois records...
to contrandict what CF blogged about?
-Hank
Verizon Business / Enterprise is the access network, aka 701/2/3. Verizon Media Group is the CDNs/Media side. Digital Media Services ( Edgecast ) , Yahoo, AOL. 15133 / 10310 / 1668. ( The entity formerly named Oath, created when Yahoo was acquired. ) On Tue, Jun 25, 2019 at 06:54 Hank Nussbacher <hank@efes.iucc.ac.il> wrote:
On 25/06/2019 08:17, Christopher Morrow wrote:
On Tue, Jun 25, 2019 at 12:49 AM Hank Nussbacher <hank@efes.iucc.ac.il> wrote:
On 25/06/2019 03:03, Tom Beecher wrote:
Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701. My comments are my own opinions only.
Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
Perhaps suggest to VZ management to use their blog: https://www.verizondigitalmedia.com/blog/ #coughwrongvz
I think anyway - you probably mean: https://enterprise.verizon.com/ This post is unrelated to Verizon Enterprise?
https://www.verizondigitalmedia.com/blog/2019/06/exponential-global-growth-a...
-Hank
GoodLuck! I think it's 3 clicks to: "www22.verizon.com" which gets even moar fun! The NOC used to answer if you called: +1-800-900-0241 which is in their whois records...
to contrandict what CF blogged about?
-Hank
Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701. My comments are my own opinions only. Disclaimer: As much as I dislike Cloudflare (I used to complain about them a lot on Twitter), this is something I am absolutely agreeing with them. Verizon failed to do the most basic of network security, and it will happen again, and again, and again...
This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. Damn those morons at 701, period.
But do we know they didn’t? They didn't, otherwise yesterday's LSE could have been prevented.
Do we know it was there and just setup wrong? If it virtually exists but does not work, it does not exist.
Did another change at another time break what was there? What's not there can not be changed. Keep in mind, another well-known route leak happened back in 2017 when Google leaked routes towards Verizon and Verizon silently accepted and propagated all of them without filtering. Probably nothing has changed since then.
Shouldn’t we be working on facts? They have stated the facts.
to take a harder stance on the BGP optimizer that generated he bogus routes The BGP optimizer was only the trigger for this event, the actual mis-configuration happened between 396531 and 701. IDGAF if 396531 or one of their peers uses a BGP optimizer, 701 should have filtered those out, but they decided to not do that instead.
You’re right to use this as a lever to push for proper filtering , RPKI, best practices. Yes, and 701 should follow those "best practices".
Point being, I do network stuff since around 10 years and started doing BGP and internet routing related stuff only around three years ago and even _I_ can follow best practices. And if I have the knowledge about those things and can follow best practices, Verizon SURELY has enough resources to do so as well! On 6/25/19 2:03 AM, Tom Beecher wrote:
Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701. My comments are my own opinions only.
Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not.
You are 100% right that 701 should have had some sort of protection mechanism in place to prevent this. But do we know they didn’t? Do we know it was there and just setup wrong? Did another change at another time break what was there? I used 701 many jobs ago and they absolutely had filtering in place; it saved my bacon when I screwed up once and started readvertising a full table from a 2nd provider. They smacked my session down an I got a nice call about it.
You guys have repeatedly accused them of being dumb without even speaking to anyone yet from the sounds of it. Shouldn’t we be working on facts?
Should they have been easier to reach once an issue was detected? Probably. They’re certainly not the first vendor to have a slow response time though. Seems like when an APAC carrier takes 18 hours to get back to us, we write it off as the cost of doing business.
It also would have been nice, in my opinion, to take a harder stance on the BGP optimizer that generated he bogus routes, and the steel company that failed BGP 101 and just gladly reannounced one upstream to another. 701 is culpable for their mistakes, but there doesn’t seem like there is much appetite to shame the other contributors.
You’re right to use this as a lever to push for proper filtering , RPKI, best practices. I’m 100% behind that. We can all be a hell of a lot better at what we do. This stuff happens more than it should, but less than it could.
But this industry is one big ass glass house. What’s that thing about stones again?
On Mon, Jun 24, 2019 at 18:06 Justin Paine via NANOG <nanog@nanog.org <mailto:nanog@nanog.org>> wrote:
FYI for the group -- we just published this: https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-pa...
_________________ *Justin Paine* Director of Trust & Safety PGP: BBAA 6BCE 3305 7FD6 6452 7115 57B6 0114 DE0B 314D 101 Townsend St., San Francisco, CA 94107 <https://www.google.com/maps/search/101+Townsend+St.,+San+Francisco,+CA+94107?entry=gmail&source=g>
On Mon, Jun 24, 2019 at 2:25 PM Mark Tinka <mark.tinka@seacom.mu <mailto:mark.tinka@seacom.mu>> wrote:
On 24/Jun/19 18:09, Pavel Lunin wrote:
> > Hehe, I haven't seen this text before. Can't agree more. > > Get your tie back on Job, nobody listened again. > > More seriously, I see no difference between prefix hijacking and the > so called bgp optimisation based on completely fake announces on > behalf of other people. > > If ever your upstream or any other party who your company pays money > to does this dirty thing, now it's just the right moment to go explain > them that you consider this dangerous for your business and are > looking for better partners among those who know how to run internet > without breaking it.
We struggled with a number of networks using these over eBGP sessions they had with networks that shared their routing data with BGPmon. It sent off all sorts of alarms, and troubleshooting it was hard when a network thinks you are de-aggregating massively, and yet you know you aren't.
Each case took nearly 3 weeks to figure out.
BGP optimizers are the bane of my existence.
Mark.
[Removing the attribution, because many people have made statements like this over the last day - or year. Just selecting this one as a succinct and recent example to illustrate the point.]
This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. Damn those morons at 701, period.
I must be old. All I can think is Kids These Days, and maybe Get Off My BGP, er Lawn. Any company running a large, high complex infrastructure is going to make mistakes. Period. It is not like 701 is causing problems every week, or even ever year. If you think this one incident proves they are ‘morons’, you are only showing you are neither experienced nor mature enough to make that judgement. To be clear, they may well be morons. I no longer know many people architecting and operating 701’s backbone, so I cannot tell you first-hand how smart they are. Maybe they are stupid, but exceptionally lucky. However, the facts at hand do not support your blanket assertion, and making it does not speak well of you. OTOH, I do have first-hand experience with previous CF blog posts, and to say they spin things in their favor is being generous. But then, it’s a blog post, i.e. Marketing. What else would you expect? I know it is anathema to the ethos of the network engineers & architects to work together instead of hurling insults, but it would probably result in a better Internet. And isn’t that what we all (supposedly) want? -- TTFN, patrick
On Tue, 25 Jun 2019, 14:31 Patrick W. Gilmore, <patrick@ianai.net> wrote:
I must be old. All I can think is Kids These Days, and maybe Get Off My BGP, er Lawn.
Maybe they ought to [puts on shades] mind their MANRS. M (scuttling away)
<Cue The Who> Now with that out of the way... The mentality of everyone working together for a Better Internet (tm) is sort of a mantra of WISPA and WISPs in general. It is a mantra that has puzzled me and perplexed my own feelings as a network engineer. Do I want a better overall experience for my users and customers? Absolutely. Do I strive to make our network the best... pause... in the world? Definitely. Should I do the same to help a neighboring ISP, a competitor? This is where I scratch my head. You would absolutely think that we would all want a better overall Internet. One that we can depend on in times of need. One that we can be proud of. But we are driven, unfortunately, by our C-level execs to shun the competition and do whatever we can to get a leg up on everyone else. While this is good for the bottom line it is not exactly a healthy mentality to pit everyone against each other. It causes animosity between providers and we end up blaming each other for something simple and then claim they are stupid. A mistake that may be easy to make, a mistake that we have probably made ourselves a few times, perhaps a mistake we can learn to shrug off. I believe there probably is a happy medium we can all meet, sort of our own ISP DMZ, where we can help one another in the simple mistakes or cut each other some slack in those difficult times. I like to think NANOG is that place. -- Adam Kennedy, Network & Systems Engineer adamkennedy@watchcomm.net *Watch Communications* (866) 586-1518 On Tue, Jun 25, 2019 at 8:50 AM Matthew Walster <matthew@walster.org> wrote:
On Tue, 25 Jun 2019, 14:31 Patrick W. Gilmore, <patrick@ianai.net> wrote:
I must be old. All I can think is Kids These Days, and maybe Get Off My BGP, er Lawn.
Maybe they ought to [puts on shades] mind their MANRS.
M (scuttling away)
On 25/Jun/19 14:59, Adam Kennedy via NANOG wrote:
I believe there probably is a happy medium we can all meet, sort of our own ISP DMZ, where we can help one another in the simple mistakes or cut each other some slack in those difficult times. I like to think NANOG is that place.
Isn't that the point of NOG's, and why we rack so many air miles each year trying to meet each other and break bread (or something) while checking the Competition Hats at the door? Mark.
(thanks, btw, again) On Tue, Jun 25, 2019 at 8:33 AM Patrick W. Gilmore <patrick@ianai.net> wrote:
It is not like 701 is causing problems every week, or even ever year. If you think this one incident proves they are ‘morons’, you are only showing you are neither experienced nor mature enough to make that judgement.
I would be shocked if 701 is no longer filtering customers by default. I know they weren't filtering 'peers'. it seems like the particular case yesterday was a missed customer prefix-list :( which is sad, but happens. the japan incident seems to be the other type, I'd guess. -chris
perhaps the good side of this saga is that it may be an inflection point randy
perhaps the good side of this saga is that it may be an inflection point I doubt it. The greyer my hair gets, the crankier I get.
i suspect i am a bit ahead of you there but i used to think that the public would never become aware of privacy issues. snowen bumped that ball and tim cook spiked it. and it is getting more and more air time. randy
On 6/25/19 2:25 AM, Katie Holly wrote:
Disclaimer: As much as I dislike Cloudflare (I used to complain about them a lot on Twitter), this is something I am absolutely agreeing with them. Verizon failed to do the most basic of network security, and it will happen again, and again, and again...
I used to be a quality control engineer in my career, so I have a question to ask from the perspective of a QC guy: what is the Best Practice for minimizing, if not totally preventing, this sort of problem? Is there a "cookbook" answer to this? (I only run edge networks now, and don't have BGP to worry about. If my current $dayjob goes away -- they all do -- I might have to get back into the BGP game, so this is not an idle query.) Somehow "just be careful and clueful" isn't the right answer.
Hi Stephen,
I used to be a quality control engineer in my career, so I have a question to ask from the perspective of a QC guy: what is the Best Practice for minimizing, if not totally preventing, this sort of problem? Is there a "cookbook" answer to this?
As suggested by Job in the thread above, - deploy RPKI based BGP Origin validation (with invalid == reject) - apply maximum prefix limits on all EBGP sessions - ask your router vendor to comply with RFC 8212 ('default deny') - turn off your 'BGP optimizers' --> You actually don't need that at all. I survived without any optimizer. Aslo, read RFC7454 and join MANRS :) Regards, Aftab Siddiqui
On Tue, Jun 25, 2019 at 7:06 AM Stephen Satchell <list@satchell.net> wrote:
On 6/25/19 2:25 AM, Katie Holly wrote:
Disclaimer: As much as I dislike Cloudflare (I used to complain about them a lot on Twitter), this is something I am absolutely agreeing with them. Verizon failed to do the most basic of network security, and it will happen again, and again, and again...
I used to be a quality control engineer in my career, so I have a question to ask from the perspective of a QC guy: what is the Best Practice for minimizing, if not totally preventing, this sort of problem? Is there a "cookbook" answer to this?
(I only run edge networks now, and don't have BGP to worry about. If my current $dayjob goes away -- they all do -- I might have to get back into the BGP game, so this is not an idle query.)
Somehow "just be careful and clueful" isn't the right answer.
1. Know what to expect — create policy to enforce routes and paths that you expect, knowing sometimes this may be very broad 2. Enforce what you expect — drop routes and session that do not conform 3. Use all the internal tools in series as layers of defense — as-path-list with regex, ip prefix lists, max-routes — they work in series and all must match. Shoving everything into a route-map is not best, because what happens when that policy breaks? Good to have layers. 4. Use irr, rpki, and alarming as external ecosystem tools. 5. Dont run noction or ios, unsafe defaults. 6. When on the phone with your peer, verbally check to make sure they double check their policy. Dont assume.
I finally thought about this after I got off my beer high :-). Some of our customers complained about losing access to Cloudflare's resources during the Verizon debacle. Since we are doing ROV and dropping Invalids, this should not have happened, given most of Cloudflare's IPv4 and IPv6 routes are ROA'd. However, since we are not using the ARIN TAL (for known reasons), this explains why this also broke for us. Back to beer now :-)... Mark.
So that means it's time for everyone to migrate their ARIN resources to a sane RIR that does allow normal access to and redistribution of its RPKI TAL? ;-) The RPKI TAL problem + an industry-standard IRRDB instead of WHOIS-RWS were both major reasons for us to bring our ARIN IPv4 address space to RIPE. Unfortunately we had to renumber our handful of IPv6 customers because ARIN doesn't do IPv6 inter-RIR transfers, but hey, no pain no gain. Therefore, Cloudflare folks - when are you transferring your resources away from ARIN? :D Best regards, Martijn On 7/4/19 11:46 AM, Mark Tinka wrote: I finally thought about this after I got off my beer high :-). Some of our customers complained about losing access to Cloudflare's resources during the Verizon debacle. Since we are doing ROV and dropping Invalids, this should not have happened, given most of Cloudflare's IPv4 and IPv6 routes are ROA'd. However, since we are not using the ARIN TAL (for known reasons), this explains why this also broke for us. Back to beer now :-)... Mark.
Martijn - i3D.net is not in the list Job posted yesterday of RPKI ROV deployment. Your message below hints that you may be using RPKI. Are you doing ROV? (You may be in the “hundreds of others” category.) —Sandy Begin forwarded message: From: Job Snijders <job@ntt.net> Subject: Re: CloudFlare issues? Date: July 4, 2019 at 11:33:57 AM EDT To: Francois Lecavalier <Francois.Lecavalier@mindgeek.com> Cc: "nanog@nanog.org" <nanog@nanog.org> I believe at this point in time it is safe to accept valid and unknown (combined with an IRR filter), and reject RPKI invalid BGP announcements at your EBGP borders. Large examples of other organisations who already are rejecting invalid announcements are AT&T, Nordunet, DE-CIX, YYCIX, XS4ALL, MSK-IX, INEX, France-IX, Seacomm, Workonline, KPN International, and hundreds of others.
On Jul 4, 2019, at 5:56 AM, i3D.net - Martijn Schmidt via NANOG <nanog@nanog.org> wrote:
So that means it's time for everyone to migrate their ARIN resources to a sane RIR that does allow normal access to and redistribution of its RPKI TAL? ;-)
The RPKI TAL problem + an industry-standard IRRDB instead of WHOIS-RWS were both major reasons for us to bring our ARIN IPv4 address space to RIPE. Unfortunately we had to renumber our handful of IPv6 customers because ARIN doesn't do IPv6 inter-RIR transfers, but hey, no pain no gain.
Therefore, Cloudflare folks - when are you transferring your resources away from ARIN? :D
Best regards, Martijn
On 7/4/19 11:46 AM, Mark Tinka wrote:
I finally thought about this after I got off my beer high :-).
Some of our customers complained about losing access to Cloudflare's resources during the Verizon debacle. Since we are doing ROV and dropping Invalids, this should not have happened, given most of Cloudflare's IPv4 and IPv6 routes are ROA'd.
However, since we are not using the ARIN TAL (for known reasons), this explains why this also broke for us.
Back to beer now :-)...
Mark.
Hey Sandy, At this time i3D.net is not able to fully implement RPKI for technical reasons: there are still some Brocade routers in our network which don't support it. We are making very good progress migrating the entire network over to Juniper routers which do support RPKI, and we will certainly deploy ROV when that is done, but with upwards of 40 default-free backbone routers spread over six continents it's not a logistically trivial task. That being said, a network doesn't need to use ROV to benefit from the routing security afforded by the RPKI protocol. Nearly all of the prefixes originated by AS49544 have been covered by RPKI ROAs for several years now. Those networks which have already deployed ROV are inoculated against route hijacks of i3D.net's IP space in scenarios where the bad paths would be marked as RPKI invalid. Considering that i3D.net was founded in The Netherlands and that a significant amount of our enterprise customers have businesses which are focused on the Dutch market, the fact that two of the major eyeball networks in the country (that'd be KPN & XS4ALL) are using ROV is already a huge win for everyone involved. And, let's not forget that the degree of protection afforded by this relatively passive participation in RPKI is directly proportional to the use of a non-ARIN TAL. Real-world example: Mark Tinka's remark concerning Seacom's connection to Cloudflare's IP space being affected by the hijack due to the ARIN TAL problem, despite both involved parties fully deploying RPKI by both signing ROAs and implementing ROV. Best regards, Martijn On 7/5/19 8:46 PM, Sandra Murphy wrote:
Martijn - i3D.net is not in the list Job posted yesterday of RPKI ROV deployment. Your message below hints that you may be using RPKI. Are you doing ROV? (You may be in the “hundreds of others” category.)
—Sandy
Begin forwarded message:
From: Job Snijders <job@ntt.net> Subject: Re: CloudFlare issues? Date: July 4, 2019 at 11:33:57 AM EDT To: Francois Lecavalier <Francois.Lecavalier@mindgeek.com> Cc: "nanog@nanog.org" <nanog@nanog.org>
I believe at this point in time it is safe to accept valid and unknown (combined with an IRR filter), and reject RPKI invalid BGP announcements at your EBGP borders. Large examples of other organisations who already are rejecting invalid announcements are AT&T, Nordunet, DE-CIX, YYCIX, XS4ALL, MSK-IX, INEX, France-IX, Seacomm, Workonline, KPN International, and hundreds of others.
On Jul 4, 2019, at 5:56 AM, i3D.net - Martijn Schmidt via NANOG <nanog@nanog.org> wrote:
So that means it's time for everyone to migrate their ARIN resources to a sane RIR that does allow normal access to and redistribution of its RPKI TAL? ;-)
The RPKI TAL problem + an industry-standard IRRDB instead of WHOIS-RWS were both major reasons for us to bring our ARIN IPv4 address space to RIPE. Unfortunately we had to renumber our handful of IPv6 customers because ARIN doesn't do IPv6 inter-RIR transfers, but hey, no pain no gain.
Therefore, Cloudflare folks - when are you transferring your resources away from ARIN? :D
Best regards, Martijn
On 7/4/19 11:46 AM, Mark Tinka wrote:
I finally thought about this after I got off my beer high :-).
Some of our customers complained about losing access to Cloudflare's resources during the Verizon debacle. Since we are doing ROV and dropping Invalids, this should not have happened, given most of Cloudflare's IPv4 and IPv6 routes are ROA'd.
However, since we are not using the ARIN TAL (for known reasons), this explains why this also broke for us.
Back to beer now :-)...
Mark.
On Thu, Jul 04, 2019 at 11:46:05AM +0200, Mark Tinka wrote:
I finally thought about this after I got off my beer high :-).
Some of our customers complained about losing access to Cloudflare's resources during the Verizon debacle. Since we are doing ROV and dropping Invalids, this should not have happened, given most of Cloudflare's IPv4 and IPv6 routes are ROA'd.
These were more-specifics, though. So if you drop all the more-specifics as failing ROV, then you end up following the valid shorter prefix to the destination. Quite possibly that points at the upstream which sent you the more-specific which you rejected, at which point your packets end up same going to the same place they would have gone if you had accepted the invalid more-specific. Two potential issues here: First, if you don't have an upstream who is also rejecting the invalid routes, then anywhere you send the packets, they're going to follow the more-specific. Second, even if you do have an upstream that is rejecting the invalid routes, ROV won't cause you to prefer the less-specific from an upstream that is rejecting the invalid routes over a less-specific from an upstream that is accepting the invalid routes. For example: if upstream A sends you: 10.0.0.0./16 valid and upstream B sends you 10.0.0.0/16 valid 10.0.0.0/17 invalid 10.0.128.0/17 invalid you want send to send the packet to A. But ROV won't cause that, and if upstream B is selected by your BGP decision criteria (path length, etc.), you're packets will ultimately follow the more-specific. (Of course, the problem is can occur more than one network away. Even if you do send to upstream A, there's no guarantee that A's less-specifics aren't pointed at another network that does have the more-specifics. But at least you give them a fighting chance by sending them to A.) -- Brett
On my test net I take ROA_INVALIDs and convert them to unreachables with a low preference (ie so that any upstreams taking only the shorter path will be selected, but so that such packets will never be routed). Obviously this isn't a well-supported operation, but I'm curious what people think of such an approach? If you really want to treat ROA_INVALID as "this is probably a hijack", you don't really want to be sending the hijacker traffic. Of course if upstreams are rejecting ROA_INVALID you can still have the same problem one network away, but its an interesting result for testing, especially since it rejects a bunch of crap in China where CT has reassigned prefixes with covering ROAs to customers who re-announce on their own ASN (which appears to be common). Matt On 7/6/19 4:05 PM, Brett Frankenberger wrote:
On Thu, Jul 04, 2019 at 11:46:05AM +0200, Mark Tinka wrote:
I finally thought about this after I got off my beer high :-).
Some of our customers complained about losing access to Cloudflare's resources during the Verizon debacle. Since we are doing ROV and dropping Invalids, this should not have happened, given most of Cloudflare's IPv4 and IPv6 routes are ROA'd.
These were more-specifics, though. So if you drop all the more-specifics as failing ROV, then you end up following the valid shorter prefix to the destination. Quite possibly that points at the upstream which sent you the more-specific which you rejected, at which point your packets end up same going to the same place they would have gone if you had accepted the invalid more-specific.
Two potential issues here: First, if you don't have an upstream who is also rejecting the invalid routes, then anywhere you send the packets, they're going to follow the more-specific. Second, even if you do have an upstream that is rejecting the invalid routes, ROV won't cause you to prefer the less-specific from an upstream that is rejecting the invalid routes over a less-specific from an upstream that is accepting the invalid routes.
For example: if upstream A sends you: 10.0.0.0./16 valid and upstream B sends you 10.0.0.0/16 valid 10.0.0.0/17 invalid 10.0.128.0/17 invalid you want send to send the packet to A. But ROV won't cause that, and if upstream B is selected by your BGP decision criteria (path length, etc.), you're packets will ultimately follow the more-specific.
(Of course, the problem is can occur more than one network away. Even if you do send to upstream A, there's no guarantee that A's less-specifics aren't pointed at another network that does have the more-specifics. But at least you give them a fighting chance by sending them to A.)
-- Brett
Oops, I mean with a script which removes such routes if there is an encompassing route which a different upstream takes, as obviously the more-specific would otherwise still win. Matt On 7/6/19 5:44 PM, Matt Corallo wrote:
On my test net I take ROA_INVALIDs and convert them to unreachables with a low preference (ie so that any upstreams taking only the shorter path will be selected, but so that such packets will never be routed).
Obviously this isn't a well-supported operation, but I'm curious what people think of such an approach? If you really want to treat ROA_INVALID as "this is probably a hijack", you don't really want to be sending the hijacker traffic.
Of course if upstreams are rejecting ROA_INVALID you can still have the same problem one network away, but its an interesting result for testing, especially since it rejects a bunch of crap in China where CT has reassigned prefixes with covering ROAs to customers who re-announce on their own ASN (which appears to be common).
Matt
On 7/6/19 4:05 PM, Brett Frankenberger wrote:
On Thu, Jul 04, 2019 at 11:46:05AM +0200, Mark Tinka wrote:
I finally thought about this after I got off my beer high :-).
Some of our customers complained about losing access to Cloudflare's resources during the Verizon debacle. Since we are doing ROV and dropping Invalids, this should not have happened, given most of Cloudflare's IPv4 and IPv6 routes are ROA'd.
These were more-specifics, though. So if you drop all the more-specifics as failing ROV, then you end up following the valid shorter prefix to the destination. Quite possibly that points at the upstream which sent you the more-specific which you rejected, at which point your packets end up same going to the same place they would have gone if you had accepted the invalid more-specific.
Two potential issues here: First, if you don't have an upstream who is also rejecting the invalid routes, then anywhere you send the packets, they're going to follow the more-specific. Second, even if you do have an upstream that is rejecting the invalid routes, ROV won't cause you to prefer the less-specific from an upstream that is rejecting the invalid routes over a less-specific from an upstream that is accepting the invalid routes.
For example: if upstream A sends you: 10.0.0.0./16 valid and upstream B sends you 10.0.0.0/16 valid 10.0.0.0/17 invalid 10.0.128.0/17 invalid you want send to send the packet to A. But ROV won't cause that, and if upstream B is selected by your BGP decision criteria (path length, etc.), you're packets will ultimately follow the more-specific.
(Of course, the problem is can occur more than one network away. Even if you do send to upstream A, there's no guarantee that A's less-specifics aren't pointed at another network that does have the more-specifics. But at least you give them a fighting chance by sending them to A.)
-- Brett
On 6/Jul/19 23:44, Matt Corallo wrote:
On my test net I take ROA_INVALIDs and convert them to unreachables with a low preference (ie so that any upstreams taking only the shorter path will be selected, but so that such packets will never be routed).
Obviously this isn't a well-supported operation, but I'm curious what people think of such an approach? If you really want to treat ROA_INVALID as "this is probably a hijack", you don't really want to be sending the hijacker traffic.
If a prefixe's RPKI state is Invalid, drop it! Simple. In most cases, it's a mistake due to a mis-configuration and/or a lack of deep understanding of RPKI. In fewer cases, it's an actual hijack. Either way, dropping the Invalid routes keeps the BGP clean and quickly encourages the originating network to get things fixed. As you point out, RPKI state validation is locally-significant, with protection extending to downstream customers only. So for this to really work, it needs critical mass. One, two, three, four or five networks implementing ROV and dropping Invalids does not a secure BGP make. Mark.
On 6/Jul/19 22:05, Brett Frankenberger wrote:
These were more-specifics, though. So if you drop all the more-specifics as failing ROV, then you end up following the valid shorter prefix to the destination.
I can't quite recall which Cloudflare prefixes were impacted. If you have a sniff at https://bgp.he.net/AS13335#_prefixes and https://bgp.he.net/AS13335#_prefixes6 you will see that Cloudflare have a larger portion of their IPv6 prefixes ROA'd than the IPv4 ones. If you remember which Cloudflare prefixes were affected by the Verizon debacle, we can have a closer look.
Quite possibly that points at the upstream which sent you the more-specific which you rejected, at which point your packets end up same going to the same place they would have gone if you had accepted the invalid more-specific.
But that's my point... we did not have the chance to drop any of the affected Cloudflare prefixes because we do not use the ARIN TAL. That means that we are currently ignoring the RPKI value of Cloudflare's prefixes that are under ARIN. Also, AFAICT, none of our current upstreams are doing ROV. You can see that list here: https://bgp.he.net/AS37100#_graph4 Mark.
On 24/Jun/19 16:11, Job Snijders wrote:
- deploy RPKI based BGP Origin validation (with invalid == reject) - apply maximum prefix limits on all EBGP sessions - ask your router vendor to comply with RFC 8212 ('default deny') - turn off your 'BGP optimizers'
I cannot over-emphasize the above, especially the BGP optimizers. Mark.
On 2019-06-24 20:16, Mark Tinka wrote:
On 24/Jun/19 16:11, Job Snijders wrote:
- deploy RPKI based BGP Origin validation (with invalid == reject) - apply maximum prefix limits on all EBGP sessions - ask your router vendor to comply with RFC 8212 ('default deny') - turn off your 'BGP optimizers'
I cannot over-emphasize the above, especially the BGP optimizers.
Mark.
+1 https://honestnetworker.net/2019/06/24/leaking-your-optimized-routes-to-stub... -- hugge
This is what looked happened: There was a large scale BGP 'leak' incident causing about 20k prefixes for 2400 network (ASNs) to be rerouted through AS396531 (a steel plant) and then on to its transit provider: Verizon (AS701) Start time: 10:34:21 (UTC) End time: 12:37 (UTC) All ASpaths had the following in common: 701 396531 33154 33154 (DQECOM ) is an ISP providing transit to 396531. 396531 is by the looks of it a steel plant. dual homed to 701 and 33154. 701 is verizon and accepted by the looks of it all BGP announcements from 396531 What appears to have happened is that 33154 those routes were propagated to 396531, which then send them to Verizon and voila... there is the full leak at work. (DQECOM runs a BGP optimizer (https://www.noction.com/clients/dqe , thanks Job for pointing that out, more below) As a result traffic for 20k prefixes or so was now rerouted through verizon and 396531 (the steel plant) We've seen numerous incidents like this in the past lessons learned: 1) if you do use a BGP optimizer, please FILTER! 2) Verizon... filter your customers, please! Since the BGP optimizer introduces new more specific routes, a lot of traffic for high traffic destinations would have been rerouted through that path, which would have been congested, causing the outages. There were many cloudflare prefixes affected, but also folks like Amazon, Akamai, Facebook, Apple, Linode etc. here's one example for Amazon - CloudFront : 52.84.32.0/22. Normally announced as a 52.84.32.0/21 but during the incident as a /22 (remember more specifics always win) https://stat.ripe.net/52.84.32.0%2F22#tabId=routing&routing_bgplay.ignoreReannouncements=false&routing_bgplay.resource=52.84.32.0/22&routing_bgplay.starttime=1561337999&routing_bgplay.endtime=1561377599&routing_bgplay.rrcs=0,1,2,5,6,7,10,11,13,14,15,16,18,20&routing_bgplay.instant=null&routing_bgplay.type=bgp RPKI would have worked here (assuming you're strict with the max length)! Cheers Andree My secret spy satellite informs me that Dmitry Sherman wrote On 2019-06-24, 3:55 AM:
Hello are there any issues with CloudFlare services now?
Dmitry Sherman dmitry@interhost.net Interhost Networks Ltd Web: http://www.interhost.co.il fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
Hi All, here in Ukraine we got an impact as well! Have two questions: 1. Why Cloudflare did not immediately announced all their address space by /24s? This can put the service up instantly for almost all places. 2. Why almost all carriers did not filter the leak on their side, but waited for "a better weather on Mars" for several hours? 24.06.19 13:55, Dmitry Sherman пише:
Hello are there any issues with CloudFlare services now?
Dmitry Sherman dmitry@interhost.net Interhost Networks Ltd Web: http://www.interhost.co.il fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
Verizon is the one who should've noticed something was amiss and dropped their customer's BGP session. They also should have had filters and prefix count limits in place, which would have prevented this whole disaster. As to why any of that didn't happen, who actually knows. Regards, Filip On 6/24/19 4:28 PM, Max Tulyev wrote:
Why almost all carriers did not filter the leak on their side, but waited for "a better weather on Mars" for several hours?
-- Filip Hruska Linux System Administrator
On Mon, Jun 24, 2019 at 10:41 AM Filip Hruska <fhr@fhrnet.eu> wrote:
Verizon is the one who should've noticed something was amiss and dropped their customer's BGP session. They also should have had filters and prefix count limits in place, which would have prevented this whole disaster.
oddly VZ used to be quite good about filtering customer seesions :( there ARE cases where: "customer says they may announce X" and that doesn't happen along a path expected :( For instance they end up announcing a path through their other transit to a prefix in the permitted list on the VZ side :( it doesn't seem plausible that that is what was happening here though, I don't expect the duquesne folk to have customer paths to (for instance) savi moebel in germany... there are some pretty fun as-paths in the set of ~25k prefixes leaked (that routeviews saw).
(Updating subject line to be accurate)
On Jun 24, 2019, at 10:28 AM, Max Tulyev <maxtul@netassist.ua> wrote:
Hi All,
here in Ukraine we got an impact as well!
Have two questions:
1. Why Cloudflare did not immediately announced all their address space by /24s? This can put the service up instantly for almost all places.
They may not want to pollute the global routing table with these entries. It has a cost for everyone. If we all did this, the table would be a mess.
2. Why almost all carriers did not filter the leak on their side, but waited for "a better weather on Mars" for several hours?
There’s several major issues here - Verizon accepted garbage from their customer - Other networks accepted the garbage from Verizon (eg: Cogent) - known best practices from over a decade ago are not applied I’m sure reporters will be reaching out to Verizon about this and their response time should be noted. It was impacting to many networks. You should filter your transits to prevent impact from these more specifics. - Jared https://twitter.com/jaredmauch/status/1143163212822720513 https://twitter.com/JobSnijders/status/1143163271693963266 https://puck.nether.net/~jared/blog/?p=208
On 6/24/2019 10:44 AM, Jared Mauch wrote:
It was impacting to many networks. You should filter your transits to prevent impact from these more specifics.
- Jared
https://twitter.com/jaredmauch/status/1143163212822720513 https://twitter.com/JobSnijders/status/1143163271693963266 https://puck.nether.net/~jared/blog/?p=208
$MAJORNET filters between peers make sense but what can a transit customer do to prevent being affected by leaks like this one?
On Jun 24, 2019, at 11:00 AM, ML <ml@kenweb.org> wrote:
On 6/24/2019 10:44 AM, Jared Mauch wrote:
It was impacting to many networks. You should filter your transits to prevent impact from these more specifics.
- Jared
https://twitter.com/jaredmauch/status/1143163212822720513 https://twitter.com/JobSnijders/status/1143163271693963266 https://puck.nether.net/~jared/blog/?p=208
$MAJORNET filters between peers make sense but what can a transit customer do to prevent being affected by leaks like this one?
Block routes from 3356 (for example) that don’t go 701_3356_ 701_2914_ 701_1239_ etc (if 701 is your transit and you are multi homed) Then you won’t accept the more specifics. If you point default it may not be any help. - Jared
24.06.19 17:44, Jared Mauch пише:
1. Why Cloudflare did not immediately announced all their address space by /24s? This can put the service up instantly for almost all places. They may not want to pollute the global routing table with these entries. It has a cost for everyone. If we all did this, the table would be a mess.
yes, it is. But it is a working, quick and temporary fix of the problem.
2. Why almost all carriers did not filter the leak on their side, but waited for "a better weather on Mars" for several hours? There’s several major issues here
- Verizon accepted garbage from their customer - Other networks accepted the garbage from Verizon (eg: Cogent) - known best practices from over a decade ago are not applied
That's it. We have several IXes connected, all of them had a correct aggregated route to CF. And there was one upstream distributed leaked more specifics. I think 30min maximum is enough to find out a problem and filter out it's source on their side. Almost nobody did it. Why?
> On Jun 24, 2019, at 11:12 AM, Max Tulyev <maxtul@netassist.ua> wrote: > > 24.06.19 17:44, Jared Mauch пише: >>> 1. Why Cloudflare did not immediately announced all their address space by /24s? This can put the service up instantly for almost all places. >> They may not want to pollute the global routing table with these entries. It has a cost for everyone. If we all did this, the table would be a mess. > > yes, it is. But it is a working, quick and temporary fix of the problem. Like many things (eg; ATT had similar issues with 12.0.0.0/8) now there’s a bunch of /9’s in the table that will likely never go away. >>> 2. Why almost all carriers did not filter the leak on their side, but waited for "a better weather on Mars" for several hours? >> There’s several major issues here >> - Verizon accepted garbage from their customer >> - Other networks accepted the garbage from Verizon (eg: Cogent) >> - known best practices from over a decade ago are not applied > > That's it. > > We have several IXes connected, all of them had a correct aggregated route to CF. And there was one upstream distributed leaked more specifics. > > I think 30min maximum is enough to find out a problem and filter out it's source on their side. Almost nobody did it. Why? I have heard people say “we don’t look for problems”. This is often the case, there is a lack of monitoring/awareness. I had several systems detect the problem, plus things like bgpmon also saw it. My guess is people that passed this on weren’t monitoring either. It’s often manual procedures vs automated scripts watching things. Instrumentation of your network elements tends to be a small set of people who invest in it. You tend to need some scale for it to make sense, and it also requires people who understand the underlying data for what is “odd”. This is why I’ve had my monitoring system up for the past 12+ years. It’s super simple (dumb) and catches a lot of issues. I implemented it again for the RIPE RIS Live service, but haven’t cut it over to be the primary (realtime) monitoring method vs watching route-views. I think it’s time to do that. - Jared
Cloudflare blog on the outage is out. https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-pa... Martin On Mon, Jun 24, 2019 at 3:57 AM Dmitry Sherman <dmitry@interhost.net> wrote:
Hello are there any issues with CloudFlare services now?
Dmitry Sherman dmitry@interhost.net Interhost Networks Ltd Web: http://www.interhost.co.il fb: https://www.facebook.com/InterhostIL Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157
participants (36)
-
Adam Kennedy
-
Aftab Siddiqui
-
Andree Toonk
-
Antonios Chariton
-
Brett Frankenberger
-
Ca By
-
Christopher Morrow
-
Dmitry Sherman
-
Dovid Bender
-
Filip Hruska
-
Fredrik Korsbäck
-
Hank Nussbacher
-
i3D.net - Martijn Schmidt
-
Jaden Roberts
-
James Jun
-
Jared Mauch
-
Job Snijders
-
Justin Paine
-
Katie Holly
-
Mark Tinka
-
Martin J. Levy
-
Matt Corallo
-
Matthew Walster
-
Max Tulyev
-
ML
-
Patrick W. Gilmore
-
Pavel Lunin
-
Randy Bush
-
Rich Kulawiec
-
Robbie Trencheny
-
Ross Tajvar
-
Sandra Murphy
-
Sean Donelan
-
Stephen Satchell
-
Tom Beecher
-
Tom Paseka