24 Jun
2019
24 Jun
'19
11:15 a.m.
> On Jun 24, 2019, at 11:12 AM, Max Tulyev <maxtul@netassist.ua> wrote: > > 24.06.19 17:44, Jared Mauch пише: >>> 1. Why Cloudflare did not immediately announced all their address space by /24s? This can put the service up instantly for almost all places. >> They may not want to pollute the global routing table with these entries. It has a cost for everyone. If we all did this, the table would be a mess. > > yes, it is. But it is a working, quick and temporary fix of the problem. Like many things (eg; ATT had similar issues with 12.0.0.0/8) now there’s a bunch of /9’s in the table that will likely never go away. >>> 2. Why almost all carriers did not filter the leak on their side, but waited for "a better weather on Mars" for several hours? >> There’s several major issues here >> - Verizon accepted garbage from their customer >> - Other networks accepted the garbage from Verizon (eg: Cogent) >> - known best practices from over a decade ago are not applied > > That's it. > > We have several IXes connected, all of them had a correct aggregated route to CF. And there was one upstream distributed leaked more specifics. > > I think 30min maximum is enough to find out a problem and filter out it's source on their side. Almost nobody did it. Why? I have heard people say “we don’t look for problems”. This is often the case, there is a lack of monitoring/awareness. I had several systems detect the problem, plus things like bgpmon also saw it. My guess is people that passed this on weren’t monitoring either. It’s often manual procedures vs automated scripts watching things. Instrumentation of your network elements tends to be a small set of people who invest in it. You tend to need some scale for it to make sense, and it also requires people who understand the underlying data for what is “odd”. This is why I’ve had my monitoring system up for the past 12+ years. It’s super simple (dumb) and catches a lot of issues. I implemented it again for the RIPE RIS Live service, but haven’t cut it over to be the primary (realtime) monitoring method vs watching route-views. I think it’s time to do that. - Jared