Re: [outages] Major Level3 (CenturyLink) Issues

2 Sep 2020

      Sure, but I don't care how busy your router is, it shouldn't take hours to withdraw routes. 

----- 
Mike Hammett 
Intelligent Computing Solutions 
http://www.ics-il.com 

Midwest-IX 
http://www.midwest-ix.com 

----- Original Message -----

From: "Saku Ytti" <saku@ytti.fi> 
To: "Martijn Schmidt" <martijnschmidt@i3d.net> 
Cc: "Outages" <outages@outages.org>, "North American Network Operators' Group" <nanog@nanog.org> 
Sent: Wednesday, September 2, 2020 2:15:46 AM 
Subject: Re: [outages] Major Level3 (CenturyLink) Issues 

On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG <nanog@nanog.org> wrote:
...
I suppose now would be a good time for everyone to re-open their Centurylink ticket and ask why the RFO doesn't address the most important defect, e.g. the inability to withdraw announcements even by shutting down the session?
The more work the BGP process has the longer it takes to complete that 
work. You could try in your RFP/RFQ if some provider will commit on 
specific convergence time, which would improve your position 
contractually and might make you eligible for some compensations or 
termination of contract, but realistically every operator can run into 
a situation where you will see what most would agree pathologically 
long convergence times. 

The more BGP sessions, more RIB entries the higher the probability 
that these issues manifest. Perhaps protocol level work can be 
justified as well. BGP doesn't have concept of initial convergence, if 
you have lot of peers, your initial convergence contains massive 
amount of useless work, because you keep changing best route, while 
you keep receiving new best routes, the higher the scale the more 
useless work you do and the longer stability you require to eventually 
~converge. Practical devices operators run may require hours during 
_normal operation_ to do initial converge. 

RFC7313 might show us way to reduce amount of useless work. You might 
want to add signal that initial convergence is done, you might want to 
add signal that no installation or best path algo happens until all 
route are loaded, this would massively improve scaled convergence as 
you wouldn't do that throwaway work, which ultimately inflates your 
work queue and pushes your useful work far to the future. 

The main thing as a customer I would ask, how can we fix it faster 
than 5h in future. Did we lose access to control-plane? Could we 
reasonably avoid losing it? 
-- 
++ytti