Sure, but I don't care how busy your router is, it shouldn't take hours to withdraw routes. ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "Saku Ytti" <saku@ytti.fi> To: "Martijn Schmidt" <martijnschmidt@i3d.net> Cc: "Outages" <outages@outages.org>, "North American Network Operators' Group" <nanog@nanog.org> Sent: Wednesday, September 2, 2020 2:15:46 AM Subject: Re: [outages] Major Level3 (CenturyLink) Issues On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG <nanog@nanog.org> wrote:
I suppose now would be a good time for everyone to re-open their Centurylink ticket and ask why the RFO doesn't address the most important defect, e.g. the inability to withdraw announcements even by shutting down the session?
The more work the BGP process has the longer it takes to complete that work. You could try in your RFP/RFQ if some provider will commit on specific convergence time, which would improve your position contractually and might make you eligible for some compensations or termination of contract, but realistically every operator can run into a situation where you will see what most would agree pathologically long convergence times. The more BGP sessions, more RIB entries the higher the probability that these issues manifest. Perhaps protocol level work can be justified as well. BGP doesn't have concept of initial convergence, if you have lot of peers, your initial convergence contains massive amount of useless work, because you keep changing best route, while you keep receiving new best routes, the higher the scale the more useless work you do and the longer stability you require to eventually ~converge. Practical devices operators run may require hours during _normal operation_ to do initial converge. RFC7313 might show us way to reduce amount of useless work. You might want to add signal that initial convergence is done, you might want to add signal that no installation or best path algo happens until all route are loaded, this would massively improve scaled convergence as you wouldn't do that throwaway work, which ultimately inflates your work queue and pushes your useful work far to the future. The main thing as a customer I would ask, how can we fix it faster than 5h in future. Did we lose access to control-plane? Could we reasonably avoid losing it? -- ++ytti