Re: [outages] Major Level3 (CenturyLink) Issues
the RFO is making the rounds http://seele.lamehost.it/~marco/blind/Network_Event_Formal_RFO_Multiple_Mark... it kinda explains the flowspec issue but completely ignores the stuck routes, which imiho was the more damaging problem. randy
I suppose now would be a good time for everyone to re-open their Centurylink ticket and ask why the RFO doesn't address the most important defect, e.g. the inability to withdraw announcements even by shutting down the session? Best regards, Martijn ________________________________ From: NANOG <nanog-bounces+martijnschmidt=i3d.net@nanog.org> on behalf of Randy Bush <randy@psg.com> Sent: 02 September 2020 08:17 To: Outages <outages@outages.org>; North American Network Operators' Group <nanog@nanog.org> Subject: Re: [outages] Major Level3 (CenturyLink) Issues the RFO is making the rounds http://seele.lamehost.it/~marco/blind/Network_Event_Formal_RFO_Multiple_Mark... it kinda explains the flowspec issue but completely ignores the stuck routes, which imiho was the more damaging problem. randy
On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG <nanog@nanog.org> wrote:
I suppose now would be a good time for everyone to re-open their Centurylink ticket and ask why the RFO doesn't address the most important defect, e.g. the inability to withdraw announcements even by shutting down the session?
The more work the BGP process has the longer it takes to complete that work. You could try in your RFP/RFQ if some provider will commit on specific convergence time, which would improve your position contractually and might make you eligible for some compensations or termination of contract, but realistically every operator can run into a situation where you will see what most would agree pathologically long convergence times. The more BGP sessions, more RIB entries the higher the probability that these issues manifest. Perhaps protocol level work can be justified as well. BGP doesn't have concept of initial convergence, if you have lot of peers, your initial convergence contains massive amount of useless work, because you keep changing best route, while you keep receiving new best routes, the higher the scale the more useless work you do and the longer stability you require to eventually ~converge. Practical devices operators run may require hours during _normal operation_ to do initial converge. RFC7313 might show us way to reduce amount of useless work. You might want to add signal that initial convergence is done, you might want to add signal that no installation or best path algo happens until all route are loaded, this would massively improve scaled convergence as you wouldn't do that throwaway work, which ultimately inflates your work queue and pushes your useful work far to the future. The main thing as a customer I would ask, how can we fix it faster than 5h in future. Did we lose access to control-plane? Could we reasonably avoid losing it? -- ++ytti
❦ 2 septembre 2020 10:15 +03, Saku Ytti:
RFC7313 might show us way to reduce amount of useless work. You might want to add signal that initial convergence is done, you might want to add signal that no installation or best path algo happens until all route are loaded, this would massively improve scaled convergence as you wouldn't do that throwaway work, which ultimately inflates your work queue and pushes your useful work far to the future.
It seems BIRD contains an implementation for RFC7313. From the source code, it delays removal of stale route until EoRR, but it doesn't seem to delay the work on updating the kernel. Juniper doesn't seem to implement it. Cisco seems to implement it, but only on refresh, not on the initial connection. Is there some survey around this RFC? -- Don't patch bad code - rewrite it. - The Elements of Programming Style (Kernighan & Plauger)
On Wed, 2 Sep 2020 at 12:50, Vincent Bernat <bernat@luffy.cx> wrote:
It seems BIRD contains an implementation for RFC7313. From the source code, it delays removal of stale route until EoRR, but it doesn't seem to delay the work on updating the kernel. Juniper doesn't seem to implement it. Cisco seems to implement it, but only on refresh, not on the initial connection. Is there some survey around this RFC?
Correct it doesn't do anything for initial, but I took it as an example how we might approach the problem of initial convergence cost at scaled environments. -- ++ytti
creative engineers can conjecturbate for days on how some turtle in the pond might write code what did not withdraw for a month, or other delightful reasons CL might have had this really really bad behavior. the point is that the actual symptoms and cause really really should be in the RFO randy
Sure. But being good engineers, we love to exercise our brains by thinking about possibilities and probabilities. For example, we don't form disaster response plans by saying "well, we could think about what *could* happen for days, but we'll just wait for something to occur". -A On Wed, Sep 2, 2020 at 8:51 AM Randy Bush <randy@psg.com> wrote:
creative engineers can conjecturbate for days on how some turtle in the pond might write code what did not withdraw for a month, or other delightful reasons CL might have had this really really bad behavior.
the point is that the actual symptoms and cause really really should be in the RFO
randy
we don't form disaster response plans by saying "well, we could think about what *could* happen for days, but we'll just wait for something to occur".
from an old talk of mine, if it was part of the “plan” it’s an “event,” if it is not then it’s a “disaster.”
Sure, but I don't care how busy your router is, it shouldn't take hours to withdraw routes. ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "Saku Ytti" <saku@ytti.fi> To: "Martijn Schmidt" <martijnschmidt@i3d.net> Cc: "Outages" <outages@outages.org>, "North American Network Operators' Group" <nanog@nanog.org> Sent: Wednesday, September 2, 2020 2:15:46 AM Subject: Re: [outages] Major Level3 (CenturyLink) Issues On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG <nanog@nanog.org> wrote:
I suppose now would be a good time for everyone to re-open their Centurylink ticket and ask why the RFO doesn't address the most important defect, e.g. the inability to withdraw announcements even by shutting down the session?
The more work the BGP process has the longer it takes to complete that work. You could try in your RFP/RFQ if some provider will commit on specific convergence time, which would improve your position contractually and might make you eligible for some compensations or termination of contract, but realistically every operator can run into a situation where you will see what most would agree pathologically long convergence times. The more BGP sessions, more RIB entries the higher the probability that these issues manifest. Perhaps protocol level work can be justified as well. BGP doesn't have concept of initial convergence, if you have lot of peers, your initial convergence contains massive amount of useless work, because you keep changing best route, while you keep receiving new best routes, the higher the scale the more useless work you do and the longer stability you require to eventually ~converge. Practical devices operators run may require hours during _normal operation_ to do initial converge. RFC7313 might show us way to reduce amount of useless work. You might want to add signal that initial convergence is done, you might want to add signal that no installation or best path algo happens until all route are loaded, this would massively improve scaled convergence as you wouldn't do that throwaway work, which ultimately inflates your work queue and pushes your useful work far to the future. The main thing as a customer I would ask, how can we fix it faster than 5h in future. Did we lose access to control-plane? Could we reasonably avoid losing it? -- ++ytti
On Wed, 2 Sep 2020 at 14:40, Mike Hammett <nanog@ics-il.net> wrote:
Sure, but I don't care how busy your router is, it shouldn't take hours to withdraw routes.
Quite, discussion is less about how we feel about it and more about why it happens and what could be done to it. -- ++ytti
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers. ons. 2. sep. 2020 14.11 skrev Saku Ytti <saku@ytti.fi>:
On Wed, 2 Sep 2020 at 14:40, Mike Hammett <nanog@ics-il.net> wrote:
Sure, but I don't care how busy your router is, it shouldn't take hours to withdraw routes.
Quite, discussion is less about how we feel about it and more about why it happens and what could be done to it.
-- ++ytti
On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers.
I can imagine writing BGP implementation like this a) own queue for keepalives, which i always serve first fully b) own queue for update, which i serve second c) own queue for withdraw, which i serve last Why I might think this makes sense, is perhaps I just received from RR2 prefix I'm pulling from RR1, if I don't handle all my updates first, I'm causing outage that should not happen, because I already actually received the update telling I don't need to withdraw it. Is this the right way to do it? Maybe not, but it's easy to imagine why it might seem like a good idea. How well BGP works in common cases and how it works in pathologically scaled and busy cases are very different cases. I know that even in stable states commonly run vendors on commonly run hardware can take +2h to finish converging iBGP on initial turn-up. -- ++ytti
Yeah. This actually would be a fascinating study to understand exactly what happened. The volume of BGP messages flying around because of the session churn must have been absolutely massive, especially in a complex internal infrastructure like 3356 has. I would say the scale of such an event has to be many orders of magnitude beyond what anyone ever designed for, so it doesn't shock me at all that unexpected behavior occurred. But that's why we're engineers ; we want to understand such things. On Wed, Sep 2, 2020 at 9:37 AM Saku Ytti <saku@ytti.fi> wrote:
On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers.
I can imagine writing BGP implementation like this
a) own queue for keepalives, which i always serve first fully b) own queue for update, which i serve second c) own queue for withdraw, which i serve last
Why I might think this makes sense, is perhaps I just received from RR2 prefix I'm pulling from RR1, if I don't handle all my updates first, I'm causing outage that should not happen, because I already actually received the update telling I don't need to withdraw it.
Is this the right way to do it? Maybe not, but it's easy to imagine why it might seem like a good idea.
How well BGP works in common cases and how it works in pathologically scaled and busy cases are very different cases.
I know that even in stable states commonly run vendors on commonly run hardware can take +2h to finish converging iBGP on initial turn-up.
-- ++ytti
I believe someone on this list reported that updates were also broken. They could not add prepending nor modify communities. Anyway I am not saying it cannot happen because clearly something did happen. I just don't believe it is a simple case of overload. There has to be more to it. ons. 2. sep. 2020 15.36 skrev Saku Ytti <saku@ytti.fi>:
On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers.
I can imagine writing BGP implementation like this
a) own queue for keepalives, which i always serve first fully b) own queue for update, which i serve second c) own queue for withdraw, which i serve last
Why I might think this makes sense, is perhaps I just received from RR2 prefix I'm pulling from RR1, if I don't handle all my updates first, I'm causing outage that should not happen, because I already actually received the update telling I don't need to withdraw it.
Is this the right way to do it? Maybe not, but it's easy to imagine why it might seem like a good idea.
How well BGP works in common cases and how it works in pathologically scaled and busy cases are very different cases.
I know that even in stable states commonly run vendors on commonly run hardware can take +2h to finish converging iBGP on initial turn-up.
-- ++ytti
Detailed explanation can be found below. https://blog.thousandeyes.com/centurylink-level-3-outage-analysis/ From: NANOG <nanog-bounces+lguillory=reservetele.com@nanog.org> on behalf of Baldur Norddahl <baldur.norddahl@gmail.com> Date: Wednesday, September 2, 2020 at 12:09 PM To: "nanog@nanog.org" <nanog@nanog.org> Subject: Re: [outages] Major Level3 (CenturyLink) Issues *External Email: Use Caution* I believe someone on this list reported that updates were also broken. They could not add prepending nor modify communities. Anyway I am not saying it cannot happen because clearly something did happen. I just don't believe it is a simple case of overload. There has to be more to it. ons. 2. sep. 2020 15.36 skrev Saku Ytti <saku@ytti.fi<mailto:saku@ytti.fi>>: On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl <baldur.norddahl@gmail.com<mailto:baldur.norddahl@gmail.com>> wrote:
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers.
I can imagine writing BGP implementation like this a) own queue for keepalives, which i always serve first fully b) own queue for update, which i serve second c) own queue for withdraw, which i serve last Why I might think this makes sense, is perhaps I just received from RR2 prefix I'm pulling from RR1, if I don't handle all my updates first, I'm causing outage that should not happen, because I already actually received the update telling I don't need to withdraw it. Is this the right way to do it? Maybe not, but it's easy to imagine why it might seem like a good idea. How well BGP works in common cases and how it works in pathologically scaled and busy cases are very different cases. I know that even in stable states commonly run vendors on commonly run hardware can take +2h to finish converging iBGP on initial turn-up. -- ++ytti
❦ 2 septembre 2020 16:35 +03, Saku Ytti:
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers.
I can imagine writing BGP implementation like this
a) own queue for keepalives, which i always serve first fully b) own queue for update, which i serve second c) own queue for withdraw, which i serve last
Or maybe, graceful restart configured without a timeout on IPv4/IPv6? The flowspec rule severed the BGP session abruptly, stale routes are kept due to graceful restart (except flowspec rules), BGP sessions are reestablished but the flowspec rules is handled before before reaching EoR and we loop from there. -- Make sure your code "does nothing" gracefully. - The Elements of Programming Style (Kernighan & Plauger)
On Wed, Sep 2, 2020 at 3:04 PM Vincent Bernat <bernat@luffy.cx> wrote:
❦ 2 septembre 2020 16:35 +03, Saku Ytti:
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers.
I can imagine writing BGP implementation like this
a) own queue for keepalives, which i always serve first fully b) own queue for update, which i serve second c) own queue for withdraw, which i serve last
Or maybe, graceful restart configured without a timeout on IPv4/IPv6? The flowspec rule severed the BGP session abruptly, stale routes are kept due to graceful restart (except flowspec rules), BGP sessions are reestablished but the flowspec rules is handled before before reaching EoR and we loop from there.
... or all routes are fed into some magic route optimization box which is designed to keep things more stable and take advantage of cisco's "step-10" to suck more traffic, or.... The root issue here is that the *publicc* RFO is incomplete / unclear. Something something flowspec something, blocked flowspec, no more something does indeed explain that something bad happened, but not what caused the lack of withdraws / cascading churn. As with many interesting outages, I suspect that we will never get the full story, and "Something bad happened, we fixed it and now it's all better and will never happen ever again, trust us..." seems to be the new normal for public postmortems... W
-- Make sure your code "does nothing" gracefully. - The Elements of Programming Style (Kernighan & Plauger)
-- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf
On Wed, 2 Sep 2020, Warren Kumari wrote:
The root issue here is that the *publicc* RFO is incomplete / unclear. Something something flowspec something, blocked flowspec, no more something does indeed explain that something bad happened, but not what caused the lack of withdraws / cascading churn. As with many interesting outages, I suspect that we will never get the full story, and "Something bad happened, we fixed it and now it's all better and will never happen ever again, trust us..." seems to be the new normal for public postmortems...
It's possible Level3's people don't fully understand what happened or that the "bad flowspec rule" causing BGP sessions to repeatedly flap network wide triggered software bugs on their routers. You've never seen rpd stuck at 100% CPU for hours or an MX960 advertise history routes to external peers, even after the internal session that had advertised the route to it has been cleared? To quote Zaphod Beeblebrox "Listen, three eyes, don't you try to outweird me. I get stranger things than you free with my breakfast cereal." Kick a BGP implementation hard enough, and weird shit is likely to happen. ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route StackPath, Sr. Neteng | therefore you are _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
Cisco had a bug a few years back that affected metro switches such that they would not withdraw routes upstream. We had an internal outage and one of my carriers kept advertising our prefixes even though we withdrew the routes. We tried downing the neighbor and even shutting down the physical interface to no avail. The carrier kept blackholing us until they shut down on their metro switch.
On 2/Sep/20 15:12, Baldur Norddahl wrote:
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers.
A BGP RFC spec. is not the same thing as a vendor translating that spec. into code. If it were, we'd never need this list. Triple the effort when deployed and operated at scale. Mark.
And just to add just a little bit of fuel to this fire let me share that the base principle of BGP spec mandating to withdraw the routes when the session goes down could be in the glory of IETF soon a history :( It started with the proposal to make BGP state "persistent": https://tools.ietf.org/html/draft-uttaro-idr-bgp-persistence-00 Now it got smoothed and improved a bit, but the effect is still the same - keep the routes and do not withdraw when session goes down: https://tools.ietf.org/html/draft-ietf-idr-long-lived-gr-00 Sure it is up to the operator discretion to enable it or not. But soon we can no longer blame such behaviour as violation of BGP RFC if this proceeds forward to formal RFC. Hint: Max LLST value (24 bits in seconds) is over 194 days of prefix expiration not just mentioned here with dislike 5 hours or so :) Best R,. On Thu, Sep 3, 2020 at 6:20 PM Mark Tinka <mark.tinka@seacom.com> wrote:
On 2/Sep/20 15:12, Baldur Norddahl wrote:
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers.
A BGP RFC spec. is not the same thing as a vendor translating that spec. into code. If it were, we'd never need this list.
Triple the effort when deployed and operated at scale.
Mark.
participants (14)
-
Aaron C. de Bruyn
-
Baldur Norddahl
-
Dantzig, Brian
-
Jon Lewis
-
Luke Guillory
-
Mark Tinka
-
Martijn Schmidt
-
Mike Hammett
-
Randy Bush
-
Robert Raszuk
-
Saku Ytti
-
Tom Beecher
-
Vincent Bernat
-
Warren Kumari