On Wed, Sep 2, 2020 at 3:04 PM Vincent Bernat <bernat@luffy.cx> wrote:
❦ 2 septembre 2020 16:35 +03, Saku Ytti:
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers.
I can imagine writing BGP implementation like this
a) own queue for keepalives, which i always serve first fully b) own queue for update, which i serve second c) own queue for withdraw, which i serve last
Or maybe, graceful restart configured without a timeout on IPv4/IPv6? The flowspec rule severed the BGP session abruptly, stale routes are kept due to graceful restart (except flowspec rules), BGP sessions are reestablished but the flowspec rules is handled before before reaching EoR and we loop from there.
... or all routes are fed into some magic route optimization box which is designed to keep things more stable and take advantage of cisco's "step-10" to suck more traffic, or.... The root issue here is that the *publicc* RFO is incomplete / unclear. Something something flowspec something, blocked flowspec, no more something does indeed explain that something bad happened, but not what caused the lack of withdraws / cascading churn. As with many interesting outages, I suspect that we will never get the full story, and "Something bad happened, we fixed it and now it's all better and will never happen ever again, trust us..." seems to be the new normal for public postmortems... W
-- Make sure your code "does nothing" gracefully. - The Elements of Programming Style (Kernighan & Plauger)
-- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf