Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least? eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N. Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref. Matt
This same issue happened in Los Angeles a number of years ago, but for IPv4 and v6. They need to setup sane BGP timers, and/or advocate the use of BFD for BGP sessions both customer facing and internal. Ryan On Nov 15 2020, at 5:58 pm, Matt Corallo <nanog@as397444.net> wrote:
Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.
Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.
Matt
Probably a ghost route. Such thing happens :( https://labs.ripe.net/Members/romain_fontugne/bgp-zombies Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE). Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails... Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).
Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :
Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.
Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.
Matt
Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix. Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel. Must just be something with my particular prefixes, oh well. Matt On 11/15/20 10:40 PM, Olivier Benghozi wrote:
Probably a ghost route. Such thing happens :(
https://labs.ripe.net/Members/romain_fontugne/bgp-zombies
Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE).
Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails...
Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).
Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :
Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.
Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.
Matt
One of the routing gears on the path don't like the large community inside those routes maybe ? :) By the way we currently see 2620:6e:a002::/48 at LINX LON1 from Choopa and HE...
Le 16 nov. 2020 à 04:44, Matt Corallo <nanog@as397444.net> a écrit :
Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix.
Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel.
Must just be something with my particular prefixes, oh well.
Matt
On 11/15/20 10:40 PM, Olivier Benghozi wrote:
Probably a ghost route. Such thing happens :( https://labs.ripe.net/Members/romain_fontugne/bgp-zombies Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE). Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails... Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).
Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :
Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.
Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.
Matt
Maybe? Never been an issue before. In this case the route does have a depref community on Telia hence why one wouldn’t expect it via the same path, but the other ghost route in question never had anything similar. Matt
On Nov 15, 2020, at 23:07, Olivier Benghozi <olivier.benghozi@wifirst.fr> wrote:
One of the routing gears on the path don't like the large community inside those routes maybe ? :) By the way we currently see 2620:6e:a002::/48 at LINX LON1 from Choopa and HE...
Le 16 nov. 2020 à 04:44, Matt Corallo <nanog@as397444.net> a écrit :
Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix.
Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel.
Must just be something with my particular prefixes, oh well.
Matt
On 11/15/20 10:40 PM, Olivier Benghozi wrote: Probably a ghost route. Such thing happens :( https://labs.ripe.net/Members/romain_fontugne/bgp-zombies Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE). Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails... Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).
Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :
Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.
Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.
Matt
For those curious, Johan indicated on Twitter this was a JunOS bug. https://twitter.com/gustawsson/status/1328298914785730561 Matt
On Nov 15, 2020, at 23:13, Matt Corallo <nanog@as397444.net> wrote:
Maybe? Never been an issue before. In this case the route does have a depref community on Telia hence why one wouldn’t expect it via the same path, but the other ghost route in question never had anything similar.
Matt
On Nov 15, 2020, at 23:07, Olivier Benghozi <olivier.benghozi@wifirst.fr> wrote:
One of the routing gears on the path don't like the large community inside those routes maybe ? :) By the way we currently see 2620:6e:a002::/48 at LINX LON1 from Choopa and HE...
Le 16 nov. 2020 à 04:44, Matt Corallo <nanog@as397444.net> a écrit :
Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix.
Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel.
Must just be something with my particular prefixes, oh well.
Matt
On 11/15/20 10:40 PM, Olivier Benghozi wrote: Probably a ghost route. Such thing happens :( https://labs.ripe.net/Members/romain_fontugne/bgp-zombies Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE). Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails... Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).
Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :
Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.
Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.
Matt
----- On Nov 15, 2020, at 5:58 PM, Matt Corallo nanog@as397444.net wrote:
Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
I have seen issues like this in a network that I operated. In that particular case, it was an internal ipv4 10/8 route which was withdrawn, along with a few hundred other routes. The withdrawl was configured on a DC exit router, in a Clos network with leaf, spine, and superspine. On the spine layer, I observed that BGP withdrawls, although being received, were not processed by the control plane. Further investigation and working with the TAC of the vendor, revealed that on that particular platform, the BGP process would stop process withdrawls in a very nasty race condition that was very difficult to reproduce. This was the first (and so far only) time in my 20+ years of working with BGP that I've observed such a weird bug. Since I operated the entire network, it was fairly easy to find the culprit. The why, took some more time. If I were in your shoes, I'd ping Telia's NOC to see what's going on. I would not be surprised if they'd be hitting a similar issue. Thanks, Sabri
See my latest response from this morning. Telia's "Head of Network Engineering & Architecture" confirmed on Twitter this was due to a (now-worked-around) bug in JunOS. https://twitter.com/gustawsson/status/1328298914785730561 Matt On 11/16/20 2:13 PM, Sabri Berisha wrote:
----- On Nov 15, 2020, at 5:58 PM, Matt Corallo nanog@as397444.net wrote:
Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
I have seen issues like this in a network that I operated. In that particular case, it was an internal ipv4 10/8 route which was withdrawn, along with a few hundred other routes. The withdrawl was configured on a DC exit router, in a Clos network with leaf, spine, and superspine. On the spine layer, I observed that BGP withdrawls, although being received, were not processed by the control plane.
Further investigation and working with the TAC of the vendor, revealed that on that particular platform, the BGP process would stop process withdrawls in a very nasty race condition that was very difficult to reproduce.
This was the first (and so far only) time in my 20+ years of working with BGP that I've observed such a weird bug. Since I operated the entire network, it was fairly easy to find the culprit. The why, took some more time.
If I were in your shoes, I'd ping Telia's NOC to see what's going on. I would not be surprised if they'd be hitting a similar issue.
Thanks,
Sabri
----- On Nov 16, 2020, at 11:45 AM, Matt Corallo nanog@as397444.net wrote: Hi,
See my latest response from this morning. Telia's "Head of Network Engineering & Architecture" confirmed on Twitter this was due to a (now-worked-around) bug in JunOS.
Interesting. A long time ago, in a galaxy far far away, where I was a JTAC engineer, policy was that once a PR was hit in the field, it would be marked public. Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy. Thanks, Sabri
On Mon, 16 Nov 2020 17:36:58 -0800, Sabri Berisha said:
Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.
Handling a withdrawal is easy. Handling one correctly without race conditions when you're seeing withdrawals and additions from multiple bgp sessions concurrently, while also maintaining RIB and FIB consistency and keep forwarding customer packets is a little bit harder.
Surely they can just put them in an array. ;) On Mon, Nov 16, 2020, 21:54 Valdis Klētnieks <valdis.kletnieks@vt.edu> wrote:
On Mon, 16 Nov 2020 17:36:58 -0800, Sabri Berisha said:
Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.
Handling a withdrawal is easy.
Handling one correctly without race conditions when you're seeing withdrawals and additions from multiple bgp sessions concurrently, while also maintaining RIB and FIB consistency and keep forwarding customer packets is a little bit harder.
On 17.11.2020 around 02:36 Sabri Berisha wrote:
Interesting. A long time ago, in a galaxy far far away, where I was a JTAC engineer, policy was that once a PR was hit in the field, it would be marked public.
Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.
New code, new features, new problems. E.g. public PR1323306 describes a BGP stuck situation. (And the fixed code should address as well a - hidden - PR, which causes down/stale sessions, leading to stuck routes even without a both-side GRES event). All very, very special cases ... but some of us will find / get hit by them (unfortunately). Markus
On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri@cluecentral.net> wrote: Hey Sabri,
Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.
I don't think this is related to skill, that there was some hard programming problem that DE couldn't solve. These are honest mistakes. I've not experienced in my tenure the frequency of these bugs change at all, NOS are as common now as they were in the 90s. I put most of the blame on the market, we've modelled commercial router market so that poor quality NOS is good for business and good quality NOS is bad for business, I don't think this is in anyone's formal business plan or that companies even realise they are not even trying to make good NOS. I think it's emergent behaviour due to the market and people follow that market demand unknowingly. If we suddenly had one commercial NOS which is 100% bug free, many of their customers would stop buying support, would rely on spare HW and Internet forums for configuration help. Lot of us only need contracts to deal with novel bugs all of us find on a regular basis, so good NOS would immediately reduce revenue. For some reason Windows, macOS or Linux almost never have novel bugs that the end user finds and when those are found, it's big news. While we don't go a month without hitting a novel bug in one of our NOS, and no one cares about it, it's business as usual. I also put a lot of blame on C, it was a terrific language when compiling had to be fast. Basically macro assembler. Now the utility of being 'close to HW' is gone, as the CPU does so much C compiler has no control over, it's not really even executing the same code as-written anymore. MSFT estimated >70% of their bugs are related to memory safety. We could accomplish significant improvements in software quality if we'd ditch C and allow the computer to do more formal correctness checks at compile time and design languages which lend towards this. We constantly misattribute problems (like in this post) to config or HW, while most common reasons for outages are pilot error and SW defect, and very little engineering time is spent on those. And often the time spent improving the two first increases the risk of the two latter, reducing mean availability over time. -- ++ytti
On 11/17/20 08:54, Saku Ytti wrote:
I put most of the blame on the market, we've modelled commercial router market so that poor quality NOS is good for business and good quality NOS is bad for business, I don't think this is in anyone's formal business plan or that companies even realise they are not even trying to make good NOS. I think it's emergent behaviour due to the market and people follow that market demand unknowingly. If we suddenly had one commercial NOS which is 100% bug free, many of their customers would stop buying support, would rely on spare HW and Internet forums for configuration help.
Not to mention that many of us would not need to be around to babysit all this dodgy software. Definitely bad for business :-). Mark.
On Behalf Of Mark Tinka Sent: Tuesday, November 17, 2020 4:32 PM
On 11/17/20 08:54, Saku Ytti wrote:
I put most of the blame on the market, we've modelled commercial router market so that poor quality NOS is good for business and good quality NOS is bad for business, I don't think this is in anyone's formal business plan or that companies even realise they are not even trying to make good NOS. I think it's emergent behaviour due to the market and people follow that market demand unknowingly. If we suddenly had one commercial NOS which is 100% bug free, many of their customers would stop buying support, would rely on spare HW and Internet forums for configuration help.
Not to mention that many of us would not need to be around to babysit all this dodgy software.
Definitely bad for business :-).
Being obsoleted already by "self-driving networks", there's no limit to what one can automate... But then one needs someone to babysit all the automation systems. adam
Saku Ytti Sent: Tuesday, November 17, 2020 6:55 AM
On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri@cluecentral.net> wrote:
Hey Sabri,
Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.
I don't think this is related to skill, that there was some hard programming problem that DE couldn't solve. These are honest mistakes. I've not experienced in my tenure the frequency of these bugs change at all, NOS are as common now as they were in the 90s.
I put most of the blame on the market, we've modelled commercial router market so that poor quality NOS is good for business and good quality NOS is bad for business, I don't think this is in anyone's formal business plan or that companies even realise they are not even trying to make good NOS. I think it's emergent behaviour due to the market and people follow that market demand unknowingly. If we suddenly had one commercial NOS which is 100% bug free, many of their customers would stop buying support, would rely on spare HW and Internet forums for configuration help. Lot of us only need contracts to deal with novel bugs all of us find on a regular basis, so good NOS would immediately reduce revenue. For some reason Windows, macOS or Linux almost never have novel bugs that the end user finds and when those are found, it's big news. While we don't go a month without hitting a novel bug in one of our NOS, and no one cares about it, it's business as usual.
I also put a lot of blame on C, it was a terrific language when compiling had to be fast. Basically macro assembler. Now the utility of being 'close to HW' is gone, as the CPU does so much C compiler has no control over, it's not really even executing the same code as-written anymore. MSFT estimated >70% of their bugs are related to memory safety. We could accomplish significant improvements in software quality if we'd ditch C and allow the computer to do more formal correctness checks at compile time and design languages which lend towards this.
We constantly misattribute problems (like in this post) to config or HW, while most common reasons for outages are pilot error and SW defect, and very little engineering time is spent on those. And often the time spent improving the two first increases the risk of the two latter, reducing mean availability over time.
I agree with everything but the last statement.
From my experience, most of the SPs spend a considerable time testing for SW defects on features (and combinations of features) that will be used and at scale intended, that's how you identify most of the bugs. What you're left with afterwards are special packets of death or some slow memory leaks (basically the more exotic stuff).
adam
On 11/18/20 14:58, adamv0025@netconsultings.com wrote:
From my experience, most of the SPs spend a considerable time testing for SW defects on features (and combinations of features) that will be used and at scale intended,
I'm not so sure about that, actually. I'd say there are some ISP's that spend some (or a considerable) amount of time testing for software defects. My anecdotal experience is that most ISP's have neither the time, tools nor resources to do significant testing of software. More like, "is the version anything after R1, has it been around long enough, has it been recommended by TAC, are the -nsp lists raving on about it, is it a maintenance release, is the caveat list too long, does my vendor SE approve", type-thing.
that's how you identify most of the bugs. What you're left with afterwards are special packets of death or some slow memory leaks (basically the more exotic stuff).
Which the majority of ISP's likely will never test for. Mark.
I also put a lot of blame on C, it was a terrific language when compiling had to be fast. Basically macro assembler. Now the utility of being 'close to HW' is gone, as the CPU does so much C compiler has no control over, it's not really even executing the same code as-written anymore. MSFT estimated >70% of their bugs are related to memory safety. We could accomplish significant improvements in software quality if we'd ditch C and allow the computer to do more formal correctness checks at compile time and design languages which lend towards this.
Agree 1000%. I think this is greatly compounded by current generations of programmers who come out of school without having had much experience with low level memory management, having mostly worked in more modern languages that handle such things in a much better way. Moving from college Python to mature C code with a hellscape of pointers must be a pretty jarring transition. :) On Tue, Nov 17, 2020 at 1:56 AM Saku Ytti <saku@ytti.fi> wrote:
On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri@cluecentral.net> wrote:
Hey Sabri,
Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.
I don't think this is related to skill, that there was some hard programming problem that DE couldn't solve. These are honest mistakes. I've not experienced in my tenure the frequency of these bugs change at all, NOS are as common now as they were in the 90s.
I put most of the blame on the market, we've modelled commercial router market so that poor quality NOS is good for business and good quality NOS is bad for business, I don't think this is in anyone's formal business plan or that companies even realise they are not even trying to make good NOS. I think it's emergent behaviour due to the market and people follow that market demand unknowingly. If we suddenly had one commercial NOS which is 100% bug free, many of their customers would stop buying support, would rely on spare HW and Internet forums for configuration help. Lot of us only need contracts to deal with novel bugs all of us find on a regular basis, so good NOS would immediately reduce revenue. For some reason Windows, macOS or Linux almost never have novel bugs that the end user finds and when those are found, it's big news. While we don't go a month without hitting a novel bug in one of our NOS, and no one cares about it, it's business as usual.
I also put a lot of blame on C, it was a terrific language when compiling had to be fast. Basically macro assembler. Now the utility of being 'close to HW' is gone, as the CPU does so much C compiler has no control over, it's not really even executing the same code as-written anymore. MSFT estimated >70% of their bugs are related to memory safety. We could accomplish significant improvements in software quality if we'd ditch C and allow the computer to do more formal correctness checks at compile time and design languages which lend towards this.
We constantly misattribute problems (like in this post) to config or HW, while most common reasons for outages are pilot error and SW defect, and very little engineering time is spent on those. And often the time spent improving the two first increases the risk of the two latter, reducing mean availability over time.
-- ++ytti
participants (11)
-
adamv0025@netconsultings.com
-
Mark Tinka
-
Markus Weber (FvD)
-
Matt Corallo
-
Neil Hanlon
-
Olivier Benghozi
-
Ryan Hamel
-
Sabri Berisha
-
Saku Ytti
-
Tom Beecher
-
Valdis Klētnieks