Question about normal ops - BGP Flaps nightly
Howdy! A question of interest to me, currently, is whether it's normal for providers to cause BGP flaps to their customers nightly... This seems, in my case, to be the provider PROBABLY updating prefix-filters on my session(s). Particularly AS56554 is currently getting v4/v6 transit from 2 providers, one of which we have 2 links toward. That provider appears to flap both of our ipv6 (only) bgp peers each night at about the same time each night. This smells like: "filter updates', but something that's different than the v4 filter update? (or perhaps they have no v4 filtering to update?) In the end, should customers expect nightly (or on a regular cadence) to see their sessions bounce? It hasn't been my experience in other situations... -chris
On Nov 21, 2019, at 4:45 AM, Christopher Morrow <morrowc.lists@gmail.com> wrote:
Howdy! A question of interest to me, currently, is whether it's normal for providers to cause BGP flaps to their customers nightly... This seems, in my case, to be the provider PROBABLY updating prefix-filters on my session(s).
Particularly AS56554 is currently getting v4/v6 transit from 2 providers, one of which we have 2 links toward. That provider appears to flap both of our ipv6 (only) bgp peers each night at about the same time each night. This smells like: "filter updates', but something that's different than the v4 filter update? (or perhaps they have no v4 filtering to update?)
In the end, should customers expect nightly (or on a regular cadence) to see their sessions bounce? It hasn't been my experience in other situations...
This seems unusual, perhaps a bug in their tooling or their config where it’s doing a hard clear vs soft clear on the session? - Jared
On Thu, Nov 21, 2019 at 4:48 AM Jared Mauch <jared@puck.nether.net> wrote:
On Nov 21, 2019, at 4:45 AM, Christopher Morrow <morrowc.lists@gmail.com> wrote:
Howdy! A question of interest to me, currently, is whether it's normal for providers to cause BGP flaps to their customers nightly... This seems, in my case, to be the provider PROBABLY updating prefix-filters on my session(s).
Particularly AS56554 is currently getting v4/v6 transit from 2 providers, one of which we have 2 links toward. That provider appears to flap both of our ipv6 (only) bgp peers each night at about the same time each night. This smells like: "filter updates', but something that's different than the v4 filter update? (or perhaps they have no v4 filtering to update?)
In the end, should customers expect nightly (or on a regular cadence) to see their sessions bounce? It hasn't been my experience in other situations...
This seems unusual, perhaps a bug in their tooling or their config where it’s doing a hard clear vs soft clear on the session?
This was sort of my thinking, but I was unsure if there was some new process and/or bug which other edge-y folk were dealing with of late. I can/will ask the provider in question (a local apac provider) if they are aware of the actions they are taking.
No. There should be no reason to bounce the session. Do you have soft updates turn on? -mel via cell
On Nov 21, 2019, at 1:46 AM, Christopher Morrow <morrowc.lists@gmail.com> wrote:
Howdy! A question of interest to me, currently, is whether it's normal for providers to cause BGP flaps to their customers nightly... This seems, in my case, to be the provider PROBABLY updating prefix-filters on my session(s).
Particularly AS56554 is currently getting v4/v6 transit from 2 providers, one of which we have 2 links toward. That provider appears to flap both of our ipv6 (only) bgp peers each night at about the same time each night. This smells like: "filter updates', but something that's different than the v4 filter update? (or perhaps they have no v4 filtering to update?)
In the end, should customers expect nightly (or on a regular cadence) to see their sessions bounce? It hasn't been my experience in other situations...
-chris
I agree that this sounds like an automated process in some way. I would suspect that either a vendor code update changed something such that a given command that would not cause session reset now does, or they changed their automation to include a command that would cause a reset without realizing it/slipped through the cracks / etc. On Thu, Nov 21, 2019 at 9:18 AM Mel Beckman <mel@beckman.org> wrote:
No. There should be no reason to bounce the session. Do you have soft updates turn on?
-mel via cell
On Nov 21, 2019, at 1:46 AM, Christopher Morrow <morrowc.lists@gmail.com> wrote:
Howdy! A question of interest to me, currently, is whether it's normal for providers to cause BGP flaps to their customers nightly... This seems, in my case, to be the provider PROBABLY updating prefix-filters on my session(s).
Particularly AS56554 is currently getting v4/v6 transit from 2 providers, one of which we have 2 links toward. That provider appears to flap both of our ipv6 (only) bgp peers each night at about the same time each night. This smells like: "filter updates', but something that's different than the v4 filter update? (or perhaps they have no v4 filtering to update?)
In the end, should customers expect nightly (or on a regular cadence) to see their sessions bounce? It hasn't been my experience in other situations...
-chris
On Fri, Nov 22, 2019 at 12:54 AM Tom Beecher <beecher@beecher.cc> wrote:
I agree that this sounds like an automated process in some way.
I would suspect that either a vendor code update changed something such that a given command that would not cause session reset now does, or they changed their automation to include a command that would cause a reset without realizing it/slipped through the cracks / etc.
thanks to some private chat with another nanog participant it was noted the reason for failure is: "Error event Operation timed out(60) for I/O session - closing it" This is fine, I suppose, except that I have v4/v6 sessions on the same ptp link/path. So, if v6 times out I'd have expected v4 to also timeout. Strangely I had thought we were told the 2 links we have land on 2 different devices, but router-id tells me that's false as well. :( The sessions appear to reset on both devices (according to syslog) at the same time, I had thought (because our alerter is telling me) the sessions had a gap between the 2 drops. The physical payer is some bidi fiber path across an L2 (ether) network to the provider, perhaps the problem isn't on the l3/bgp parts here, but in the l2 network between. we are at the end of our time here so I think I'll gather some logs and see if the provider can make sense of the issues.
On Thu, Nov 21, 2019 at 9:18 AM Mel Beckman <mel@beckman.org> wrote:
No. There should be no reason to bounce the session. Do you have soft updates turn on?
-mel via cell
On Nov 21, 2019, at 1:46 AM, Christopher Morrow <morrowc.lists@gmail.com> wrote:
Howdy! A question of interest to me, currently, is whether it's normal for providers to cause BGP flaps to their customers nightly... This seems, in my case, to be the provider PROBABLY updating prefix-filters on my session(s).
Particularly AS56554 is currently getting v4/v6 transit from 2 providers, one of which we have 2 links toward. That provider appears to flap both of our ipv6 (only) bgp peers each night at about the same time each night. This smells like: "filter updates', but something that's different than the v4 filter update? (or perhaps they have no v4 filtering to update?)
In the end, should customers expect nightly (or on a regular cadence) to see their sessions bounce? It hasn't been my experience in other situations...
-chris
A BGP reset can cause routing trouble for as much as 15 minutes. Since you have two sessions that mitigates the problem somewhat. But nevertheless this will not be acceptable. Regards Baldur tor. 21. nov. 2019 10.47 skrev Christopher Morrow <morrowc.lists@gmail.com>:
Howdy! A question of interest to me, currently, is whether it's normal for providers to cause BGP flaps to their customers nightly... This seems, in my case, to be the provider PROBABLY updating prefix-filters on my session(s).
Particularly AS56554 is currently getting v4/v6 transit from 2 providers, one of which we have 2 links toward. That provider appears to flap both of our ipv6 (only) bgp peers each night at about the same time each night. This smells like: "filter updates', but something that's different than the v4 filter update? (or perhaps they have no v4 filtering to update?)
In the end, should customers expect nightly (or on a regular cadence) to see their sessions bounce? It hasn't been my experience in other situations...
-chris
On Thu, 21 Nov 2019 at 19:44, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
A BGP reset can cause routing trouble for as much as 15 minutes. Since you have two sessions that mitigates the problem somewhat. But nevertheless this will not be acceptable.
As there are best path algorithms which consider route age, BGP reset impact may be indefinite. -- ++ytti
On Fri, Nov 22, 2019 at 2:01 AM Saku Ytti <saku@ytti.fi> wrote:
On Thu, 21 Nov 2019 at 19:44, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
A BGP reset can cause routing trouble for as much as 15 minutes. Since you have two sessions that mitigates the problem somewhat. But nevertheless this will not be acceptable.
As there are best path algorithms which consider route age, BGP reset impact may be indefinite.
fortunately we have a second actual provider... so this all isn't super impacting to us, just weird and unexpected on my part.
-- ++ytti
On Fri, Nov 22, 2019 at 1:21 AM Christopher Morrow <morrowc.lists@gmail.com> wrote:
On Fri, Nov 22, 2019 at 2:01 AM Saku Ytti <saku@ytti.fi> wrote:
On Thu, 21 Nov 2019 at 19:44, Baldur Norddahl <baldur.norddahl@gmail.com>
wrote:
A BGP reset can cause routing trouble for as much as 15 minutes. Since
you have two sessions that mitigates the problem somewhat. But nevertheless this will not be acceptable.
As there are best path algorithms which consider route age, BGP reset impact may be indefinite.
fortunately we have a second actual provider... so this all isn't super impacting to us, just weird and unexpected on my part.
No that is not helping. When the BGP session flaps your routes via that provider are withdrawn. Everyone out there that were using those routes will need to switch. But consider the following: ISP A has routes from both of your providers ISP B has A as uplink BGP works so that ISP A is only announcing the route that he is actually using to ISP B. ISP B therefore does not have both of your routes. When the active route is withdrawn ISP B will momentary be without any route to your network. It can take some time after the withdraw before ISP A announces that he now is using the alternative route. This gets worse with longer chains. Also some ISPs are using route flap limiting techniques that can prolong this process. As I said, my experience is that you can expect as much as 15 minutes of flaky internet after a BGP reset. This is with multiple transit providers. I can not say too much about why you have BGP resets, but I can say that you really want it fixed. It will affect your connectivity. Regards, Baldur
On Fri, Nov 22, 2019 at 12:32 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
On Fri, Nov 22, 2019 at 1:21 AM Christopher Morrow <morrowc.lists@gmail.com> wrote:
On Fri, Nov 22, 2019 at 2:01 AM Saku Ytti <saku@ytti.fi> wrote:
On Thu, 21 Nov 2019 at 19:44, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
A BGP reset can cause routing trouble for as much as 15 minutes. Since you have two sessions that mitigates the problem somewhat. But nevertheless this will not be acceptable.
As there are best path algorithms which consider route age, BGP reset impact may be indefinite.
fortunately we have a second actual provider... so this all isn't super impacting to us, just weird and unexpected on my part.
No that is not helping. When the BGP session flaps your routes via that provider are withdrawn. Everyone out there that were using those routes will need to switch. But consider the following:
ISP A has routes from both of your providers ISP B has A as uplink
BGP works so that ISP A is only announcing the route that he is actually using to ISP B. ISP B therefore does not have both of your routes. When the active route is withdrawn ISP B will momentary be without any route to your network. It can take some time after the withdraw before ISP A announces that he now is using the alternative route. This gets worse with longer chains. Also some ISPs are using route flap limiting techniques that can prolong this process.
As I said, my experience is that you can expect as much as 15 minutes of flaky internet after a BGP reset. This is with multiple transit providers.
Yup, I'm sensitive to flapping causing problems. This was why i started the thread, which really should have been: "Is there a well known bug people are working around? or is this a new problem I should chase with the provider? or 'nah, everyone does this, you just aren't normally paying attention'"
I can not say too much about why you have BGP resets, but I can say that you really want it fixed. It will affect your connectivity.
fortunately 3am local time is not prime-internet-use time :) phew! (not a great excuse though, of course) I'll be chasing up the provider to see what's up. thanks! -chris
Regards,
Baldur
On Fri, Nov 22, 2019 at 12:40 PM Christopher Morrow <morrowc.lists@gmail.com> wrote:
On Fri, Nov 22, 2019 at 12:32 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
On Fri, Nov 22, 2019 at 1:21 AM Christopher Morrow <morrowc.lists@gmail.com> wrote:
On Fri, Nov 22, 2019 at 2:01 AM Saku Ytti <saku@ytti.fi> wrote:
On Thu, 21 Nov 2019 at 19:44, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
A BGP reset can cause routing trouble for as much as 15 minutes. Since you have two sessions that mitigates the problem somewhat. But nevertheless this will not be acceptable.
As there are best path algorithms which consider route age, BGP reset impact may be indefinite.
fortunately we have a second actual provider... so this all isn't super impacting to us, just weird and unexpected on my part.
No that is not helping. When the BGP session flaps your routes via that provider are withdrawn. Everyone out there that were using those routes will need to switch. But consider the following:
ISP A has routes from both of your providers ISP B has A as uplink
BGP works so that ISP A is only announcing the route that he is actually using to ISP B. ISP B therefore does not have both of your routes. When the active route is withdrawn ISP B will momentary be without any route to your network. It can take some time after the withdraw before ISP A announces that he now is using the alternative route. This gets worse with longer chains. Also some ISPs are using route flap limiting techniques that can prolong this process.
As I said, my experience is that you can expect as much as 15 minutes of flaky internet after a BGP reset. This is with multiple transit providers.
Yup, I'm sensitive to flapping causing problems. This was why i started the thread, which really should have been: "Is there a well known bug people are working around? or is this a new problem I should chase with the provider? or 'nah, everyone does this, you just aren't normally paying attention'"
I can not say too much about why you have BGP resets, but I can say that you really want it fixed. It will affect your connectivity.
fortunately 3am local time is not prime-internet-use time :) phew! (not a great excuse though, of course)
The other saving grace / "meh" is that this is for a conference network, and we are picking up sticks and leaving tomorrow... so, we will let the provider know that there is something that should be fixed, but a: our pain will have stopped :-P and b: we won't really have a good way to know if they have fixed the issue (other than perhaps watching for a spike of withdraws / reannouncements every 24 hours through this AS path) W
I'll be chasing up the provider to see what's up. thanks! -chris
Regards,
Baldur
-- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf
”Someday we’ll find it: the stable connection, my providers, my routers and me.“ - Kermit the Frog https://networkphil.com/ Couldn’t resist- having read this and then almost immediately seeing a link to the above “Rainbow Connection” remix via a LinkedIn post. Happy Friday! Cheers, Yoni
On Nov 21, 2019, at 11:50 PM, Warren Kumari <warren@kumari.net> wrote:
On Fri, Nov 22, 2019 at 12:40 PM Christopher Morrow <morrowc.lists@gmail.com> wrote:
On Fri, Nov 22, 2019 at 12:32 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
On Fri, Nov 22, 2019 at 1:21 AM Christopher Morrow <morrowc.lists@gmail.com> wrote:
On Fri, Nov 22, 2019 at 2:01 AM Saku Ytti <saku@ytti.fi> wrote:
On Thu, 21 Nov 2019 at 19:44, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
A BGP reset can cause routing trouble for as much as 15 minutes. Since you have two sessions that mitigates the problem somewhat. But nevertheless this will not be acceptable.
As there are best path algorithms which consider route age, BGP reset impact may be indefinite.
fortunately we have a second actual provider... so this all isn't super impacting to us, just weird and unexpected on my part.
No that is not helping. When the BGP session flaps your routes via that provider are withdrawn. Everyone out there that were using those routes will need to switch. But consider the following:
ISP A has routes from both of your providers ISP B has A as uplink
BGP works so that ISP A is only announcing the route that he is actually using to ISP B. ISP B therefore does not have both of your routes. When the active route is withdrawn ISP B will momentary be without any route to your network. It can take some time after the withdraw before ISP A announces that he now is using the alternative route. This gets worse with longer chains. Also some ISPs are using route flap limiting techniques that can prolong this process.
As I said, my experience is that you can expect as much as 15 minutes of flaky internet after a BGP reset. This is with multiple transit providers.
Yup, I'm sensitive to flapping causing problems. This was why i started the thread, which really should have been: "Is there a well known bug people are working around? or is this a new problem I should chase with the provider? or 'nah, everyone does this, you just aren't normally paying attention'"
I can not say too much about why you have BGP resets, but I can say that you really want it fixed. It will affect your connectivity.
fortunately 3am local time is not prime-internet-use time :) phew! (not a great excuse though, of course)
The other saving grace / "meh" is that this is for a conference network, and we are picking up sticks and leaving tomorrow... so, we will let the provider know that there is something that should be fixed, but a: our pain will have stopped :-P and b: we won't really have a good way to know if they have fixed the issue (other than perhaps watching for a spike of withdraws / reannouncements every 24 hours through this AS path)
W
I'll be chasing up the provider to see what's up. thanks! -chris
Regards,
Baldur
-- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf
On 21/Nov/19 19:59, Saku Ytti wrote:
As there are best path algorithms which consider route age, BGP reset impact may be indefinite.
A practical problem we've seen with Cisco's BGP-SD implementation is that 0/0 and ::/0, when learned via BGP, are installed last. So consider a situation where BGP flaps a session on IOS or IOS XE running BGP-SD. Even though the full BGP table is being held in RIB only (which can take about 10 minutes to fully download with the CPU performance of, say, an ME3600X or an ASR920), a default route coming in over an iBGP session will get loaded only after all more specific routes have been installed and a best path algorithm ran against them. If you write only default into FIB on these platforms, you're basically blackholing traffic for as long as it takes for BGP to reconverge. So yes, while the fundamental design for this by Cisco is inherently flawed, unnecessary session resets are not ideal. Mark.
participants (9)
-
Baldur Norddahl
-
Christopher Morrow
-
Jared Mauch
-
Mark Tinka
-
Mel Beckman
-
Saku Ytti
-
Tom Beecher
-
Warren Kumari
-
Yoni Radzin