Soliciting your opinions on Internet routing: A survey on BGP convergence
Hi NANOG, We often read that the Internet (i.e. BGP) is "slow to converge". But how slow is it really? Do you care anyway? And can we (researchers) do anything about it? Please help us out to find out by answering our short anonymous survey (<10 minutes). Survey URL: https://goo.gl/forms/JZd2CK0EFpCk0c272 <https://goo.gl/forms/WW7KX5kT45m6UUM82> ** Background: While existing fast-reroute mechanisms enable sub-second convergence upon local outages (planned or not), they do not apply to remote outages happening further away from your AS as their detection and protection mechanisms only work locally. Remote outages therefore mandate a "BGP-only" convergence which tends to be slow, as long streams of BGP UPDATEs (containing up to 100,000s of them) must be propagated router-by-router. Our initial measurements indicate that it can take state-of-the-art BGP routers dozens of seconds to process and propagate these large streams of BGP UPDATEs. During this time, traffic for important destinations can be lost. ** This survey: This survey aims at evaluating the impact of slow BGP convergence on operational practices. We expect the findings to increase the understanding of the perceived BGP convergence in the Internet, which could then help researchers to design better fast-reroute mechanisms. We expect the questionnaire to be filled out by network operators whose job relates to BGP operations. It has a total of 17 questions and should take less 10 minutes to answer. The survey and the collected data are anonymous (so please do *not* include information that may help to identify you or your organization). All questions are optional, so if you don't like a question or don't know the answer, please skip it. A summary of the aggregate results will be published as a part of a scientific article later this year. Thank you so much in advance, and we look forward to read your responses! Laurent Vanbever (ETH Zürich, Switzerland) PS: It goes without saying that we would be also extremely grateful if you could forward this email to any operator you might know who may not read NANOG.
Hello I find that the type of outage that affects our network the most is neither of the two options you describe. As is probably typical for smaller networks, we do not have redundant uplinks to all of our transits. If a transit link goes, for example because we had to reboot a router, traffic is supposed to reroute to the remaining transit links. Internally our network handles this fairly fast for egress traffic. However the problem is the ingress traffic - it can be 5 to 15 minutes before everything has settled down. This is the time before everyone else on the internet has processed that they will have to switch to your alternate transit. The only solution I know of is to have redundant links to all transits. Going forward I will make sure we have this because it is a huge disadvantage not being able to take a router out of service without causing downtime for all users. Not to mention that a router crash or link failure that should have taken seconds at most to reroute, but instead causes at least 5 minutes of unstable internet. Regards, Baldur Den 09/01/2017 kl. 23.56 skrev Laurent Vanbever:
Hi NANOG,
We often read that the Internet (i.e. BGP) is "slow to converge". But how slow is it really? Do you care anyway? And can we (researchers) do anything about it? Please help us out to find out by answering our short anonymous survey (<10 minutes).
Survey URL: https://goo.gl/forms/JZd2CK0EFpCk0c272 <https://goo.gl/forms/WW7KX5kT45m6UUM82>
** Background:
While existing fast-reroute mechanisms enable sub-second convergence upon local outages (planned or not), they do not apply to remote outages happening further away from your AS as their detection and protection mechanisms only work locally.
Remote outages therefore mandate a "BGP-only" convergence which tends to be slow, as long streams of BGP UPDATEs (containing up to 100,000s of them) must be propagated router-by-router. Our initial measurements indicate that it can take state-of-the-art BGP routers dozens of seconds to process and propagate these large streams of BGP UPDATEs. During this time, traffic for important destinations can be lost.
** This survey:
This survey aims at evaluating the impact of slow BGP convergence on operational practices. We expect the findings to increase the understanding of the perceived BGP convergence in the Internet, which could then help researchers to design better fast-reroute mechanisms.
We expect the questionnaire to be filled out by network operators whose job relates to BGP operations. It has a total of 17 questions and should take less 10 minutes to answer. The survey and the collected data are anonymous (so please do *not* include information that may help to identify you or your organization). All questions are optional, so if you don't like a question or don't know the answer, please skip it.
A summary of the aggregate results will be published as a part of a scientific article later this year.
Thank you so much in advance, and we look forward to read your responses!
Laurent Vanbever (ETH Zürich, Switzerland)
PS: It goes without saying that we would be also extremely grateful if you could forward this email to any operator you might know who may not read NANOG.
Dear Baldur,
I find that the type of outage that affects our network the most is neither of the two options you describe. As is probably typical for smaller networks, we do not have redundant uplinks to all of our transits. If a transit link goes, for example because we had to reboot a router, traffic is supposed to reroute to the remaining transit links. Internally our network handles this fairly fast for egress traffic.
However the problem is the ingress traffic - it can be 5 to 15 minutes before everything has settled down. This is the time before everyone else on the internet has processed that they will have to switch to your alternate transit.
Thanks a lot for your input. Indeed, that case is a bit special. I’d say it is a kind of remote outage that remote ASes experience towards your prefix and, as such, requires a "BGP-only” convergence. I guess if your prefixes going via alternate transit are not visible at all prior to the switch (and I guess not), this is a kind of “extreme” convergence where routes have to be withdrawn/updated Internet-wide. This reminds me of the paper by Craig Labovitz et al. (http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-5-2.pdf <http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-5-2.pdf>) which I think classify these events as Tlong ("An active route with a short ASPath is implicitly replaced with a new route possessing a longer ASPath. This represents both a route failure and failover”). And indeed, these are the second slowest just before the withdraw of a prefix Internet-wide. You’re right that our survey targets more the case in which large bursts of UPDATEs/WITHDRAWs are exchanged. I guess a parallel case to the one you mention could be that your prime transit performs a planned maintenance (or experiences a failure) that triggers the sending of WITHDRAWs for your prefixes out.
The only solution I know of is to have redundant links to all transits. Going forward I will make sure we have this because it is a huge disadvantage not being able to take a router out of service without causing downtime for all users. Not to mention that a router crash or link failure that should have taken seconds at most to reroute, but instead causes at least 5 minutes of unstable internet.
Maybe you could advertise better routes (i.e., with shorter AS-PATHs/longer prefixes) via the alternate transit prior to the take down? Ideally, if you could somehow make your primary transit switch to use an alternate transit prior to the maintenance (maybe with a special community?), you could completely avoid a disruption. This would go into the direction of minimizing the amount of WITHDRAWs in favor of UPDATEs. But, of course, this would only work in the case of planned maintenance. We would definitely welcome more input on the convergence issue you face! Best, Laurent
On Tue, Jan 10, 2017 at 03:51:04AM +0100, Baldur Norddahl wrote:
If a transit link goes, for example because we had to reboot a router, traffic is supposed to reroute to the remaining transit links. Internally our network handles this fairly fast for egress traffic.
However the problem is the ingress traffic - it can be 5 to 15 minutes before everything has settled down. This is the time before everyone else on the internet has processed that they will have to switch to your alternate transit.
The only solution I know of is to have redundant links to all transits.
Alternatively, if you reboot a router, perhaps you could first shutdown the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain away (should be visible in your NMS stats), and then proceed with the maintenance? Of course this only works for planned reboots, not suprise reboots. Kind regards, Job
On Tue 2017-Jan-10 20:58:02 +0100, Job Snijders <job@instituut.net> wrote:
On Tue, Jan 10, 2017 at 03:51:04AM +0100, Baldur Norddahl wrote:
If a transit link goes, for example because we had to reboot a router, traffic is supposed to reroute to the remaining transit links. Internally our network handles this fairly fast for egress traffic.
However the problem is the ingress traffic - it can be 5 to 15 minutes before everything has settled down. This is the time before everyone else on the internet has processed that they will have to switch to your alternate transit.
The only solution I know of is to have redundant links to all transits.
Alternatively, if you reboot a router, perhaps you could first shutdown the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain away (should be visible in your NMS stats), and then proceed with the maintenance?
Of course this only works for planned reboots, not suprise reboots.
...or link failures.
Kind regards,
Job
-- Hugo Slabbert | email, xmpp/jabber: hugo@slabnet.com pgp key: B178313E | also on Signal
On Jan 10, 2017, at 3:14 PM, Hugo Slabbert <hugo@slabnet.com> wrote:
On Tue 2017-Jan-10 20:58:02 +0100, Job Snijders <job@instituut.net> wrote:
On Tue, Jan 10, 2017 at 03:51:04AM +0100, Baldur Norddahl wrote:
If a transit link goes, for example because we had to reboot a router, traffic is supposed to reroute to the remaining transit links. Internally our network handles this fairly fast for egress traffic.
However the problem is the ingress traffic - it can be 5 to 15 minutes before everything has settled down. This is the time before everyone else on the internet has processed that they will have to switch to your alternate transit.
The only solution I know of is to have redundant links to all transits.
Alternatively, if you reboot a router, perhaps you could first shutdown the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain away (should be visible in your NMS stats), and then proceed with the maintenance?
Of course this only works for planned reboots, not suprise reboots.
...or link failures.
One other comment: there has been a long history of poorly behaving BGP stacks that would take quite some time to hunt through the paths. While this can still occur with people with nearing ancient software and hardware still in-use, many of the modern software/hardware options enable things like BGP-PIC (in your survey) by default. Many of these options you document as best practices like path mtu discovery are well known fixes for networks, as well as using jumbo mtu internally to obtain 9k+ mss for high performance TCP. Vendors have not always chosen to enable the TCP options by default like the protocols have, eg: BGP-PIC and like Jakob’s response, tout other solutions vs fixing the TCP stack first. Many of these performances were documented in 2002 and are considered best practices by many networks, but due to their obscure knobs may not be widely deployed as a result, or seen as risky to configure. (We had a vendor panic when we discovered a bug in their TCP-SACK code, they were almost frozen in not fixing the code because touching TCP felt dangerous and there was an inadequate testing culture around something seen as ‘stable’). here’s the presentation from IETF 53, I don’t see it in the proceedings handily: http://morse.colorado.edu/~epperson/courses/routing-protocols/handouts/bgp_s... - Jared
On 10 January 2017 at 19:58, Job Snijders <job@instituut.net> wrote:
On Tue, Jan 10, 2017 at 03:51:04AM +0100, Baldur Norddahl wrote:
If a transit link goes, for example because we had to reboot a router, traffic is supposed to reroute to the remaining transit links. Internally our network handles this fairly fast for egress traffic.
However the problem is the ingress traffic - it can be 5 to 15 minutes before everything has settled down. This is the time before everyone else on the internet has processed that they will have to switch to your alternate transit.
The only solution I know of is to have redundant links to all transits.
Alternatively, if you reboot a router, perhaps you could first shutdown the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain away (should be visible in your NMS stats), and then proceed with the maintenance?
Of course this only works for planned reboots, not suprise reboots.
Kind regards,
Job
If I tear down my eBGP sessions the upstream router withdraws the route and the traffic just stops. Are your upstreams propagating withdraws without actually updating their own routing tables? I believe the simple explanation of the problem can be seen by firing up an inbound mtr from a distant network then withdrawing the route from the path it is taking. It should show either destination unreachable or a routing loop which "retreats" (under the right circumstances I have observed it distinctly move 1 hop at a time) until it finds an alternate path. My observed convergence times for a single withdraw are however in the sub-10 second range, to get all the networks in the original path pointing at a new one. My view on the problem is that if you are failing over frequently enough for a customer to notice and report it, you have bigger problems than convergence times. - Mike Jones
On 1/9/17 2:56 PM, Laurent Vanbever wrote:
Hi NANOG,
We often read that the Internet (i.e. BGP) is "slow to converge". But how slow is it really? Do you care anyway? And can we (researchers) do anything about it? Please help us out to find out by answering our short anonymous survey (<10 minutes).
Survey URL: https://goo.gl/forms/JZd2CK0EFpCk0c272 <https://goo.gl/forms/WW7KX5kT45m6UUM82>
** Background:
While existing fast-reroute mechanisms enable sub-second convergence upon local outages (planned or not), they do not apply to remote outages happening further away from your AS as their detection and protection mechanisms only work locally.
Remote outages therefore mandate a "BGP-only" convergence which tends to be slow, as long streams of BGP UPDATEs (containing up to 100,000s of them) must be propagated router-by-router. Our initial measurements indicate that it can take state-of-the-art BGP routers dozens of seconds to process and propagate these large streams of BGP UPDATEs. During this time, traffic for important destinations can be lost.
One of the phenomena that is relatively easy to observe by withdrawing a prefix entirely is the convergence towards longer and longer AS paths until the route disappears entirely. that is providers that are further away will remain advertising the route and in the interim their neighbors will ingest the available path will until they too process the withdraw. it can take a comically long time (like 5 minutes) to see the prefix ultimately disappear from the internet. When withdrawing a prefix from a peer with which you have a single adjacency this can easily happens in miniature.
** This survey:
This survey aims at evaluating the impact of slow BGP convergence on operational practices. We expect the findings to increase the understanding of the perceived BGP convergence in the Internet, which could then help researchers to design better fast-reroute mechanisms.
We expect the questionnaire to be filled out by network operators whose job relates to BGP operations. It has a total of 17 questions and should take less 10 minutes to answer. The survey and the collected data are anonymous (so please do *not* include information that may help to identify you or your organization). All questions are optional, so if you don't like a question or don't know the answer, please skip it.
A summary of the aggregate results will be published as a part of a scientific article later this year.
Thank you so much in advance, and we look forward to read your responses!
Laurent Vanbever (ETH Zürich, Switzerland)
PS: It goes without saying that we would be also extremely grateful if you could forward this email to any operator you might know who may not read NANOG.
Hi Joel,
On 10 Jan 2017, at 06:51, joel jaeggli <joelja@bogus.com> wrote:
On 1/9/17 2:56 PM, Laurent Vanbever wrote:
Hi NANOG,
We often read that the Internet (i.e. BGP) is "slow to converge". But how slow is it really? Do you care anyway? And can we (researchers) do anything about it? Please help us out to find out by answering our short anonymous survey (<10 minutes).
Survey URL: https://goo.gl/forms/JZd2CK0EFpCk0c272 <https://goo.gl/forms/WW7KX5kT45m6UUM82>
** Background:
While existing fast-reroute mechanisms enable sub-second convergence upon local outages (planned or not), they do not apply to remote outages happening further away from your AS as their detection and protection mechanisms only work locally.
Remote outages therefore mandate a "BGP-only" convergence which tends to be slow, as long streams of BGP UPDATEs (containing up to 100,000s of them) must be propagated router-by-router. Our initial measurements indicate that it can take state-of-the-art BGP routers dozens of seconds to process and propagate these large streams of BGP UPDATEs. During this time, traffic for important destinations can be lost.
One of the phenomena that is relatively easy to observe by withdrawing a prefix entirely is the convergence towards longer and longer AS paths until the route disappears entirely. that is providers that are further away will remain advertising the route and in the interim their neighbors will ingest the available path will until they too process the withdraw. it can take a comically long time (like 5 minutes) to see the prefix ultimately disappear from the internet. When withdrawing a prefix from a peer with which you have a single adjacency this can easily happens in miniature.
Thanks! Yes, definitely. This relates to the issue Baldur was raising in which a less-preferred prefix (or not prefix at all in your case) has to take over a more preferred one. That case is definitely bad for BGP convergence. Our survey/study is more geared towards cases where there is diversity available (alternates paths are there and at least partially visible). We are especially interested in finding out whether, even when you take all the precautionary measures required by the book, long BGP convergence can still bite you and… whether we can do anything about it. Laurent PS: Thanks so much to the 21 operators who have answered already! If you haven’t so already, please help us out to find out about troublesome BGP convergence by answering our short anonymous survey (<10 minutes): https://goo.gl/forms/JZd2CK0EFpCk0c272 <https://goo.gl/forms/JZd2CK0EFpCk0c272>
participants (7)
-
Baldur Norddahl
-
Hugo Slabbert
-
Jared Mauch
-
Job Snijders
-
joel jaeggli
-
Laurent Vanbever
-
Mike Jones