Re: Soliciting your opinions on Internet routing: A survey on BGP convergence

10 Jan 2017

      On 10 January 2017 at 19:58, Job Snijders <job@instituut.net> wrote:
...
On Tue, Jan 10, 2017 at 03:51:04AM +0100, Baldur Norddahl wrote:
...
If a transit link goes, for example because we had to reboot a router,
traffic is supposed to reroute to the remaining transit links.
Internally our network handles this fairly fast for egress traffic.
However the problem is the ingress traffic - it can be 5 to 15 minutes
before everything has settled down. This is the time before everyone
else on the internet has processed that they will have to switch to
your alternate transit.
The only solution I know of is to have redundant links to all transits.
Alternatively, if you reboot a router, perhaps you could first shutdown
the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain
away (should be visible in your NMS stats), and then proceed with the
maintenance?
Of course this only works for planned reboots, not suprise reboots.
Kind regards,
Job
If I tear down my eBGP sessions the upstream router withdraws the
route and the traffic just stops. Are your upstreams propagating
withdraws without actually updating their own routing tables?

I believe the simple explanation of the problem can be seen by firing
up an inbound mtr from a distant network then withdrawing the route
from the path it is taking. It should show either destination
unreachable or a routing loop which "retreats" (under the right
circumstances I have observed it distinctly move 1 hop at a time)
until it finds an alternate path.

My observed convergence times for a single withdraw are however in the
sub-10 second range, to get all the networks in the original path
pointing at a new one. My view on the problem is that if you are
failing over frequently enough for a customer to notice and report it,
you have bigger problems than convergence times.

- Mike Jones