On 10 January 2017 at 19:58, Job Snijders <job@instituut.net> wrote:
On Tue, Jan 10, 2017 at 03:51:04AM +0100, Baldur Norddahl wrote:
If a transit link goes, for example because we had to reboot a router, traffic is supposed to reroute to the remaining transit links. Internally our network handles this fairly fast for egress traffic.
However the problem is the ingress traffic - it can be 5 to 15 minutes before everything has settled down. This is the time before everyone else on the internet has processed that they will have to switch to your alternate transit.
The only solution I know of is to have redundant links to all transits.
Alternatively, if you reboot a router, perhaps you could first shutdown the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain away (should be visible in your NMS stats), and then proceed with the maintenance?
Of course this only works for planned reboots, not suprise reboots.
Kind regards,
Job
If I tear down my eBGP sessions the upstream router withdraws the route and the traffic just stops. Are your upstreams propagating withdraws without actually updating their own routing tables? I believe the simple explanation of the problem can be seen by firing up an inbound mtr from a distant network then withdrawing the route from the path it is taking. It should show either destination unreachable or a routing loop which "retreats" (under the right circumstances I have observed it distinctly move 1 hop at a time) until it finds an alternate path. My observed convergence times for a single withdraw are however in the sub-10 second range, to get all the networks in the original path pointing at a new one. My view on the problem is that if you are failing over frequently enough for a customer to notice and report it, you have bigger problems than convergence times. - Mike Jones