On Mon, Feb 10, 2020 at 5:42 PM <adamv0025@netconsultings.com> wrote:

> To be explicit: Router R1 has connections to transits T1 and T2.
> Router R2 also has connections to the same transits T1 and T2. When
> router R1 goes down, only small internal changes at T1 and T2 happens.
> Nobody notices and the recovery is sub second.
>
Good point again,
Though if I had only T1 on R1 and only T2 on R2 then convergence won't happen inside each Transit but instead between T1 and T2 which will add to the convergence time.
So thinking about it seems the optimal design pattern in a distributed (horizontally scaled out) edge would be to try and pair up -i.e. at least two edge nodes per Transit (or Peer for that matter), in order to allow for potentially faster intra-Transit convergence rather than arguably slower inter-transit convergence.

I am assuming R1 and R2 are connected and announcing the same routes. Each transit is therefore receiving the same routes from two independent routers (R1 and R2). When R1 goes down, something internally at the transit will change to reflect that. But peers, other customers at that transit and higher tier transits will see no difference at all. Assuming R1 and R2 both announce a default route internally in your network, your internal convergence will be as fast as your detection of the dead router.

This scheme also protects against link failure or failure at the provider end (if you make sure the transit is also using two routers).

Therefore even if R1 and R2 are in the same physical location, maybe the same rack mounted on top of each other, that is a better solution than one big hunky router with redundant hardware. Having them at different locations is better of course but not always feasible.

Many dual homed companies may start out with two routers and two transits but without dual links to each transit, as you describe above. That will cause significant disruption if one link goes down. It is not just about convergence between T1 and T2 but for a major part of the internet. Been there, done that, yes you can be down for up to several minuttes before everything is normal again. Assume tier 1 transits and that contact to T1 was lost. This means T1 will have a peering session with T2 somewhere, but T1 will not allow peer to peer traffic to go via that link. All those peers will need to search for a different way to reach you, a way that does not transit T1 (unless they have a contract with T1).

Therefore, if being down for several minutes is not ok, you should invest in dual links to your transits. And connect those to two different routers. If possible with a guarantee the transits use two routers at their end and that divergent fiber paths are used etc.

Regards,

Baldur