Tom Beecher wrote on 22/12/2024 02:51:

Definitely will be interesting to read the list discussion about this. My first reaction was why would you even need this, so def curious.

There are two main target situations: firstly, when a router unexpectedly drops off an ixp platform, this won't be explicitly signaled to the other routers on the fabric, which can mean that packets to that device will be black-holed until all the others bgp hold timers kick in. This would typically happen after 90-180s (e.g. hello time: 30-60s). The second situation would be to deal with forwarding plane incongruence on IXPs, i.e. where router A can reach RS, router B can reach RS, router A cannot reach router B due to a problem on the IXP fabric itself. Thankfully this style of problem has become quite unusual over the last several years.

I'm not sure it's a all-round good solution to either of these problems, in the "be careful what you wish for, because you might get it" sense. There are going to be router platforms out there which won't handle hundreds of BFD sessions reliably, so if the protocol were widely supported, it's not clear that it would help or harm interdomain routing stability due to the ability of routers to handle large numbers of BFD sessions, particularly where there were situations where all the sessions could be triggered simultaneously.

As a separate issue, hold timers should generally be of a comparable order of magnitude to the non-availability effect they're attempting to mitigate. Inter-domain routing convergence is often measured in minutes rather than seconds. So even if the protocol layer worked at IXPs without causing control plane meltdown, it's still a mechanism which which has a trigger timer two orders of magnitude faster than the general case of DFZ reconvergence. I can't see that this would help overall inter-domain routing stability.

Nick