Hello Adam, On Mon, 10 Feb 2020 at 13:37, <adamv0025@netconsultings.com> wrote:
Would like to take a poll on whether you folks tend to treat your transit/peering connections (BGP sessions in particular) as pets or rather as cattle.
Cattle every day of the week. I don't trust control-plane resiliency and things like ISSU any farther than I can throw the big boxes it runs on. The entire network is engineered so that my customers *do not* feel the loss of one node (*). That is the design principal here and while traffic grows and we keep adding more capacity this is something we always consider. How difficult it is to achieve that depends on the particular situation, and it may be quite difficult in some situations, but not here. That is why I can upgrade releases on those nodes (without customers, just transit and peers) quite frequently. I can achieve that with mostly zero packet loss because of the design and all-around traffic draining using graceful shutdown and friends. We had quite some issues to drain traffic from nodes in the past (brownouts due to FIB mismatch between routers due to IP lookup on both ingress and egress node with per vrf label allocation, but since we switched to "per-ce" - meaning per nexthop - label allocation things work great). On the other side, transit with support for graceful-shutdown is of course great, but even if there is no support for it, for maintenance on your or your transit's box, you still know about the maintenance beforehand, so you can manually drain your egress traffic (you peer doesn't have to support RFC8326 for you to drop YOUR loc-pref to zero), and many transit provider have some kind of "set loc-pref below peer" community, which allows you to do basically the same thing manually without actual RFC8326 support on the other side. That said, for ingress traffic, unless you are announcing *A LOT* of routes, convergence is usually *very* fast anyway. I can see the benefit of having internal HW redundancy for nodes where customers are connected (shorter maintenance sessions, less outages in some single HW failures scenarios, overall theoretical better service uptime), but it never covers everything and it may just introduce unnecessary complexity that is actually root-causing outages and certainly complexity. Maybe I'm just a lucky fellow, but the hardware has been so reliable here that I'm pretty sure the complexity of Dual-RSP, ISSU and friends would have caused more issues over time than what I'm seeing with some good old and honest HW failures. Regarding HW redundancy itself: Dual RSP doesn't have any benefit when the guy in the MMR pulls the wrong fiber, bringing down my transit. It will still be BGP that has to converge. We don't have PIC today, maybe this is something to look into in the future, but it isn't something that internal HW redundancy fixes. A straightforward and KISS design, where the engineers actually know "what happens when", and how to do things properly (like draining traffic), and also quite frankly accepting some brownouts for uncommon events is the strategy that worked best for us. (*) sure, if the node with 700k best-paths towards a transit dies non-gracefully (hw or power failure), there will be a brownout of the affected prefixes for some minutes. But after convergence my network will be fine and my customers will stop feeling. They will ask what happened and I will be able to explain. cheers, lukas