No matter how much money you put into your peering router, the session will be no more stable that whatever the peer did to their end. Plus at some point you will need to reboot due to software upgrade or other reasons. If you care at all, you should be doing redundancy by having multiple locations, multiple routers. You can then save the money spent on each router, because a router failure will not cause any change on what the internet sees through BGP. Also transits are way more important than peers. Loosing a transit will cause massive route changes around the globe and it will take a few minutes to stabilize. Loosing a peer usually just means the peer switches to the transit route, that they already had available. Peers are not equal. You may want to ensure redundancy to your biggest peers, while the small fish will be fine without. To be explicit: Router R1 has connections to transits T1 and T2. Router R2 also has connections to the same transits T1 and T2. When router R1 goes down, only small internal changes at T1 and T2 happens. Nobody notices and the recovery is sub second. Peers are less important: R1 has connection to internet exchange IE1 and R2 to a different internet exchange IE2. When R1 goes down the small peers at IE1 are lost but will quickly reroute through transit. Large peers may be present at both internet exchanges and so will instantly switch the traffic to IE2. Regards, Baldur On Mon, Feb 10, 2020 at 1:38 PM <adamv0025@netconsultings.com> wrote:
Hi,
Would like to take a poll on whether you folks tend to treat your transit/peering connections (BGP sessions in particular) as pets or rather as cattle.
And I appreciate the answer could differ for transit vs peering connections.
However, I’d like to ask this question through a lens of redundant vs non-redundant Internet edge devices.
To explain,
1. The “pet” case:
Would you rather try improving the failure rate of your transit/peering connections by using resilient Control-Plane (REs/RSPs/RPs) or even designing these as link bundles over separate cards and optical modules?
Is this on the bases that doesn’t matter how hard you try on your end (i.e. distribute your traffic to multitude of transit and peering connections or use BFD or even BGP-PIC Edge to shuffle thing around fast, any disruption to the eBGP session itself will still hurt you in some way, (i.e. at least some partial outage for some proportion of the traffic for not insignificant period of time) until things converge in direction from The Internet back to you.
1. The “cattle” case:
Or would you instead rely on small-ish non-redundant HW at your internet edge rather than trying to enhance MTBF with big chassis full of redundant HW?
Is this cause eventually the MTBF figure for a particular transit/peering eBGP session boils down to the MTBF of the single card or even single optical module hosting the link, (and creating bundles over separate cards -well you can never be quite sure how the setup looks like on the other end of that connection)?
Or is it because the effects of a smaller/non-resilient border edge device failure is not that bad in your particular (maybe horizontally scaled) setup?
Would appreciate any pointers, thank you.
Thank you
adam