Peering/Transit eBGP sessions -pet or cattle?
Hi, Would like to take a poll on whether you folks tend to treat your transit/peering connections (BGP sessions in particular) as pets or rather as cattle. And I appreciate the answer could differ for transit vs peering connections. However, I'd like to ask this question through a lens of redundant vs non-redundant Internet edge devices. To explain, a. The "pet" case: Would you rather try improving the failure rate of your transit/peering connections by using resilient Control-Plane (REs/RSPs/RPs) or even designing these as link bundles over separate cards and optical modules? Is this on the bases that doesn't matter how hard you try on your end (i.e. distribute your traffic to multitude of transit and peering connections or use BFD or even BGP-PIC Edge to shuffle thing around fast, any disruption to the eBGP session itself will still hurt you in some way, (i.e. at least some partial outage for some proportion of the traffic for not insignificant period of time) until things converge in direction from The Internet back to you. b. The "cattle" case: Or would you instead rely on small-ish non-redundant HW at your internet edge rather than trying to enhance MTBF with big chassis full of redundant HW? Is this cause eventually the MTBF figure for a particular transit/peering eBGP session boils down to the MTBF of the single card or even single optical module hosting the link, (and creating bundles over separate cards -well you can never be quite sure how the setup looks like on the other end of that connection)? Or is it because the effects of a smaller/non-resilient border edge device failure is not that bad in your particular (maybe horizontally scaled) setup? Would appreciate any pointers, thank you. Thank you adam
No matter how much money you put into your peering router, the session will be no more stable that whatever the peer did to their end. Plus at some point you will need to reboot due to software upgrade or other reasons. If you care at all, you should be doing redundancy by having multiple locations, multiple routers. You can then save the money spent on each router, because a router failure will not cause any change on what the internet sees through BGP. Also transits are way more important than peers. Loosing a transit will cause massive route changes around the globe and it will take a few minutes to stabilize. Loosing a peer usually just means the peer switches to the transit route, that they already had available. Peers are not equal. You may want to ensure redundancy to your biggest peers, while the small fish will be fine without. To be explicit: Router R1 has connections to transits T1 and T2. Router R2 also has connections to the same transits T1 and T2. When router R1 goes down, only small internal changes at T1 and T2 happens. Nobody notices and the recovery is sub second. Peers are less important: R1 has connection to internet exchange IE1 and R2 to a different internet exchange IE2. When R1 goes down the small peers at IE1 are lost but will quickly reroute through transit. Large peers may be present at both internet exchanges and so will instantly switch the traffic to IE2. Regards, Baldur On Mon, Feb 10, 2020 at 1:38 PM <adamv0025@netconsultings.com> wrote:
Hi,
Would like to take a poll on whether you folks tend to treat your transit/peering connections (BGP sessions in particular) as pets or rather as cattle.
And I appreciate the answer could differ for transit vs peering connections.
However, I’d like to ask this question through a lens of redundant vs non-redundant Internet edge devices.
To explain,
1. The “pet” case:
Would you rather try improving the failure rate of your transit/peering connections by using resilient Control-Plane (REs/RSPs/RPs) or even designing these as link bundles over separate cards and optical modules?
Is this on the bases that doesn’t matter how hard you try on your end (i.e. distribute your traffic to multitude of transit and peering connections or use BFD or even BGP-PIC Edge to shuffle thing around fast, any disruption to the eBGP session itself will still hurt you in some way, (i.e. at least some partial outage for some proportion of the traffic for not insignificant period of time) until things converge in direction from The Internet back to you.
1. The “cattle” case:
Or would you instead rely on small-ish non-redundant HW at your internet edge rather than trying to enhance MTBF with big chassis full of redundant HW?
Is this cause eventually the MTBF figure for a particular transit/peering eBGP session boils down to the MTBF of the single card or even single optical module hosting the link, (and creating bundles over separate cards -well you can never be quite sure how the setup looks like on the other end of that connection)?
Or is it because the effects of a smaller/non-resilient border edge device failure is not that bad in your particular (maybe horizontally scaled) setup?
Would appreciate any pointers, thank you.
Thank you
adam
Baldur Norddahl Sent: Monday, February 10, 2020 3:06 PM
No matter how much money you put into your peering router, the session will be no more stable that whatever the peer did to their end.
Agreed, that's a fair point,
Plus at some point you will need to reboot due to software upgrade or other reasons.
There are ways of draining traffic for planned maintenance.
If you care at all, you should be doing redundancy by having multiple locations, multiple routers. You can then save the money spent on each router, because a router failure will not cause any change on what the internet sees through BGP.
I think router failure will cause change on what the Internet sees as you rightly outlined below:
Also transits are way more important than peers. Loosing a transit will cause massive route changes around the globe and it will take a few minutes to stabilize. Loosing a peer usually just means the peer switches to the transit route, that they already had available.
agreed and I suppose the questions is whether folks tend to try minimizing these impacts by all means possible or just take it as necessary evil that will eventually happen.
Peers are not equal. You may want to ensure redundancy to your biggest peers, while the small fish will be fine without.
To be explicit: Router R1 has connections to transits T1 and T2. Router R2 also has connections to the same transits T1 and T2. When router R1 goes down, only small internal changes at T1 and T2 happens. Nobody notices and the recovery is sub second.
Good point again, Though if I had only T1 on R1 and only T2 on R2 then convergence won't happen inside each Transit but instead between T1 and T2 which will add to the convergence time. So thinking about it seems the optimal design pattern in a distributed (horizontally scaled out) edge would be to try and pair up -i.e. at least two edge nodes per Transit (or Peer for that matter), in order to allow for potentially faster intra-Transit convergence rather than arguably slower inter-transit convergence. adam
On Mon, Feb 10, 2020 at 5:42 PM <adamv0025@netconsultings.com> wrote:
To be explicit: Router R1 has connections to transits T1 and T2. Router R2 also has connections to the same transits T1 and T2. When router R1 goes down, only small internal changes at T1 and T2 happens. Nobody notices and the recovery is sub second.
Good point again, Though if I had only T1 on R1 and only T2 on R2 then convergence won't happen inside each Transit but instead between T1 and T2 which will add to the convergence time. So thinking about it seems the optimal design pattern in a distributed (horizontally scaled out) edge would be to try and pair up -i.e. at least two edge nodes per Transit (or Peer for that matter), in order to allow for potentially faster intra-Transit convergence rather than arguably slower inter-transit convergence.
I am assuming R1 and R2 are connected and announcing the same routes. Each transit is therefore receiving the same routes from two independent routers (R1 and R2). When R1 goes down, something internally at the transit will change to reflect that. But peers, other customers at that transit and higher tier transits will see no difference at all. Assuming R1 and R2 both announce a default route internally in your network, your internal convergence will be as fast as your detection of the dead router. This scheme also protects against link failure or failure at the provider end (if you make sure the transit is also using two routers). Therefore even if R1 and R2 are in the same physical location, maybe the same rack mounted on top of each other, that is a better solution than one big hunky router with redundant hardware. Having them at different locations is better of course but not always feasible. Many dual homed companies may start out with two routers and two transits but without dual links to each transit, as you describe above. That will cause significant disruption if one link goes down. It is not just about convergence between T1 and T2 but for a major part of the internet. Been there, done that, yes you can be down for up to several minuttes before everything is normal again. Assume tier 1 transits and that contact to T1 was lost. This means T1 will have a peering session with T2 somewhere, but T1 will not allow peer to peer traffic to go via that link. All those peers will need to search for a different way to reach you, a way that does not transit T1 (unless they have a contract with T1). Therefore, if being down for several minutes is not ok, you should invest in dual links to your transits. And connect those to two different routers. If possible with a guarantee the transits use two routers at their end and that divergent fiber paths are used etc. Regards, Baldur
Hello Baldur, On Mon, 10 Feb 2020 at 19:57, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
Many dual homed companies may start out with two routers and two transits but without dual links to each transit, as you describe above. That will cause significant disruption if one link goes down. It is not just about convergence between T1 and T2 but for a major part of the internet. Been there, done that, yes you can be down for up to several minuttes before everything is normal again. Assume tier 1 transits and that contact to T1 was lost. This means T1 will have a peering session with T2 somewhere, but T1 will not allow peer to peer traffic to go via that link. All those peers will need to search for a different way to reach you, a way that does not transit T1 (unless they have a contract with T1).
Therefore, if being down for several minutes is not ok, you should invest in dual links to your transits. And connect those to two different routers. If possible with a guarantee the transits use two routers at their end and that divergent fiber paths are used etc.
That is not my experience *at all*. I have always seen my prefixes converge in a couple of seconds upstream (vs 2 different Tier1's). That is with a double-digit number of announcements. Maybe if you announce tens of thousands of prefixes as a large Tier 2, things are more problematic, that I can't tell. Or maybe you hit some old-school route dampening somewhere down the path. Maybe there is another reason for this. But even if 3 AS hops are involved I don't really understand how they would spend *minutes* to converge after receiving your BGP withdraw message. When I saw *minutes* of brownouts in connectivity it was always because of ingress prefix convergence (or the lack thereof, due to slow FIB programing, then temporary internal routing loops, nasty things like that, but never external). I agree there are a number of reasons (including best convergence) to have completely diversified connections to a single transit AS. Another reason is that when you manually reroute traffic for a certain AS path (say transit 2 has an always congested PNI towards a third party ASN), you may not have an alternative to the congested path when you other transit provider goes away. But I never saw minutes of brownout because of upstream -> downstream -> downstream convergence (or whatever the scenario looks like). lukas
On Tue, Feb 11, 2020 at 12:33 AM Lukas Tribus <lists@ltri.eu> wrote:
Therefore, if being down for several minutes is not ok, you should invest in dual links to your transits. And connect those to two different routers. If possible with a guarantee the transits use two routers at their end and that divergent fiber paths are used etc.
That is not my experience *at all*. I have always seen my prefixes converge in a couple of seconds upstream (vs 2 different Tier1's).
This is a bit old but probably still thus: https://labs.ripe.net/Members/vastur/the-shape-of-a-bgp-update Quote: "To conclude, we observe that BGP route updates tend to converge globally in just a few minutes. The propagation of newly announced prefixes happens almost instantaneously, reaching 50% visibility in just under 10 seconds, revealing a highly responsive global system. Prefix withdrawals take longer to converge and generate nearly 4 times more BGP traffic, with the visibility dropping below 10% only after approximately 2 minutes". Unfortunately they did not test the case of withdrawal from one router while having the prefix still active at another.
When I saw *minutes* of brownouts in connectivity it was always because of ingress prefix convergence (or the lack thereof, due to slow FIB programing, then temporary internal routing loops, nasty things like that, but never external).
That is also a significant problem. In the case of a single transit connection per router, two routers and two providers, there will be a lot of internal convergence between your two routers in the case of a link failure. That is also avoided by having both routers having the same provider connections. That way a router may still have to invalidate many routes but there will be no loops and the router has loop free alternatives loaded into memory already (to the other provider). Plus you can use the simple trick of having a default route as a fall back. Regards Baldur
Baldur Norddahl Sent: Wednesday, February 12, 2020 7:57 PM
On Tue, Feb 11, 2020 at 12:33 AM Lukas Tribus <mailto:lists@ltri.eu> wrote:
Therefore, if being down for several minutes is not ok, you should invest in dual links to your transits. And connect those to two different routers. If possible with a guarantee the transits use two routers at their end and that divergent fiber paths are used etc.
That is not my experience *at all*. I have always seen my prefixes converge in a couple of seconds upstream (vs 2 different Tier1's).
This is a bit old but probably still thus:
https://labs.ripe.net/Members/vastur/the-shape-of-a-bgp-update
Quote: "To conclude, we observe that BGP route updates tend to converge globally in just a few minutes. The propagation of newly announced prefixes happens almost instantaneously, reaching 50% visibility in just under 10 seconds, revealing a highly responsive global system. Prefix withdrawals take longer to converge and generate nearly 4 times more BGP traffic, with the visibility dropping below 10% only after approximately 2 minutes".
Unfortunately they did not test the case of withdrawal from one router while having the prefix still active at another.
Yes that's unfortunate, Although I'm thinking that the convergence time would be highly dependent on the first-hop upstream providers involved in the "local-repair" for the affected AS -once that is done doesn't matter that the whole world still routes traffic to affected AS towards the original first-hop upstream AS, as long as it has a valid detour route. And I guess the topology configuration of this first-hop outskirt from the affected AS involved in the "local-repair" would dictate the convergence time. E.g. if your upstream A box happens to have a direct (usable) link/session to upstream B box -winner, however the higher the number of boxes involved in the "local-repair" detour that need to be told "A no more, now B is the way to go" the longer the convergence time. -but if significant portion of the Internet gets withdraw in 2 min -wondering how long could it be for a typical "local-repair" string of bgp speakers to all get the memo. -but realistically how many bgp speakers could that be, ranging from min 2 - to max... say ~6?
When I saw *minutes* of brownouts in connectivity it was always because of ingress prefix convergence (or the lack thereof, due to slow FIB programing, then temporary internal routing loops, nasty things like that, but never external).
That is also a significant problem. In the case of a single transit connection per router, two routers and two providers, there will be a lot of internal convergence between your two routers in the case of a link failure. That is also avoided by having both routers having the same provider connections. That way a router may still have to invalidate many routes but there will be no loops and the router has loop free alternatives loaded into memory already (to the other provider). Plus you can use the simple trick of having a default route as a fall back.
This is a very good point actually, indeed since the box has two transit sessions in case of a failure of only one of them it will still retain all the prefixes in FIB -it will just need to reprogram few next-hops to point towards the other eBGP/iBGP speakers, whoever offers a best path. And reprograming next-hops is significantly faster (with hierarchical FIBs anyways). adam
On 10/Feb/20 17:06, Baldur Norddahl wrote:
Also transits are way more important than peers. Loosing a transit will cause massive route changes around the globe and it will take a few minutes to stabilize. Loosing a peer usually just means the peer switches to the transit route, that they already had available.
Not in our case, where only 15% of our traffic is handled by our transit providers. 85% of our traffic comes from peering. Then again, we have a single connection to each the major 7 transit providers, spread across multiple cities. But I appreciate that not many operators can be in this position. Mark.
Hello Adam, On Mon, 10 Feb 2020 at 13:37, <adamv0025@netconsultings.com> wrote:
Would like to take a poll on whether you folks tend to treat your transit/peering connections (BGP sessions in particular) as pets or rather as cattle.
Cattle every day of the week. I don't trust control-plane resiliency and things like ISSU any farther than I can throw the big boxes it runs on. The entire network is engineered so that my customers *do not* feel the loss of one node (*). That is the design principal here and while traffic grows and we keep adding more capacity this is something we always consider. How difficult it is to achieve that depends on the particular situation, and it may be quite difficult in some situations, but not here. That is why I can upgrade releases on those nodes (without customers, just transit and peers) quite frequently. I can achieve that with mostly zero packet loss because of the design and all-around traffic draining using graceful shutdown and friends. We had quite some issues to drain traffic from nodes in the past (brownouts due to FIB mismatch between routers due to IP lookup on both ingress and egress node with per vrf label allocation, but since we switched to "per-ce" - meaning per nexthop - label allocation things work great). On the other side, transit with support for graceful-shutdown is of course great, but even if there is no support for it, for maintenance on your or your transit's box, you still know about the maintenance beforehand, so you can manually drain your egress traffic (you peer doesn't have to support RFC8326 for you to drop YOUR loc-pref to zero), and many transit provider have some kind of "set loc-pref below peer" community, which allows you to do basically the same thing manually without actual RFC8326 support on the other side. That said, for ingress traffic, unless you are announcing *A LOT* of routes, convergence is usually *very* fast anyway. I can see the benefit of having internal HW redundancy for nodes where customers are connected (shorter maintenance sessions, less outages in some single HW failures scenarios, overall theoretical better service uptime), but it never covers everything and it may just introduce unnecessary complexity that is actually root-causing outages and certainly complexity. Maybe I'm just a lucky fellow, but the hardware has been so reliable here that I'm pretty sure the complexity of Dual-RSP, ISSU and friends would have caused more issues over time than what I'm seeing with some good old and honest HW failures. Regarding HW redundancy itself: Dual RSP doesn't have any benefit when the guy in the MMR pulls the wrong fiber, bringing down my transit. It will still be BGP that has to converge. We don't have PIC today, maybe this is something to look into in the future, but it isn't something that internal HW redundancy fixes. A straightforward and KISS design, where the engineers actually know "what happens when", and how to do things properly (like draining traffic), and also quite frankly accepting some brownouts for uncommon events is the strategy that worked best for us. (*) sure, if the node with 700k best-paths towards a transit dies non-gracefully (hw or power failure), there will be a brownout of the affected prefixes for some minutes. But after convergence my network will be fine and my customers will stop feeling. They will ask what happened and I will be able to explain. cheers, lukas
On 10/Feb/20 14:37, adamv0025@netconsultings.com wrote:
The “cattle” case:
Or would you instead rely on small-ish non-redundant HW at your internet edge rather than trying to enhance MTBF with big chassis full of redundant HW?
Is this cause eventually the MTBF figure for a particular transit/peering eBGP session boils down to the MTBF of the single card or even single optical module hosting the link, (and creating bundles over separate cards -well you can never be quite sure how the setup looks like on the other end of that connection)?
Or is it because the effects of a smaller/non-resilient border edge device failure is not that bad in your particular (maybe horizontally scaled) setup?
This, for us. We pick up transit and peering in multiple cities around the world. The cute boxes at the time were the MX80 and ASR9001. These have since run out of steam for us, and in many sites, our best option was the MX480, as there was no other high-performance, non-redundant device that made sense to us. However, just months after upgrading to the MX480, the MX204 launched. So now, we focus on the MX204 for peering and transit. It's so massively distributed that it doesn't make sense to aggregate multiple exchange points or transit providers in a single location. And if a device in one location were to fail, there is sufficient coverage across the backbone to pick up the slack. We also use separate devices for transit and peering. Of course, transit providers (and some exchange points) don't really enjoy this model with us, as they'd like to sell us a multi-site contract, which doesn't make any sense to us. Mark.
participants (4)
-
adamv0025@netconsultings.com
-
Baldur Norddahl
-
Lukas Tribus
-
Mark Tinka