On Tue, Dec 18, 2018 at 6:40 PM Randy Bush <randy@psg.com> wrote:
do you have rfd on? with what parms?
I assume rfd in this context means "Route Flap Dampening". NTT / AS 2914 does *not* have Route Flap Dampening configured, as is documented here https://us.ntt.net/support/policy/routing.cfm#routedampening Kind regards, Job
Route Flap Damping via https://tools.ietf.org/html/rfc2439 for everyone. On Tue, Dec 18, 2018 at 11:42 AM Randy Bush <randy@psg.com> wrote:
do you have rfd on? with what parms?
randy
-- - Andrew "lathama" Latham -
I always wondered why does it have to be so binary. I don't want to decide for my customers if partial visibility is better than busy CPU, but I do appreciate stability. Why can't we have local-pref penalty for flapping route. If it's only option, keep offering it, if there are other, more stable options, offer those. -- ++ytti
Mainly because propagating a flapping route across the entire Internet is damaging to performance of things other your own equipment and that of your customer. It is just "bad manners" to propagate a flapping route to your peers and it helps maintain a minimum level of stability that it required to keep you "on the Internet". Imagine a table where 1000s of providers are each sending 100s of unstable routes and that those unstable routes might be redistributing into various IGPs that may not respond very gracefully to rapid table changes (like most distance vector IGPs). Also think of this scenario, your link to your customer might be flapping but that same customer might have other carriers advertising the same address space over a stable link. In that case you would be doing a dis-service by not withdrawing that route and having a local-pref does not help since you don't necessarily have visibility to all of your customers other carrier networks. You do have the ability to clear the RFD timers for a route if you need to manually intervene for example when you know for a fact that you fixed the problem. That means that if no one is watching or intervening the network will "do the safe thing". Steven Naslund Chicago IL
I always wondered why does it have to be so binary.
I don't want to decide for my customers if partial visibility is better than busy CPU, but I do appreciate stability. Why can't we have local-pref penalty for flapping route. If it's only option, keep offering it, if there are other, more >stable options, offer those.
-- ++ytti
Hi Steve, Lowering the LP would achieve the outcome you desire, provided there are (stable) alternative paths. What you advocate results in absolute outages in what may already be precarious situations (natural disasters?) - what Saku Ytti suggests like a less painful alternative with desirable properties. Kind regards, Job On Tue, Dec 18, 2018 at 21:56 Naslund, Steve <SNaslund@medline.com> wrote:
Mainly because propagating a flapping route across the entire Internet is damaging to performance of things other your own equipment and that of your customer. It is just "bad manners" to propagate a flapping route to your peers and it helps maintain a minimum level of stability that it required to keep you "on the Internet". Imagine a table where 1000s of providers are each sending 100s of unstable routes and that those unstable routes might be redistributing into various IGPs that may not respond very gracefully to rapid table changes (like most distance vector IGPs). Also think of this scenario, your link to your customer might be flapping but that same customer might have other carriers advertising the same address space over a stable link. In that case you would be doing a dis-service by not withdrawing that route and having a local-pref does not help since you don't necessarily have visibility to all of your customers other carrier networks.
You do have the ability to clear the RFD timers for a route if you need to manually intervene for example when you know for a fact that you fixed the problem. That means that if no one is watching or intervening the network will "do the safe thing".
Steven Naslund Chicago IL
I always wondered why does it have to be so binary.
I don't want to decide for my customers if partial visibility is better
than busy CPU, but I do appreciate stability. Why can't we have local-pref penalty for flapping route. If it's only option, keep offering it, if there are other, more >stable options, offer those.
-- ++ytti
Remember always that the local pref is just that, YOUR local preference. Sending that flapping route upstream does not give your peer the option to ignore it. In any case, the downside is that you have to process that route and then choose whether or not to use it. It’s like saying “now that you have processed this unstable route and burned your CPU cycles, I am now giving you to option not to install it into your table”. Remember also that we are only talking about default behavior here. You always have the option to override it by changing timer, penalties, or shutting down RFD all together. We are only talking about day-to-day operation here. Also, keep in mind that when we are talking about alterative stable paths we are only talking about what your network sees, not the entire Internet. If you as a service provider are experiencing major issues, you may see a route to me as stable or unstable but making global routing decisions based on that is not sound. What might be best for your customer or your business might not be best for the Internet community as a whole. It is a matter of scale, how many services providers can allow how many unstable routes before the entire network becomes regionally or globally unstable. It’s important to remember that flapping routes leave a certain amount of data in flight with no destination which is detrimental to overall performance. As we move into a V6 world we are again worried about the size of the global routing tables and pushing routing performance. Instability of routes is dangerous to system running near the limits. Propagating a known unstable route would be a major shift in routing policy. Today, you either say you can reach something or you don’t say anything. Using the suggested alternative adds the option of “I might be able to reach this but not reliably” which then brings about metrics of “how reliably?” and that is a huge shift in how global routing works. We have been struggling with a backbone routing protocol that does not really do a good job of understanding bandwidth and multiple paths so I would suggest that adding “maybe” routes is not a good idea. At least using RFD you can explain to your customer why they are not reachable rather than explaining how you made a manual decision to dump them for the “good of the Internet”. There is also a business penalty to the service provider that exposes instability to network. People don’t want to peer or send traffic through unstable network regions. Steve
Hi Steve,
Lowering the LP would achieve the outcome you desire, provided there are (stable) alternative paths.
What you advocate results in absolute outages in what may already be precarious situations (natural disasters?) - what Saku Ytti suggests like a less painful alternative with desirable properties.
Kind regards,
Job
What would really be of interest to me would be for those that run RFD to measure its impact to their network (positive or otherwise) so we have something scientific to base on. The theory (and practice of old) tells us that RFD is either very good, or very bad. There are probably more folk that have turned it off than run it, or vice versa. Ultimately, if we can get the state of RFD's performance in 2018 on an axis, our words will likely carry more weight. Mark. On 18/Dec/18 23:24, Naslund, Steve wrote:
Remember always that the local pref is just that, YOUR local preference. Sending that flapping route upstream does not give your peer the option to ignore it. In any case, the downside is that you have to process that route and then choose whether or not to use it. It’s like saying “now that you have processed this unstable route and burned your CPU cycles, I am now giving you to option not to install it into your table”. Remember also that we are only talking about default behavior here. You always have the option to override it by changing timer, penalties, or shutting down RFD all together. We are only talking about day-to-day operation here.
Also, keep in mind that when we are talking about alterative stable paths we are only talking about what your network sees, not the entire Internet. If you as a service provider are experiencing major issues, you may see a route to me as stable or unstable but making global routing decisions based on that is not sound. What might be best for your customer or your business might not be best for the Internet community as a whole. It is a matter of scale, how many services providers can allow how many unstable routes before the entire network becomes regionally or globally unstable. It’s important to remember that flapping routes leave a certain amount of data in flight with no destination which is detrimental to overall performance. As we move into a V6 world we are again worried about the size of the global routing tables and pushing routing performance. Instability of routes is dangerous to system running near the limits. Propagating a known unstable route would be a major shift in routing policy. Today, you either say you can reach something or you don’t say anything. Using the suggested alternative adds the option of “I might be able to reach this but not reliably” which then brings about metrics of “how reliably?” and that is a huge shift in how global routing works. We have been struggling with a backbone routing protocol that does not really do a good job of understanding bandwidth and multiple paths so I would suggest that adding “maybe” routes is not a good idea.
At least using RFD you can explain to your customer why they are not reachable rather than explaining how you made a manual decision to dump them for the “good of the Internet”. There is also a business penalty to the service provider that exposes instability to network. People don’t want to peer or send traffic through unstable network regions.
Steve
Hi Steve,
Lowering the LP would achieve the outcome you desire, provided there are (stable) alternative paths.
What you advocate results in absolute outages in what may already be precarious situations (natural disasters?) - what Saku Ytti suggests like a less painful alternative with desirable properties.
Kind regards,
Job
I think you will find that very hard to evaluate since the value of RFD will be different in different network regions. For example, it is probably good practice to run RFD toward a customer on an unstable access link. It might not be a good idea to run it on a major backbone link that could possibly flap a large number of times in a very short period due to something like a maintenance activity. Also, in areas that are largely on a fiber infrastructure will see RFD in a much different light than a largely wireless infrastructure that might be subject to momentary interference or interruptions. I think it is most safe to say that RFD needs to be evaluated and tuned for what you want it to do. Penalties are never a pleasant thing but they prevent lawlessness. That is exactly what RFD does. You are the cop that decides how to enforce the laws. In fact in my experience people could also get much better network performance overall by properly tuning BGP timers but very few actually do it. I bet you could improve the Internet stability way more by doing that. Steven Naslund Chicago IL
What would really be of interest to me would be for those that run RFD to measure its impact to their network (positive or otherwise) so we have something scientific to base on.
The theory (and practice of old) tells us that RFD is either very good, or very bad. There are probably more folk that have turned it off than run it, or vice versa. Ultimately, if we can get the state >of RFD's performance in 2018 on an axis, our words will likely carry more weight. Mark.
Dear Steve, No worries, I have not forgotten the transitive properties of the LOCAL_PREF BGP Path Attribute! :-) You are right that any LOCAL_PREF modifications (and the attribute itself), are local to the Autonomous System in which they were set, but the effects of such settings can percolate further into the routing system. A great example is the "BGP Graceful Shutdown" mechanism (science partially documented in https://tools.ietf.org/html/rfc6198, actual specification here https://tools.ietf.org/html/rfc8326). What is interesting is that by considering a path (any path, could be flapping) my network will propagate alternative paths to my neighboring networks, or possibly even *withdraw* my announcement in favor of alternative (stable?) paths via competitors. By attaching a lower LOCAL_PREF value to a given path for a period of time as a 'penalty' for flapping, I suspect the visiblity of that flapping will be greatly reduced. This of course doesn't hold true when the only origin of the path is flapping, but in many flapping cases I triaged it was clear that only one out of many links was the root of the flapping. I'm not sure I share your concerns about scale, it appears that so far we seem to be doing just fine without "route flap dampening, penalty type: suppress". No customers ask for it, in fact many are relieved we don't use it. None of our peering partners ask for it either. When we see oscillating paths we reach out to the offending party and ask them to fix it, or take unilateral action within a specific time frame. Kind regards, Job
I will grant you that no customer ever asked for route dampening. I also realize that RFD is much less important now than in the past. I come from the ARPANET/DDN ages of the Internet and can tell you that RFD was absolutely critical in the days of very under powered routers and very unstable data links. I remember when it was quite hard to maintain a 64k link to some locations at all. There might be less of a need for such a simple RFD but it did serve its purpose. In fact, my main argument on this whole topic is that RFD is not relevant enough to waste a lot of effort on a global accepted mechanism. It is just not the low hanging fruit of routing performance improvements. I see two major improvements to global routing...congestion avoidance (which goes a little bit with bandwidth awareness but not exactly) and multipath load balancing (which kind of requires a congestion avoidance awareness). Both of these are going to be extremely difficult issues on a global scale of adoption but that's what is needed. Steven Naslund Chicago IL
Dear Steve,
No worries, I have not forgotten the transitive properties of the LOCAL_PREF BGP Path Attribute! :-) You are right that any LOCAL_PREF modifications (and the attribute itself), are local to the Autonomous System in which they were >set, but the effects of such settings can percolate further into the routing system.
A great example is the "BGP Graceful Shutdown" mechanism (science partially documented in https://tools.ietf.org/html/rfc6198, actual specification here https://tools.ietf.org/html/rfc8326). What is interesting is that by >considering a path (any path, could be flapping) my network will propagate alternative paths to my neighboring networks, or possibly even *withdraw* my announcement in favor of alternative (stable?) paths via competitors.
By attaching a lower LOCAL_PREF value to a given path for a period of time as a 'penalty' for flapping, I suspect the visiblity of that flapping will be greatly reduced. This of course doesn't hold true when the only origin of the path is >flapping, but in many flapping cases I triaged it was clear that only one out of many links was the root of the flapping.
I'm not sure I share your concerns about scale, it appears that so far we seem to be doing just fine without "route flap dampening, penalty type: suppress". No customers ask for it, in fact many are relieved we don't use it. None of our peering partners ask for it either. When we see oscillating paths we reach out to the offending party and ask them to fix it, or take >unilateral action within a specific time frame.
Kind regards,
Job
In general I agree with the idea here but I would also be interested in the possibility of running the local route policy engine against routes that are locally detected to meet a damping condition (user configureable of course). This would potentially yield the ability to change local_pref as well as other attributes that may be useful such as MED/metric (which can be transitive) and/or communities. On Tue, Dec 18, 2018 at 4:55 PM Job Snijders <job@ntt.net> wrote:
Dear Steve,
No worries, I have not forgotten the transitive properties of the LOCAL_PREF BGP Path Attribute! :-) You are right that any LOCAL_PREF modifications (and the attribute itself), are local to the Autonomous System in which they were set, but the effects of such settings can percolate further into the routing system.
A great example is the "BGP Graceful Shutdown" mechanism (science partially documented in https://tools.ietf.org/html/rfc6198, actual specification here https://tools.ietf.org/html/rfc8326). What is interesting is that by considering a path (any path, could be flapping) my network will propagate alternative paths to my neighboring networks, or possibly even *withdraw* my announcement in favor of alternative (stable?) paths via competitors.
By attaching a lower LOCAL_PREF value to a given path for a period of time as a 'penalty' for flapping, I suspect the visiblity of that flapping will be greatly reduced. This of course doesn't hold true when the only origin of the path is flapping, but in many flapping cases I triaged it was clear that only one out of many links was the root of the flapping.
I'm not sure I share your concerns about scale, it appears that so far we seem to be doing just fine without "route flap dampening, penalty type: suppress". No customers ask for it, in fact many are relieved we don't use it. None of our peering partners ask for it either. When we see oscillating paths we reach out to the offending party and ask them to fix it, or take unilateral action within a specific time frame.
Kind regards,
Job
-- [stillwaxin@gmail.com ~]$ cat .signature cat: .signature: No such file or directory [stillwaxin@gmail.com ~]$
Randy Bush Sent: Tuesday, December 18, 2018 5:40 PM
do you have rfd on? with what parms?
randy
If I remember correctly the industry was back and forth on this several times now. First it was deemed good then some studies came out proving the penalty is worse than the crime couple years later another study came out suggesting that if correct parameters are used it should be alright, but I guess at that time no one could have cared less already switching it on and off and on again... With regards to the comments made here on the number of unstable routes till the whole system or significant parts collapse, I could easily revert that argument and ask how many badly configured rfd till the whole system shuts/dampens itself down... (positive vs negative feedback loop) I guess the ideal solution is somewhere in between. Personally I think rfd is just the aspirin, i.e. not treating the cause -but merely helping with the headaches. And I suspect that Interface State Dampening would address 80% of the route-flaps out there (it works exactly like rfd but treats the cause). With the reminder being true protocol flaps either by misconfiguration of max prefix limit (sessions should stay down) or BGP error handling -which again can be solved by the enhanced BGP error handling or genuine bugs. adam
participants (10)
-
adamv0025@netconsultings.com
-
Andrew Latham
-
Jared Mauch
-
Job Snijders
-
Job Snijders
-
Mark Tinka
-
Michael Still
-
Naslund, Steve
-
Randy Bush
-
Saku Ytti