On Fri, Mar 20, 2020 at 3:09 PM Saku Ytti <saku@ytti.fi> wrote:
Hey Nimrod,
I was contacted by my NOC to investigate a LAG that was not distributing traffic evenly among the members to the point where one member was congested while the utilization on the LAG was reasonably low. Looking at my netflow data, I was able to confirm that this was caused by a single large flow of ESP traffic. Fortunately, I was able to shift this flow to another path that had enough headroom available so that the flow could be accommodated on a single member link.
With the increase in remote workers and VPN traffic that won't hash across multiple paths, I thought this anecdote might help someone else track down a problem that might not be so obvious.
This problem is called elephant flow. Some vendors have solution for this, by dynamically monitoring utilisation and remapping the hashResult => egressInt table to create bias to offset the elephant flow.
One particular example:
https://www.juniper.net/documentation/en_US/junos/topics/reference/configura...
Ideally VPN providers would be defensive and would use SPORT for entropy, like MPLSoUDP does.
-- ++ytti
There are *several* caveats to doing dynamic monitoring and remapping of flows; one of the biggest challenges is that it puts extra demands on the line cards tracking the flows, especially as the number of flows rises to large values. I recommend reading https://www.juniper.net/documentation/en_US/junos/topics/topic-map/load-bala... before configuring it. "Although the feature performance is high, it consumes significant amount of line card memory. Approximately, 4000 logical interfaces or 16 aggregated Ethernet logical interfaces can have this feature enabled on supported MPCs. However, when the Packet Forwarding Engine hardware memory is low, depending upon the available memory, it falls back to the default load balancing mechanism." What is that old saying? Oh, right--There Ain't No Such Thing As A Free Lunch. ^_^;; Matt