Baldur Norddahl <baldur.norddahl@gmail.com> writes:
Hello
What is the best current practice for buffer size? For customer facing ports, core network ports and transit links?
We have a buffer problem, discovered by a customer that moved their servers to a cloud service some distance away. That resulted in a drastic reduced transfer speed between their office and the cloud service. Nothing much could be done since we, like so many others, have switches with extreme fast port speeds (48x 10G, 4x 100G) with a tiny shared 12 MB buffer.
Now the time has come to upgrade that hardware to something that does have plenty of buffer capacity, so I am planning out what the settings should be.
I have read this paper http://web.stanford.edu/class/cs244/papers/sizing-router-buffers-redux.pdf which claims not much buffer is needed at all. And I think they are completely wrong. In the paper they assume we are trying to get as much throughput on a congested core or transit port. But we always make sure those are not congested, so that misses the mark completely. For core ports we are concerned about microbursts and for customer ports we do care about a single TCP session being able to get the max throughput. The paper assumes there will be a lot of TCP sessions sharing the bandwidth, but that is not always the case with customer ports. It might be a guy downloading an ISO image with him being the only guy at the office.
The common wisdom is to set the buffer size to one bandwidth-delay product. And also that bigger buffers than this is harmful. But that raises the question of what distance do I tune for? Amsterdam is 10 ms away. East coast USA is 100 ms.
There are a couple of trends here to be aware of: One is that the proliferation of CDNs and localised clouds means RTTs for a lot of bandwidth-heavy traffic is quite low these days. The second is that newer TCP congestion control algorithms such as BBR make heavy use of packet pacing which all but eliminates the microbursts of older TCPs. BBR will run quite happily across a shallow-buffered link. Google is using this pretty much across all their infrastructure, so that's one major source of traffic (youtube, gcloud, etc) taken care of; not sure about other CDNs, but I do believe several of them have at least been experimenting with it... What this means is that *if* buffer size is the only config knob you have to twiddle, you're likely better off erring on the size of too small a buffer than too big. The bufferbloat induced by overbuffering is going to hurt your customers more than the slight loss of single-flow legacy TCP performance of the occasional too-shallow buffer will.
Also the new hardware (Juniper ACX710) does support more than one queue per port. Would it be possible to have that ISO download go into a queue for heavy streams and allow smaller streams to skip the line, so we do not see the downside of a heavy buffer (buffer bloat)? It is no longer just a simple matter of buffer size.
You're basically describing the FQ-CoDel algorithm (RFC8290) here: it does per-flow queueing, and combines it with AQM on each queue + automatic no-knobs prioritisation of short (and thus often latency-sensitive) flows. It's really the gold standard, but unfortunately it hasn't yet made it into big-iron routers yet.
What are others doing to deliver the best possible performance to customers with regards to buffering?
As you note yourself above, just tuning the buffer size is a very blunt tool which can't really be made to work for all scenarios. You really want some kind of queue management algorithm enabled, i.e., an AQM which will start dropping packets as the queue *starts* building up, possibly combined with flow queueing. Looking at the data sheet for that Juniper box, unfortunately it doesn't seem to offer much in this space. It lists the RED AQM, which can be made to work, but requires specific tuning to the link speed (and will cripple the link if set wrong). Theoretically, Juniper should be able to implement PIE (RFC8033) as a firmware update leveraging their existing RED machinery; you could ask them for that? As for flow queueing, the data sheet does mention Weighted Fair Queueing, and you mention you can have more than one queue per port. Configuring this with as many queues per port as you can, using flow-based hashing to divide up traffic between them would not be unreasonable. It won't have the smart prioritisation of FQ-CoDel, but even a standard round-robin scheme can help separate out elephant flows from the rest of the traffic and alleviate the bad effects of bufferbloat. -Toke