Shady areas of TCP window autotuning?
Hi all, TCP window autotuning is part of several OSs today. However, the actual implementations behind this buzzword differ significantly and might impose negative side-effects to our networks - which I'd like to discuss here. There seem to be two basic approaches which differ in the main principle: #1: autotuning tries to set rx window to a sensible value for a given RTT #2: autotuning just ensures, that rx window is always bigger than congestion window of the sender, i.e. it never limits the flow While both approaches succeed to achieve high throughput on high-RTT paths, their behaviour on low-RTT paths is very different - mainly because the fact, that #2 suffers from "spiraling death" syndrome. I.e. when RTT increases due to queueing at the bottleneck point, autotuning reacts by increasing the advertised window, which again increases RTT... So the net effect of #2 is, that after very short TCP connection lifetime it might advertise extermely huge RX window compared to BDP of the path: RTT when idle Max advertised window #1 Max advertised window #2 ---------------------------------------------------------------------- < 1 msec 66560 byte 3 Mbyte 3 msec 66560 byte 3 Mbyte 12 msec 243200 byte 3 Mbyte 24 msec 482048 byte 3 Mbyte (The above data were taken from the same host connected by 100 Mpbs ethernet to the network while running two OSs with different approaches and transferring 1 GByte of data) It's obvious, that #2 has significant impact on the network. Since it advertises really huge window, it will fill up all buffers at the bottleneck point, it might increase latency to >100 msec levels even at LAN context and prevent classic TCP implementations with fixed window size from getting a fair share of bandwidth. This however doesn't seem to be of any concern for TCP maintainers of #2, who claim that receiver is not supposed to anyhow assist in congestion control. Instead, they advise everyone to use advanced queue management, RED or other congestion-control mechanisms at the sender and at every network device to avoid this behaviour. My personal opinion is that this looks more like passing the problem to someone else, not mentioning the fact, that absolutely trusting the sender to perform everything right and removing all safety-belts at the receiver could be very dangerous. What do people here think about this topic? Thanks & kind regards, M. -------------------------------------------------------------------------- ---- ---- ---- Marian Ďurkovič network manager ---- ---- ---- ---- Slovak Technical University Tel: +421 2 571 041 81 ---- ---- Computer Centre, Nám. Slobody 17 Fax: +421 2 524 94 351 ---- ---- 812 43 Bratislava, Slovak Republic E-mail/sip: md@bts.sk ---- ---- ---- --------------------------------------------------------------------------
Briefly? They're correct - the rx advertised window has nothing to do with congestion control and everything to do with flow control. The problem you've described *is* a problem, but not because of its effects on congestion control -- the problem it causes is one we call a lack of agility: it takes longer for control signals to take effect if you're doing things like fast-forwarding a YouTube movie that's being delivered over TCP. If you want patches for Linux that properly decrease the window size, I can send you them out-of-band. But in general, TCP's proper behavior is to try to fill up the bottleneck buffer. This isn't a huge problem *in general*, but can be fairly annoying on, e.g., cable modems with oversized buffers, which are fairly common. But that's pretty fundamental with the way TCP is designed. Otherwise, you WILL sacrifice throughput at other times. -Dave On Mar 16, 2009, at 5:15 AM, Marian Ďurkovič wrote:
Hi all,
TCP window autotuning is part of several OSs today. However, the actual implementations behind this buzzword differ significantly and might impose negative side-effects to our networks - which I'd like to discuss here. There seem to be two basic approaches which differ in the main principle:
#1: autotuning tries to set rx window to a sensible value for a given RTT #2: autotuning just ensures, that rx window is always bigger than congestion window of the sender, i.e. it never limits the flow
While both approaches succeed to achieve high throughput on high-RTT paths, their behaviour on low-RTT paths is very different - mainly because the fact, that #2 suffers from "spiraling death" syndrome. I.e. when RTT increases due to queueing at the bottleneck point, autotuning reacts by increasing the advertised window, which again increases RTT... So the net effect of #2 is, that after very short TCP connection lifetime it might advertise extermely huge RX window compared to BDP of the path:
RTT when idle Max advertised window #1 Max advertised window #2 ---------------------------------------------------------------------- < 1 msec 66560 byte 3 Mbyte 3 msec 66560 byte 3 Mbyte 12 msec 243200 byte 3 Mbyte 24 msec 482048 byte 3 Mbyte
(The above data were taken from the same host connected by 100 Mpbs ethernet to the network while running two OSs with different approaches and transferring 1 GByte of data)
It's obvious, that #2 has significant impact on the network. Since it advertises really huge window, it will fill up all buffers at the bottleneck point, it might increase latency to >100 msec levels even at LAN context and prevent classic TCP implementations with fixed window size from getting a fair share of bandwidth.
This however doesn't seem to be of any concern for TCP maintainers of #2, who claim that receiver is not supposed to anyhow assist in congestion control. Instead, they advise everyone to use advanced queue management, RED or other congestion-control mechanisms at the sender and at every network device to avoid this behaviour.
My personal opinion is that this looks more like passing the problem to someone else, not mentioning the fact, that absolutely trusting the sender to perform everything right and removing all safety-belts at the receiver could be very dangerous.
What do people here think about this topic?
Thanks & kind regards,
M.
-------------------------------------------------------------------------- ---- ---- ---- Marian Ďurkovič network manager ---- ---- ---- ---- Slovak Technical University Tel: +421 2 571 041 81 ---- ---- Computer Centre, Nám. Slobody 17 Fax: +421 2 524 94 351 ---- ---- 812 43 Bratislava, Slovak Republic E-mail/sip: md@bts.sk ---- ---- ---- --------------------------------------------------------------------------
In a message written on Mon, Mar 16, 2009 at 10:15:37AM +0100, Marian ??urkovi?? wrote:
This however doesn't seem to be of any concern for TCP maintainers of #2, who claim that receiver is not supposed to anyhow assist in congestion control. Instead, they advise everyone to use advanced queue management, RED or other congestion-control mechanisms at the sender and at every network device to avoid this behaviour.
I think the advice here is good, but it actually overlooks the larger problem. Many edge devices have queues that are way too large. What appears to happen is vendors don't auto-size queues. Something like a cable or DSL modem may be designed for a maximum speed of 10Mbps, and the vendor sizes the queue appropriately. The service provider then deploys the device at 2.5Mbps, which means roughly (as it can be more complex) the queue should be 1/4th the size. However the software doesn't auto-size the buffer to the link speed, and the operator doesn't adjust the buffer size in their config. The result is that if the vendor targeted 100ms of buffer you now have 400ms of buffer, and really bad lag. As network operators we have to get out of the mind set that "packet drops are bad". While that may be true in planning the backbone to have sufficient bandwidth, it's the exact opposite of true when managing congestion at the edge. Reducing the buffer to be ~50ms of bandwidth makes the users a lot happier, and allows TCP to work. TCP needs drops to manage to the right speed. My wish is for the vendors to step up. I would love to be able to configure my router/cable modem/dsl box with "queue-size 50ms" and have it compute, for the current link speed, 50ms of buffer. Sure, I can do that by hand and turn it into "queue 20 packets", but that is very manual and must be done for every different link speed (at least, at slower speeds). Operators don't adjust because it is too much work. If network operators could get the queue sizes fixed then it might be worrying about the behavior you describe; however I suspect 90% of the problem you describe would also go away. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
On Mon, Mar 16, 2009 at 09:09:35AM -0500, Leo Bicknell wrote:
The result is that if the vendor targeted 100ms of buffer you now have 400ms of buffer, and really bad lag.
Well, this is one of the reasons why I hate the fact that we're effectively stuck in a 1500 MTU world. My customers are vastly concerned with the quantity of data they can transmit per unit of latency. You may be more familiar with this termed as "through-put". Customers beat us operators and engineers up over it every day. TCP window tuning does help that if you can manage the side effects. A larger default layer 2 MTU (why we didn't change this when GE came out, I will never understand) would help even more by reducing the total number of frames necessary to transmit a packet across a give wire.
As network operators we have to get out of the mind set that "packet drops are bad"
Well, thats easier said than done and arguably not realistic. I got started in this business when 1-3% packet loss was normal and expected. As the network has grown, the expectation for 0% loss in all cases has grown with it. You have to remember that in the early days, the network itself was expected to guarentee data delivery. (ie X.25) Then the network improved and that burdon was cast on the host devices. Well, technology has continued to improve to the point where you litterally can expect 0% packet loss in relatively confined areas. (Say, Provider X in Los Angeles to user Y in San Jose.) But as you go further afield, such as from LAX to Israel, expectations have to change. Today, that mindset is not always there. As you illude to, this has also bred applications that are almost entirely intollerant of packet loss and extremely sensitive to jitter. (VOIP people, are you listening?) Real time gaming is a great example. Back in the days when 99% of us were on modems, any loss or varying delay between the client and the user made the difference between an enjoyable session and nothing but frustration and it was often hit and miss. A congested or dirty link in the middle of the path destroyed the user's experience. This is further compounded by the ever increasingly international participation in some of these services which means that 24x7 requirements render the customers and their users more and more sensitive to maintenance activities. (There can be areas where there is no "after hours" in which to do this stuff.) Add to this that as media companies expand their use of the network that customers have forced providers to write into their SLAs performance based metrics that, rather than simple uptime, now require often arbitrary guarentees of latency and data loss and you've got a real problem for operations and engineering. Techniques that can help improve network integrity are worth exploring. The difficulty is in proving these techniques under a wide array of circumstances, getting them properly adopted, and not having vendors or customers arbitrarily break them because of improper understanding, poor implementations, or bad configs (PMTUD, anyone?) Going forward, this sort of thing is going to be more and more important and harder and harder to get right. I'm actually glad to see this particular thread appear and will be quite interested in what people have to say on the matter. -Wayne --- Wayne Bouchard web@typo.org Network Dude http://www.typo.org/~web/
Hi, On 2009-3-16, at 7:09, Leo Bicknell wrote:
My wish is for the vendors to step up. I would love to be able to configure my router/cable modem/dsl box with "queue-size 50ms" and have it compute, for the current link speed, 50ms of buffer.
if the vendors got active and deployed better queueing schemes, that'd be great. In the meantime, we've also started some work that will allow bulk transfer applications to transmit in a way that is designed to minimize queue lengths: http://www.ietf.org/html.charters/ledbat-charter.html Lars
It was my understanding that (most) cable modems are L2 devices -- how it is that they have a buffer, other than what the network processor needs to switch it? Frank -----Original Message----- From: Leo Bicknell [mailto:bicknell@ufp.org] Sent: Monday, March 16, 2009 9:10 AM To: nanog@nanog.org Subject: Re: Shady areas of TCP window autotuning? <snip> What appears to happen is vendors don't auto-size queues. Something like a cable or DSL modem may be designed for a maximum speed of 10Mbps, and the vendor sizes the queue appropriately. The service provider then deploys the device at 2.5Mbps, which means roughly (as it can be more complex) the queue should be 1/4th the size. However the software doesn't auto-size the buffer to the link speed, and the operator doesn't adjust the buffer size in their config. <snip> My wish is for the vendors to step up. I would love to be able to configure my router/cable modem/dsl box with "queue-size 50ms" and have it compute, for the current link speed, 50ms of buffer. Sure, I can do that by hand and turn it into "queue 20 packets", but that is very manual and must be done for every different link speed (at least, at slower speeds). Operators don't adjust because it is too much work. <snip> -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
On Mon, Mar 16, 2009 at 10:48:42PM -0500, Frank Bulk - iName.com wrote:
It was my understanding that (most) cable modems are L2 devices -- how it is that they have a buffer, other than what the network processor needs to switch it?
The Ethernet is typically faster than the upstream cable channel. So it needs some place to put the data that arrives from the Ethernet port until it gets sent upstream. This has nothing to do with layer 2 / layer 3. Any device connecting between media of different speeds (or connecting more than two ports -- creating the possibility of contention) would need some amount of buffering. -- Brett
On Mon, 16 Mar 2009, Leo Bicknell wrote:
What appears to happen is vendors don't auto-size queues. Something
In my mind, the problem is that they tend to use FIFO, not that the queues are too large. This is most likely due to the enormous price competition in the market, where you might lose a DSL CPE deal because you charged $1 per unit more than the competition. What we need is ~100ms of buffer and fair-queue or equivalent, at both ends of the end-user link (unless it's 100 meg or more, where 5ms buffers and FIFO tail-drop seems to work just fine), because 1 meg uplink (ADSL) and 200ms buffer is just bad for the customer experience, and if they can't figure out how to do fair-queue properly, they might as well just to WRED 30 ms 50 ms (100% drop probability at 50ms) or even taildrop at 50ms. It's very rare today that an end user is helped by anything buffering their packet more than 50ms. I've done some testing with fairly fast links with big buffers (T3/OC3 and real routers) and doing FIFO and tuned TCP windows (single session) it's easy to get 100ms buffering, which is just pointless. So either smaller buffers and FIFO, or large buffers and some kind of intelligent queue handling. -- Mikael Abrahamsson email: swmike@swm.pp.se
On Mon, Mar 16, 2009 at 09:09:35AM -0500, Leo Bicknell wrote:
Many edge devices have queues that are way too large.
What appears to happen is vendors don't auto-size queues. Something like a cable or DSL modem may be designed for a maximum speed of 10Mbps, and the vendor sizes the queue appropriately. The service provider then deploys the device at 2.5Mbps, which means roughly (as it can be more complex) the queue should be 1/4th the size. However the software doesn't auto-size the buffer to the link speed, and the operator doesn't adjust the buffer size in their config.
The result is that if the vendor targeted 100ms of buffer you now have 400ms of buffer, and really bad lag.
This is a very good point. Let me add, that it happens also for every autosensing 10/100/1000Base-T ethernet port, which typically does not auto-reduce buffers when the actual negotiated speed is not 1 Gbps.
As network operators we have to get out of the mind set that "packet drops are bad". While that may be true in planning the backbone to have sufficient bandwidth, it's the exact opposite of true when managing congestion at the edge. Reducing the buffer to be ~50ms of bandwidth makes the users a lot happier, and allows TCP to work. TCP needs drops to manage to the right speed.
My wish is for the vendors to step up. I would love to be able to configure my router/cable modem/dsl box with "queue-size 50ms" and have it compute, for the current link speed, 50ms of buffer.
Reducing buffers to 50 msec clearly avoids excessive queueing delays, but let's look at this from the wider perspective: 1) initially we had a system where hosts were using fixed 64 kB buffers This was unable to achieve good performance over high BDP paths 2) OS maintainers have fixed this by means of buffer autotuning, where the host buffer size is no longer the problem. 3) the above fix introduces unacceptable delays into networks and users are complaining, especially if autotuning approach #2 is used 4) network operators will fix the problem by reducing buffers to e.g. 50 msec So at the end of the day, we'll again have a system which is unable to achieve good performance over high BDP paths, since with reduced buffers we'll have an underbuffered bottleneck in the path which will prevent full link untilization if RTT>50 msec. Thus all the above exercises will end up in having almost the same situation as before (of course YMMV). Something is seriously wrong, isn't it? And yes, I opened this topic last week on Linux netdev mailinglist and tried hard to persuade those people that some less aggresive approach is probably necessary to achieve good balance between the requirements for fastest possible throughput and fairness in the network. But the maintainers simply didn't want to listen :-( M.
In a message written on Tue, Mar 17, 2009 at 08:46:50AM +0100, Mikael Abrahamsson wrote:
In my mind, the problem is that they tend to use FIFO, not that the queues are too large.
We could quickly get lost in queuing science, but at a high level you are most correct that both are a problem.
What we need is ~100ms of buffer and fair-queue or equivalent, at both ends of the end-user link (unless it's 100 meg or more, where 5ms buffers and FIFO tail-drop seems to work just fine), because 1 meg uplink (ADSL) and 200ms buffer is just bad for the customer experience, and if they can't figure out how to do fair-queue properly, they might as well just to WRED 30 ms 50 ms (100% drop probability at 50ms) or even taildrop at 50ms. It's very rare today that an end user is helped by anything buffering their packet more than 50ms.
Some of this technology exists, just not where it can do a lot of good. Some fancier CPE devices know how to queue VOIP in a priority queue, and elevate some games. This works great when the cable modem or DSL modem are integrated, but when you buy a "router" and hook it to your provider supplied DSL or Cable Modem it's doing no good. I hate to suggest such a thing, but perhaps a protocol for a modem to communicate a comitted rate to a router would be a good thing... I'd also like to point out, where this technology exists today it's almost never used. How many 2600's and 3600's have you seen terminating T1's or DS-3's that don't have anything changed from the default FIFO queue? I am particularly fond of the DS-3 frame circuits with 100 PVC's, each with 40 packets of buffer. 4000 packets of buffer on a DS-3. No wonder performance is horrid. In a message written on Tue, Mar 17, 2009 at 09:47:39AM +0100, Marian ??urkovi?? wrote:
Reducing buffers to 50 msec clearly avoids excessive queueing delays, but let's look at this from the wider perspective:
1) initially we had a system where hosts were using fixed 64 kB buffers This was unable to achieve good performance over high BDP paths
Note that the host buffer, which generally should be 2 * Bandwidth * Delay is, well, basically unrelated to the hop by hop network buffers.
2) OS maintainers have fixed this by means of buffer autotuning, where the host buffer size is no longer the problem.
3) the above fix introduces unacceptable delays into networks and users are complaining, especially if autotuning approach #2 is used
4) network operators will fix the problem by reducing buffers to e.g. 50 msec
So at the end of the day, we'll again have a system which is unable to achieve good performance over high BDP paths, since with reduced buffers we'll have an underbuffered bottleneck in the path which will prevent full link untilization if RTT>50 msec. Thus all the above exercises will end up in having almost the same situation as before (of course YMMV).
This is an incorrect conclusion. The host buffer has to wait for an RTT for an ack to return, so it has to buffer a full RTT of data and then some. Hop by hop buffers only have to buffer until an output port on the same device is free. This is why a router with 20 10GE interfaces can have a 75 packet deep queue on each interface and work fine, the packet only sits there until a 10GE output interface is available (a few microseconds). The problems are related, as TCP goes faster there is an increased probability it will fill the buffer at any particular hop; but that means a link is full and TCP is hitting the maximum speed for that path anyway. Reducing the buffer size (to a point) /does not slow/ TCP, it reduces the feedback loop time. It provides less jitter to the user, which is good for VoIP and ssh and the like. However, if the hop-by-hop buffers are filling and there is lag and jitter, that's a sign the hop-by-hop buffers were always too large. 99.99% of devices ship with buffers that are too large. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
On Tue, 17 Mar 2009 10:39:13 -0500, Leo Bicknell wrote
So at the end of the day, we'll again have a system which is unable to achieve good performance over high BDP paths, since with reduced buffers we'll have an underbuffered bottleneck in the path which will prevent full link untilization if RTT>50 msec. Thus all the above exercises will end up in having almost the same situation as before (of course YMMV).
This is an incorrect conclusion. The host buffer has to wait for an RTT for an ack to return, so it has to buffer a full RTT of data and then some. Hop by hop buffers only have to buffer until an output port on the same device is free.
[snip]
However, if the hop-by-hop buffers are filling and there is lag and jitter, that's a sign the hop-by-hop buffers were always too large. 99.99% of devices ship with buffers that are too large.
Vendors size the buffers according to principles outlined e.g. here: http://tiny-tera.stanford.edu/~nickm/papers/sigcomm2004.pdf It's fine to have smaller buffers in the high-speed core, but at the edge you still need to buffer for full RTT if you want to fully utilize the link with TCP Reno. Thus my conclusion holds - if we reduce buffers at the bottleneck point to 50 msec, flows with RTT>50 msec would suffer from reduced throughput. Anyway we probably have no other chance in situations when the only available queueing is FIFO. And if this gets implemented on larger scale, it could even have a positive side-effect - it might finally motivate OS maintainers to seriously consider deploying some delay-sensitive variant of TCP since Reno will no longer give them the best results. M.
In a message written on Wed, Mar 18, 2009 at 09:04:42AM +0100, Marian ??urkovi?? wrote:
It's fine to have smaller buffers in the high-speed core, but at the edge you still need to buffer for full RTT if you want to fully utilize the link with TCP Reno. Thus my conclusion holds - if we reduce buffers at the bottleneck point to 50 msec, flows with RTT>50 msec would suffer from reduced throughput.
Ah, I understand your point now. There is a balance to be had at the edge; the tuning to support a single TCP stream at full bandwidth and the tuning to reduce latency and jitter are on some level incompatable; so one must strike a balance between the two. Of course...
Anyway we probably have no other chance in situations when the only available queueing is FIFO. And if this gets implemented on larger scale, it could even have a positive side-effect - it might finally motivate OS maintainers to seriously consider deploying some delay-sensitive variant of TCP since Reno will no longer give them the best results.
Many of the problems can be mitigated with well known queueing strategies. WRED, priority queues, and other options have been around long enough I refuse to believe they add significant cost. Rather, I think the problem is of awareness, far too few network engineers seem to really understand what effect the queuing options have on traffic. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
Leo Bicknell wrote:
As network operators we have to get out of the mind set that "packet drops are bad".
They are bad.
TCP needs drops to manage to the right speed.
This is whats bad. TCP should be slightly more intelligent and start considering rtt jitter as its primary source of congestion information. Designing L2 network performance to optimize a l3 protocol is backwards.
Or use a transmission-layer protocol that optimizes delay end-to-end. http://tools.ietf.org/html/draft-shalunov-ledbat-congestion-00 On 2009Mar17, at 12:47 PM, Joe Maimon wrote:
Leo Bicknell wrote:
TCP needs drops to manage to the right speed.
This is whats bad. TCP should be slightly more intelligent and start considering rtt jitter as its primary source of congestion information.
Designing L2 network performance to optimize a l3 protocol is backwards.
On Tue, 17 Mar 2009, Joe Maimon wrote:
TCP needs drops to manage to the right speed.
This is whats bad. TCP should be slightly more intelligent and start considering rtt jitter as its primary source of congestion information.
TCP Vegas did this but sadly it never became popular. (It doesn't compete well with Reno.) Tony. -- f.anthony.n.finch <dot@dotat.at> http://dotat.at/ GERMAN BIGHT HUMBER: SOUTHWEST 5 TO 7. MODERATE OR ROUGH. SQUALLY SHOWERS. MODERATE OR GOOD.
On 2009-3-17, at 12:10, Tony Finch wrote:
On Tue, 17 Mar 2009, Joe Maimon wrote:
TCP needs drops to manage to the right speed.
This is whats bad. TCP should be slightly more intelligent and start considering rtt jitter as its primary source of congestion information.
TCP Vegas did this but sadly it never became popular. (It doesn't compete well with Reno.)
FWIW, Compound TCP does this (shipping with Vista, but disabled by default.) There are other delay-based or delay-sensitive TCP flavors, too. Lars
* Marian Ďurkovič:
TCP window autotuning is part of several OSs today. However, the actual implementations behind this buzzword differ significantly and might impose negative side-effects to our networks - which I'd like to discuss here. There seem to be two basic approaches which differ in the main principle:
This has bene discussed previously on the netdev list: <http://thread.gmane.org/gmane.linux.network/121674> You may want to review the dicussion over there before replying on NANOG.
participants (12)
-
Brett Frankenberger
-
David Andersen
-
Florian Weimer
-
Frank Bulk - iName.com
-
Joe Maimon
-
John Schnizlein
-
Lars Eggert
-
Leo Bicknell
-
Marian Ďurkovič
-
Mikael Abrahamsson
-
Tony Finch
-
Wayne E. Bouchard