
I made a private reply to Curtis on his posting earlier this week, and he gave a nice analysis and cc'd end2end-interest rather than nanog. For those that don't care to care to read all this, here's the summary:
Which would you prefer? 140 msec and 0% loss or 70 msec and 5% loss?
So we get to choose between large delay or large lossage. Doesn't sound wonderful... I thought you folks in nanog might be interested, so with Curtis' permission, here's the full exchange, (the original posting by Curtis is at the at the very end). -- Jim Here's what I wrote:
In message <199511272220.OAA01151@stilton.cisco.com>, Jim Forster writes:
Curtis,
I think these days for lots of folks the interesting question is not what happens when a single or a few high-rate TCPs get in equlibrium, but rather what happens when a DS-3 or higher is filled with 56k or slower flows, each of which only lasts for an average of 20 packets or so. Unfortunately, these 20 packet TCP flows are what's driving the stats these days, due I guess to the silly WWW (TCP per file; file per graphic; many graphics per page) that's been so successful.
And Curtis's reply:
The analysis below also applies to just under 800 TCP flows each getting 1/800th of a DS3 link or about 56Kb/s. The loss rate on the link should be about one packet in 11 if the delay can be increased to 250 msec. If the delay is held at 70 msec, lots of timeouts and terrible fairness and poor overall performance will result.
Do we need an ISP to prove this to you by exhibiting terrible performance? If so, please speak to Jon Crowcroft. His case is 400 flows on 4 Mb/s which is far worse, since delay would have to be increased over 3 seconds or segment size reduced below 552. :-(
I could try to derive the results but I'm sure you or others would do better :-). How many of the packets in the 20 packet flow are at equilibrium? What's the drop rate? Hmmm, very simple minded analysis says that it will be large: expontential growth (doubling cwnd every ack) should get above best case pretty quickly, certainly within the 20 packet flow. Assume it's only above optimum once, then the packet loss rate is 1 in 20. Sounds grim. Vegas TCP sounds better for these reasons, since it tracks actual bw, but I'm not really qualified to judge.
-- Jim
Jim,
The end2end-interest thread was quite long and I didn't want to repeat the whole thing. The initial topic was very tiny TCP flows of 3 to 4 packets. That is a really bad problem, but should no longer be a realistic problem once HTTP is modified to allow it to pick up both the HTML page and all inline images in one TCP connection.
Your example is quite reasonable. At 20 packets per flow, with no loss you get 1, 2, 4, 8, 3 packets per RTT or complete transfer in about 5 RTT. On average each TCP flow will get 20 packets / 5 RTT of bandwidth until congestion of 4 packets/RTT (for 552/70 msec, this is about 64 Kb/s). If the connection is temporarily overloaded by a factor of 2, this must be reduced to 2 packets/RTT. If we drop 1 packet in 20, roughly 35% of the flows go completely untouched (0.95^20). Some 15% will drop one packet of the first 3 and timeout and slow start, resulting in less than 20 packet / 3 seconds (3 seconds >> 5*RTT). Some 60% will drop one packet of the 4th through 20th, resulting in fast retransmit, no timeout, and linear growth in window. If the 4th is dropped, the window is cut to 2, so next few RTTs you get 2, 3, 4, 5, 3, or 8 RTTS (2 initial, 1 drop, 5 more). This is probably not quite enough to slow things down.
On a DS3 with 70 msec RTT and 1500 simultaneous flows of 20 packets each (steady state such that the number of active flows remains about 1500, roughly twice what a DS3 could support) you would need a drop rate of on the order of 5% or more. Alternately, you could queue things up, doubling the delay to 140 msec and give every flow the same slower rate (perfect fairness in your example) and have a zero drop rate.
Which would you prefer? 140 msec and 0% loss or 70 msec and 5% loss? Delay is good. We want delay for elastic traffic! But not for real time - use RSVP, admission control, police at the ingress and stick it on the front of the queue.
In practice, I'd expect overload to be due to lots of flows, but not enough little guys to overload the link (if so, get a bigger pipe, we can say that and put it in practice). The overload will be due to a high baseline of little guys (20 packet flows, or a range of fairly small ones), plus some percentage of longer duration flows capable of sucking up the better part of a T1, giving half a chance. It is the latter that you want to slow down, and these are the ones that you *can* slow down with a fairly low drop rate.
I leave it as an exercise to the reader to determine how RED fits into this picture (either one, my overload scenario or Jim's where all the flows are 20 packets in duration).
The 400 flows on 4 Mb/s is an interesting (and difficult) case. I've suggested both allowing delay to get very large (ie: as high as 2 seconds) and hacking the host implementation to reduce segment size to as low as 128 bytes when RTT gets huge or cwnd drops below 4 segments, holding the window to no less than 512 (4 segments) in hopes that fast retransmit will almost always work even in 15-20% loss situations.
Curtis
Curtis's original posting:
In order to get X bandwidth on a given TCP flow you need to have an average window size of X * RTT. This is expressed in terms of TCP segments N = (X * RTT) / MSS (or more correctly the segment size in use rather than MSS). To sustain an average window of N segments, you must ideally reach a steady state where you cut cwnd (current window) in half, then grow linearly, fluctuating between 2/3 and 4/3 of the target size. This would mean one drop in 2/3 N windows or DropRate in terms of time is 2/3 N * RTT. In one RTT on average X * RTT amount of data flows. In practice, you rarely drop at the perfect time, so the constant 2/3 (call it K) can be raised to 1-2. Since N = (X * RTT) / MSS, DropRate = K * X * RTT * X * RTT / MSS. Units are b/s * sec * b/s * sec / b, or b. The DropRate expressed in bits can be converted to seconds or packets (divide by X or by MSS). This type of analysis is courtesy of the good folks at PSC (Matt, Jamshid, et al).
For example, to get 40 Mb/s at 70 msec RTT and 4096 MSS, you get one error about every 6 seconds (K=1) or 1 in 7,300 packets. If you look at 56k Kb/s and 512 MSS you get a very interesting result. You need one error every 66 msec or 1 error in 0.9 packets. This gives a good incentive to increase delay. At 250 msec, you get a result of one error in 11.7 packets (much better!).
Another interesting point to note is that you need 3 duplicate ACKs for TCP fast retransmit to work, so your window must be at least 4 segments (and should be more). If you have a very large number of TCP flows, where on average people get less than 1200 baud or so, the delay you need to make TCP work well starts to exceed the magic 3 second boundary. This was discussed ad nauseum on end2end-interest. An important result is that you need more queueing than the delay bandwidth product for severely congested links. Another is that there is a limit to the number of active TCP flows that can be supported per bandwidth. One suggestion to address the latter problem is to further drop segment size if cwnd is less than 4 segments in size and/or when estimated RTT gets into the seconds range.
This analysis of how much loss is acceptable to TCP may not be outside the bounds of an informational RFC, but so far none exists.
Curtis