Re: Westnet and Utah outage

1 Dec 1995

      I made a private reply to Curtis on his posting earlier this week, and he
gave a nice analysis and cc'd end2end-interest rather than nanog.  For
those that don't care to care to read all this, here's the summary:
...
Which would you prefer?  140 msec and 0% loss or 70 msec and 5% loss?
So we get to choose between large delay or large lossage.  Doesn't sound
wonderful...

I thought you folks in nanog might be interested, so with Curtis'
permission, here's the full exchange, (the original posting by Curtis is at
the at the very end).

  -- Jim

Here's what I wrote:
...
In message <199511272220.OAA01151@stilton.cisco.com>, Jim Forster writes:
...
Curtis,
I think these days for lots of folks the interesting question is not what
happens when a single or a few high-rate TCPs get in equlibrium, but rather
what happens when a DS-3 or higher is filled with 56k or slower flows, each
of which only lasts for an average of 20 packets or so.  Unfortunately,
these 20 packet TCP flows are what's driving the stats these days, due I
guess to the silly WWW (TCP per file; file per graphic; many graphics per
page) that's been so successful.
And Curtis's reply:
...
The analysis below also applies to just under 800 TCP flows each
getting 1/800th of a DS3 link or about 56Kb/s.  The loss rate on the
link should be about one packet in 11 if the delay can be increased to
250 msec.  If the delay is held at 70 msec, lots of timeouts and
terrible fairness and poor overall performance will result.
Do we need an ISP to prove this to you by exhibiting terrible
performance?  If so, please speak to Jon Crowcroft.  His case is 400
flows on 4 Mb/s which is far worse, since delay would have to be
increased over 3 seconds or segment size reduced below 552.  :-(
...
I could try to derive the results but I'm sure you or others would do
better :-).  How many of the packets in the 20 packet flow are at
equilibrium?  What's the drop rate?  Hmmm, very simple minded analysis says
that it will be large: expontential growth (doubling cwnd every ack) should
get above best case pretty quickly, certainly within the 20 packet flow.
Assume it's only above optimum once, then the packet loss rate is 1 in 20.
Sounds grim.  Vegas TCP sounds better for these reasons, since it tracks
actual bw, but I'm not really qualified to judge.
-- Jim
Jim,
The end2end-interest thread was quite long and I didn't want to repeat
the whole thing.  The initial topic was very tiny TCP flows of 3 to 4
packets.  That is a really bad problem, but should no longer be a
realistic problem once HTTP is modified to allow it to pick up both
the HTML page and all inline images in one TCP connection.
Your example is quite reasonable.  At 20 packets per flow, with no
loss you get 1, 2, 4, 8, 3 packets per RTT or complete transfer in
about 5 RTT.  On average each TCP flow will get 20 packets / 5 RTT of
bandwidth until congestion of 4 packets/RTT (for 552/70 msec, this is
about 64 Kb/s).  If the connection is temporarily overloaded by a
factor of 2, this must be reduced to 2 packets/RTT.  If we drop 1
packet in 20, roughly 35% of the flows go completely untouched
(0.95^20).  Some 15% will drop one packet of the first 3 and timeout
and slow start, resulting in less than 20 packet / 3 seconds (3
seconds >> 5*RTT).  Some 60% will drop one packet of the 4th through
20th, resulting in fast retransmit, no timeout, and linear growth in
window.  If the 4th is dropped, the window is cut to 2, so next few
RTTs you get 2, 3, 4, 5, 3, or 8 RTTS (2 initial, 1 drop, 5 more).
This is probably not quite enough to slow things down.
On a DS3 with 70 msec RTT and 1500 simultaneous flows of 20 packets
each (steady state such that the number of active flows remains about
1500, roughly twice what a DS3 could support) you would need a drop
rate of on the order of 5% or more.  Alternately, you could queue
things up, doubling the delay to 140 msec and give every flow the same
slower rate (perfect fairness in your example) and have a zero drop
rate.
Which would you prefer?  140 msec and 0% loss or 70 msec and 5% loss?
Delay is good.  We want delay for elastic traffic!  But not for real
time - use RSVP, admission control, police at the ingress and stick it
on the front of the queue.
In practice, I'd expect overload to be due to lots of flows, but not
enough little guys to overload the link (if so, get a bigger pipe, we
can say that and put it in practice).  The overload will be due to a
high baseline of little guys (20 packet flows, or a range of fairly
small ones), plus some percentage of longer duration flows capable of
sucking up the better part of a T1, giving half a chance.  It is the
latter that you want to slow down, and these are the ones that you
*can* slow down with a fairly low drop rate.
I leave it as an exercise to the reader to determine how RED fits into
this picture (either one, my overload scenario or Jim's where all the
flows are 20 packets in duration).
The 400 flows on 4 Mb/s is an interesting (and difficult) case.  I've
suggested both allowing delay to get very large (ie: as high as 2
seconds) and hacking the host implementation to reduce segment size to
as low as 128 bytes when RTT gets huge or cwnd drops below 4 segments,
holding the window to no less than 512 (4 segments) in hopes that fast
retransmit will almost always work even in 15-20% loss situations.
Curtis
Curtis's original posting:
...
In order to get X bandwidth on a given TCP flow you need to have an
average window size of X * RTT.  This is expressed in terms of TCP
segments N = (X * RTT) / MSS (or more correctly the segment size in
use rather than MSS).  To sustain an average window of N segments, you
must ideally reach a steady state where you cut cwnd (current window)
in half, then grow linearly, fluctuating between 2/3 and 4/3 of the
target size.  This would mean one drop in 2/3 N windows or DropRate in
terms of time is 2/3 N * RTT.  In one RTT on average X * RTT amount of
data flows.  In practice, you rarely drop at the perfect time, so the
constant 2/3 (call it K) can be raised to 1-2.  Since N = (X * RTT) /
MSS, DropRate = K * X * RTT * X * RTT / MSS.  Units are b/s * sec *
b/s * sec / b, or b.  The DropRate expressed in bits can be converted
to seconds or packets (divide by X or by MSS).  This type of analysis
is courtesy of the good folks at PSC (Matt, Jamshid, et al).
For example, to get 40 Mb/s at 70 msec RTT and 4096 MSS, you get one
error about every 6 seconds (K=1) or 1 in 7,300 packets.  If you look
at 56k Kb/s and 512 MSS you get a very interesting result.  You need
one error every 66 msec or 1 error in 0.9 packets.  This gives a good
incentive to increase delay.  At 250 msec, you get a result of one
error in 11.7 packets (much better!).
Another interesting point to note is that you need 3 duplicate ACKs
for TCP fast retransmit to work, so your window must be at least 4
segments (and should be more).  If you have a very large number of TCP
flows, where on average people get less than 1200 baud or so, the
delay you need to make TCP work well starts to exceed the magic 3
second boundary.  This was discussed ad nauseum on end2end-interest.
An important result is that you need more queueing than the delay
bandwidth product for severely congested links.  Another is that there
is a limit to the number of active TCP flows that can be supported per
bandwidth.  One suggestion to address the latter problem is to further
drop segment size if cwnd is less than 4 segments in size and/or when
estimated RTT gets into the seconds range.
This analysis of how much loss is acceptable to TCP may not be outside
the bounds of an informational RFC, but so far none exists.
Curtis