Re: TCP time_wait and port exhaustion for servers

5 Dec 2012

      For each second that goes by you remove X addresses from the available
pool of ports for new connections for whatever the TCP_TIMEWAIT_LEN is
set to (60 seconds by default in Linux).

In this case it's making quick connections for HTTP requests (most of
which finish in less than a second).

Say you have a pool of 30,000 ports and 500 new connections per second
(typical):
1 second goes by you now have 29500
10 seconds go by you now have 25000
30 seconds go by you now have 15000
at 59 seconds you get to 29500,
at 60 you get back 500 and stay at 29500 and that keeps rolling at
29500.  Everyone is happy.

Now say that you're seeing an average of 550 connections a second.
Suddenly there aren't any available ports to use.

So, your first option is to bump up the range of allowed local ports;
easy enough, but even if you open it up as much as you can and go from
1025 to 65535, that's still only 64000 ports; with your 60 second
TCP_TIMEWAIT_LEN, you can sustain an average of 1000 connections a
second.

Our problem is that our busy sites are easily peaking to that 1000
connection a second average, and when we enable TCP_TW_RECYCLE, we see
them go past that to 1500 or so connections per second sustained.

Unfortinately, TCP_TW_RECYCLE is a little too blunt of a hammer and breaks TCP.
...
From what I've read and heard from others, in a high connection
environment the key is really to drop down the TCP_TIMEWAIT_LEN.
My question is basically, "how low can you go?"

There seems to be consensus around 20 seconds being safe, 15 being a
99% OK, and 10 or less being problematic.

So if I rebuild the kernel to use a 20 second timeout, then that 30000
port pool can sustain 1500, and a 60000 port pool can sustain 3000
connections per second.

The software could be re-written to round-robin though IP addresses
for outgoing requests, but trying to avoid that.

On Wed, Dec 5, 2012 at 1:58 PM, William Herrin <bill@herrin.us> wrote:
...
On Wed, Dec 5, 2012 at 12:09 PM, Ray Soucy <rps@maine.edu> wrote:
...
Like most web traffic, the majority of these connections open and
close in under a second.  When we get to a point that there is enough
traffic from users behind the proxy to be generating over 500 new
outgoing connections per second, sustained, we start having users
experience an error where there are no local ports available to Squid
to use since they're all tied up in a TIME_WAIT state.
Here is an example of netstat totals on a box we're seeing the behavior on:
481947 TIME_WAIT
Stupid question but how does 500 x 60 = 481947?  To have that many
connections in TIME_WAIT on a 60 second timer, you'd need more like
8000 connections per second, wouldn't you?
Regards,
Bill Herrin
--
William D. Herrin ................ herrin@dirtside.com  bill@herrin.us
3005 Crane Dr. ...................... Web: <http://bill.herrin.us/>
Falls Church, VA 22042-3004
-- 
Ray Patrick Soucy
Network Engineer
University of Maine System

T: 207-561-3526
F: 207-561-3531

MaineREN, Maine's Research and Education Network
www.maineren.net