For each second that goes by you remove X addresses from the available pool of ports for new connections for whatever the TCP_TIMEWAIT_LEN is set to (60 seconds by default in Linux). In this case it's making quick connections for HTTP requests (most of which finish in less than a second). Say you have a pool of 30,000 ports and 500 new connections per second (typical): 1 second goes by you now have 29500 10 seconds go by you now have 25000 30 seconds go by you now have 15000 at 59 seconds you get to 29500, at 60 you get back 500 and stay at 29500 and that keeps rolling at 29500. Everyone is happy. Now say that you're seeing an average of 550 connections a second. Suddenly there aren't any available ports to use. So, your first option is to bump up the range of allowed local ports; easy enough, but even if you open it up as much as you can and go from 1025 to 65535, that's still only 64000 ports; with your 60 second TCP_TIMEWAIT_LEN, you can sustain an average of 1000 connections a second. Our problem is that our busy sites are easily peaking to that 1000 connection a second average, and when we enable TCP_TW_RECYCLE, we see them go past that to 1500 or so connections per second sustained. Unfortinately, TCP_TW_RECYCLE is a little too blunt of a hammer and breaks TCP.
From what I've read and heard from others, in a high connection environment the key is really to drop down the TCP_TIMEWAIT_LEN.
My question is basically, "how low can you go?" There seems to be consensus around 20 seconds being safe, 15 being a 99% OK, and 10 or less being problematic. So if I rebuild the kernel to use a 20 second timeout, then that 30000 port pool can sustain 1500, and a 60000 port pool can sustain 3000 connections per second. The software could be re-written to round-robin though IP addresses for outgoing requests, but trying to avoid that. On Wed, Dec 5, 2012 at 1:58 PM, William Herrin <bill@herrin.us> wrote:
On Wed, Dec 5, 2012 at 12:09 PM, Ray Soucy <rps@maine.edu> wrote:
Like most web traffic, the majority of these connections open and close in under a second. When we get to a point that there is enough traffic from users behind the proxy to be generating over 500 new outgoing connections per second, sustained, we start having users experience an error where there are no local ports available to Squid to use since they're all tied up in a TIME_WAIT state.
Here is an example of netstat totals on a box we're seeing the behavior on:
481947 TIME_WAIT
Stupid question but how does 500 x 60 = 481947? To have that many connections in TIME_WAIT on a 60 second timer, you'd need more like 8000 connections per second, wouldn't you?
Regards, Bill Herrin
-- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004
-- Ray Patrick Soucy Network Engineer University of Maine System T: 207-561-3526 F: 207-561-3531 MaineREN, Maine's Research and Education Network www.maineren.net