TCP time_wait and port exhaustion for servers

newer
Fwd: [Infowarrior] - Leaked: ITU's...

Ray Soucy

5 Dec 2012 5 Dec '12

3:59 p.m.

RFC 793 arbitrarily defines 2MSL (how long to hold a socket in TIME_WAIT state before cleaning up) as 4 min. Linux is a little more reasonable in this and has it baked into the source as 60 seconds in "/usr/src/linux/include/net/tcp.h": #define TCP_TIMEWAIT_LEN (60*HZ) Where there is no way to change this though /proc (probably a good idea to keep users from messing with it), I am considering re-building a kernel with a lower TCP_TIMEWAIT_LEN to deal with the following issue. With a 60 second timeout on TIME_WAIT, local port identifiers are tied up from being used for new outgoing connections (in this case a proxy server). The default local port range on Linux can easily be adjusted; but even when bumped up to a range of 32K ports, the 60 second timeout means you can only sustain about 500 new connections per second before you run out of ports. There are two options to try an deal with this, tcp_tw_reuse, and tcp_tw_recycle; but both seem to be less than ideal. With tcp_tw_reuse, it doesn't appear to be effective in situations where you're sustaining 500+ new connections per second rather than a small burst. With tcp_tw_recycle it seems like too big of a hammer and has been reported to cause problems with NATed connections. The best solution seems to be trying to keep TIME_WAIT in place, but being faster about it. 30 seconds would get you to 1000 connections a second; 15 to 2000, and 10 seconds to about 3000 a second. A few questions: Does anyone have any data on how typical it is for TIME_WAIT to be necessary beyond 10 seconds on a modern network? Has anyone done some research on how low you can make TIME_WAIT safely? Is this a terrible idea? What alternatives are there? Keep in mind this is a proxy server making outgoing connections as the source of the problem; so things like SO_REUSEADDR which work for reusing sockets for incoming connections don't seem to do much in this situation. Anyone running large proxies or load balancers have this situation? If so what is your solution? -- Ray Patrick Soucy Network Engineer University of Maine System T: 207-561-3526 F: 207-561-3531 MaineREN, Maine's Research and Education Network www.maineren.net

Show replies by date

JÁKÓ András

5 Dec 5 Dec

4:56 p.m.

Ray,

...

With a 60 second timeout on TIME_WAIT, local port identifiers are tied up from being used for new outgoing connections (in this case a proxy server). The default local port range on Linux can easily be adjusted; but even when bumped up to a range of 32K ports, the 60 second timeout means you can only sustain about 500 new connections per second before you run out of ports.

Is that 500 new connections per second per {protocol, remote address, remote port} tuple, that's too few for your proxy? (OK, this tuple is more or less equivalent with only {remote address} if we talk about a web proxy.) Just curious. Regards, András

Ray Soucy

5:09 p.m.

This would be outgoing connections sourced from the IP of the proxy, destined to whatever remote website (so 80 or 443) requested by the user. Essentially it's a modified Squid service that is used to filter HTTP for CIPA compliance (required by the government) for keep children in public schools from stumbling on to inappropriate content. Like most web traffic, the majority of these connections open and close in under a second. When we get to a point that there is enough traffic from users behind the proxy to be generating over 500 new outgoing connections per second, sustained, we start having users experience an error where there are no local ports available to Squid to use since they're all tied up in a TIME_WAIT state. Here is an example of netstat totals on a box we're seeing the behavior on: 10 LAST_ACK 32 LISTEN 5 SYN_RECV 5 CLOSE_WAIT 756 ESTABLISHED 26 FIN_WAIT1 40 FIN_WAIT2 5 CLOSING 10 SYN_SENT 481947 TIME_WAIT As a band-aid we've opened up the local port range to allow up to 50K local ports with /proc/sys/net/ipv4/ip_local_port_range, but they're brushing up against that limit again at peak times. It's a shame because memory and CPU-wise the box isn't breaking a sweat. Enabling TW_REUSE doesn't seem to have any effect for this case (/proc/sys/net/ipv4/tcp_tw_reuse) Using TW_RECYCLE drops the TIME_WAIT count to about 10K instead of 50K, but everything I read online says to avoid using TW_RECYCLE because it will break things horribly. Someone responded off-list saying that TIME_WAIT is controlled by /proc/sys/net/ipv4/tcp_fin_timeout, but that is just incorrect information that has been parroted by a lot on blogs. There is no relation between fin_timeout and TCP_TIMEWAIT_LEN. This level of use seems to translate into about 250 Mbps of traffic on average, FWIW. On Wed, Dec 5, 2012 at 11:56 AM, JÁKÓ András <jako.andras@eik.bme.hu> wrote:

...

Ray,

...
With a 60 second timeout on TIME_WAIT, local port identifiers are tied up from being used for new outgoing connections (in this case a proxy server). The default local port range on Linux can easily be adjusted; but even when bumped up to a range of 32K ports, the 60 second timeout means you can only sustain about 500 new connections per second before you run out of ports.

Is that 500 new connections per second per {protocol, remote address, remote port} tuple, that's too few for your proxy? (OK, this tuple is more or less equivalent with only {remote address} if we talk about a web proxy.) Just curious.

Regards, András

-- Ray Patrick Soucy Network Engineer University of Maine System T: 207-561-3526 F: 207-561-3531 MaineREN, Maine's Research and Education Network www.maineren.net

joel jaeggli

5:22 p.m.

...

This would be outgoing connections sourced from the IP of the proxy, destined to whatever remote website (so 80 or 443) requested by the user.

Essentially it's a modified Squid service that is used to filter HTTP for CIPA compliance (required by the government) for keep children in public schools from stumbling on to inappropriate content.

Like most web traffic, the majority of these connections open and close in under a second. When we get to a point that there is enough traffic from users behind the proxy to be generating over 500 new outgoing connections per second, sustained, we start having users experience an error where there are no local ports available to Squid to use since they're all tied up in a TIME_WAIT state.

Here is an example of netstat totals on a box we're seeing the behavior on:

10 LAST_ACK 32 LISTEN 5 SYN_RECV 5 CLOSE_WAIT 756 ESTABLISHED 26 FIN_WAIT1 40 FIN_WAIT2 5 CLOSING 10 SYN_SENT 481947 TIME_WAIT

As a band-aid we've opened up the local port range to allow up to 50K local ports with /proc/sys/net/ipv4/ip_local_port_range, but they're brushing up against that limit again at peak times. We've found it necessary to use address pools to source outgoing connections from our DC devices in order to prevent collisions with

On 12/5/12 9:09 AM, Ray Soucy wrote: ports in timewait. for some particularly high traffic destinations for us. It kinda sucks to burn a /28 or shorter per outbound proxy, but there you have it.

...

It's a shame because memory and CPU-wise the box isn't breaking a sweat.

Enabling TW_REUSE doesn't seem to have any effect for this case (/proc/sys/net/ipv4/tcp_tw_reuse) Using TW_RECYCLE drops the TIME_WAIT count to about 10K instead of 50K, but everything I read online says to avoid using TW_RECYCLE because it will break things horribly.

Someone responded off-list saying that TIME_WAIT is controlled by /proc/sys/net/ipv4/tcp_fin_timeout, but that is just incorrect information that has been parroted by a lot on blogs. There is no relation between fin_timeout and TCP_TIMEWAIT_LEN.

This level of use seems to translate into about 250 Mbps of traffic on average, FWIW.

On Wed, Dec 5, 2012 at 11:56 AM, JÁKÓ András <jako.andras@eik.bme.hu> wrote:

...
Ray,

...
With a 60 second timeout on TIME_WAIT, local port identifiers are tied up from being used for new outgoing connections (in this case a proxy server). The default local port range on Linux can easily be adjusted; but even when bumped up to a range of 32K ports, the 60 second timeout means you can only sustain about 500 new connections per second before you run out of ports. Is that 500 new connections per second per {protocol, remote address, remote port} tuple, that's too few for your proxy? (OK, this tuple is more or less equivalent with only {remote address} if we talk about a web proxy.) Just curious.

Regards, András

William Herrin

6:58 p.m.

On Wed, Dec 5, 2012 at 12:09 PM, Ray Soucy <rps@maine.edu> wrote:

...

Like most web traffic, the majority of these connections open and close in under a second. When we get to a point that there is enough traffic from users behind the proxy to be generating over 500 new outgoing connections per second, sustained, we start having users experience an error where there are no local ports available to Squid to use since they're all tied up in a TIME_WAIT state.

Here is an example of netstat totals on a box we're seeing the behavior on:

481947 TIME_WAIT

Stupid question but how does 500 x 60 = 481947? To have that many connections in TIME_WAIT on a 60 second timer, you'd need more like 8000 connections per second, wouldn't you? Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Owen DeLong

7:31 p.m.

On Dec 5, 2012, at 10:58 AM, William Herrin <bill@herrin.us> wrote:

...

On Wed, Dec 5, 2012 at 12:09 PM, Ray Soucy <rps@maine.edu> wrote:

...
Like most web traffic, the majority of these connections open and close in under a second. When we get to a point that there is enough traffic from users behind the proxy to be generating over 500 new outgoing connections per second, sustained, we start having users experience an error where there are no local ports available to Squid to use since they're all tied up in a TIME_WAIT state.

Here is an example of netstat totals on a box we're seeing the behavior on:

481947 TIME_WAIT

Stupid question but how does 500 x 60 = 481947? To have that many connections in TIME_WAIT on a 60 second timer, you'd need more like 8000 connections per second, wouldn't you?

Isn't TIME_WAIT based on disconnections, not connections? Sure, assuming all connections are for equal durations, then the disconnection rate would be roughly equal to the connection rate, and, of course over the long term they will eventually trend towards equality, but that doesn't mean that the peak of connections in TIME_WAIT will not be greater than the incoming connection rate would suggest. Owen

Ray Soucy

7:55 p.m.

For each second that goes by you remove X addresses from the available pool of ports for new connections for whatever the TCP_TIMEWAIT_LEN is set to (60 seconds by default in Linux). In this case it's making quick connections for HTTP requests (most of which finish in less than a second). Say you have a pool of 30,000 ports and 500 new connections per second (typical): 1 second goes by you now have 29500 10 seconds go by you now have 25000 30 seconds go by you now have 15000 at 59 seconds you get to 29500, at 60 you get back 500 and stay at 29500 and that keeps rolling at 29500. Everyone is happy. Now say that you're seeing an average of 550 connections a second. Suddenly there aren't any available ports to use. So, your first option is to bump up the range of allowed local ports; easy enough, but even if you open it up as much as you can and go from 1025 to 65535, that's still only 64000 ports; with your 60 second TCP_TIMEWAIT_LEN, you can sustain an average of 1000 connections a second. Our problem is that our busy sites are easily peaking to that 1000 connection a second average, and when we enable TCP_TW_RECYCLE, we see them go past that to 1500 or so connections per second sustained. Unfortinately, TCP_TW_RECYCLE is a little too blunt of a hammer and breaks TCP.

...

From what I've read and heard from others, in a high connection environment the key is really to drop down the TCP_TIMEWAIT_LEN.

My question is basically, "how low can you go?" There seems to be consensus around 20 seconds being safe, 15 being a 99% OK, and 10 or less being problematic. So if I rebuild the kernel to use a 20 second timeout, then that 30000 port pool can sustain 1500, and a 60000 port pool can sustain 3000 connections per second. The software could be re-written to round-robin though IP addresses for outgoing requests, but trying to avoid that. On Wed, Dec 5, 2012 at 1:58 PM, William Herrin <bill@herrin.us> wrote:

...

On Wed, Dec 5, 2012 at 12:09 PM, Ray Soucy <rps@maine.edu> wrote:

...
Like most web traffic, the majority of these connections open and close in under a second. When we get to a point that there is enough traffic from users behind the proxy to be generating over 500 new outgoing connections per second, sustained, we start having users experience an error where there are no local ports available to Squid to use since they're all tied up in a TIME_WAIT state.

Here is an example of netstat totals on a box we're seeing the behavior on:

481947 TIME_WAIT

Stupid question but how does 500 x 60 = 481947? To have that many connections in TIME_WAIT on a 60 second timer, you'd need more like 8000 connections per second, wouldn't you?

Regards, Bill Herrin

-- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

-- Ray Patrick Soucy Network Engineer University of Maine System T: 207-561-3526 F: 207-561-3531 MaineREN, Maine's Research and Education Network www.maineren.net

William Herrin

8:56 p.m.

On Wed, Dec 5, 2012 at 2:55 PM, Ray Soucy <rps@maine.edu> wrote:

...

For each second that goes by you remove X addresses from the available pool of ports for new connections for whatever the TCP_TIMEWAIT_LEN is set to (60 seconds by default in Linux).

In this case it's making quick connections for HTTP requests (most of which finish in less than a second).

Say you have a pool of 30,000 ports and 500 new connections per second (typical): 1 second goes by you now have 29500 10 seconds go by you now have 25000 30 seconds go by you now have 15000 at 59 seconds you get to 29500, at 60 you get back 500 and stay at 29500 and that keeps rolling at 29500. Everyone is happy.

The thing is, Linux doesn't behave quite that way. If you do an anonymous connect(), that is you socket() and then connect() without a bind() in the middle, then the limit applies *per destination IP:port pair*. So, you should be able to do 30,000 connections to 192.168.1.1 port 80, another 30,000 connections to 192.168.1.2 port 80, and so on. You should only fail if you A) bump against the top of NR_OPEN or B) try to do a massive number of TCP connections to the same remote IP address. Try it: set up a listener on discard that just closes the connection and repeat connect() to 127.0.0.5 until you get an error. Then confirm that you're out of ports: telnet 127.0.0.5 9 Trying 127.0.0.5... telnet: Unable to connect to remote host: Cannot assign requested address And confirm that you can still make outbound connections to a different IP address: telnet 127.0.0.4 9 Trying 127.0.0.4... Connected to 127.0.0.4. Escape character is '^]'. Connection closed by foreign host. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Mark Andrews

10:01 p.m.

In message <CAP-guGW6oXo=UfTfg+SDiFjB4=qxPShO+YfK6vxnLkCC58PvgQ@mail.gmail.com>, William Herrin writes:

...

On Wed, Dec 5, 2012 at 2:55 PM, Ray Soucy <rps@maine.edu> wrote:

...
For each second that goes by you remove X addresses from the available pool of ports for new connections for whatever the TCP_TIMEWAIT_LEN is set to (60 seconds by default in Linux).

In this case it's making quick connections for HTTP requests (most of which finish in less than a second).

Say you have a pool of 30,000 ports and 500 new connections per second (typical): 1 second goes by you now have 29500 10 seconds go by you now have 25000 30 seconds go by you now have 15000 at 59 seconds you get to 29500, at 60 you get back 500 and stay at 29500 and that keeps rolling at 29500. Everyone is happy.

The thing is, Linux doesn't behave quite that way.

If you do an anonymous connect(), that is you socket() and then connect() without a bind() in the middle, then the limit applies *per destination IP:port pair*. So, you should be able to do 30,000 connections to 192.168.1.1 port 80, another 30,000 connections to 192.168.1.2 port 80, and so on.

The socket api is missing a bind + connect call which restricts the source address when making the connect. This is needed when you are required to use a fixed source address.

...

You should only fail if you A) bump against the top of NR_OPEN or B) try to do a massive number of TCP connections to the same remote IP address.

Try it: set up a listener on discard that just closes the connection and repeat connect() to 127.0.0.5 until you get an error. Then confirm that you're out of ports:

telnet 127.0.0.5 9 Trying 127.0.0.5... telnet: Unable to connect to remote host: Cannot assign requested address

And confirm that you can still make outbound connections to a different IP address:

telnet 127.0.0.4 9 Trying 127.0.0.4... Connected to 127.0.0.4. Escape character is '^]'. Connection closed by foreign host.

Regards, Bill Herrin

-- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

-- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

William Herrin

10:24 p.m.

On Wed, Dec 5, 2012 at 5:01 PM, Mark Andrews <marka@isc.org> wrote:

...

In message <CAP-guGW6oXo=UfTfg+SDiFjB4=qxPShO+YfK6vxnLkCC58PvgQ@mail.gmail.com>, William Herrin writes:

...
The thing is, Linux doesn't behave quite that way.

If you do an anonymous connect(), that is you socket() and then connect() without a bind() in the middle, then the limit applies *per destination IP:port pair*. So, you should be able to do 30,000 connections to 192.168.1.1 port 80, another 30,000 connections to 192.168.1.2 port 80, and so on.

The socket api is missing a bind + connect call which restricts the source address when making the connect. This is needed when you are required to use a fixed source address.

Hi Mark, There are ways around this problem in Linux. For example you can mark a packet with iptables based on the uid of the process which created it and then you can NAT the source address based on the mark. Little messy but the tools are there. Anyway, Ray didn't indicate that he needed a fixed source address other than the one the machine would ordinarily choose for itself. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Mark Andrews

10:35 p.m.

In message <CAP-guGVSMXgt-xhnqC191-Mfh2-W38Gg5mZf1MegCgwObVKK-Q@mail.gmail.com>, William Herrin writes:

...

On Wed, Dec 5, 2012 at 5:01 PM, Mark Andrews <marka@isc.org> wrote:

...
In message <CAP-guGW6oXo=UfTfg+SDiFjB4=qxPShO+YfK6vxnLkCC58PvgQ@mail.gmail.c om>, William Herrin writes:

...
The thing is, Linux doesn't behave quite that way.

If you do an anonymous connect(), that is you socket() and then connect() without a bind() in the middle, then the limit applies *per destination IP:port pair*. So, you should be able to do 30,000 connections to 192.168.1.1 port 80, another 30,000 connections to 192.168.1.2 port 80, and so on.

The socket api is missing a bind + connect call which restricts the source address when making the connect. This is needed when you are required to use a fixed source address.

Hi Mark,

There are ways around this problem in Linux. For example you can mark a packet with iptables based on the uid of the process which created it and then you can NAT the source address based on the mark. Little messy but the tools are there.

And not available to the ordinary user. Nameservers potentially run into this limit. This is something The OpenGroup need to address when updating the next revision of the socket api in POSIX.

...

Anyway, Ray didn't indicate that he needed a fixed source address other than the one the machine would ordinarily choose for itself.

But he didn't say it wasn't required either. Mark -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Ray Soucy

6 Dec 6 Dec

1:31 p.m.

It does require a fixed source address. The box is also a router and firewall, so it has many IP addresses available to it. On Wed, Dec 5, 2012 at 5:24 PM, William Herrin <bill@herrin.us> wrote:

...

On Wed, Dec 5, 2012 at 5:01 PM, Mark Andrews <marka@isc.org> wrote:

...
In message <CAP-guGW6oXo=UfTfg+SDiFjB4=qxPShO+YfK6vxnLkCC58PvgQ@mail.gmail.com>, William Herrin writes:

...
The thing is, Linux doesn't behave quite that way.

If you do an anonymous connect(), that is you socket() and then connect() without a bind() in the middle, then the limit applies *per destination IP:port pair*. So, you should be able to do 30,000 connections to 192.168.1.1 port 80, another 30,000 connections to 192.168.1.2 port 80, and so on.

The socket api is missing a bind + connect call which restricts the source address when making the connect. This is needed when you are required to use a fixed source address.

Hi Mark,

There are ways around this problem in Linux. For example you can mark a packet with iptables based on the uid of the process which created it and then you can NAT the source address based on the mark. Little messy but the tools are there.

Anyway, Ray didn't indicate that he needed a fixed source address other than the one the machine would ordinarily choose for itself.

Regards, Bill Herrin

-- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

-- Ray Patrick Soucy Network Engineer University of Maine System T: 207-561-3526 F: 207-561-3531 MaineREN, Maine's Research and Education Network www.maineren.net

Miquel van Smoorenburg

5 Dec 5 Dec

11:25 p.m.

In article <xs4all.20121205220127.7F6F12CA0F17@drugs.dv.isc.org> you write:

...

In message <CAP-guGW6oXo=UfTfg+SDiFjB4=qxPShO+YfK6vxnLkCC58PvgQ@mail.gmail.com>, William Herrin writes:

...
The thing is, Linux doesn't behave quite that way.

If you do an anonymous connect(), that is you socket() and then connect() without a bind() in the middle, then the limit applies *per destination IP:port pair*. So, you should be able to do 30,000 connections to 192.168.1.1 port 80, another 30,000 connections to 192.168.1.2 port 80, and so on.

The socket api is missing a bind + connect call which restricts the source address when making the connect. This is needed when you are required to use a fixed source address.

William was talking about the destination address. Linux (and I would hope any other network stack) can really open a million connections from one source address, as long as it's not to one destination address but to lots of different ones. It's not the (srcip,srcport) tuple that needs to be unique; it's the (srcip,srcport,dstip,dstport) tuple. Anyway, you can actually bind to a source address and still have a dynamic source port; just use port 0. Lots of tools do this. (for example, strace nc -s 127.0.0.2 127.0.0.1 22 and see what it does) Mike.

Mark Andrews

6 Dec 6 Dec

12:49 a.m.

In message <201212052325.qB5NPrZe005631@xs8.xs4all.nl>, "Miquel van Smoorenburg" writes:

...

In article <xs4all.20121205220127.7F6F12CA0F17@drugs.dv.isc.org> you write:

...
In message <CAP-guGW6oXo=UfTfg+SDiFjB4=qxPShO+YfK6vxnLkCC58PvgQ@mail.gmail.co

m>,

...
William Herrin writes:

...
The thing is, Linux doesn't behave quite that way.

If you do an anonymous connect(), that is you socket() and then connect() without a bind() in the middle, then the limit applies *per destination IP:port pair*. So, you should be able to do 30,000 connections to 192.168.1.1 port 80, another 30,000 connections to 192.168.1.2 port 80, and so on.

The socket api is missing a bind + connect call which restricts the source address when making the connect. This is needed when you are required to use a fixed source address.

William was talking about the destination address. Linux (and I would hope any other network stack) can really open a million connections from one source address, as long as it's not to one destination address but to lots of different ones. It's not the (srcip,srcport) tuple that needs to be unique; it's the (srcip,srcport,dstip,dstport) tuple.

Anyway, you can actually bind to a source address and still have a dynamic source port; just use port 0. Lots of tools do this.

(for example, strace nc -s 127.0.0.2 127.0.0.1 22 and see what it does)

Mike.

Eventually the bind call fails. Below was a counter: dest address in hex 16376: 1a003ff9 16377: 1a003ffa bind: before bind: Can't assign requested address 16378: 1a003ffb connect: Can't assign requested address bind: before bind: Can't assign requested address and if you remove the bind() the connect fails 16378: 1a003ffb 16379: 1a003ffc connect: Can't assign requested address 16380: 1a003ffd this is with a simple loop socket() ioctl(FIONBIO) bind(addr++:80) connect() I had a firewall dropping the connection attempts -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Mark Andrews

3:29 a.m.

In message <20121206004909.B302F2CA212F@drugs.dv.isc.org>, Mark Andrews writes:

...

In message <201212052325.qB5NPrZe005631@xs8.xs4all.nl>, "Miquel van Smoorenburg" writes:

...
In article <xs4all.20121205220127.7F6F12CA0F17@drugs.dv.isc.org> you write:

...
In message <CAP-guGW6oXo=UfTfg+SDiFjB4=qxPShO+YfK6vxnLkCC58PvgQ@mail.gmail.co

m>,

...
William Herrin writes:

...
The thing is, Linux doesn't behave quite that way.

If you do an anonymous connect(), that is you socket() and then connect() without a bind() in the middle, then the limit applies *per destination IP:port pair*. So, you should be able to do 30,000 connections to 192.168.1.1 port 80, another 30,000 connections to 192.168.1.2 port 80, and so on.

The socket api is missing a bind + connect call which restricts the source address when making the connect. This is needed when you are required to use a fixed source address.

William was talking about the destination address. Linux (and I would hope any other network stack) can really open a million connections from one source address, as long as it's not to one destination address but to lots of different ones. It's not the (srcip,srcport) tuple that needs to be unique; it's the (srcip,srcport,dstip,dstport) tuple.

Anyway, you can actually bind to a source address and still have a dynamic source port; just use port 0. Lots of tools do this.

(for example, strace nc -s 127.0.0.2 127.0.0.1 22 and see what it does)

Mike.

Eventually the bind call fails. Below was a

counter: dest address in hex

16376: 1a003ff9 16377: 1a003ffa bind: before bind: Can't assign requested address 16378: 1a003ffb connect: Can't assign requested address bind: before bind: Can't assign requested address

and if you remove the bind() the connect fails

16378: 1a003ffb 16379: 1a003ffc connect: Can't assign requested address 16380: 1a003ffd

this is with a simple loop

socket() ioctl(FIONBIO) bind(addr++:80) connect()

I had a firewall dropping the connection attempts

To get more one needs to setsockopt SO_REUSEADDR but that consumes all the port space so applications that need to listen for incoming connections on the same machine will break. If you also set IP_PORTRANGE_HIGH and have configured the system so that the high range does not match the default range then you can avoid the above issue. This is the default configuration for some but not all platforms. Sockets with IP_PORTRANGE_HIGH set are not expected to accept incoming traffic. So if you are making a out bound connections you should set SO_REUSEADDR and IP_PORTRANGE_HIGH options on the socket to avoid local port limits. Not most applications do not do this which is fine until you are using 10's of thousands of outgoing sockets. Mark

...

-- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

William Herrin

4:29 a.m.

On Wed, Dec 5, 2012 at 7:49 PM, Mark Andrews <marka@isc.org> wrote:

...

counter: dest address in hex

16376: 1a003ff9 16377: 1a003ffa bind: before bind: Can't assign requested address 16378: 1a003ffb connect: Can't assign requested address bind: before bind: Can't assign requested address

and if you remove the bind() the connect fails

16378: 1a003ffb 16379: 1a003ffc connect: Can't assign requested address 16380: 1a003ffd

Tried it. When I removed the bind() I made it to a quarter million connections before I ran out of ram on the test machine and the out of memory killer swung into action. Don't know what your problem is. for (count=0; count<1000000; count++) { s=sockets[count]=socket(AF_INET,SOCK_STREAM,0); if (s<0) { printf ("\nCould not get socket #%d\n",count); sleep (900); return 1; } if (connect(sockets[count], (struct sockaddr *) &sa, sizeof(sa))<0) { if (errno != 115) { printf ("\nErrno %d on socket #%d\n",(int) errno, count); sleep (900); return 1; } } sa.sin_addr.s_addr = htonl(ntohl(sa.sin_addr.s_addr)+1); now = time(NULL); if (now!=before) { before=now; fprintf (stdout,"%d\r",count); fflush (stdout); } } Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Jon Lewis

5 Dec 5 Dec

9:11 p.m.

On Wed, 5 Dec 2012, Ray Soucy wrote:

...

So if I rebuild the kernel to use a 20 second timeout, then that 30000 port pool can sustain 1500, and a 60000 port pool can sustain 3000 connections per second.

The software could be re-written to round-robin though IP addresses for outgoing requests, but trying to avoid that.

It's kind of a hack, but you don't have to rewrite the software to get different source IPs for different connections. On linux, you could do the following: *) Keep your normal default route *) Configure extra IPs as aliases (eth0:0, eth0:1,...) on the proxy *) Split up the internet into however many subnets you have proxy host IPs *) route each part of the internet to your default gateway tacking on "dev eth0:n". This will make the default IP for reaching each subnet of the internet the IP from eth0:n. Of course you probably won't get very good load balancing of connections over your IPs that way, but it's better than nothing and a really quick fix that would give you immediate additional capacity. I was going to also suggest, that to get better balancing, you could periodically (for some relatively short period) rotate the internet subnet routes such that you'd change which parts of the internet were pointed at which dev eth0:n every so many seconds or minutes, but that's kind of annoying to people like me (similar to the problem I recently posted about with AT&T 3G data web proxy). Having your software round robin the source IPs would probably introduce the same problem/effect. ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

Fred Baker (fred)

10:06 p.m.

If you want to get into software rewriting, the simplest thing I might come up with would be to put TCBs in some form of LRU list and, at a point where you need a port back, close the TCB that least recently did anything. My understanding is that this was implemented 15 years ago to manage SYN attacks, and could be built on to manage this form of "attack". Or, change the period of time a TCB is willing to stay in time-wait. Instead of 60 seconds, make it 10. On Dec 5, 2012, at 1:11 PM, Jon Lewis wrote:

...

On Wed, 5 Dec 2012, Ray Soucy wrote:

...
So if I rebuild the kernel to use a 20 second timeout, then that 30000 port pool can sustain 1500, and a 60000 port pool can sustain 3000 connections per second.

The software could be re-written to round-robin though IP addresses for outgoing requests, but trying to avoid that.

It's kind of a hack, but you don't have to rewrite the software to get different source IPs for different connections. On linux, you could do the following:

*) Keep your normal default route *) Configure extra IPs as aliases (eth0:0, eth0:1,...) on the proxy *) Split up the internet into however many subnets you have proxy host IPs *) route each part of the internet to your default gateway tacking on "dev eth0:n".

This will make the default IP for reaching each subnet of the internet the IP from eth0:n.

Of course you probably won't get very good load balancing of connections over your IPs that way, but it's better than nothing and a really quick fix that would give you immediate additional capacity.

I was going to also suggest, that to get better balancing, you could periodically (for some relatively short period) rotate the internet subnet routes such that you'd change which parts of the internet were pointed at which dev eth0:n every so many seconds or minutes, but that's kind of annoying to people like me (similar to the problem I recently posted about with AT&T 3G data web proxy). Having your software round robin the source IPs would probably introduce the same problem/effect.

---------------------------------------------------------------------- Jon Lewis, MCP :) | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

David Conrad

11:08 p.m.

On Dec 5, 2012, at 2:06 PM, Fred Baker (fred) <fred@cisco.com> wrote:

...

If you want to get into software rewriting, the simplest thing I might come up with would be to put TCBs in some form of LRU list and, at a point where you need a port back, close the TCB that least recently did anything. My understanding is that this was implemented 15 years ago to manage SYN attacks, and could be built on to manage this form of "attack".

I can say for certain that it was implemented (at least) twice that long ago (circa 1983) in a TCP implementation for a particular memory constrained environment ("640K should be good enough for anybody") :). Regards, -drc

Terry Baranski

10:30 p.m.

On Wed, 5 Dec 2012, Ray Soucy wrote:

...

My question is basically, "how low can you go?"

There seems to be consensus around 20 seconds being safe, 15 being a 99% OK, and 10 or less being problematic.

I'm trying to imagine how even 10 could be problematic nowadays. Have you found people reporting specific issues with 10? -Terry

Ray Soucy

8 p.m.

There is an extra 7 on that number, it was 48194 (was sitting on a different PC so I typed it instead of copy-paste). On Wed, Dec 5, 2012 at 1:58 PM, William Herrin <bill@herrin.us> wrote:

...

On Wed, Dec 5, 2012 at 12:09 PM, Ray Soucy <rps@maine.edu> wrote:

...
Like most web traffic, the majority of these connections open and close in under a second. When we get to a point that there is enough traffic from users behind the proxy to be generating over 500 new outgoing connections per second, sustained, we start having users experience an error where there are no local ports available to Squid to use since they're all tied up in a TIME_WAIT state.

Here is an example of netstat totals on a box we're seeing the behavior on:

481947 TIME_WAIT

Stupid question but how does 500 x 60 = 481947? To have that many connections in TIME_WAIT on a 60 second timer, you'd need more like 8000 connections per second, wouldn't you?

Regards, Bill Herrin

-- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

-- Ray Patrick Soucy Network Engineer University of Maine System T: 207-561-3526 F: 207-561-3531 MaineREN, Maine's Research and Education Network www.maineren.net

Owen DeLong

5:43 p.m.

You could simply add another IP address to the servers's source- address pool, which effectively gives you another 32K (or whatever value you have for the local port range) identifiers. Owen On Dec 5, 2012, at 7:59 AM, Ray Soucy <rps@maine.edu> wrote:

...

RFC 793 arbitrarily defines 2MSL (how long to hold a socket in TIME_WAIT state before cleaning up) as 4 min.

Linux is a little more reasonable in this and has it baked into the source as 60 seconds in "/usr/src/linux/include/net/tcp.h": #define TCP_TIMEWAIT_LEN (60*HZ)

Where there is no way to change this though /proc (probably a good idea to keep users from messing with it), I am considering re-building a kernel with a lower TCP_TIMEWAIT_LEN to deal with the following issue.

With a 60 second timeout on TIME_WAIT, local port identifiers are tied up from being used for new outgoing connections (in this case a proxy server). The default local port range on Linux can easily be adjusted; but even when bumped up to a range of 32K ports, the 60 second timeout means you can only sustain about 500 new connections per second before you run out of ports.

There are two options to try an deal with this, tcp_tw_reuse, and tcp_tw_recycle; but both seem to be less than ideal. With tcp_tw_reuse, it doesn't appear to be effective in situations where you're sustaining 500+ new connections per second rather than a small burst. With tcp_tw_recycle it seems like too big of a hammer and has been reported to cause problems with NATed connections.

The best solution seems to be trying to keep TIME_WAIT in place, but being faster about it.

30 seconds would get you to 1000 connections a second; 15 to 2000, and 10 seconds to about 3000 a second.

A few questions:

Does anyone have any data on how typical it is for TIME_WAIT to be necessary beyond 10 seconds on a modern network? Has anyone done some research on how low you can make TIME_WAIT safely? Is this a terrible idea? What alternatives are there? Keep in mind this is a proxy server making outgoing connections as the source of the problem; so things like SO_REUSEADDR which work for reusing sockets for incoming connections don't seem to do much in this situation.

Anyone running large proxies or load balancers have this situation? If so what is your solution?

-- Ray Patrick Soucy Network Engineer University of Maine System

T: 207-561-3526 F: 207-561-3531

MaineREN, Maine's Research and Education Network www.maineren.net

Cyril Bouthors

9:18 p.m.

On 5 Dec 2012, rps@maine.edu wrote:

...

Where there is no way to change this though /proc

10:17PM lenovo:~% sudo sysctl -a |grep wait net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120 net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60 net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120 net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120 net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60 net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120 10:17PM lenovo:~% ? We use this to work around the default limit on our internal load balancers. HIH. -- Cyril Bouthors - Administration Système, Infogérance ISVTEC SARL, 14 avenue de l'Opéra, 75001 Paris 1 rue Émile Zola, 69002 Lyon Tél : 01 84 16 16 17 - Fax : 01 77 72 57 24 Ligne directe : 0x7B9EE3B0E

Jon Lewis

6 Dec 6 Dec

1:44 a.m.

On Wed, 5 Dec 2012, Cyril Bouthors wrote:

...

On 5 Dec 2012, rps@maine.edu wrote:

...
Where there is no way to change this though /proc

10:17PM lenovo:~% sudo sysctl -a |grep wait net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120 net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60 net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120 net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120 net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60 net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120

Those netfilter connection tracking tunables have nothing to do with the kernel's TCP socket handling. ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

Ray Soucy

1:32 p.m.

This tunes conntrack, not local TCP on the server itself. On Wed, Dec 5, 2012 at 4:18 PM, Cyril Bouthors <cyril@bouthors.org> wrote:

...

On 5 Dec 2012, rps@maine.edu wrote:

...
Where there is no way to change this though /proc

10:17PM lenovo:~% sudo sysctl -a |grep wait net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120 net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60 net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120 net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120 net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60 net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120 10:17PM lenovo:~%

?

We use this to work around the default limit on our internal load balancers.

HIH. -- Cyril Bouthors - Administration Système, Infogérance ISVTEC SARL, 14 avenue de l'Opéra, 75001 Paris 1 rue Émile Zola, 69002 Lyon Tél : 01 84 16 16 17 - Fax : 01 77 72 57 24 Ligne directe : 0x7B9EE3B0E

-- Ray Patrick Soucy Network Engineer University of Maine System T: 207-561-3526 F: 207-561-3531 MaineREN, Maine's Research and Education Network www.maineren.net

4565

Age (days ago)

4566

Last active (days ago)

List overview

Download

24 comments

12 participants

participants (12)

Cyril Bouthors
David Conrad
Fred Baker (fred)
joel jaeggli
Jon Lewis
JÁKÓ András
Mark Andrews
Miquel van Smoorenburg
Owen DeLong
Ray Soucy
Terry Baranski
William Herrin