Re: TCP time_wait and port exhaustion for servers
On 5 Dec 2012, rps@maine.edu wrote:
Where there is no way to change this though /proc
...
Those netfilter connection tracking tunables have nothing to do with the kernel's TCP socket handling.
No, but these do... net.ipv4.tcp_keepalive_intvl = 15 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_keepalive_time = 90 net.ipv4.tcp_fin_timeout = 30 I think the OP was wrong, and missed something. I'm no TCP/IP expert, but IME connections go into TIME_WAIT for a period pertaining to the above tuneables (X number of probes at Y interval until the remote end is declared likely dead and gone), and then go into FIN_WAIT and then IIRC FIN_WAIT2 or some other state like that before they are finally killed off. Those tunables certainly seem to have actually worked in the real world for me, whether they are right "in theory" or not is possibly another matter. Broadly speaking I agree with the other posters who've suggested adding other IP addresses and opening up the local port range available. I'm assuming the talk of 30k connections is because the OP's proxy has a 'one in one out' situation going on with connections, and that's why your ~65k pool for connections is halved. K.
net.ipv4.tcp_keepalive_intvl = 15 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_keepalive_time = 90 net.ipv4.tcp_fin_timeout = 30
As discussed, those do not affect TCP_TIMEWAIT_LEN. There is a lot of misinformation out there on this subject so please don't just Google for 5 min. and chime in with a "solution" that you haven't verified yourself. We can expand the ephemeral port range to be a full 60K (and we have as a band-aid), but that only delays the issue as use grows. I can verify that changing it via: echo 1025 65535 > /proc/sys/net/ipv4/ip_local_port_range Does work for the full range, as a spot check shows ports as low as 2000 and as high as 64000 being used. While this works fine for the majority of our sites as they average well below that, for a handful peak hours can spike above 1000 connections per second; so we would really like to see something closer to an ability to provide closer to 2000 or 2500 connections a second for the amount of bandwidth being delivered through the unit (full gigabit). But ideally we would find a way to significantly reduce the number of ports being chewed up for outgoing connections. On the incoming side everything just makes use of the server port locally so it's not an issue. Trying to avoid using multiple source addresses for this as it would involve a fairly large configuration change to about 100+ units; each requiring coordination with the end-user, but it is a last resort option. The other issue is that this is all essentially squid, so a drastic re-design of how it handles networking is not ideal either. On Thu, Dec 6, 2012 at 8:25 AM, Kyrian <kyrian@ore.org> wrote:
On 5 Dec 2012, rps@maine.edu wrote:
Where there is no way to change this though /proc
...
Those netfilter connection tracking tunables have nothing to do with the kernel's TCP socket handling.
No, but these do...
net.ipv4.tcp_keepalive_intvl = 15 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_keepalive_time = 90 net.ipv4.tcp_fin_timeout = 30
I think the OP was wrong, and missed something.
I'm no TCP/IP expert, but IME connections go into TIME_WAIT for a period pertaining to the above tuneables (X number of probes at Y interval until the remote end is declared likely dead and gone), and then go into FIN_WAIT and then IIRC FIN_WAIT2 or some other state like that before they are finally killed off. Those tunables certainly seem to have actually worked in the real world for me, whether they are right "in theory" or not is possibly another matter.
Broadly speaking I agree with the other posters who've suggested adding other IP addresses and opening up the local port range available.
I'm assuming the talk of 30k connections is because the OP's proxy has a 'one in one out' situation going on with connections, and that's why your ~65k pool for connections is halved.
K.
-- Ray Patrick Soucy Network Engineer University of Maine System T: 207-561-3526 F: 207-561-3531 MaineREN, Maine's Research and Education Network www.maineren.net
Quoting Ray Soucy <rps@maine.edu>:
net.ipv4.tcp_keepalive_intvl = 15 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_keepalive_time = 90 net.ipv4.tcp_fin_timeout = 30
As discussed, those do not affect TCP_TIMEWAIT_LEN.
There is a lot of misinformation out there on this subject so please don't just Google for 5 min. and chime in with a "solution" that you haven't verified yourself.
...
Those tunables certainly seem to have actually worked in the real world for me, whether they are right "in theory" or not is possibly another matter.
TLDR? They worked for me, to reduce connections in a TIME_WAIT state, in a real situation, after well over 5 minutes of Googling. Exactly as I said. Further, they differed from the (netfilter) ones posted previously that were stated as not having anything to do with it by someone or other. There's no cause at all for your snotty message back. What you didn't state in your email was whether these connections were being left in TIME_WAIT because they had not been closed (eg. mobile devices or similar that are somewhat notorious for not closing connections properly), or whether the "normal" close process was taking too long. I suspect that if you had clarified that point initially, things would have made more sense all round. The tunables listed above, AIUI handle connections that were not properly terminated, and idling out, whereas I believe (having had the opportunity to consider it in more depth) your situation seems more to do with "properly" terminated connections that have hard-coded behaviours in the kernel. Perhaps you can clarify for the benefit of the masses. Also, if you are going to hack the kernel to make that change, I urge you to make it part of the sysctl mechanism as well, and to send a patch back to the kernel developers to help out others who might be in a similar situation to you. This is both to help the community, and give you an easier means to tweak the setting as needed in future without a further kernel recompile. K. -- Kev Green, aka Kyrian. E: kyrian@ore.org WWW: http://kyrian.ore.org/ ISP/Perl/PHP/Linux/Security Contractor, via http://www.orenet.co.uk/
On 12/6/12 10:20 AM, Kyrian wrote:
Also, if you are going to hack the kernel to make that change, I urge you to make it part of the sysctl mechanism as well, and to send a patch back to the kernel developers to help out others who might be in a similar situation to you. This is both to help the community, and give you an easier means to tweak the setting as needed in future without a further kernel recompile.
Of course, this whole problem would have gone away years ago, had more folks implemented RFC6013. Or prior recommendations going back 15+ years. Meanwhile, my experience with the Linux kernel team is that about 1/2 of the tweak will go in, and the rest will fall by the wayside. Only about 1/3 of RFC6013 made it into 2.6.32, even though I started feeding them code 6 months before publication.
Question: If a TCP connection is left hanging and continues to hoard the port for some time before it times out, shouldn't the work to be focused on finding out why the connection is not properly closed instead of trying to support a greater number of hung connections waiting to time out ?
This issue is for really for connections that close properly and without any issue. The application closes the socket and doesn't care about it; but the OS keeps it in the TIME_WAIT state as required by the RFC for TCP in case data tries to be sent after the connection has closed (out of order transmission). I think we're going to go with dropping it to 30 seconds instead of 60 seconds and seeing how that goes. It seems to be the direction taken by people who have implemented high traffic load balancers and proxy servers. I was hoping someone would have real data on what a realistic time window is for keeping a socket in a TIME_WAIT state, but it doesn't seem like anyone has collected data on it. On Thu, Dec 6, 2012 at 11:33 AM, Jean-Francois Mezei <jfmezei_nanog@vaxination.ca> wrote:
Question:
If a TCP connection is left hanging and continues to hoard the port for some time before it times out, shouldn't the work to be focused on finding out why the connection is not properly closed instead of trying to support a greater number of hung connections waiting to time out ?
-- Ray Patrick Soucy Network Engineer University of Maine System T: 207-561-3526 F: 207-561-3531 MaineREN, Maine's Research and Education Network www.maineren.net
On Thu, Dec 06, 2012 at 08:58:10AM -0500, Ray Soucy wrote:
net.ipv4.tcp_keepalive_intvl = 15 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_keepalive_time = 90 net.ipv4.tcp_fin_timeout = 30
As discussed, those do not affect TCP_TIMEWAIT_LEN.
There is a lot of misinformation out there on this subject so please don't just Google for 5 min. and chime in with a "solution" that you haven't verified yourself.
We can expand the ephemeral port range to be a full 60K (and we have as a band-aid), but that only delays the issue as use grows. I can verify that changing it via:
echo 1025 65535 > /proc/sys/net/ipv4/ip_local_port_range
Does work for the full range, as a spot check shows ports as low as 2000 and as high as 64000 being used.
I can attest to the effectiveness of this method, however be sure and add any ports in that range that you use as incoming ports for services to /proc/sys/net/ipv4/ip_local_reserved_ports, otherwise the first time you restart a service that uses a high port (*cough*NRPE*cough*), its port will probably get snarfed for an outgoing connection and then you're in a sad, sad place. - Matt -- [An ad for Microsoft] uses the musical theme of the "Confutatis Maledictis" from Mozart's Requiem. "Where do you want to go today?" is on the screen, while the chorus sings "Confutatis maledictis, flammis acribus addictis,". Translation: "The damned and accursed are convicted to the flames of hell."
+1 Thanks for the tip, this looks very useful. Looks like it was only introduced in 2.6.35, we're still on 2.6.32 ... might be worth the upgrade, it just takes so long to test new kernel versions in this application. We ended up dropping TCP_TIMEWAIT_LEN to 30 seconds as a band-aid for now, along with the expanded port range. In talking to others 20 seconds seems to be 99%+ safe, with the sweet spot seeming to be 24 seconds or so. So we opted to just go with 30 seconds and be cautious, even though others claim going as low as 10 or 5 seconds without issue. I'll let people know if it introduces any problems. In talking with the author of HAproxy, he seems to be in the camp that using SO_LINGER of 0 might be the way to go, but is unsure of how servers would respond to it; we'll likely try a build with that method and see what happens at some point. On Fri, Dec 7, 2012 at 4:51 PM, Matthew Palmer <mpalmer@hezmatt.org> wrote:
On Thu, Dec 06, 2012 at 08:58:10AM -0500, Ray Soucy wrote:
net.ipv4.tcp_keepalive_intvl = 15 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_keepalive_time = 90 net.ipv4.tcp_fin_timeout = 30
As discussed, those do not affect TCP_TIMEWAIT_LEN.
There is a lot of misinformation out there on this subject so please don't just Google for 5 min. and chime in with a "solution" that you haven't verified yourself.
We can expand the ephemeral port range to be a full 60K (and we have as a band-aid), but that only delays the issue as use grows. I can verify that changing it via:
echo 1025 65535 > /proc/sys/net/ipv4/ip_local_port_range
Does work for the full range, as a spot check shows ports as low as 2000 and as high as 64000 being used.
I can attest to the effectiveness of this method, however be sure and add any ports in that range that you use as incoming ports for services to /proc/sys/net/ipv4/ip_local_reserved_ports, otherwise the first time you restart a service that uses a high port (*cough*NRPE*cough*), its port will probably get snarfed for an outgoing connection and then you're in a sad, sad place.
- Matt
-- [An ad for Microsoft] uses the musical theme of the "Confutatis Maledictis" from Mozart's Requiem. "Where do you want to go today?" is on the screen, while the chorus sings "Confutatis maledictis, flammis acribus addictis,". Translation: "The damned and accursed are convicted to the flames of hell."
-- Ray Patrick Soucy Network Engineer University of Maine System T: 207-561-3526 F: 207-561-3531 MaineREN, Maine's Research and Education Network www.maineren.net
participants (5)
-
Jean-Francois Mezei
-
Kyrian
-
Matthew Palmer
-
Ray Soucy
-
William Allen Simpson