PMTU-D: remember, your load balancer is broken
This is your monthly PMTU-D horkage rant. Chances are that if you are using a load balancer for TCP connections, then it does not properly handle Path MTU Discovery. Examples of devices like the ones I am talking about that do not, last I knew, handle this properly are localdirectors and arrowpoints. F5 claimed that they fixed their big/ip product to do this properly some time ago (remember when they broke NSI's whois service this way?), but I haven't seen it in action yet or know what version is required, and their support channels don't seem to know much about it when asked and give nonsensical answers like "it is built into the BSD/OS system that our product is built on". I would love to know about any such load balanceres that actually do handle this right. For an explanation of PMTU-D, see http://users.worldgate.ca/~marcs/mtu/ What happens with most load balancers is that when the server behind them tries to use PMTU-D, the ICMP "can't fragment" that may come back from a router between the server and the client will not make it to the load-balanced server because the load balancer will throw it away. The result is that most users with a path MTU that is less than min(client MTU, server MTU) will be unable to receive data from the server. The fix is to bitch at your vendor to fix their broken system and to tell them to hire someone that knows something about how TCP works. If you are a vendor, then make sure your load balancing software works right. What it needs to do is either send the "can't fragment" on to just the backend servers that have connections from the remote IP, or to flood it to all of them. The workaround for the person using such load balancers is to disable PMTU-D on your backend servers. This is your only option if the vendor of your load balancer doesn't care or takes a while to release a fix. If you have complaints that small subset of clients that can open a TCP connection to your load balanced IP but can't receive any reponse to their request, this could be what is up. (yes, www.slashdot.org seems to be broken in this way as I type. Oh well, slashdot isn't always a good thing...) This has been your monthly PMTU-D horkage rant.
On Tue, 13 Jun 2000 17:04:19 MDT, Marc Slemko <marcs@znep.com> said:
Chances are that if you are using a load balancer for TCP connections, then it does not properly handle Path MTU Discovery. Examples of devices
Does anybody have any field experience on how much PMTU-D actually helps? I just checked 'netstat -s' on an AIX box that runs a stratum-2 NTP server, which accidentally had it enabled for several weeks. Abridged output follows: ip: 16357209 total packets received 18411 fragments received 5314999 path MTU discovery packets sent 0 path MTU discovery decreases detected icmp: Input histogram: echo reply: 3635421 destination unreachable: 271455 AIX sends a test ICMP Echo to detect PMTU for UDP (which is where the high icmp numbers came from). The main interface on the box is a 10BaseT, so the MTU gets nailed to 1500. As a result, I do *not* have figures on how often we would have used a bigger MTU than 1500 - only on whether there's still sub-1500 links out there. On the other hand, at least in today's Internet, the Other End is still quite likely to be 10BaseT or PPP. Approximately 80% of the traffic this machine sees is from off-campus, all over the US. We only got about 60% replies on the test ICMP Echo, which constituted a good 40% of the entire traffic. In spite of this, not once did the PMTU get fragmented below 1500. Admittedly, PMTU-D for TCP is a lot less resource intensive (just set the DF bit and see who salutes). However, it should be tripped roughly the same percent of the time (if a packet needs fragmenting, it needs fragmenting - it's probably rare that a TCP packet of a given size would fit but the same size UDP would fragment). It looks to me like a better Rule Of Thumb is just: a) If you know that the path to a specific net has an MTU over 1500 all the way, set a route specifying the MTU. b) If you're a webserver or something else providing service Out There to random users, just nail the MTU at 1500, which will work for any Ethernet/PPP/SLIP out there. And if you're load balancing to geographically disparate servers, then your users are probably Out There, with an MTU almost guaranteed to be 1500. I assert that the chances of PMTU-D helping are in direct ratio to the number of end users who have connections with MTU>1500 - it's almost a sure thing that you probably won't have users with an MTU on their last-hop that's bigger than their campus backbone and/or Internet connection's MTU. Is anybody seeing any documentable wins by using PMTU-D? Valdis Kletnieks Operating Systems Analyst Virginia Tech
On Tue, 13 Jun 2000 Valdis.Kletnieks@vt.edu wrote:
On Tue, 13 Jun 2000 17:04:19 MDT, Marc Slemko <marcs@znep.com> said:
Chances are that if you are using a load balancer for TCP connections, then it does not properly handle Path MTU Discovery. Examples of devices
Does anybody have any field experience on how much PMTU-D actually helps? I just checked 'netstat -s' on an AIX box that runs a stratum-2 NTP server, which accidentally had it enabled for several weeks. Abridged output follows:
ip: 16357209 total packets received 18411 fragments received 5314999 path MTU discovery packets sent 0 path MTU discovery decreases detected
Mmm. I don't trust AIX, especially with a "0". A 1 or 2 would make me trust it more. I'll throw in some numbers from a FreeBSD machine (a day or so's worth): 73658076 packets sent 59036492 data packets (2258619726 bytes) 1916471 data packets (1875195237 bytes) retransmitted 290 resends initiated by MTU discovery 9082213 ack-only packets (3047476 delayed) 0 URG only packets 81937 window probe packets 842836 window update packets 2698127 control packets 2881141 connections established (including accepts) This machine mostly serves HTTP, with a bit of random junk thrown in and has a 1500 byte MTU and 99% of connections are from remote clients. So this could be... 5000 connections that get a win from PMTU-D over hardcoding a 1460 MSS, as a rough guess (assuming that each 290 byte resend represents a host that makes x connections over the time that the result is cached, and it only takes one try to get it right). Whatever the numbers, they aren't a very high percentage.
icmp: Input histogram: echo reply: 3635421 destination unreachable: 271455
AIX sends a test ICMP Echo to detect PMTU for UDP (which is where the high icmp numbers came from). The main interface on the box is a 10BaseT, so the MTU gets nailed to 1500. As a result, I do *not* have figures on how often we would have used a bigger MTU than 1500 - only on whether there's still sub-1500 links out there. On the other hand, at least in today's Internet, the Other End is still quite likely to be 10BaseT or PPP.
Approximately 80% of the traffic this machine sees is from off-campus, all over the US. We only got about 60% replies on the test ICMP Echo, which constituted a good 40% of the entire traffic. In spite of this, not once did the PMTU get fragmented below 1500.
I shouldn't get started here. I have trouble buying into HP's way of doing things (I was only aware that HPUX did this; but it seems that AIX does too...). If you run a high traffic DNS server on an AIX box without disabling this "feature" then you must just be spewing ICMP echo requests. It could add up to more bytes than your DNS responses... And, obviously, ICMP pings don't work too well much of the time anyway. And I'm concerned about the possibility of some nasty DoS potential by exploiting this. I haven't looked into this in depth, and it depends on how it handles cache replacement, etc. But I don't know the details of exactly how AIX does it, and it may differ from HPUX, which I still don't know all the details about but have looked into in more detail.
Admittedly, PMTU-D for TCP is a lot less resource intensive (just set the DF bit and see who salutes). However, it should be tripped roughly the same percent of the time (if a packet needs fragmenting, it needs fragmenting - it's probably rare that a TCP packet of a given size would fit but the same size UDP would fragment).
The difference is that if you are sending a small amount of data, then "normal" PMTU-D (ie. as per the RFC) will not result in any extra bits flying across the wire.
It looks to me like a better Rule Of Thumb is just:
a) If you know that the path to a specific net has an MTU over 1500 all the way, set a route specifying the MTU.
b) If you're a webserver or something else providing service Out There to random users, just nail the MTU at 1500, which will work for any Ethernet/PPP/SLIP out there. And if you're load balancing to geographically disparate servers, then your users are probably Out There, with an MTU almost guaranteed to be 1500.
Except that, technically, you are not permitted to just blindly send segments of such size. Well, you can but systems in the middle don't have to handle them. No? It is also a concern that, in my experience, many of the links with MTUs <1500 are also the links with greater packet loss, etc. so you really don't want fragmentation on them. However, I have to admit, hardcoding the server to a 1460 MSS is what I do and recommend. I started doing this a few years ago when more servers started supporting PMTU-D and there were just too many stupid broken networks that don't deal with it properly due to filtering or what have you. I think enough servers do it now that it is "safe" to leave it enabled, barring things like broken load balancers.
I assert that the chances of PMTU-D helping are in direct ratio to the number of end users who have connections with MTU>1500 - it's almost a sure thing that you probably won't have users with an MTU on their last-hop that's bigger than their campus backbone and/or Internet connection's MTU.
Is anybody seeing any documentable wins by using PMTU-D?
The current situation is such that it is rare for the PMTU to be lower than min(client MTU, server MTU). In such situations, PMTU-D obviously will never come into effect. If we see more and more FDDI or gigabit ethernet w/jumbograms etc., this will change. Suprisingly few servers are using such technologies w/MTUs >1500 now in my experience; I think FDDI use has significantly dropped in terms of percent of servers in the past few years. The tunnelling that smb brings up is an important issue, and there are other issues surrounding that too. There are definitely situations where it gives huge wins. They are, however, all specialized situations. I think it is simply that we the net is in a state of somewhat amazing homogoney right now. I don't think this will continue, but who knows. I do think that PMTU-D is an important feature, and people should be encouraged to leave it enabled wherever possible, so that one day if networks do change to make it more useful in the general case, it will be there...
On Tue, 13 Jun 2000 22:36:08 MDT, Marc Slemko said:
I shouldn't get started here. I have trouble buying into HP's way of doing things (I was only aware that HPUX did this; but it seems that AIX does too...). If you run a high traffic DNS server
AIX started supporting PMTU-D for both TCP and UDP in 4.2.1. The gotcha was it being on by default in 4.3.3.
on an AIX box without disabling this "feature" then you must just be spewing ICMP echo requests. It could add up to more bytes than your DNS responses...
Well, as I said, it was done in error, and yes, the bytes for the ICMP *were* running almost as high as the actual NTP traffic... The surprising part is that it was broken for close to 3 months before somebody noticed (yesterday, just a few hours before this discussion started, in fact). As noted, PMTU-D for TCP is a lot lighter weight, and has an actual chance of winning sometimes. Does anybody know of a UDP-based application that is able to *do* anything with PMTU-D? A co-worker had heard of research at PSC that dealt with TCP-friendly multicast, but that was all we could think of...
And, obviously, ICMP pings don't work too well much of the time anyway. And I'm concerned about the possibility of some nasty DoS potential by exploiting this. I haven't looked into this in depth, and it depends on how it handles cache replacement, etc.
Except that, technically, you are not permitted to just blindly send segments of such size. Well, you can but systems in the middle don't have to handle them. No?
Hmm.. either I did a bad job of explaining or I haven't had enough caffiene to parse what you said. Given that you also suggest going to a 1460 MSS, I suspect that we're actually violently in agreement here. Now if I can remember why I chose 1396 for a default MSS.... ;)
It is also a concern that, in my experience, many of the links with MTUs <1500 are also the links with greater packet loss, etc. so you really don't want fragmentation on them.
The worst part here is that I suspect that most of these links (just on sheer numbers of shipped product) are the aformentioned Win98 576-MTU. However, in this case, the fragmentation happens in a terminal server on the last hop, and hopefully the case of a terminal server running out of queueing buffers and having to drop one of the 2 remaining fragments of a 1500->576 split after sending the first one is pretty rare.... I seem to remember that the *original* motivation for slow-start and all that was Van Jacobson's observation that the most common cause of a TCP retransmit was that an *entire* packet had been silently dropped due to queueing congestion, and could thus be treated identical to an ICMP Source Quench. Has this changed? Has "fragmentation" become a Great Evil, rather than an annoyance that some links have to deal with?
I think it is simply that we the net is in a state of somewhat amazing homogoney right now. I don't think this will continue, but who knows. I do think that PMTU-D is an important feature, and people should be encouraged to leave it enabled wherever possible, so that one day if networks do change to make it more useful in the general case, it will be there...
At least for TCP. I'm still unconvinced for UDP ;) -- Valdis Kletnieks Operating Systems Analyst Virginia Tech
Valdis.Kletnieks@vt.edu: Wednesday, June 14, 2000 8:07 AM
On Tue, 13 Jun 2000 22:36:08 MDT, Marc Slemko said:
It is also a concern that, in my experience, many of the links with MTUs <1500 are also the links with greater packet loss, etc. so you really don't want fragmentation on them.
The worst part here is that I suspect that most of these links (just on sheer numbers of shipped product) are the aformentioned Win98 576-MTU.
I just set my dial PPP ports to MTU=512+40=552, is this wrong? Where does the MTU=576 number come from?
I seem to remember that the *original* motivation for slow-start and all that was Van Jacobson's observation that the most common cause of a TCP retransmit was that an *entire* packet had been silently dropped due to queueing congestion, and could thus be treated identical to an ICMP Source Quench.
Has this changed? Has "fragmentation" become a Great Evil, rather than an annoyance that some links have to deal with?
I'm having some trouble getting full throughput from a GigE pipe. Even in the 100baseTX/FDX down-stream, I'm not getting full link utilization (everything on switches, Cat6509 and 3512XLs). I'm considering increasing MTU sizes to MTU=4096+40, or even larger. Most of the data transmissions fall into the 5KB-50KB range. The site can be considered a large portal. What would be the effect on my upstream? Would it create problems? The only systems that see the Internet are the web-servers (dual NICs).
I'm having some trouble getting full throughput from a GigE pipe. Even in the 100baseTX/FDX down-stream, I'm not getting full link utilization (everything on switches, Cat6509 and 3512XLs). I'm considering increasing MTU sizes to MTU=4096+40, or even larger. Most of the data transmissions fall into the 5KB-50KB range. The site can be considered a large portal. What would be the effect on my upstream? Would it create problems? The only systems that see the Internet are the web-servers (dual NICs).
The one thing to consider if the fact that all devices on the ends of the GigaE support "Giant Frames." We have had no problems going with 4500+40 MTU server to server (direct) on GigaE but going to an HP 4000M switch, it was a different story. It seems that the HP's don't support giant frames. John Fraizer EnterZone, Inc
2000-06-14-00:36:08 Marc Slemko:
b) If you're a webserver or something else providing service Out There to random users, just nail the MTU at 1500, which will work for any Ethernet/PPP/SLIP out there. And if you're load balancing to geographically disparate servers, then your users are probably Out There, with an MTU almost guaranteed to be 1500.
Except that, technically, you are not permitted to just blindly send segments of such size. Well, you can but systems in the middle don't have to handle them. No?
No? I thought traffic only failed to flow when PMTU discovery was attempted (dont-fragment bit set on first packet) but the needed ICMP to make it work was being blocked. If you don't even try to do PMTU, then people who have paths where middle links have MTUs smaller than the smallest of the two end-points' MTUs will just have to fragment. And as long as they're rare, that shouldn't be much problem, no? -Bennett
On 14-Jun-2000 Valdis.Kletnieks@vt.edu wrote:
b) If you're a webserver or something else providing service Out There to random users, just nail the MTU at 1500, which will work for any Ethernet/PPP/SLIP out there. And if you're load balancing to geographically disparate servers, then your users are probably Out There, with an MTU almost guaranteed to be 1500.
I assert that the chances of PMTU-D helping are in direct ratio to the number of end users who have connections with MTU>1500 - it's almost a sure thing that you probably won't have users with an MTU on their last-hop that's bigger than their campus backbone and/or Internet connection's MTU.
www.bt.com drops (Or at least used to) all ICMP silently, and this can cause problems - one of our ISPs (U-Net) runs a Frame Relay network internally from some customers that had an MTU of 1496, (The default MTU for FR on some equipment, including (earlier?) Cisco IOSes, apparently) Symptom - web site unreachable. Complained to bt.com, go the usual "everything is fine here" response. :-( Similar symptoms accessing other sites, although it was intermittent. Apparently, the problem is more often seen on NT servers (No surprise there, then) as they set the DF bit on outbound packets. Managed to persuade U-Net to change their Frame Relay network to have an MTU of 1500, which was quite nice of them as it wasn't really their system that was broken! Improved performance noticably however. -- Ryan O'Connell - http://www.complicity.co.uk/ - <nemesis@eh.org> You are the Dancing Queen, young and sweet, only seventeen Dancing Queen, feel the beat from the tambourine You can dance, you can jive, having the time of your life See that girl, watch that scene, dig in the Dancing Queen
participants (6)
-
Bennett Todd
-
John Fraizer
-
Marc Slemko
-
Roeland Meyer (E-mail)
-
Ryan O`Connell
-
Valdis.Kletnieks@vt.edu