On Tue, 13 Jun 2000 Valdis.Kletnieks@vt.edu wrote:
On Tue, 13 Jun 2000 17:04:19 MDT, Marc Slemko <marcs@znep.com> said:
Chances are that if you are using a load balancer for TCP connections, then it does not properly handle Path MTU Discovery. Examples of devices
Does anybody have any field experience on how much PMTU-D actually helps? I just checked 'netstat -s' on an AIX box that runs a stratum-2 NTP server, which accidentally had it enabled for several weeks. Abridged output follows:
ip: 16357209 total packets received 18411 fragments received 5314999 path MTU discovery packets sent 0 path MTU discovery decreases detected
Mmm. I don't trust AIX, especially with a "0". A 1 or 2 would make me trust it more. I'll throw in some numbers from a FreeBSD machine (a day or so's worth): 73658076 packets sent 59036492 data packets (2258619726 bytes) 1916471 data packets (1875195237 bytes) retransmitted 290 resends initiated by MTU discovery 9082213 ack-only packets (3047476 delayed) 0 URG only packets 81937 window probe packets 842836 window update packets 2698127 control packets 2881141 connections established (including accepts) This machine mostly serves HTTP, with a bit of random junk thrown in and has a 1500 byte MTU and 99% of connections are from remote clients. So this could be... 5000 connections that get a win from PMTU-D over hardcoding a 1460 MSS, as a rough guess (assuming that each 290 byte resend represents a host that makes x connections over the time that the result is cached, and it only takes one try to get it right). Whatever the numbers, they aren't a very high percentage.
icmp: Input histogram: echo reply: 3635421 destination unreachable: 271455
AIX sends a test ICMP Echo to detect PMTU for UDP (which is where the high icmp numbers came from). The main interface on the box is a 10BaseT, so the MTU gets nailed to 1500. As a result, I do *not* have figures on how often we would have used a bigger MTU than 1500 - only on whether there's still sub-1500 links out there. On the other hand, at least in today's Internet, the Other End is still quite likely to be 10BaseT or PPP.
Approximately 80% of the traffic this machine sees is from off-campus, all over the US. We only got about 60% replies on the test ICMP Echo, which constituted a good 40% of the entire traffic. In spite of this, not once did the PMTU get fragmented below 1500.
I shouldn't get started here. I have trouble buying into HP's way of doing things (I was only aware that HPUX did this; but it seems that AIX does too...). If you run a high traffic DNS server on an AIX box without disabling this "feature" then you must just be spewing ICMP echo requests. It could add up to more bytes than your DNS responses... And, obviously, ICMP pings don't work too well much of the time anyway. And I'm concerned about the possibility of some nasty DoS potential by exploiting this. I haven't looked into this in depth, and it depends on how it handles cache replacement, etc. But I don't know the details of exactly how AIX does it, and it may differ from HPUX, which I still don't know all the details about but have looked into in more detail.
Admittedly, PMTU-D for TCP is a lot less resource intensive (just set the DF bit and see who salutes). However, it should be tripped roughly the same percent of the time (if a packet needs fragmenting, it needs fragmenting - it's probably rare that a TCP packet of a given size would fit but the same size UDP would fragment).
The difference is that if you are sending a small amount of data, then "normal" PMTU-D (ie. as per the RFC) will not result in any extra bits flying across the wire.
It looks to me like a better Rule Of Thumb is just:
a) If you know that the path to a specific net has an MTU over 1500 all the way, set a route specifying the MTU.
b) If you're a webserver or something else providing service Out There to random users, just nail the MTU at 1500, which will work for any Ethernet/PPP/SLIP out there. And if you're load balancing to geographically disparate servers, then your users are probably Out There, with an MTU almost guaranteed to be 1500.
Except that, technically, you are not permitted to just blindly send segments of such size. Well, you can but systems in the middle don't have to handle them. No? It is also a concern that, in my experience, many of the links with MTUs <1500 are also the links with greater packet loss, etc. so you really don't want fragmentation on them. However, I have to admit, hardcoding the server to a 1460 MSS is what I do and recommend. I started doing this a few years ago when more servers started supporting PMTU-D and there were just too many stupid broken networks that don't deal with it properly due to filtering or what have you. I think enough servers do it now that it is "safe" to leave it enabled, barring things like broken load balancers.
I assert that the chances of PMTU-D helping are in direct ratio to the number of end users who have connections with MTU>1500 - it's almost a sure thing that you probably won't have users with an MTU on their last-hop that's bigger than their campus backbone and/or Internet connection's MTU.
Is anybody seeing any documentable wins by using PMTU-D?
The current situation is such that it is rare for the PMTU to be lower than min(client MTU, server MTU). In such situations, PMTU-D obviously will never come into effect. If we see more and more FDDI or gigabit ethernet w/jumbograms etc., this will change. Suprisingly few servers are using such technologies w/MTUs >1500 now in my experience; I think FDDI use has significantly dropped in terms of percent of servers in the past few years. The tunnelling that smb brings up is an important issue, and there are other issues surrounding that too. There are definitely situations where it gives huge wins. They are, however, all specialized situations. I think it is simply that we the net is in a state of somewhat amazing homogoney right now. I don't think this will continue, but who knows. I do think that PMTU-D is an important feature, and people should be encouraged to leave it enabled wherever possible, so that one day if networks do change to make it more useful in the general case, it will be there...