Re: PMTU-D: remember, your load balancer is broken

14 Jun 2000

      On Tue, 13 Jun 2000 Valdis.Kletnieks@vt.edu wrote:
...
On Tue, 13 Jun 2000 17:04:19 MDT, Marc Slemko <marcs@znep.com>  said:
...
Chances are that if you are using a load balancer for TCP connections,
then it does not properly handle Path MTU Discovery.  Examples of devices
Does anybody have any field experience on how much PMTU-D actually
helps?  I just checked 'netstat -s' on an AIX box that runs a
stratum-2 NTP server, which accidentally had it enabled for
several weeks.  Abridged output follows:
ip:
  16357209 total packets received
  18411 fragments received
  5314999 path MTU discovery packets sent
  0 path MTU discovery decreases detected
Mmm.  I don't trust AIX, especially with a "0".  A 1 or 2 would make me
trust it more.

I'll throw in some numbers from a FreeBSD machine (a day or so's worth):

        73658076 packets sent
                59036492 data packets (2258619726 bytes)
                1916471 data packets (1875195237 bytes) retransmitted
                290 resends initiated by MTU discovery
                9082213 ack-only packets (3047476 delayed)
                0 URG only packets
                81937 window probe packets
                842836 window update packets
                2698127 control packets

        2881141 connections established (including accepts)

This machine mostly serves HTTP, with a bit of random junk thrown
in and has a 1500 byte MTU and 99% of connections are from remote
clients.  So this could be... 5000 connections that get a win from
PMTU-D over hardcoding a 1460 MSS, as a rough guess (assuming that
each 290 byte resend represents a host that makes x connections
over the time that the result is cached, and it only takes one try
to get it right).

Whatever the numbers, they aren't a very high percentage.
...
icmp:
  Input histogram:
      echo reply: 3635421
      destination unreachable: 271455
AIX sends a test ICMP Echo to detect PMTU for UDP (which is where the
high icmp numbers came from).  The main interface on the box is a
10BaseT, so the MTU gets nailed to 1500.  As a result, I do *not* have
figures on how often we would have used a bigger MTU than 1500 - only
on whether there's still sub-1500 links out there.  On the other
hand, at least in today's Internet, the Other End is still quite
likely to be 10BaseT or PPP.
Approximately 80% of the traffic this machine sees is from off-campus,
all over the US.  We only got about 60% replies on the test ICMP Echo,
which constituted a good 40% of the entire traffic. In spite of this,
not once did the PMTU get fragmented below 1500.
I shouldn't get started here.  I have trouble buying into HP's
way of doing things (I was only aware that HPUX did this; but it seems
that AIX does too...).  If you run a high traffic DNS server
on an AIX box without disabling this "feature" then you must just
be spewing ICMP echo requests.  It could add up to more bytes
than your DNS responses...

And, obviously, ICMP pings don't work too well much of the time
anyway.  And I'm concerned about the possibility of some nasty DoS
potential by exploiting this.  I haven't looked into this in depth, and
it depends on how it handles cache replacement, etc.

But I don't know the details of exactly how AIX does it, and it
may differ from HPUX, which I still don't know all the details
about but have looked into in more detail.
...
Admittedly, PMTU-D for TCP is a lot less resource intensive (just
set the DF bit and see who salutes).  However, it should be tripped
roughly the same percent of the time (if a packet needs fragmenting,
it needs fragmenting - it's probably rare that a TCP packet of a given
size would fit but the same size UDP would fragment).
The difference is that if you are sending a small amount of data, then
"normal" PMTU-D (ie. as per the RFC) will not result in any extra bits
flying across the wire.
...
It looks to me like a better Rule Of Thumb is just:
a) If you know that the path to a specific net has an MTU over
1500 all the way, set a route specifying the MTU.
b) If you're a webserver or something else providing service Out
There to random users, just nail the MTU at 1500, which will
work for any Ethernet/PPP/SLIP out there.  And if you're load
balancing to geographically disparate servers, then your users
are probably Out There, with an MTU almost guaranteed to be 1500.
Except that, technically, you are not permitted to just blindly send 
segments of such size.  Well, you can but systems in the middle don't 
have to handle them.  No?

It is also a concern that, in my experience, many of the links with
MTUs <1500 are also the links with greater packet loss, etc. so 
you really don't want fragmentation on them.

However, I have to admit, hardcoding the server to a 1460 MSS
is what I do and recommend.  I started doing this a few years ago
when more servers started supporting PMTU-D and there were just
too many stupid broken networks that don't deal with it properly
due to filtering or what have you.  I think enough servers do it now
that it is "safe" to leave it enabled, barring things like broken load
balancers.
...
I assert that the chances of PMTU-D helping are in direct ratio to the
number of end users who have connections with MTU>1500 - it's almost
a sure thing that you probably won't have users with an MTU on their
last-hop that's bigger than their campus backbone and/or Internet
connection's MTU.
Is anybody seeing any documentable wins by using PMTU-D?
The current situation is such that it is rare for the PMTU to be
lower than min(client MTU, server MTU).  In such situations, PMTU-D
obviously will never come into effect.  If we see more and more
FDDI or gigabit ethernet w/jumbograms etc., this will change.
Suprisingly few servers are using such technologies w/MTUs >1500
now in my experience; I think FDDI use has significantly dropped
in terms of percent of servers in the past few years.  The tunnelling
that smb brings up is an important issue, and there are other issues
surrounding that too.

There are definitely situations where it gives huge wins.  They are,
however, all specialized situations.

I think it is simply that we the net is in a state of somewhat
amazing homogoney right now.  I don't think this will continue,
but who knows.  I do think that PMTU-D is an important feature, and 
people should be encouraged to leave it enabled wherever possible,
so that one day if networks do change to make it more useful in the
general case, it will be there...