[ On Thursday, June 15, 2000 at 10:15:22 (-0400), Greg A. Woods wrote: ]
Subject: Re: PMTU-D: remember, your load balancer is broken
Since discovering that servers with an MSS default of 512 bytes cannot possibly ever deliver good TCP throughput to local high-speed customers (eg. on a cable or DSL plant), I've also been hard-coding a TCP MSS default of 1460 on most systems I control (though on cable modem squid servers, etc., it could probably safely be raised to 1500, but of course on my GRE tunnel this is the maximum I can use without fragmentation).
No, silly me -- it has to be lowered to 1410 on my GRE tunnel when the tunnel MTU is 1450... I *still* keep getting the MSS and MTU confused. I do like the way some folks have been saying 1460+40 to express the MTU as that does eliminate some of the confusion by stating the obvious....
In fact I think I'm having this very problem with segue.merit.edu [198.108.1.41] trying to deliver some NANOG messages to my server ever since yesterday or the day before! (Another server at theplanet.co.uk is definitely giving me these headaches -- I still have to capture a failed connection from segue.merit.edu to prove the latter though....)
A whole bunch of tcpdump'ing on my upstream router later I was finally able to duplicate the problem using a remote host where I new the path was open to all ICMP and where I could run tcpdump and could turn on Path-MTU-discovery and do some FTPs through my tunnel. It turns out the "needs frag" packets were arriving just fine at the remote host and these packets were correctly specifying the maximum size (which at the time was 1448 bytes). In fact I could send a ping packet of exactly that size and no larger from the same test server through my tunnel without it being fragmented or rejected. However it seems that there's a bug somewhere deep in, or below, the GRE tunnel code on NetBSD (1.4ZD) that causes it to silently drop maximum sized packets if they have the DF bit set. It may be that the MTU of the GRE interface is by default one or two bytes too large, and based on that hypothesis we manually forced the tunnel to have a lower MTU of 1400 and, voila!, it works like a charm now! All my NANOG mail came flooding in in short order! ;-) So Path-MTU-discovery is still the problem -- but at least in my current scenario it can sometimes be made to work, if really necessary. I still have to wonder though why people seem to think they need to use PMTU in the first place. Certainly it may be of some advantage if you want the majority of your traffic to be carried in "giant frames" but yet you still need to communicate with some hosts that have interfaces with more traditional sized MTUs *and* you don't want your gateway router to have to fragment all the remote traffic (and then of course remote hosts have to reassemble the fragments). I'm guessing though that this exact scenario is extrememly rare and that the improved throughput for bulk transfers that most people see when using PMTU can be achieved with far fewer headaches (and indeed on far more servers where PMTU is not available in the first place) by simply increasing the default MSS to 1460 (or 1360 to be friendly to users of PPPoE and GRE and similar :-). [[it's almost always possible to increase the default MSS for a server even if it's not easy.]] So, how about it everyone? Can we please all disable PMTU everywhere and try just increasing our default MSS where necessary? I.e. even if you're using a load balancer or not? Pretty please? The extra fragmentation is only going to be a problem for those people who live behind tunnels of one sort or another. I certainly don't mind paying for a bit of extra fragmentation in order to use my low-cost high-bandwidth tunnel! -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <robohack!woods> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>