
Iljitsch van Beijnum wrote:
A more common approach is to rewrite the MSS option in all TCP SYNs
[snip] Yeah, we do this now, but the software that we have been using for PPPoE termination as well as for a huge portion of our clients (MikroTik RouterOS) doesn't do it correctly in my estimation when you flip on the automatic "change-tcp-mss" option...it rewrites the MSS in ALL SYNs passing through it, either coming OR going. This has the effect of breaking communication with other hosts that actually have a SMALLER MSS than our PPPoE customers since our client will get a SYN+ACK from the remote host that we have rewritten to reflect a larger MSS than the remote host is capable of dealing with. Because MikroTik rewrote both the SYNs generated by us as well as received by us, our customer's host is now under the impression that the lowest MSS between the two hosts matches its own. At least that's the best theory I've come up with. We can write (and have written) custom IP manglers on the MikroTik boxes that only touch SYNs generated by our clients, and only when the MSS is larger than a certain value (in order to honor MSSes even lower than that allowed by their PPPoE gateway). But it's a PITA to deal with. I'd just rather everyone follow protocol. :-P Although we can't always expect everyone to do it by the book, I don't think it is too much to ask that those who operate sizable networks that nearly everyone is required to interact with on a daily basis (read: Microsoft) act responsibly.
All of this even went so far that the IETF came up with RFC 4821, which will do path MTU discovery by correlating lost packets with packet sizes to determine the path MTU rather than depend on ICMP messages.
What's funny is that I ran my tests from a Windows XP host with the recently-released Service Pack 3 installed, which is supposed to activate Microsoft's "PMTUD Black Hole Router Detection" by default (available pre-SP3 but apparently not turned on without a registry change). I haven't read up on exactly how it's supposed to work, but I think the basic idea is that if the TCP connection is negotiated properly but it doesn't get a response beyond that, it will try lower and lower MSSes until it does. However it works (or doesn't as the case may be), it didn't make a lick of difference. I waited and waited for content to be delivered to me until eventually Microsoft's end sent me a TCP RST. While I was poking at this, though, I had a thought...most IP stacks I believe keep a path MTU cache of some sort. I know Windows does: if I send an ICMP packet with DF set that is larger than the PPPoE gateway can handle, I get something similar to the following: C:\Documents and Settings\nathana>ping 64.126.160.1 -f -l 1472 Pinging 64.126.160.1 with 1472 bytes of data: Reply from 64.126.142.249: Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set. [...] Next time that I try the same thing, Windows doesn't even bother trying to send the packet. It looks at its PMTU table for that IP, and already KNOWS it is too big: C:\Documents and Settings\nathana>ping 64.126.160.1 -f -l 1472 Pinging 64.126.160.1 with 1472 bytes of data: Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set. [...] However, even when trying this with www.msnbc.msn.com, and with the MSNBC entry in its PMTU cache (and its IP set statically in my 'hosts' file so that Akamai/MS round-robin DNS doesn't screw with me during the test), when I tried to build a TCP connection to MSNBC from this same host, Windows told the remote host it had a 1460 MSS. Now, although that makes sense, in order to avoid issues like the one we are facing with Microsoft, would it not make _more_ sense for the stack to look at the PMTU cache first, and then adjust its own MSS just for connections to that one host? Maybe even send out an MTU - 40 ICMP packet to the host that we want to build a TCP connection with FIRST to get an ICMP type 3 code 4 response from the router in-between with the smaller MTU? That would put the burden of PMTUD on the host requesting the TCP session rather than on the one responding, but if hosts were "smarter" like this it seems to me it might smooth out some of these issues. The remote end could be "broken" with respect to PMTUD but it wouldn't matter. Thoughts? -- Nathan Anderson First Step Internet, LLC nathana@fsr.com