Re: [NANOG] Microsoft.com PMTUD black hole?
Has anyone else here seen problems with microsoft/msn/hotmail/live.com sites not performing PMTUD correctly?
I used to see it a lot when hosting on windows was popular and people realised they needed a firewall or decided to add a load balancer but broke PMTUD by leaving it enabled on the servers. I've not heard of it for some time so those people got a clue or moved to something else (or everyone worked around them) brandon
On 6 mei 2008, at 21:58, Brandon Butterworth wrote:
Has anyone else here seen problems with microsoft/msn/hotmail/ live.com sites not performing PMTUD correctly?
I used to see it a lot when hosting on windows was popular and people realised they needed a firewall or decided to add a load balancer but broke PMTUD by leaving it enabled on the servers.
I've not heard of it for some time so those people got a clue or moved to something else (or everyone worked around them)
Many years ago I had occasion to terminate dial-up service over L2TP from modem pools operated by a service provider who shall remain nameless to protect the guilty. This service had the unfortunate tendency to drap all packets larger than 576 bytes. So we needed to negotiate a 576-byte MTU over PPP. We then got many complaints from users who dialed in using ISDN routers (yes this was a while ago) because of broken path MTU discovery. The behavior that Microsoft exhibits was EXTREMELY common in those days, and I have no reason to assume it's any less common today. (I also see it regularly with IPv6.) What I did was clear the DF bit on packets going out to the L2TP virtual interfaces so the packets could be fragmented. A more common approach is to rewrite the MSS option in all TCP SYNs with a smaller value so there won't be TCP segments large enough to trigger the problem. AFAIK, all boxes that do PPPoE do this. All of this even went so far that the IETF came up with RFC 4821, which will do path MTU discovery by correlating lost packets with packet sizes to determine the path MTU rather than depend on ICMP messages.
Iljitsch van Beijnum wrote:
A more common approach is to rewrite the MSS option in all TCP SYNs
[snip] Yeah, we do this now, but the software that we have been using for PPPoE termination as well as for a huge portion of our clients (MikroTik RouterOS) doesn't do it correctly in my estimation when you flip on the automatic "change-tcp-mss" option...it rewrites the MSS in ALL SYNs passing through it, either coming OR going. This has the effect of breaking communication with other hosts that actually have a SMALLER MSS than our PPPoE customers since our client will get a SYN+ACK from the remote host that we have rewritten to reflect a larger MSS than the remote host is capable of dealing with. Because MikroTik rewrote both the SYNs generated by us as well as received by us, our customer's host is now under the impression that the lowest MSS between the two hosts matches its own. At least that's the best theory I've come up with. We can write (and have written) custom IP manglers on the MikroTik boxes that only touch SYNs generated by our clients, and only when the MSS is larger than a certain value (in order to honor MSSes even lower than that allowed by their PPPoE gateway). But it's a PITA to deal with. I'd just rather everyone follow protocol. :-P Although we can't always expect everyone to do it by the book, I don't think it is too much to ask that those who operate sizable networks that nearly everyone is required to interact with on a daily basis (read: Microsoft) act responsibly.
All of this even went so far that the IETF came up with RFC 4821, which will do path MTU discovery by correlating lost packets with packet sizes to determine the path MTU rather than depend on ICMP messages.
What's funny is that I ran my tests from a Windows XP host with the recently-released Service Pack 3 installed, which is supposed to activate Microsoft's "PMTUD Black Hole Router Detection" by default (available pre-SP3 but apparently not turned on without a registry change). I haven't read up on exactly how it's supposed to work, but I think the basic idea is that if the TCP connection is negotiated properly but it doesn't get a response beyond that, it will try lower and lower MSSes until it does. However it works (or doesn't as the case may be), it didn't make a lick of difference. I waited and waited for content to be delivered to me until eventually Microsoft's end sent me a TCP RST. While I was poking at this, though, I had a thought...most IP stacks I believe keep a path MTU cache of some sort. I know Windows does: if I send an ICMP packet with DF set that is larger than the PPPoE gateway can handle, I get something similar to the following: C:\Documents and Settings\nathana>ping 64.126.160.1 -f -l 1472 Pinging 64.126.160.1 with 1472 bytes of data: Reply from 64.126.142.249: Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set. [...] Next time that I try the same thing, Windows doesn't even bother trying to send the packet. It looks at its PMTU table for that IP, and already KNOWS it is too big: C:\Documents and Settings\nathana>ping 64.126.160.1 -f -l 1472 Pinging 64.126.160.1 with 1472 bytes of data: Packet needs to be fragmented but DF set. Packet needs to be fragmented but DF set. [...] However, even when trying this with www.msnbc.msn.com, and with the MSNBC entry in its PMTU cache (and its IP set statically in my 'hosts' file so that Akamai/MS round-robin DNS doesn't screw with me during the test), when I tried to build a TCP connection to MSNBC from this same host, Windows told the remote host it had a 1460 MSS. Now, although that makes sense, in order to avoid issues like the one we are facing with Microsoft, would it not make _more_ sense for the stack to look at the PMTU cache first, and then adjust its own MSS just for connections to that one host? Maybe even send out an MTU - 40 ICMP packet to the host that we want to build a TCP connection with FIRST to get an ICMP type 3 code 4 response from the router in-between with the smaller MTU? That would put the burden of PMTUD on the host requesting the TCP session rather than on the one responding, but if hosts were "smarter" like this it seems to me it might smooth out some of these issues. The remote end could be "broken" with respect to PMTUD but it wouldn't matter. Thoughts? -- Nathan Anderson First Step Internet, LLC nathana@fsr.com
On 6 mei 2008, at 23:29, Nathan Anderson/FSR wrote:
Now, although that makes sense, in order to avoid issues like the one we are facing with Microsoft, would it not make _more_ sense for the stack to look at the PMTU cache first, and then adjust its own MSS just for connections to that one host? Maybe even send out an MTU - 40 ICMP packet to the host that we want to build a TCP connection with FIRST to get an ICMP type 3 code 4 response from the router in-between with the smaller MTU?
No. This would add significant delay because you'd have to give the other side enough time to respond to the large packet (also sending a large packet on something like GPRS/EDGE is a waste of bandwidth and battery power) while if there is ICMP filtering, there won't be a response, which is exactly the reason why we're in this bind in the first place (along with the stupid idea that DF should be set for ALL packets rather than just once in a while). And adjusting the MSS based on ephemeral information is the wrong thing to do in the first place. The path MTU can vary. Once you've advertised a small MSS you can never increase it. It is incredibly unprofessional that people enable PMTUD, then break it and require the rest of the world to implement workarounds. Either use PMTUD properly by accepting the ICMP messages or turn PMTUD off.
Iljitsch van Beijnum wrote:
No. This would add significant delay because you'd have to give the other side enough time to respond to the large packet (also sending a large packet on something like GPRS/EDGE is a waste of bandwidth and battery power) while if there is ICMP filtering, there won't be a response, which is exactly the reason why we're in this bind in the first place
I admit the idea needs tweaking (at best), and it was just a stray thought :-), but 1) even if there is ICMP filtering happening way at the other end, I (the TCP initiator) will still get a response from the router in the middle (RITM) that is reducing the total path MTU if I try to send a packet through it larger than the actual path MTU, and 2) if I don't get a response to my single large packet (either from a RITM or the other end) in a timely fashion (less than a second?), then the client/initiator may just assume that path MTU == local MTU and will set its MSS accordingly (which is no different than what is happening now), until it has a reason to think differently. Also, if there is already something in the local PMTU cache for a single host address, I'm not sure I follow why it would be a bad idea for the TCP initiator to consult that cache when preparing the SYN. Although, on second thought, I suppose it is possible (and, in more than a few cases, likely) that in instances of route path asymmetry, the PMTU of the path from the initiator to the server may be different than the PMTU of the path back from the server to the client. Hmmm. Okay, scratch that idea then. :-P -- Nathan Anderson First Step Internet, LLC nathana@fsr.com
Iljitsch van Beijnum <iljitsch@muada.com> writes:
Many years ago I had occasion to terminate dial-up service over L2TP from modem pools operated by a service provider who shall remain nameless to protect the guilty. This service had the unfortunate tendency to drap all packets larger than 576 bytes. So we needed to negotiate a 576-byte MTU over PPP.
We then got many complaints from users who dialed in using ISDN routers (yes this was a while ago) because of broken path MTU discovery. The behavior that Microsoft exhibits was EXTREMELY common in those days, and I have no reason to assume it's any less common today. (I also see it regularly with IPv6.) What I did was clear the DF bit on packets going out to the L2TP virtual interfaces so the packets could be fragmented.
Right. I once stumbled across a SOHO-router doing just that. I never understood why, but now you've given at least one explanation how it could appear to be a good idea. I can also provide the reason why we found it to be an extremely bad idea at the time: Some (most? all?) systems won't set both the DF flag and the identification field at the same time. If you clear the DF flag without changing the identification field, you might end up with fragmented packets that are impossible to reassemble. Which was why I stumbled across the DF-clearing SOHO-router in the first place. The random problems it generated were extremely difficult to debug, and when we started we truly believed that we had a problem with a layer 4 load balancing switch. Note: There are solutions that will both clear the DF flag and generate a new id. E.g. http://www.openbsd.org/faq/pf/scrub.html This is the proper way to clear DF, if you must. Never just clear it. Bjørn
Brandon Butterworth wrote:
I used to see it a lot when hosting on windows was popular and people realised they needed a firewall or decided to add a load balancer but broke PMTUD by leaving it enabled on the servers.
Yeah, but this is Microsoft's OWN server farm we are talking about here, not some small podunk IIS-based hosting provider. ...well, you may be right. I am probably giving MS too much credit here. On another note, someone pointed out to me off-list that I apparently tyop'd "hostmaster" when I sent the e-mail to MS. I have since re-sent it to the properly-spelled address and again promptly received a "User unknown" bounceback. -- Nathan Anderson First Step Internet, LLC nathana@fsr.com
participants (4)
-
Bjørn Mork
-
Brandon Butterworth
-
Iljitsch van Beijnum
-
Nathan Anderson/FSR