Hello, Has anyone else here seen problems with microsoft/msn/hotmail/live.com sites not performing PMTUD correctly? We have, for a while now, had people on our network complain of poor microsoft.com reachability, and discovered we can work around the issue by changing MSS on all TCP SYN as they go out of our network. I recently watched the whole conversation between msn.com and a host on our network (with the MSS rewrite disabled), and if I'm reading it right, we are following PMTUD protocol correctly by sending back ICMP type 3 code 4, but all Microsoft hosts seem to ignore this and continue to send packets back to our host with an MSS that is too large. I hope I'm wrong and that it is we who are doing something stupid, but after cruising Google for a while, I found a multitude of other complaints from people connected to other ISPs specifically about not being able to reach Microsoft web sites. It seems crazy that MS could have PMTUD broken for so long with nobody ever raising a complaint to them directly, though, which makes me wonder if there is another answer here that I'm missing. I sent the following message to a couple of addresses that I gleaned from ARIN WHOIS for the IP block in question and threw hostmaster in there just in case it went somewhere, but noc@microsoft.com appears to be defunct. I have yet to receive acknowledgment of receipt from the other address. Are there any microsoft.com admins that hang out here that can comment on this or get in touch with me, or is there perhaps someone on here with connections to the Microsoft NOC? (BTW, I stripped the referenced libpcap attachment off of this message to the list just so that I wouldn't accidentally incur the wrath of NANOG...if y'all want to see it, I'm happy to post it.) Thanks, -- Nathan Anderson First Step Internet, LLC nathana@fsr.com -------- Original Message -------- Subject: Microsoft/MSN/Live!/Hotmail behind blackhole router? Date: Thu, 01 May 2008 19:00:46 -0700 From: Nathan Anderson/FSR <nathana@fsr.com> To: hostmoaster@microsoft.com, noc@microsoft.com, iprrms@microsoft.com To microsoft.com NOC admins: I work for a regional ISP in the inland pacific northwest. May of our customers' connections have MTUs of less than 1500, and we get routine complaints from them that they have trouble reaching web sites that are under your administration. Usually we can fix the problem by "mangling" the TCP SYNs originating from our customers and headed to the world to reflect a lower value; however, we would rather not have to do that. The fact that we are REQUIRED to do this in order for your sites to be reachable by our customers strongly suggests that either the servers that respond to HTTP requests sent to www.microsoft/msn/hotmail/live.com are behind routers that are blocking ALL ICMP traffic sent their way -- even ICMP type 3 code 4 (packet too large, DF set), which is necessary in order for Path MTU Discovery to work -- or the servers themselves are not listening to the ICMP messages that we are sending their way when our routers are forced to drop a packet sent by you which is too large to be forwarded to a customer of ours. I set up a test connection "on the bench" so to speak, and had our router capture a copy of the conversation between our test client and www.msnbc.msn.com and forward that conversation encapsulated in TZSP to the same test client over a different interface. The capture clearly shows our test client establishing the TCP connection with MSNBC (SYN/SYN+ACK/ACK), and then goes on to show MSNBC send ethernet MTU-sized packets our way that an intermediate router of ours drops and responds with "packet too big, DF set." Despite this, MSNBC continues to retrasmit the original packet with the same payload and the same size back to us. We continue to respond "packet too big, DF set," but the MSNBC server never seems to get the message (literally). We see the same behavior with all sites across the board contained within the 207.46.0.0/16 space, regardless of actual hostname/FQDN. We also find this ironic considering that Microsoft published a Technet article a few years back on black hole routers and the problems they pose, found at http://technet.microsoft.com/en-us/library/bb878081.aspx (which we can't read/access unless we are mangling the MSS). We would appreciate it if Microsoft NOC admins would please look into the matter and take the appropriate corrective action: allowing ICMP type 3 code 4 messages through your routers/firewalls, and making sure that your servers respond to them appropriately as defined in RFC 1191. I have attached the capture we made of the conversation to this e-mail message in libpcap format for your analysis. The test client itself had a 1500 MTU to a desktop router, which in turn had an MTU of 1492 on its uplink to us. I am available to answer any additional clarifying questions you may have. Thank you for your time and attention to this matter. Regards, -- Nathan Anderson First Step Internet, LLC nathana@fsr.com