SOLVED! The cause of puzzling TCP (eg. WHOIS) connection failures with some InterNIC.net hosts
 
            [[ NOTE: this message is cross-posted to NANOG and tech-net, as well as Cc'ed to Mark Kosters (because I don't know if Mark reads either list). Please reply either directly to me, or to *one* of those lists as appropriate (it's probably not relevant to discuss on NANOG now that the problem has been identified unless it's a problem with some particular piece of equipment, in which case it would be good to identify it so others can fix similar problems. ]] I've discovered the cause of those problems with TCP connections to/from some InterNIC.net hosts (and some other hosts, one of which was trying to send me e-mail and thus necessitated that I debug it in more detail). Now that I know the cause I can say that this problem is usually indicative of a firewall with a non-compliant TCP/IP implementation, though it may also indicate an unwise firewall filtering policy too. The problem has to do with the failure of a host to fragment larger packets on demand (i.e. when the other host sends an ICMP "needs frag" notification). This may be because the ICMP packet never gets through (perhaps someone who didn't understand TCP/IP and ICMP and everything else related implemented a filter on all "abnormal" ICMP packets); or it may be because the receiving host doesn't understand the ICMP "needs frag" request (and also doesn't implement path MTU discovery, or have I got that backwards?). No matter what the problem really is, I'm sure a *lot* of people would be much happier if this problem were fixed, specifically for the WHOIS service (though I've also had troubles receiving HTTP too). I got quite a few replies about similar experiences when I first posted about this on NANOG recently. Here's a sample trace collected from the PPP router upstream which shows the outgoing ICMP packets and the incoming TCP retransmissions, un-fragmented, even after the first request to fragment: 15:02:56.097980 204.92.254.2.4721 > 198.41.0.6.43: S 1660910424:1660910424(0) win 16384 <mss 1460,nop,wscale 0,nop,nop,timestamp 5148233 0> (ttl 62, id 25948) 15:02:56.420273 198.41.0.6.43 > 204.92.254.2.4721: S 1062510833:1062510833(0) ack 1660910425 win 8760 <mss 1460> (DF) (ttl 245, id 4189) 15:02:56.674783 204.92.254.2.4721 > 198.41.0.6.43: . ack 1 win 17520 (ttl 62, id 25951) 15:02:56.677143 204.92.254.2.4721 > 198.41.0.6.43: P 1:6(5) ack 1 win 17520 (ttl 62, id 25952) 15:02:57.175854 198.41.0.6.43 > 204.92.254.2.4721: . ack 6 win 8760 (DF) (ttl 245, id 4190) 15:02:59.393169 198.41.0.6.43 > 204.92.254.2.4721: P 1:4(3) ack 6 win 8760 (DF) (ttl 245, id 4191) 15:02:59.532326 204.92.254.2.4721 > 198.41.0.6.43: . ack 4 win 17517 (ttl 62, id 25994) 15:03:00.326761 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 4192) 15:03:00.327688 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19390) 15:03:00.420416 198.41.0.6.43 > 204.92.254.2.4721: . 1464:2914(1450) ack 6 win 8760 (DF) (ttl 245, id 4193) 15:03:00.421157 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19391) 15:03:03.381245 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 4194) 15:03:03.382120 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19392) 15:03:10.619116 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 4195) 15:03:10.620110 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19411) 15:03:24.974732 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 4196) 15:03:24.975626 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19413) 15:03:53.941690 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 4197) 15:03:53.942656 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19418) 15:04:50.256764 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 52333) 15:04:50.257959 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19425) 15:05:46.509834 198.41.0.6.43 > 204.92.254.2.4721: . 4:1464(1460) ack 6 win 8760 (DF) (ttl 245, id 43047) 15:05:46.510716 204.29.161.41 > 198.41.0.6: icmp: 204.92.254.2 unreachable - need to frag (mtu 1006) (DF) (ttl 255, id 19433) 15:06:29.615496 198.41.0.6.43 > 204.92.254.2.4721: R 4:4(0) ack 6 win 0 (ttl 55, id 23874) Note that ICMP packets get through correctly, seemingly because NetBSD fragments them on the way through (I suppose a "needs frag" packet could be sent in this case too, but that does seem a little too intertwined to be reliable). Here's the trace of two big (-s 1400) packet pings from the router's POV: 15:17:21.092624 204.92.254.2 > 198.41.0.6: icmp: echo request (ttl 253, id 41845) 15:17:21.366494 198.41.0.6 > 204.92.254.2: icmp: echo reply (ttl 246, id 25176) 15:17:21.978679 204.92.254.2 > 198.41.0.6: icmp: echo request (ttl 253, id 41855) 15:17:22.227824 198.41.0.6 > 204.92.254.2: icmp: echo reply (ttl 246, id 25351) And here's what I see on my end of the link corresponding to the above: 15:17:17.466591 204.92.254.2 > 198.41.0.6: icmp: echo request (ttl 255, id 41845) 15:17:18.467986 204.92.254.2 > 198.41.0.6: icmp: echo request (ttl 255, id 41855) 15:17:18.487006 198.41.0.6 > 204.92.254.2: icmp: echo reply (frag 25176:984@0+) (ttl 244) 15:17:18.489940 198.41.0.6 > 204.92.254.2: (frag 25176:24@984) (ttl 244) 15:17:19.251136 198.41.0.6 > 204.92.254.2: icmp: echo reply (frag 25351:984@0+) (ttl 244) 15:17:19.263880 198.41.0.6 > 204.92.254.2: (frag 25351:24@984) (ttl 244) Perhaps routers (i.e. NetBSD when it's routing, in this case) should also fragment TCP packets on the way through if there are "too many" retransmissions of over-sized packets (one's too many for me, but I guess on high-latency links there might be two or three in the pipe -- perhaps a timer would help adjust when to do local fragmenting). -- Greg A. Woods +1 416 218-0098 VE3TCP <gwoods@acm.org> <robohack!woods> Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
 
            On Fri, 20 Nov 1998, Greg A. Woods wrote: [...]
The problem has to do with the failure of a host to fragment larger packets on demand (i.e. when the other host sends an ICMP "needs frag" notification). This may be because the ICMP packet never gets through (perhaps someone who didn't understand TCP/IP and ICMP and everything else related implemented a filter on all "abnormal" ICMP packets); or it may be because the receiving host doesn't understand the ICMP "needs frag" request (and also doesn't implement path MTU discovery, or have I got that backwards?).
No, if they don't implement PMTU-D then they normally wouldn't be sending packets with DF set. If DF isn't set, then normally the packets will be fragmented by the router so there is no problem. I really don't think it would be very wise for a router to try to keep track of packets it is routing that have DF set and then ignore it if it thinks it should. You can implement PMTU-D blackhole discovery on the sender, ie. if it keeps trying to send and doesn't succeed or get any can't fragment ICMP back, then it will try backing down. The problem in this case is probably the load balancing systems that NSI is now using, which have known bogus behaviour when interacting with PMTU-D. NSI should disable PMTU-D on their servers until they can fix this problem; fixing this problem probably involves having the vendor for their load balancing boxes fixing their broken software.
Here's a sample trace collected from the PPP router upstream which shows the outgoing ICMP packets and the incoming TCP retransmissions, un-fragmented, even after the first request to fragment:
I thought you had said that there were no differences in the traffic dumps between working and non-working connections... As always, see http://www.worldgate.com/~marcs/mtu/ for discussion of PMTU-D and how things break it.
 
            The problem in this case is probably the load balancing systems that NSI is now using, which have known bogus behaviour when interacting with PMTU-D. NSI should disable PMTU-D on their servers until they can fix this problem; fixing this problem probably involves having the vendor for their load balancing boxes fixing their broken software.
What load balancing systems are known to mess up with PMTU-D ? Rubens Kuhl Jr.
 
            On Fri, 20 Nov 1998, Rubens Kuhl Jr. wrote:
The problem in this case is probably the load balancing systems that NSI is now using, which have known bogus behaviour when interacting with PMTU-D. NSI should disable PMTU-D on their servers until they can fix this problem; fixing this problem probably involves having the vendor for their load balancing boxes fixing their broken software.
What load balancing systems are known to mess up with PMTU-D ?
I would suggest that it appears that NSI is, as has previously been mentioned on the list and f5's web page, using f5's BIG/ip boxes.
 
            I would guess bigIP as that's what their sales folks said they used. Our luck with them was less than that of what they said it would be. It's now no longer in our network. *big* mistake and big headache. - jared On Fri, Nov 20, 1998 at 09:42:43PM -0200, Rubens Kuhl Jr. wrote:
The problem in this case is probably the load balancing systems that NSI is now using, which have known bogus behaviour when interacting with PMTU-D. NSI should disable PMTU-D on their servers until they can fix this problem; fixing this problem probably involves having the vendor for their load balancing boxes fixing their broken software.
What load balancing systems are known to mess up with PMTU-D ?
Rubens Kuhl Jr.
-- Jared Mauch | pgp key available via finger from jared@puck.nether.net clue++; | http://puck.nether.net/~jared/
 
            On Fri, Nov 20, 1998 at 04:25:11PM -0500, Greg A. Woods wrote:
The problem has to do with the failure of a host to fragment larger packets on demand (i.e. when the other host sends an ICMP "needs frag" notification). This may be because the ICMP packet never gets through (perhaps someone who didn't understand TCP/IP and ICMP and everything else related implemented a filter on all "abnormal" ICMP packets); or it may be because the receiving host doesn't understand the ICMP "needs frag" request (and also doesn't implement path MTU discovery, or have I got that backwards?).
No matter what the problem really is, I'm sure a *lot* of people would be much happier if this problem were fixed, specifically for the WHOIS service (though I've also had troubles receiving HTTP too). I got quite a few replies about similar experiences when I first posted about this on NANOG recently.
Thanks Greg for the good information. The InterNIC load balancers (BigIP made by F5 Labs) do have a problem with path MTU discovery. We have taken a short term fix of turning off path MTU discovery on the hosts behind BigIP until F5 issues a fix. Regards, Mark -- Mark Kosters markk@internic.net InterNIC Registration Services PGP Key fingerprint = 1A 2A 92 F8 8E D3 47 F9 15 65 80 87 68 13 F6 48 I am not a spokesperson for NSI. Anything I write or say is my personal opinion and in no way should be interpreted as NSI's official position.
participants (5)
- 
                 Jared Mauch Jared Mauch
- 
                 Marc Slemko Marc Slemko
- 
                 Mark Kosters Mark Kosters
- 
                 Rubens Kuhl Jr. Rubens Kuhl Jr.
- 
                 woods@most.weird.com woods@most.weird.com