On Sep 17, 2009, at 7:45 PM, Richard A Steenbergen wrote: [ SNIP ]
Story 2. Had a customer report that they were getting extremely slow transfers to another network, despite not being able to find any packet loss. Shifting the traffic to a different port to reach the same network resolved the problem. After removing the traffic and attempting to ping the far side, I got the following:
<drop> 64 bytes from x.x.x.x: icmp_seq=1 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=0 ttl=61 time=4.159 ms <drop> 64 bytes from x.x.x.x: icmp_seq=5 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=6 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=7 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=61 time=4.159 ms
After a little bit more testing, it turned out that every 4th packet that was being sent to the peers' router was being queued until another "4th packet" would come along and knock it out. If you increased the interval time of the ping, you would see the amount of time the packet spent in the queue increase. At one point I had it up to over 350 seconds (not milliseconds) that the packet stayed in the other routers' queue before that 4th packet came along and knocked it free. I suspect it could have gone higher, but random scanning traffic on the internet was coming in. When there was a lot of traffic on the interface you would never see the packet loss, just reordering of every 4th packet and thus slow tcp transfers. :)
Story 1: ----------- I had a router where I was suddenly unable to reach certain hosts on the (/24) ethernet interface -- pinging form the router worked fine, transit traffic wouldn't. I decided to try and figure out if there was any sort of rhyme or reason to which hosts had gone unreachable. I could successfully reach xxx.yyy.zzz.1 xxx.yyy.zzz.2 xxx.yyy.zzz.3 xxx.yyy.zzz.5 xxx.yyy.zzz.7 xxx.yyy.zzz.11 xxx.yyy.zzz.13 xxx.yyy.zzz.17 ... xxx.yyy.zzz.197 xxx.yyy.zzz.199 There were only 200 hosts on the LAN, but I'd bet dollars to donuts that I know what the next reachable one would have been if there had been more. Unfortunately the box rebooted itself (when I tried to view the FIB) before I could collect more info. Story 2: ---------- Had a small router connecting a remote office over a multilink PPP[1] interface (4xE1). Site starts getting massive packet-loss, so I figure one of the circuits has gone bad, but didn't get removed from the bundle. I'm having a hard time reaching the remote side, so I pull the interfaces from protocols and try ping the remote router -- no replies.... Luckily I didn't hit Ctrl-C on the ping, because suddenly I start getting replies with no drops: 64 bytes from x.x.x.x: icmp_seq=1 ttl=120 time=30132.148 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=120 time=30128.178 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=120 time=30133.231 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=120 time=30112.571 ms 64 bytes from x.x.x.x: icmp_seq=5 ttl=120 time=30132.632 ms What?! I figure it's gotta be MLPPP stupidity and / or depref of ICMP, so I connect OOB and A: remove MLPPP and use just a single interface and B: start pinging a host behind the router instead... 64 bytes from x.x.x.x: icmp_seq=1 ttl=120 time=30142.323 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=120 time=30144.571 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=120 time=30141.632 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=120 time=30142.420 ms 64 bytes from x.x.x.x: icmp_seq=5 ttl=120 time=30159.706 ms I fire up tcpdump and try ssh to a host on the remote side -- I see the SYN leave my machine and then, 30 *seconds* later I get back a SYN- ACK. I change the queuing on the interface from FIFO to something else and the problem goes away. I change the queuing back to FIFO and it's 30 second RTT again. Somehow it seems to be buffering as much traffic as it can (and anything more than one copy of ping running (or ping with anything larger than the default packet size) makes if start dropping badly). I ran "show buffers" to try get more of an idea what was happening, but it didn't like that and reloaded. Came back up fine though... Story 3: ---------- Running a network that had a large number of L3 switches from a vendor (lets call them "X") in a single OSPF area. This area also contained a large number of poor quality international circuits that would flap often, so there was *lots* of churn. Apparently this vendor X's OSPF implementation didn't much like this and so would become unhappy. The way it would express its displeasure was by corrupting a pointer to / in the LSDB so it was off-by-one and you'd get: Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.9.32.5 Mask 10.160.8.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3 Mask 10.2.153.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3 Mask 10.2.153.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. (This network was addressed out of 10/8 - 10.178.255.252 is one of vendors X boxes and 10.160.8.0 is a valid subnet, but, surprisingly enough, not a valid mask..... ). To make mattes even more fun, the OSPF adjacency would go down and then come back up -- and the grumpy box would flood all of it's (corrupt) LSAs... W [1]: Hey, not my idea...
-- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
-- "Real children don't go hoppity-skip unless they are on drugs." -- Susan, the ultimate sensible governess (Terry Pratchett, Hogfather)