[NANOG] Strange network behaviour
We had a very strange problem today. Two of our hosts could not reach a server, but only those two hosts. All of our other hosts could reach those servers fine. (OK, I didn't try ALL of our IPs, but the half dozen I did try worked fine.) I checked all of our firewalls and routers, and everywhere I looked all of the traffic was exiting our network just fine. I saw on our edge routers the traffic going out, just no traffic back to the two hosts in question. (We had good bidirectional traffic to all of our other hosts.) And the two hosts in question were only having problems connecting to ftp.agnewsonline.com. Lets start with a traceroute from a working host, the orginating host is 12.192.92.14: [~]% traceroute -I ftp.agnewsonline.com traceroute to agnewsonline.com (64.46.45.226), 64 hops max, 60 byte packets 1 12.192.92.3 (12.192.92.3) 0.257 ms 0.171 ms 0.163 ms 2 pluto-0 (12.192.93.13) 0.401 ms 0.296 ms 0.294 ms 3 ixion-att (12.192.93.244) 1.260 ms 0.463 ms 1.116 ms 4 12.87.125.249 (12.87.125.249) 14.838 ms 9.314 ms 9.755 ms 5 tbr2.cgcil.ip.att.net (12.122.99.122) 24.528 ms 24.788 ms 23.009 ms 6 ggr2.cgcil.ip.att.net (12.123.6.69) 22.362 ms 23.410 ms 22.335 ms 7 192.205.33.186 (192.205.33.186) 23.448 ms 24.074 ms 29.405 ms 8 ae-31-53.ebr1.Chicago1.Level3.net (4.68.101.94) 22.800 ms 32.598 ms 36.093 ms 9 ae-68.ebr3.Chicago1.Level3.net (4.69.134.58) 23.446 ms 21.599 ms 34.060 ms 10 ae-3.ebr2.Denver1.Level3.net (4.69.132.61) 61.517 ms 57.482 ms 56.606 ms 11 ae-2.ebr2.Seattle1.Level3.net (4.69.132.53) 96.484 ms 114.264 ms 96.984 ms 12 ae-23-52.car3.Seattle1.Level3.net (4.68.105.36) 91.295 ms 88.700 ms 89.705 ms 13 BIG-PIPE-IN.car3.Seattle1.Level3.net (4.71.152.26) 90.053 ms 90.511 ms 92.072 ms 14 rc1wh-pos14-0.vc.shawcable.net (66.163.76.1) 90.062 ms 93.489 ms 90.757 ms 15 rc2wh-pos0-15-2-0.vc.shawcable.net (66.163.69.181) 96.527 ms 91.743 ms 97.254 ms 16 rd1ht-tge1-1-1.ok.shawcable.net (66.163.77.18) 101.412 ms 114.160 ms 100.530 ms 17 ra1ht-ge3-1.ok.shawcable.net (66.163.72.134) 105.651 ms 101.336 ms 101.628 ms 18 rx0ht-rack-force-2.ok.bigpipeinc.com (64.251.64.50) 111.960 ms 101.535 ms 116.136 ms 19 rf1.01.rackforce.net (69.10.128.198) 583.192 ms 491.170 ms 598.406 ms 20 64.46.45.226 (64.46.45.226) 110.207 ms 108.718 ms 107.279 ms A traceroute from one of the hosts that doesn't work would reach ae-3.ebr2.Denver1.Level3.net but go no further. I then tried pinging the routers I couldn't reach. I could not ping: ae-3.ebr2.Denver1.Level3.net (4.69.132.61) ae-2.ebr2.Seattle1.Level3.net (4.69.132.53) ae-23-52.car3.Seattle1.Level3.net (4.68.105.36) BIG-PIPE-IN.car3.Seattle1.Level3.net (4.71.152.26) but when I started pinging rc1wh-pos14-0.vc.shawcable.net (66.163.76.1) not only did I start getting responses, but everything started working to ftp.agnewsonline.com too, but just from that host. It really seemed that pinging that router some how fixed my problem. Well, I'm not sure I really believed that, but I still had another host that couldn't reach ftp.agnewsonline.com, so on that host I started a ping. I'll add my comments to describe what I was doing in another window in /* */: [~]% ping ftp.agnewsonline.com PING agnewsonline.com (64.46.45.226): 56 data bytes /* At this point in another window I started a nother ping: */ /* ping 66.163.76.1 and immediately this ping started working ... */ 64 bytes from 64.46.45.226: icmp_seq=18 ttl=108 time=104.617 ms 64 bytes from 64.46.45.226: icmp_seq=19 ttl=108 time=105.775 ms 64 bytes from 64.46.45.226: icmp_seq=20 ttl=108 time=101.569 ms --- agnewsonline.com ping statistics --- 22 packets transmitted, 3 packets received, 86% packet loss round-trip min/avg/max/stddev = 101.569/103.987/105.775/1.774 ms It was like I threw a switch. The single outbound ICMP packet to rc1wh-pos14-0.vc.shawcable.net (66.163.76.1) fixed everything for that host. I was wondering if anybody has any clue what might be going on. I've never experienced a problem like this before.
In the popular tradition of replying to my own post ... It seems that this problem started right around the time I changed our BGP configuration. I did: config term route-map att_out permit 9999 set as-path prepend 19317 19317 exit clear ip bgp 12.87.125.249 out This change was to increase our prepending of our own AS path from 1 to two. It was: set as-path prepend 19317 And we /could/ have had another host with a similar problem but with a different end point, this time in Canada, and the (outbound) route to our Canadian server does not transit either Level3 or Shaw cable. We did not identify exactly when it started working again, we were poking around this problem and then happened to check and it was working. It is *possible*, but by no means confirmed, that a traceroute allowed to go through all its timeouts to the Canadian server may have also switched that problem off too. Doug "Still searching for answers" Rand.
Did your inbound path change as a result? Sounds like a path asymmetry issue might be involved. Douglas K. Rand wrote:
In the popular tradition of replying to my own post ...
It seems that this problem started right around the time I changed our BGP configuration. I did:
config term route-map att_out permit 9999 set as-path prepend 19317 19317 exit clear ip bgp 12.87.125.249 out
This change was to increase our prepending of our own AS path from 1 to two. It was: set as-path prepend 19317
And we /could/ have had another host with a similar problem but with a different end point, this time in Canada, and the (outbound) route to our Canadian server does not transit either Level3 or Shaw cable.
We did not identify exactly when it started working again, we were poking around this problem and then happened to check and it was working. It is *possible*, but by no means confirmed, that a traceroute allowed to go through all its timeouts to the Canadian server may have also switched that problem off too.
Doug "Still searching for answers" Rand.
_______________________________________________ NANOG mailing list NANOG@nanog.org http://mailman.nanog.org/mailman/listinfo/nanog
Did your inbound path change as a result?
Yes, I was trying to re-balance our inbound traffic a bit better. The route-map change resulted in about 30% of our traffic coming in via our other provider. The change was made around 16:00 (CDT) last Friday, about 72 hours before this problem was brought to my attention.
Sounds like a path asymmetry issue might be involved.
But why would the problem last almost 72 hours and be solved by a single ICMP packet to a particular router? And why would only two of our hosts be affected while all of our other systems work just fine? I don't mean to whine, really! And I really don't mean to disagree, I'm no expert in this stuff. I just don't understand what seems like a very fine grained problem.
participants (2)
-
Deepak Jain
-
Douglas K. Rand