Since this list likes to speculate with little facts on a regular basis (and I'll admit to being as guilty as anyone) I throw this one out for opinions : We were seeing very odd behavior on a Cogent circuit following a software upgrade to tol01.atlas. Two traceroutes: mark@angola-gw> traceroute 74.125.226.6 traceroute to 74.125.226.6 (74.125.226.6), 30 hops max, 40 byte packets 1 * * gi1-1.ccr01.tol01.atlas.cogentco.com (38.104.148.5) 110.315 ms 2 te4-2.ccr01.sbn01.atlas.cogentco.com (154.54.7.154) 139.520 ms 196.910 ms 5.728 ms 3 * * * 4 * te0-5-0-5.ccr21.ord03.atlas.cogentco.com (154.54.44.174) 8.310 ms te0-0-0-7.ccr21.ord03.atlas.cogentco.com (154.54.25.70) 8.752 ms 5 te0-0-0-0.ccr22.ord03.atlas.cogentco.com (154.54.24.214) 8.983 ms te0-1-0-0.ccr22.ord03.atlas.cogentco.com (66.28.4.66) 7.948 ms * 6 * * te-9-1.car4.Chicago1.Level3.net (4.68.127.129) 26.127 ms 7 GOOGLE-INC.car4.Chicago1.Level3.net (4.71.100.22) 38.132 ms 25.120 ms * 8 * * 209.85.254.122 (209.85.254.122) 24.539 ms 9 * 72.14.237.130 (72.14.237.130) 26.134 ms 72.14.237.108 (72.14.237.108) 25.021 ms MPLS Label=666803 CoS=4 TTL=1 S=1 10 216.239.46.161 (216.239.46.161) 31.816 ms 35.702 ms 32.249 ms 11 72.14.233.142 (72.14.233.142) 32.897 ms * * 12 * yyz06s05-in-f6.1e100.net (74.125.226.6) 33.319 ms * and a ping over the same path: --- www.l.google.com ping statistics --- 675 packets transmitted, 323 packets received, 52.1% packet loss round-trip min/avg/max/stddev = 12.834/28.831/129.743/28.987 ms and at the same time: mark@angola-gw> traceroute 38.100.128.10 traceroute to 38.100.128.10 (38.100.128.10), 30 hops max, 40 byte packets 1 gi1-1.ccr01.tol01.atlas.cogentco.com (38.104.148.5) 4.445 ms 1.841 ms 1.713 ms 2 te7-7.ccr02.cle04.atlas.cogentco.com (154.54.5.230) 5.318 ms te3-2.ccr02.cle04.atlas.cogentco.com (154.54.28.86) 4.755 ms te7-7.ccr02.cle04.atlas.cogentco.com (154.54.5.230) 4.982 ms 3 te4-2.ccr01.pit02.atlas.cogentco.com (154.54.30.10) 7.997 ms te3-2.ccr01.pit02.atlas.cogentco.com (154.54.30.6) 7.736 ms te4-2.ccr01.pit02.atlas.cogentco.com (154.54.30.10) 8.177 ms 4 te0-0-0-5.mpd21.dca01.atlas.cogentco.com (154.54.40.81) 17.197 ms te0-0-0-5.ccr22.dca01.atlas.cogentco.com (154.54.30.230) 16.907 ms te0-0-0-5.mpd21.dca01.atlas.cogentco.com (154.54.40.81) 17.008 ms 5 te0-1-0-0.mpd22.dca01.atlas.cogentco.com (154.54.2.193) 17.358 ms te0-0-0-0.mpd22.dca01.atlas.cogentco.com (154.54.31.38) 17.196 ms te0-1-0-0.mpd22.dca01.atlas.cogentco.com (154.54.2.193) 18.690 ms 6 te4-2.mpd01.iad03.atlas.cogentco.com (154.54.29.122) 17.885 ms * 18.537 ms 7 cogentco.com (38.100.128.10) 17.836 ms !<10> 17.918 ms !<10> 17.833 ms !<10> --- 38.100.128.10 ping statistics --- 236 packets transmitted, 236 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 22.717/27.942/128.011/12.236 ms sh-3.2# Works perfectly. There is no asymmetric routing in this scenario (only 1 BGP peer running during this test), and it is not due to traffic congestion. Initial speculation over the dropped packets in the trace to 74.125.226.6 was ICMP depriortization. The results are too consistent for that to make sense (I have dozens of traceroutes to the same destination - they all appear similar). I realize there is a long history of Cogent/L3 ugliness but I'm pretty sure that this issue has nothing to do with that subject. Traceroutes and pings from the control plane of tol01.atlas sourced from 38.104.148.5 do not show any odd behavior. Inbound traffic (to us) is not affected by this. Our workaround while resolving this issue was to change local-pref on the affected prefixes to send traffic out our other providers. The issue started after a software upgrade to tol01.atlas and resolved after a (reported) reboot of tol01.atlas. The question is: How does a router break in this manner? It appears to unintentionally be doing something different with traffic based on the source address, not the destination address. I realize this can be done intentionally - but that is not the case here (unless somebody isn't telling me something). -- Mark Radabaugh Amplex mark@amplex.net 419.837.5015
On (2011-11-23 09:41 -0500), Mark Radabaugh wrote:
The question is: How does a router break in this manner? It appears to unintentionally be doing something different with traffic based on the source address, not the destination address. I realize this can be done intentionally - but that is not the case here (unless somebody isn't telling me something).
I don't think we can determine that it has anything to do with source address based on data shown. 38.104.148.5 could very well be 6500 and somehow broken adjacency to 74.125.226.6, perhaps hardware adjacency having MTU of 0B, causing punt which is rate-limited by different policer than TTL exceeded policer. -- ++ytti
On (2011-11-23 09:41 -0500), Mark Radabaugh wrote:
The question is: How does a router break in this manner? It appears to unintentionally be doing something different with traffic based on the source address, not the destination address. I realize this can be done intentionally - but that is not the case here (unless somebody isn't telling me something).
I don't think we can determine that it has anything to do with source address based on data shown. 38.104.148.5 could very well be 6500 and somehow broken adjacency to 74.125.226.6, perhaps hardware adjacency having MTU of 0B, causing punt which is rate-limited by different policer than TTL exceeded policer.
Agree. I've seen similar effects with a different ISP who had one side of an ether-channel go south without the port showing down. Stuff hashed over
2011/11/23 Saku Ytti <saku@ytti.fi> the good like was fine, stuff hashed over the bad like wasn't. Led to some painful support calls from customers. I agree this list is a haven of speculation and OT comments. In order to avoid making a bad problem worse you should probably contact cogent.
2011/11/23 Saku Ytti<saku@ytti.fi>
On (2011-11-23 09:41 -0500), Mark Radabaugh wrote:
The question is: How does a router break in this manner? It appears to unintentionally be doing something different with traffic based on the source address, not the destination address. I realize this can be done intentionally - but that is not the case here (unless somebody isn't telling me something). I don't think we can determine that it has anything to do with source address based on data shown. 38.104.148.5 could very well be 6500 and somehow broken adjacency to 74.125.226.6, perhaps hardware adjacency having MTU of 0B, causing punt which is rate-limited by different policer than TTL exceeded policer.
Agree. I've seen similar effects with a different ISP who had one side of an ether-channel go south without the port showing down. Stuff hashed over the good like was fine, stuff hashed over the bad like wasn't. Led to some painful support calls from customers. I agree this list is a haven of speculation and OT comments. In order to avoid making a bad problem worse you should probably contact cogent. It's fixed at this point. You are correct in that it was quite
On 11/23/11 11:41 AM, Keegan Holley wrote: painful getting this escalated far enough to get it fixed. The tools that are available (at least that I know of) to try to prove the issue to level 1 and 2 support just doesn't get the job done. It's the eternal problem of convincing L1/2 support that you really have a problem not of your own making. Mark
On 11/23/11 11:33 AM, Saku Ytti wrote:
On (2011-11-23 09:41 -0500), Mark Radabaugh wrote:
The question is: How does a router break in this manner? It appears to unintentionally be doing something different with traffic based on the source address, not the destination address. I realize this can be done intentionally - but that is not the case here (unless somebody isn't telling me something). I don't think we can determine that it has anything to do with source address based on data shown. 38.104.148.5 could very well be 6500 and somehow broken adjacency to 74.125.226.6, perhaps hardware adjacency having MTU of 0B, causing punt which is rate-limited by different policer than TTL exceeded policer.
I was told the router was reloaded to resolve a CEF issue. Not sure what was wrong with 'clear cef linecard'. -- Mark Radabaugh Amplex mark@amplex.net 419.837.5015
-----Original Message----- From: Mark Radabaugh [mailto:mark@amplex.net] Sent: 23 November 2011 16:53 To: NANOG list Subject: Re: Odd router brokenness
On (2011-11-23 09:41 -0500), Mark Radabaugh wrote:
The question is: How does a router break in this manner? It appears to unintentionally be doing something different with traffic based on the source address, not the destination address. I realize this can be done intentionally - but that is not the case here (unless somebody isn't telling me something). I don't think we can determine that it has anything to do with source address based on data shown. 38.104.148.5 could very well be 6500 and somehow broken adjacency to 74.125.226.6, perhaps hardware adjacency having MTU of 0B, causing
On 11/23/11 11:33 AM, Saku Ytti wrote: punt
which is rate-limited by different policer than TTL exceeded policer.
I was told the router was reloaded to resolve a CEF issue. Not sure what was wrong with 'clear cef linecard'.
Now *that* brings back memories! -- Leigh ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________
On (2011-11-23 11:45 -0500), Mark Radabaugh wrote:
I was told the router was reloaded to resolve a CEF issue. Not sure what was wrong with 'clear cef linecard'.
Or just fixing the broken prefixes/adjacencies and opening CTAC case about what was wrong with them. http://www.quickmeme.com/meme/35cet6/ -- ++ytti
participants (4)
-
Keegan Holley
-
Leigh Porter
-
Mark Radabaugh
-
Saku Ytti