This is just a WAG but what the hell. Jon Lewis wrote:
I've got this private line DS3. It connects cisco 7206 routers in Orlando (at our data center) and in Ocala (a colo rack in the Embarq CO).
According to the DLR, it's a real circuit, various portions of it ride varying sized OC circuits, and then it's handed off to us at each end the usual way (copper/coax) and plugged into PA-2T3 cards.
Are you sure that they are not crossing some channels in the middle and accidentally handing them to a different customer? You mention above that various portions of the DS3 ride different transport circuits in the middle. That always creates the potential for someone to not put it back together correctly on either end. I've seen DLCs get crossed before. I could easily see a transport provider crossing portions of a circuit, especially if they break it into pieces in the middle and have to put it back together on the ends. I think it makes sense too. Somebody's getting traffic off a T1 that isn't destined for them. Their router sees it, says WTF and sends a ICMP dest unreachable via their default route through Sprint. Same thing goes for a traceroute; it simply follows its default route to reply to your packets with the expiring TTL. Taking a path through a different provider would be expected since it doesn't have a connected route to the source of the traceroute (since it's not the far end of your T1 that you're expecting). The site getting your crossed T1 could be using the T1 as a PtP to a branch office and has Internet through a different circuit that hasn't been hosed. I would be curious to hear if Sprint is having any problems with a circuit connected to sl-bb20-dc-6-0-0.sprintlink.net, what the router is and if any directly connected customers are having T1 problems. If nothing else Sprint should be able to track down the source of the traceroute return packets and contact the customer. The T1 could be part of a bundle at their site and they may not even realize that the bundle dropped a path.
Last Tuesday, at about 2:30PM, "something bad happened." We saw a serious jump in traffic to Ocala, and in particular we noticed one customer's connection (a group of load sharing T1s) was just totally full. We quickly assumed it was a DDoS aimed at that customer, but looking at the traffic, we couldn't pinpoint anything that wasn't expected flows.
Are you sure that the traffic being received by each of the T1s is their's? Do you have any way to getting flows or packets off of individual T1s and not the bundle as a whole? Tracing through you to your upstream...
7 andc-br-3-f2-0.atlantic.net (209.208.9.138) 47.951 ms 56.096 ms 56.154 ms 8 ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98) 56.199 ms 56.320 ms 56.196 ms 9 * * *
Circuit gets crossed onto the wrong customer. Wrong site received a packet with an expiring TTL and goes to send a reply. Destination IP isn't on a connected route so the site sends the reply via it's default route on Sprint.
10 sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174) 80.774 ms 81.030 ms 81.821 ms 11 sl-st20-ash-10-0.sprintlink.net (144.232.20.152) 75.731 ms 75.902 ms 77.128 ms
Reply traverses Sprint to L3 and on to you.
12 te-10-1-0.edge2.Washington4.level3.net (4.68.63.209) 46.548 ms 53.200 ms 45.736 ms 13 vlan69.csw1.Washington1.Level3.net (4.68.17.62) 42.918 ms vlan79.csw2.Washington1.Level3.net (4.68.17.126) 55.438 ms vlan69.csw1.Washington1.Level3.net (4.68.17.62) 42.693 ms 14 ae-81-81.ebr1.Washington1.Level3.net (4.69.134.137) 48.935 ms ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129) 49.317 ms ae-91-91.ebr1.Washington1.Level3.net (4.69.134.141) 48.865 ms 15 ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85) 59.642 ms 56.278 ms 56.671 ms 16 ae-61-60.ebr1.Atlanta2.Level3.net (4.69.138.2) 47.401 ms 62.980 ms 62.640 ms 17 ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149) 40.300 ms 40.101 ms 42.690 ms 18 ae-6-6.car1.Orlando1.Level3.net (4.69.133.77) 40.959 ms 40.963 ms 41.016 ms 19 unknown.Level3.net (63.209.98.66) 246.744 ms 240.826 ms 239.758 ms 20 andc-br-3-f2-0.atlantic.net (209.208.9.138) 39.725 ms 37.751 ms 42.262 ms 21 ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98) 43.524 ms 45.844 ms 43.392 ms 22 * * * 23 sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174) 63.752 ms 61.648 ms 60.839 ms 24 sl-st20-ash-10-0.sprintlink.net (144.232.20.152) 66.923 ms 65.258 ms 70.609 ms 25 te-10-1-0.edge2.Washington4.level3.net (4.68.63.209) 67.106 ms 93.415 ms 73.932 ms 26 vlan99.csw4.Washington1.Level3.net (4.68.17.254) 88.919 ms 75.306 ms vlan79.csw2.Washington1.Level3.net (4.68.17.126) 75.048 ms 27 ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129) 69.508 ms 68.401 ms ae-71-71.ebr1.Washington1.Level3.net (4.69.134.133) 79.128 ms 28 ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85) 64.048 ms 67.764 ms 67.704 ms 29 ae-71-70.ebr1.Atlanta2.Level3.net (4.69.138.18) 68.372 ms 67.025 ms 68.162 ms 30 ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149) 65.112 ms 65.584 ms 65.525 ms
I can't explain the continuous loop or the dupes. I'm not sure if my theory fits those symptoms or not.
Our circuit provider's support people have basically just maintained that this behavior isn't possible and so there's nothing they can do about it. i.e. that the problem has to be something other than the circuit.
Can you have them put the circuit into maintenance and have them test it end to end? They can't deny it when their TDR says that there's a problem.
I got tired of talking to their brick wall, so I contacted Sprint and was able to confirm with them that the traffic in question really was inexplicably appearing on their network...and not terribly close geographically to the Orlando/Ocala areas.
Which supports with my theory of a crossed circuit. Crossing a DS1 onto the wrong DS3 or OCx could easily make it pop up anywhere. Somewhere is another site that's having T1 problems.
So, I have a circuit that's bleeding duplicate packets onto an unrelated IP network, a circuit provider who's got their head in the sand and keeps telling me "this can't happen, we can't help you", and customers who were getting tired of receiving all their packets in triplicate (or more) saturating their connections and confusing their applications. After a while, I had to give up on finding the problem and focus on just making it stop. After trying a couple of things, the solution I found was to change the encapsulation we use at each end of the DS3. I haven't gotten confirmation of this from Sprint, but I assume they're now seeing massive input errors one the one or more circuits where our packets were/are appearing. The important thing (for me) is that this makes the packets invalid to Sprint's routers and so it keeps them from forwarding the packets to us. Cisco TAC finally got back to us the day after I "fixed" the circuit...but since it was obviously not a problem with our cisco gear, I haven't pursued it with them.
Right. By changing the encap you've basically killed the circuit. With that T1 effectively down on your end you won't be sending any packets down the problem path and aren't able to see that problem anymore with your traceroutes. However your customer with the bundle of T1s is down a circuit. It makes sense in my mind that it's simply a crossed circuit in the middle. Your transport provider for whatever reason pulled out a DS1 and sent it down a different path. They accidentally crossed DS1s in the middle and are handing your DS1 to a Sprint customer and their DS1 to your customer. That's my theory at least. Justin