cross connect reliability

newer
Multi-homed implementation and BGP...

older
Re: Multi-POP design check/help...

Michael J McCafferty

17 Sep 2009 17 Sep '09

9:45 p.m.

All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo? Thanks, Mike -- ************************************************************ Michael J. McCafferty Principal, Security Engineer M5 Hosting http://www.m5hosting.com You can have your own custom Dedicated Server up and running today ! RedHat Enterprise, CentOS, Ubuntu, Debian, OpenBSD, FreeBSD, and more ************************************************************

Show replies by date

Seth Mattinen

17 Sep 17 Sep

9:52 p.m.

Michael J McCafferty wrote:

...

All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?

Never to fail? Seriously; if you're talking about a passive connection (optical or electrical) like a patch panel, I'd expect it to keep going forever unless someone damages it. ~Seth

Alex Balashov

9:57 p.m.

Seth Mattinen wrote:

...

Michael J McCafferty wrote:

...
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?

Never to fail? Seriously; if you're talking about a passive connection (optical or electrical) like a patch panel, I'd expect it to keep going forever unless someone damages it.

That's truly wishful thinking, as are the assumptions that insulate it from damaging factors. Nothing lasts forever. -- Alex Balashov - Principal Evariste Systems Web : http://www.evaristesys.com/ Tel : (+1) (678) 954-0670 Direct : (+1) (678) 954-0671

Seth Mattinen

10:21 p.m.

Alex Balashov wrote:

...

Seth Mattinen wrote:

...
Michael J McCafferty wrote:

...
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?

Never to fail? Seriously; if you're talking about a passive connection (optical or electrical) like a patch panel, I'd expect it to keep going forever unless someone damages it.

That's truly wishful thinking, as are the assumptions that insulate it from damaging factors. Nothing lasts forever.

What the OP is describing is abnormally high in my view. Based purely on my own personal experience, the structured wiring in my parent's house I put in in the mid 90's has never suffered a failure, is still in use today, and it's in a residential environment with dogs and cats. I'd expect a properly managed environment to fare at least as good as that. ~Seth

Deepak Jain

10:37 p.m.

[lots of stuff deleted]. We've seen cross-connects fail at sites like "E" and others. Generally speaking, it is a human-error issue and not a component failure one. Either people are being sloppy and aren't reading labels, or the labels aren't there. In a cabinet situation, every cabinet does not necessarily home back to its own patch panel, so some trashing may occur -- it can be avoided with good design [cables in the back stay there, etc]. When you are talking about optics failing and they are providing "smart" cross-connects, almost anything is possible. The true tell tale is whether you have to call when the cross-connect goes down, or if it just "bounces". Either way, have them take you to their cross-connect room and show you their mess. Once you see it, you'll know what to expect going forward. Deepak

Pete Carah

11:14 p.m.

On 09/17/2009 06:37 PM, Deepak Jain wrote:

...

[lots of stuff deleted].

A famous one that can happen with some techs is that they make jumpers from solid wire with generic rj45 plugs (yes, I've seen this recently from several folks who should know better). These will last somewhere around a year (long enough to forget when they were installed) then randomly fail from just fan vibration or slight breezes. There are rj45 plugs made for solid wire (have 3 little prongs instead of 2, and they are offset to straddle the wire) but I feel that even these can go bad. I know if the techs are properly educated that this "will never happen" (tm)... (till someone needs a custom-length jumper on a sunday...) (for which, one colo building has an ace hardware with most of the right stuff, but unfortunately most don't). As we all (should) know, all solid-wire cable should terminate in a panel and proper short jumpers (pref. with molded strain-relief) are used for the rest. -- Pete

Marshall Eubanks

9:59 p.m.

On Sep 17, 2009, at 5:52 PM, Seth Mattinen wrote:

...

Michael J McCafferty wrote:

...
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?

Never to fail? Seriously; if you're talking about a passive connection (optical or electrical) like a patch panel, I'd expect it to keep going forever unless someone damages it.

Or until someone pulls out the wrong cable (which has happened to me). Regards Marshall

...

~Seth

Charles Wyble

10:35 p.m.

Marshall Eubanks wrote:

...

On Sep 17, 2009, at 5:52 PM, Seth Mattinen wrote:

...
Michael J McCafferty wrote:

...
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?

Never to fail? Seriously; if you're talking about a passive connection (optical or electrical) like a patch panel, I'd expect it to keep going forever unless someone damages it.

Or until someone pulls out the wrong cable (which has happened to me).

That's not a failure though. It's a disconnection. It happens but is readily attributable to a cause. Random failures of a single ports connectivity.... bizzare and annoying. Whole switches? Seen it. Whole panels? Seen it. Whole blades? Seen it. Single port on a switch or patch panel? Never.

Richard A Steenbergen

11:45 p.m.

On Thu, Sep 17, 2009 at 03:35:37PM -0700, Charles Wyble wrote:

...

Random failures of a single ports connectivity.... bizzare and annoying. Whole switches? Seen it. Whole panels? Seen it. Whole blades? Seen it.

Single port on a switch or patch panel? Never.

You've never seen a single port go bad on a switch? I can't even count the number of times I've seen that happen. Not that I'm not suggesting the OP wasn't the victim of a human error like unplugging the wrong port and they just lied to him, that happens even more. My favorite bizarre random failure story is a toss-up between one of these two: Story 1. Had a customer report that they weren't able to transfer this one particular file over their connection. The transfer would start and then at a certain point the tcp session would just lock up. After a lot of head scratching, it turned out that for 8 ports on a 24 port FastE switch blade, this certain combination of bytes caused the packet to be dropped on this otherwise perfectly normal and functioning card, thus stalling the tcp session while leaving everything around it unaffected. If you moved them to a different port outside this group of 8, or used https, or uuencoded it, it would go through fine. Story 2. Had a customer report that they were getting extremely slow transfers to another network, despite not being able to find any packet loss. Shifting the traffic to a different port to reach the same network resolved the problem. After removing the traffic and attempting to ping the far side, I got the following: <drop> 64 bytes from x.x.x.x: icmp_seq=1 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=0 ttl=61 time=4.159 ms <drop> 64 bytes from x.x.x.x: icmp_seq=5 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=6 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=7 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=61 time=4.159 ms After a little bit more testing, it turned out that every 4th packet that was being sent to the peers' router was being queued until another "4th packet" would come along and knock it out. If you increased the interval time of the ping, you would see the amount of time the packet spent in the queue increase. At one point I had it up to over 350 seconds (not milliseconds) that the packet stayed in the other routers' queue before that 4th packet came along and knocked it free. I suspect it could have gone higher, but random scanning traffic on the internet was coming in. When there was a lot of traffic on the interface you would never see the packet loss, just reordering of every 4th packet and thus slow tcp transfers. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

Mark Andrews

11:56 p.m.

In message <20090917234547.GT51443@gerbil.cluepon.net>, Richard A Steenbergen w rites:

...

On Thu, Sep 17, 2009 at 03:35:37PM -0700, Charles Wyble wrote:

...
Random failures of a single ports connectivity.... bizzare and annoying. Whole switches? Seen it. Whole panels? Seen it. Whole blades? Seen it.

Single port on a switch or patch panel? Never.

You've never seen a single port go bad on a switch? I can't even count the number of times I've seen that happen. Not that I'm not suggesting the OP wasn't the victim of a human error like unplugging the wrong port and they just lied to him, that happens even more.

My favorite bizarre random failure story is a toss-up between one of these two:

Story 1. Had a customer report that they weren't able to transfer this one particular file over their connection. The transfer would start and then at a certain point the tcp session would just lock up. After a lot of head scratching, it turned out that for 8 ports on a 24 port FastE switch blade, this certain combination of bytes caused the packet to be dropped on this otherwise perfectly normal and functioning card, thus stalling the tcp session while leaving everything around it unaffected. If you moved them to a different port outside this group of 8, or used https, or uuencoded it, it would go through fine.

Seen that more than once. It's worse when it's in some router on the other side of the planet and your just a lowly customer.

...

Story 2. Had a customer report that they were getting extremely slow transfers to another network, despite not being able to find any packet loss. Shifting the traffic to a different port to reach the same network resolved the problem. After removing the traffic and attempting to ping the far side, I got the following:

<drop> 64 bytes from x.x.x.x: icmp_seq=1 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=0 ttl=61 time=4.159 ms <drop> 64 bytes from x.x.x.x: icmp_seq=5 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=6 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=7 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=61 time=4.159 ms

After a little bit more testing, it turned out that every 4th packet that was being sent to the peers' router was being queued until another "4th packet" would come along and knock it out. If you increased the interval time of the ping, you would see the amount of time the packet spent in the queue increase. At one point I had it up to over 350 seconds (not milliseconds) that the packet stayed in the other routers' queue before that 4th packet came along and knocked it free. I suspect it could have gone higher, but random scanning traffic on the internet was coming in. When there was a lot of traffic on the interface you would never see the packet loss, just reordering of every 4th packet and thus slow tcp transfers. :)

-- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

-- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Warren Kumari

18 Sep 18 Sep

5:55 p.m.

On Sep 17, 2009, at 7:45 PM, Richard A Steenbergen wrote: [ SNIP ]

...

Story 2. Had a customer report that they were getting extremely slow transfers to another network, despite not being able to find any packet loss. Shifting the traffic to a different port to reach the same network resolved the problem. After removing the traffic and attempting to ping the far side, I got the following:

<drop> 64 bytes from x.x.x.x: icmp_seq=1 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=0 ttl=61 time=4.159 ms <drop> 64 bytes from x.x.x.x: icmp_seq=5 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=6 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=7 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=61 time=4.159 ms

After a little bit more testing, it turned out that every 4th packet that was being sent to the peers' router was being queued until another "4th packet" would come along and knock it out. If you increased the interval time of the ping, you would see the amount of time the packet spent in the queue increase. At one point I had it up to over 350 seconds (not milliseconds) that the packet stayed in the other routers' queue before that 4th packet came along and knocked it free. I suspect it could have gone higher, but random scanning traffic on the internet was coming in. When there was a lot of traffic on the interface you would never see the packet loss, just reordering of every 4th packet and thus slow tcp transfers. :)

Story 1: ----------- I had a router where I was suddenly unable to reach certain hosts on the (/24) ethernet interface -- pinging form the router worked fine, transit traffic wouldn't. I decided to try and figure out if there was any sort of rhyme or reason to which hosts had gone unreachable. I could successfully reach xxx.yyy.zzz.1 xxx.yyy.zzz.2 xxx.yyy.zzz.3 xxx.yyy.zzz.5 xxx.yyy.zzz.7 xxx.yyy.zzz.11 xxx.yyy.zzz.13 xxx.yyy.zzz.17 ... xxx.yyy.zzz.197 xxx.yyy.zzz.199 There were only 200 hosts on the LAN, but I'd bet dollars to donuts that I know what the next reachable one would have been if there had been more. Unfortunately the box rebooted itself (when I tried to view the FIB) before I could collect more info. Story 2: ---------- Had a small router connecting a remote office over a multilink PPP[1] interface (4xE1). Site starts getting massive packet-loss, so I figure one of the circuits has gone bad, but didn't get removed from the bundle. I'm having a hard time reaching the remote side, so I pull the interfaces from protocols and try ping the remote router -- no replies.... Luckily I didn't hit Ctrl-C on the ping, because suddenly I start getting replies with no drops: 64 bytes from x.x.x.x: icmp_seq=1 ttl=120 time=30132.148 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=120 time=30128.178 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=120 time=30133.231 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=120 time=30112.571 ms 64 bytes from x.x.x.x: icmp_seq=5 ttl=120 time=30132.632 ms What?! I figure it's gotta be MLPPP stupidity and / or depref of ICMP, so I connect OOB and A: remove MLPPP and use just a single interface and B: start pinging a host behind the router instead... 64 bytes from x.x.x.x: icmp_seq=1 ttl=120 time=30142.323 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=120 time=30144.571 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=120 time=30141.632 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=120 time=30142.420 ms 64 bytes from x.x.x.x: icmp_seq=5 ttl=120 time=30159.706 ms I fire up tcpdump and try ssh to a host on the remote side -- I see the SYN leave my machine and then, 30 *seconds* later I get back a SYN- ACK. I change the queuing on the interface from FIFO to something else and the problem goes away. I change the queuing back to FIFO and it's 30 second RTT again. Somehow it seems to be buffering as much traffic as it can (and anything more than one copy of ping running (or ping with anything larger than the default packet size) makes if start dropping badly). I ran "show buffers" to try get more of an idea what was happening, but it didn't like that and reloaded. Came back up fine though... Story 3: ---------- Running a network that had a large number of L3 switches from a vendor (lets call them "X") in a single OSPF area. This area also contained a large number of poor quality international circuits that would flap often, so there was *lots* of churn. Apparently this vendor X's OSPF implementation didn't much like this and so would become unhappy. The way it would express its displeasure was by corrupting a pointer to / in the LSDB so it was off-by-one and you'd get: Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.9.32.5 Mask 10.160.8.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3 Mask 10.2.153.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3 Mask 10.2.153.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. (This network was addressed out of 10/8 - 10.178.255.252 is one of vendors X boxes and 10.160.8.0 is a valid subnet, but, surprisingly enough, not a valid mask..... ). To make mattes even more fun, the OSPF adjacency would go down and then come back up -- and the grumpy box would flood all of it's (corrupt) LSAs... W [1]: Hey, not my idea...

...

-- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

-- "Real children don't go hoppity-skip unless they are on drugs." -- Susan, the ultimate sensible governess (Terry Pratchett, Hogfather)

Chris Adams

6:15 p.m.

Once upon a time, Warren Kumari <warren@kumari.net> said:

...

xxx.yyy.zzz.1 xxx.yyy.zzz.2 xxx.yyy.zzz.3 xxx.yyy.zzz.5 xxx.yyy.zzz.7 xxx.yyy.zzz.11 xxx.yyy.zzz.13 xxx.yyy.zzz.17 ... xxx.yyy.zzz.197 xxx.yyy.zzz.199

Oh, come on; everybody knows 1 doesn't belong in that list! :-) -- Chris Adams <cmadams@hiwaay.net> Systems and Network Administrator - HiWAAY Internet Services I don't speak for anybody but myself - that's enough trouble.

Valdis.Kletnieks＠vt.edu

6:57 p.m.

On Fri, 18 Sep 2009 13:15:37 CDT, Chris Adams said:

...

Oh, come on; everybody knows 1 doesn't belong in that list! :-)

Microcode bug, obviously. ;)

Luke S Crawford

21 Sep 21 Sep

6:29 a.m.

Richard A Steenbergen <ras@e-gerbil.net> writes:

...

You've never seen a single port go bad on a switch? I can't even count the number of times I've seen that happen. Not that I'm not suggesting the OP wasn't the victim of a human error like unplugging the wrong port and they just lied to him, that happens even more.

I know it happens; it's happened to me, and I have probably touched fewer switches than you. Still, from what I can understand, it can be prevented/minimized by the use of a grounded port. from: http://support.3com.com/documents/switches/baseline/3Com-Switch-Family_Safet... "CAUTION: If you want to install the Switch using a Category 5E or Category 6 cable, 3Com recommends that you briefly connect the cable to a grounded port before you connect to the network equipment. If you do not, the cable’s electrostatic discharge (ESD) may damage the Switch's port. You can create a grounded port by connecting all wires at one end of a UTP cable to an earth ground point, and the other end to a female RJ-45 connector located, for example, on a Switch rack or patch panel. The RJ-45 connector is now a grounded port."

Seth Mattinen

6:48 a.m.

Luke S Crawford wrote:

...

Richard A Steenbergen <ras@e-gerbil.net> writes:

...
You've never seen a single port go bad on a switch? I can't even count the number of times I've seen that happen. Not that I'm not suggesting the OP wasn't the victim of a human error like unplugging the wrong port and they just lied to him, that happens even more.

I know it happens; it's happened to me, and I have probably touched fewer switches than you. Still, from what I can understand, it can be prevented/minimized by the use of a grounded port.

from: http://support.3com.com/documents/switches/baseline/3Com-Switch-Family_Safet...

"CAUTION: If you want to install the Switch using a Category 5E or Category 6 cable, 3Com recommends that you briefly connect the cable to a grounded port before you connect to the network equipment. If you do not, the cable’s electrostatic discharge (ESD) may damage the Switch's port.

You can create a grounded port by connecting all wires at one end of a UTP cable to an earth ground point, and the other end to a female RJ-45 connector located, for example, on a Switch rack or patch panel. The RJ-45 connector is now a grounded port."

HP chassis switches ship with a grounding jack accessory you attach to the DB9 port (I assume it ties all RJ-45 pins to shied/ground) explicitly for this purpose. The instructions say to always plug a cable into the grounding device before connecting to a switch port. ~Seth

Richard Golodner

17 Sep 17 Sep

10:46 p.m.

On Thu, 2009-09-17 at 17:59 -0400, Marshall Eubanks wrote:

...

Or until someone pulls out the wrong cable (which has happened to me).

Not that I would know from experience, but it is rumored that certain telco techs in the NYC area can be persuaded to "borrow" other people's pairs for less than a hundred dollars. Richard Golodner

Mike Lieman

11:04 p.m.

We have a winner! On Thu, Sep 17, 2009 at 5:59 PM, Marshall Eubanks <tme@americafree.tv>wrote:

...

Or until someone pulls out the wrong cable (which has happened to me).

Regards Marshall

~Seth

...

Jon Lewis

11:23 p.m.

Not really. That's all too easy to diagnose and fix. Poorly terminated and or mistreated cabling is far more likely. I wrote a long post about all the crap termination and poor treatment I've seen...but canceled the message. On Thu, 17 Sep 2009, Mike Lieman wrote:

...

We have a winner!

On Thu, Sep 17, 2009 at 5:59 PM, Marshall Eubanks <tme@americafree.tv>wrote:

...
Or until someone pulls out the wrong cable (which has happened to me).

Regards Marshall

~Seth

...

---------------------------------------------------------------------- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

Mike Lieman

11:25 p.m.

Because no-one is stealing pairs anymore? On Thu, Sep 17, 2009 at 7:23 PM, Jon Lewis <jlewis@lewis.org> wrote:

...

Not really. That's all too easy to diagnose and fix. Poorly terminated and or mistreated cabling is far more likely. I wrote a long post about all the crap termination and poor treatment I've seen...but canceled the message.

On Thu, 17 Sep 2009, Mike Lieman wrote:

We have a winner!

...
On Thu, Sep 17, 2009 at 5:59 PM, Marshall Eubanks <tme@americafree.tv

...
wrote:

...
Or until someone pulls out the wrong cable (which has happened to me).

Regards Marshall

~Seth

...
---------------------------------------------------------------------- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp<http://www.lewis.org/%7Ejlewis/pgp>for PGP public key_________

Jon Lewis

11:30 p.m.

It's just not as interesting or hard to troubleshoot as a poorly made patch cable that's had one conductor go open, only goes open when the wire is tugged a certain direction, nicked wires shorting, a switch port with its RX side burned out, an RJ45 plug who's mistreated tab no longer works, and though it looks inserted in the port, it's really just kind of hanging there not making full/good contact, etc. I would hope in any data center, "stealing pairs" doesn't happen as much as any of the above or taking pairs the tech genuinely thought were dead. On Thu, 17 Sep 2009, Mike Lieman wrote:

...

Because no-one is stealing pairs anymore?

On Thu, Sep 17, 2009 at 7:23 PM, Jon Lewis <jlewis@lewis.org> wrote:

...
Not really. That's all too easy to diagnose and fix. Poorly terminated and or mistreated cabling is far more likely. I wrote a long post about all the crap termination and poor treatment I've seen...but canceled the message.

On Thu, 17 Sep 2009, Mike Lieman wrote:

We have a winner!

...
On Thu, Sep 17, 2009 at 5:59 PM, Marshall Eubanks <tme@americafree.tv

...
wrote:

...
Or until someone pulls out the wrong cable (which has happened to me).

Regards Marshall

~Seth

...
---------------------------------------------------------------------- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp<http://www.lewis.org/%7Ejlewis/pgp>for PGP public key_________

Michael K. Smith - Adhost

9:54 p.m.

...

-----Original Message----- From: Michael J McCafferty [mailto:mike@m5computersecurity.com] Sent: Thursday, September 17, 2009 2:46 PM To: nanog Subject: cross connect reliability

All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes

Hello Michael: like

...

a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?

Thanks, Mike

I agree with their Reason for Outage, but it sounds like a design issue. We prewire all of our switches to patch panels so they don't get touched once they're installed. The patch panels are much more friendly to insertions and removals than a 48 port 1-U switch. We also have multiple connections on the fiber side to avoid those failures. With all of that, we still have failures, but their effect and frequency are minimized. Mike -- Michael K. Smith - CISSP, GISP Chief Technical Officer - Adhost Internet LLC mksmith@adhost.com w: +1 (206) 404-9500 f: +1 (206) 404-9050 PGP: B49A DDF5 8611 27F3 08B9 84BB E61E 38C0 (Key ID: 0x9A96777D)

Justin Wilson - MTIN

10:17 p.m.

From: Michael J McCafferty <mike@m5computersecurity.com> Organization: M5Hosting Date: Thu, 17 Sep 2009 14:45:36 -0700 To: nanog <nanog@nanog.org> Subject: cross connect reliability All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo? Thanks, Mike Does the colo let anyone run cables or do they have approved contractors? It sounds like a design issue to me in the way the cables are treated. In 4 years at a busy colo we have had one copper cross connect not act right. It would pass data but was flaky. We replaced it because it was an easy run just to rule it out. I am assuming your are in shared space. If so I would investigate your weak points (which I am sure you already are doing). Justin

Richard A Steenbergen

11:29 p.m.

On Thu, Sep 17, 2009 at 02:45:36PM -0700, Michael J McCafferty wrote:

...

All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables.

I once had a circuit go down because the fiber connector wasn't crimped on correctly, and the fiber pulled out of the connector while a tech was working in the cable tray nearby. After we opened a ticket about the issue, said tech "fixed" it by shoving the fiber back into the connector by hand, and walking away. Needless to say it went down again the next day. Names withheld to protect the guilty and keep them from raising my prices for heckling them in public, but the moral of the story is never underestimate the laziness or stupidity of the cable monkeys some of these places hire and let touch your routers. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

Jon Lewis

11:34 p.m.

On Thu, 17 Sep 2009, Richard A Steenbergen wrote:

...

I once had a circuit go down because the fiber connector wasn't crimped on correctly, and the fiber pulled out of the connector while a tech was working in the cable tray nearby. After we opened a ticket about the issue, said tech "fixed" it by shoving the fiber back into the connector by hand, and walking away. Needless to say it went down again the next day. Names withheld to protect the guilty and keep them from raising my prices for heckling them in public, but the moral of the story is never underestimate the laziness or stupidity of the cable monkeys some of these places hire and let touch your routers. :)

In their defense, that was clearly the fastest way to fix it. :) Just not a very long term solution. ---------------------------------------------------------------------- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

Shane Ronan

11:39 p.m.

Having work in high traffic colo spaces around the world for the last ten years or so, in my experience this type of issue is very rare. If you are having this type of "quality" issue, I would sit down with your sales rep and ask to be stepped through their processes, there is obviously something that has gone VERY VERY WRONG. Shane On Sep 17, 2009, at 10:45 PM, Michael J McCafferty wrote:

...

All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?

Thanks, Mike

-- ************************************************************ Michael J. McCafferty Principal, Security Engineer M5 Hosting http://www.m5hosting.com

You can have your own custom Dedicated Server up and running today ! RedHat Enterprise, CentOS, Ubuntu, Debian, OpenBSD, FreeBSD, and more ************************************************************

Brandon Palmer

18 Sep 18 Sep

12:58 a.m.

...

...
...
Michael J McCafferty <mike@m5computersecurity.com> 9/17/2009 5:45 PM >>> All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate,

We've never had a fiber CC fail. We HAVE had DS3 and T1s fail. Those were due to other customer circuits being installed near ours and bumping them. they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo? Thanks, Mike -- ************************************************************ Michael J. McCafferty Principal, Security Engineer M5 Hosting http://www.m5hosting.com You can have your own custom Dedicated Server up and running today ! RedHat Enterprise, CentOS, Ubuntu, Debian, OpenBSD, FreeBSD, and more ************************************************************

Martin Hannigan

5:22 p.m.

On Thu, Sep 17, 2009 at 5:45 PM, Michael J McCafferty < mike@m5computersecurity.com> wrote: [ clip ]

...

I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?

At the physical layer, near zero. If "jiggling and wiggling" is causing cable failures, they have bigger problems. The last time I had problems with "jiggle and wiggle" was with techs walking by ds3 xcon farms and "testing" if the cables were "taught" enough with respect to the termination hardware. If you pull on them enough, they will come apart and they will fail. I'd love to see a picture of their xcon frames. -M< -- Martin Hannigan martin@theicelandguy.com p: +16178216079 Power, Network, and Costs Consulting for Iceland Datacenters and Occupants

5737

Age (days ago)

5741

Last active (days ago)

List overview

Download

26 comments

21 participants

participants (21)

Alex Balashov
Brandon Palmer
Charles Wyble
Chris Adams
Deepak Jain
Jon Lewis
Justin Wilson - MTIN
Luke S Crawford
Mark Andrews
Marshall Eubanks
Martin Hannigan
Michael J McCafferty
Michael K. Smith - Adhost
Mike Lieman
Pete Carah
Richard A Steenbergen
Richard Golodner
Seth Mattinen
Shane Ronan
Valdis.Kletnieks＠vt.edu
Warren Kumari