cross connect reliability
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo? Thanks, Mike -- ************************************************************ Michael J. McCafferty Principal, Security Engineer M5 Hosting http://www.m5hosting.com You can have your own custom Dedicated Server up and running today ! RedHat Enterprise, CentOS, Ubuntu, Debian, OpenBSD, FreeBSD, and more ************************************************************
Michael J McCafferty wrote:
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?
Never to fail? Seriously; if you're talking about a passive connection (optical or electrical) like a patch panel, I'd expect it to keep going forever unless someone damages it. ~Seth
Seth Mattinen wrote:
Michael J McCafferty wrote:
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?
Never to fail? Seriously; if you're talking about a passive connection (optical or electrical) like a patch panel, I'd expect it to keep going forever unless someone damages it.
That's truly wishful thinking, as are the assumptions that insulate it from damaging factors. Nothing lasts forever. -- Alex Balashov - Principal Evariste Systems Web : http://www.evaristesys.com/ Tel : (+1) (678) 954-0670 Direct : (+1) (678) 954-0671
Alex Balashov wrote:
Seth Mattinen wrote:
Michael J McCafferty wrote:
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?
Never to fail? Seriously; if you're talking about a passive connection (optical or electrical) like a patch panel, I'd expect it to keep going forever unless someone damages it.
That's truly wishful thinking, as are the assumptions that insulate it from damaging factors. Nothing lasts forever.
What the OP is describing is abnormally high in my view. Based purely on my own personal experience, the structured wiring in my parent's house I put in in the mid 90's has never suffered a failure, is still in use today, and it's in a residential environment with dogs and cats. I'd expect a properly managed environment to fare at least as good as that. ~Seth
[lots of stuff deleted]. We've seen cross-connects fail at sites like "E" and others. Generally speaking, it is a human-error issue and not a component failure one. Either people are being sloppy and aren't reading labels, or the labels aren't there. In a cabinet situation, every cabinet does not necessarily home back to its own patch panel, so some trashing may occur -- it can be avoided with good design [cables in the back stay there, etc]. When you are talking about optics failing and they are providing "smart" cross-connects, almost anything is possible. The true tell tale is whether you have to call when the cross-connect goes down, or if it just "bounces". Either way, have them take you to their cross-connect room and show you their mess. Once you see it, you'll know what to expect going forward. Deepak
On 09/17/2009 06:37 PM, Deepak Jain wrote:
[lots of stuff deleted].
A famous one that can happen with some techs is that they make jumpers from solid wire with generic rj45 plugs (yes, I've seen this recently from several folks who should know better). These will last somewhere around a year (long enough to forget when they were installed) then randomly fail from just fan vibration or slight breezes. There are rj45 plugs made for solid wire (have 3 little prongs instead of 2, and they are offset to straddle the wire) but I feel that even these can go bad. I know if the techs are properly educated that this "will never happen" (tm)... (till someone needs a custom-length jumper on a sunday...) (for which, one colo building has an ace hardware with most of the right stuff, but unfortunately most don't). As we all (should) know, all solid-wire cable should terminate in a panel and proper short jumpers (pref. with molded strain-relief) are used for the rest. -- Pete
On Sep 17, 2009, at 5:52 PM, Seth Mattinen wrote:
Michael J McCafferty wrote:
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?
Never to fail? Seriously; if you're talking about a passive connection (optical or electrical) like a patch panel, I'd expect it to keep going forever unless someone damages it.
Or until someone pulls out the wrong cable (which has happened to me). Regards Marshall
~Seth
Marshall Eubanks wrote:
On Sep 17, 2009, at 5:52 PM, Seth Mattinen wrote:
Michael J McCafferty wrote:
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?
Never to fail? Seriously; if you're talking about a passive connection (optical or electrical) like a patch panel, I'd expect it to keep going forever unless someone damages it.
Or until someone pulls out the wrong cable (which has happened to me).
That's not a failure though. It's a disconnection. It happens but is readily attributable to a cause. Random failures of a single ports connectivity.... bizzare and annoying. Whole switches? Seen it. Whole panels? Seen it. Whole blades? Seen it. Single port on a switch or patch panel? Never.
On Thu, Sep 17, 2009 at 03:35:37PM -0700, Charles Wyble wrote:
Random failures of a single ports connectivity.... bizzare and annoying. Whole switches? Seen it. Whole panels? Seen it. Whole blades? Seen it.
Single port on a switch or patch panel? Never.
You've never seen a single port go bad on a switch? I can't even count the number of times I've seen that happen. Not that I'm not suggesting the OP wasn't the victim of a human error like unplugging the wrong port and they just lied to him, that happens even more. My favorite bizarre random failure story is a toss-up between one of these two: Story 1. Had a customer report that they weren't able to transfer this one particular file over their connection. The transfer would start and then at a certain point the tcp session would just lock up. After a lot of head scratching, it turned out that for 8 ports on a 24 port FastE switch blade, this certain combination of bytes caused the packet to be dropped on this otherwise perfectly normal and functioning card, thus stalling the tcp session while leaving everything around it unaffected. If you moved them to a different port outside this group of 8, or used https, or uuencoded it, it would go through fine. Story 2. Had a customer report that they were getting extremely slow transfers to another network, despite not being able to find any packet loss. Shifting the traffic to a different port to reach the same network resolved the problem. After removing the traffic and attempting to ping the far side, I got the following: <drop> 64 bytes from x.x.x.x: icmp_seq=1 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=0 ttl=61 time=4.159 ms <drop> 64 bytes from x.x.x.x: icmp_seq=5 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=6 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=7 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=61 time=4.159 ms After a little bit more testing, it turned out that every 4th packet that was being sent to the peers' router was being queued until another "4th packet" would come along and knock it out. If you increased the interval time of the ping, you would see the amount of time the packet spent in the queue increase. At one point I had it up to over 350 seconds (not milliseconds) that the packet stayed in the other routers' queue before that 4th packet came along and knocked it free. I suspect it could have gone higher, but random scanning traffic on the internet was coming in. When there was a lot of traffic on the interface you would never see the packet loss, just reordering of every 4th packet and thus slow tcp transfers. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
In message <20090917234547.GT51443@gerbil.cluepon.net>, Richard A Steenbergen w rites:
On Thu, Sep 17, 2009 at 03:35:37PM -0700, Charles Wyble wrote:
Random failures of a single ports connectivity.... bizzare and annoying. Whole switches? Seen it. Whole panels? Seen it. Whole blades? Seen it.
Single port on a switch or patch panel? Never.
You've never seen a single port go bad on a switch? I can't even count the number of times I've seen that happen. Not that I'm not suggesting the OP wasn't the victim of a human error like unplugging the wrong port and they just lied to him, that happens even more.
My favorite bizarre random failure story is a toss-up between one of these two:
Story 1. Had a customer report that they weren't able to transfer this one particular file over their connection. The transfer would start and then at a certain point the tcp session would just lock up. After a lot of head scratching, it turned out that for 8 ports on a 24 port FastE switch blade, this certain combination of bytes caused the packet to be dropped on this otherwise perfectly normal and functioning card, thus stalling the tcp session while leaving everything around it unaffected. If you moved them to a different port outside this group of 8, or used https, or uuencoded it, it would go through fine.
Seen that more than once. It's worse when it's in some router on the other side of the planet and your just a lowly customer.
Story 2. Had a customer report that they were getting extremely slow transfers to another network, despite not being able to find any packet loss. Shifting the traffic to a different port to reach the same network resolved the problem. After removing the traffic and attempting to ping the far side, I got the following:
<drop> 64 bytes from x.x.x.x: icmp_seq=1 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=0 ttl=61 time=4.159 ms <drop> 64 bytes from x.x.x.x: icmp_seq=5 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=6 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=7 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=61 time=4.159 ms
After a little bit more testing, it turned out that every 4th packet that was being sent to the peers' router was being queued until another "4th packet" would come along and knock it out. If you increased the interval time of the ping, you would see the amount of time the packet spent in the queue increase. At one point I had it up to over 350 seconds (not milliseconds) that the packet stayed in the other routers' queue before that 4th packet came along and knocked it free. I suspect it could have gone higher, but random scanning traffic on the internet was coming in. When there was a lot of traffic on the interface you would never see the packet loss, just reordering of every 4th packet and thus slow tcp transfers. :)
-- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
-- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org
On Sep 17, 2009, at 7:45 PM, Richard A Steenbergen wrote: [ SNIP ]
Story 2. Had a customer report that they were getting extremely slow transfers to another network, despite not being able to find any packet loss. Shifting the traffic to a different port to reach the same network resolved the problem. After removing the traffic and attempting to ping the far side, I got the following:
<drop> 64 bytes from x.x.x.x: icmp_seq=1 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=0 ttl=61 time=4.159 ms <drop> 64 bytes from x.x.x.x: icmp_seq=5 ttl=61 time=0.194 ms 64 bytes from x.x.x.x: icmp_seq=6 ttl=61 time=0.196 ms 64 bytes from x.x.x.x: icmp_seq=7 ttl=61 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=61 time=4.159 ms
After a little bit more testing, it turned out that every 4th packet that was being sent to the peers' router was being queued until another "4th packet" would come along and knock it out. If you increased the interval time of the ping, you would see the amount of time the packet spent in the queue increase. At one point I had it up to over 350 seconds (not milliseconds) that the packet stayed in the other routers' queue before that 4th packet came along and knocked it free. I suspect it could have gone higher, but random scanning traffic on the internet was coming in. When there was a lot of traffic on the interface you would never see the packet loss, just reordering of every 4th packet and thus slow tcp transfers. :)
Story 1: ----------- I had a router where I was suddenly unable to reach certain hosts on the (/24) ethernet interface -- pinging form the router worked fine, transit traffic wouldn't. I decided to try and figure out if there was any sort of rhyme or reason to which hosts had gone unreachable. I could successfully reach xxx.yyy.zzz.1 xxx.yyy.zzz.2 xxx.yyy.zzz.3 xxx.yyy.zzz.5 xxx.yyy.zzz.7 xxx.yyy.zzz.11 xxx.yyy.zzz.13 xxx.yyy.zzz.17 ... xxx.yyy.zzz.197 xxx.yyy.zzz.199 There were only 200 hosts on the LAN, but I'd bet dollars to donuts that I know what the next reachable one would have been if there had been more. Unfortunately the box rebooted itself (when I tried to view the FIB) before I could collect more info. Story 2: ---------- Had a small router connecting a remote office over a multilink PPP[1] interface (4xE1). Site starts getting massive packet-loss, so I figure one of the circuits has gone bad, but didn't get removed from the bundle. I'm having a hard time reaching the remote side, so I pull the interfaces from protocols and try ping the remote router -- no replies.... Luckily I didn't hit Ctrl-C on the ping, because suddenly I start getting replies with no drops: 64 bytes from x.x.x.x: icmp_seq=1 ttl=120 time=30132.148 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=120 time=30128.178 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=120 time=30133.231 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=120 time=30112.571 ms 64 bytes from x.x.x.x: icmp_seq=5 ttl=120 time=30132.632 ms What?! I figure it's gotta be MLPPP stupidity and / or depref of ICMP, so I connect OOB and A: remove MLPPP and use just a single interface and B: start pinging a host behind the router instead... 64 bytes from x.x.x.x: icmp_seq=1 ttl=120 time=30142.323 ms 64 bytes from x.x.x.x: icmp_seq=2 ttl=120 time=30144.571 ms 64 bytes from x.x.x.x: icmp_seq=3 ttl=120 time=30141.632 ms 64 bytes from x.x.x.x: icmp_seq=4 ttl=120 time=30142.420 ms 64 bytes from x.x.x.x: icmp_seq=5 ttl=120 time=30159.706 ms I fire up tcpdump and try ssh to a host on the remote side -- I see the SYN leave my machine and then, 30 *seconds* later I get back a SYN- ACK. I change the queuing on the interface from FIFO to something else and the problem goes away. I change the queuing back to FIFO and it's 30 second RTT again. Somehow it seems to be buffering as much traffic as it can (and anything more than one copy of ping running (or ping with anything larger than the default packet size) makes if start dropping badly). I ran "show buffers" to try get more of an idea what was happening, but it didn't like that and reloaded. Came back up fine though... Story 3: ---------- Running a network that had a large number of L3 switches from a vendor (lets call them "X") in a single OSPF area. This area also contained a large number of poor quality international circuits that would flap often, so there was *lots* of churn. Apparently this vendor X's OSPF implementation didn't much like this and so would become unhappy. The way it would express its displeasure was by corrupting a pointer to / in the LSDB so it was off-by-one and you'd get: Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.9.32.5 Mask 10.160.8.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3 Mask 10.2.153.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3 Mask 10.2.153.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. (This network was addressed out of 10/8 - 10.178.255.252 is one of vendors X boxes and 10.160.8.0 is a valid subnet, but, surprisingly enough, not a valid mask..... ). To make mattes even more fun, the OSPF adjacency would go down and then come back up -- and the grumpy box would flood all of it's (corrupt) LSAs... W [1]: Hey, not my idea...
-- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
-- "Real children don't go hoppity-skip unless they are on drugs." -- Susan, the ultimate sensible governess (Terry Pratchett, Hogfather)
Once upon a time, Warren Kumari <warren@kumari.net> said:
xxx.yyy.zzz.1 xxx.yyy.zzz.2 xxx.yyy.zzz.3 xxx.yyy.zzz.5 xxx.yyy.zzz.7 xxx.yyy.zzz.11 xxx.yyy.zzz.13 xxx.yyy.zzz.17 ... xxx.yyy.zzz.197 xxx.yyy.zzz.199
Oh, come on; everybody knows 1 doesn't belong in that list! :-) -- Chris Adams <cmadams@hiwaay.net> Systems and Network Administrator - HiWAAY Internet Services I don't speak for anybody but myself - that's enough trouble.
Richard A Steenbergen <ras@e-gerbil.net> writes:
You've never seen a single port go bad on a switch? I can't even count the number of times I've seen that happen. Not that I'm not suggesting the OP wasn't the victim of a human error like unplugging the wrong port and they just lied to him, that happens even more.
I know it happens; it's happened to me, and I have probably touched fewer switches than you. Still, from what I can understand, it can be prevented/minimized by the use of a grounded port. from: http://support.3com.com/documents/switches/baseline/3Com-Switch-Family_Safet... "CAUTION: If you want to install the Switch using a Category 5E or Category 6 cable, 3Com recommends that you briefly connect the cable to a grounded port before you connect to the network equipment. If you do not, the cable’s electrostatic discharge (ESD) may damage the Switch's port. You can create a grounded port by connecting all wires at one end of a UTP cable to an earth ground point, and the other end to a female RJ-45 connector located, for example, on a Switch rack or patch panel. The RJ-45 connector is now a grounded port."
Luke S Crawford wrote:
Richard A Steenbergen <ras@e-gerbil.net> writes:
You've never seen a single port go bad on a switch? I can't even count the number of times I've seen that happen. Not that I'm not suggesting the OP wasn't the victim of a human error like unplugging the wrong port and they just lied to him, that happens even more.
I know it happens; it's happened to me, and I have probably touched fewer switches than you. Still, from what I can understand, it can be prevented/minimized by the use of a grounded port.
from: http://support.3com.com/documents/switches/baseline/3Com-Switch-Family_Safet...
"CAUTION: If you want to install the Switch using a Category 5E or Category 6 cable, 3Com recommends that you briefly connect the cable to a grounded port before you connect to the network equipment. If you do not, the cable’s electrostatic discharge (ESD) may damage the Switch's port.
You can create a grounded port by connecting all wires at one end of a UTP cable to an earth ground point, and the other end to a female RJ-45 connector located, for example, on a Switch rack or patch panel. The RJ-45 connector is now a grounded port."
HP chassis switches ship with a grounding jack accessory you attach to the DB9 port (I assume it ties all RJ-45 pins to shied/ground) explicitly for this purpose. The instructions say to always plug a cable into the grounding device before connecting to a switch port. ~Seth
On Thu, 2009-09-17 at 17:59 -0400, Marshall Eubanks wrote:
Or until someone pulls out the wrong cable (which has happened to me).
Not that I would know from experience, but it is rumored that certain telco techs in the NYC area can be persuaded to "borrow" other people's pairs for less than a hundred dollars. Richard Golodner
Not really. That's all too easy to diagnose and fix. Poorly terminated and or mistreated cabling is far more likely. I wrote a long post about all the crap termination and poor treatment I've seen...but canceled the message. On Thu, 17 Sep 2009, Mike Lieman wrote:
We have a winner!
On Thu, Sep 17, 2009 at 5:59 PM, Marshall Eubanks <tme@americafree.tv>wrote:
Or until someone pulls out the wrong cable (which has happened to me).
Regards Marshall
~Seth
---------------------------------------------------------------------- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
Because no-one is stealing pairs anymore? On Thu, Sep 17, 2009 at 7:23 PM, Jon Lewis <jlewis@lewis.org> wrote:
Not really. That's all too easy to diagnose and fix. Poorly terminated and or mistreated cabling is far more likely. I wrote a long post about all the crap termination and poor treatment I've seen...but canceled the message.
On Thu, 17 Sep 2009, Mike Lieman wrote:
We have a winner!
On Thu, Sep 17, 2009 at 5:59 PM, Marshall Eubanks <tme@americafree.tv
wrote:
Or until someone pulls out the wrong cable (which has happened to me).
Regards Marshall
~Seth
---------------------------------------------------------------------- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp<http://www.lewis.org/%7Ejlewis/pgp>for PGP public key_________
It's just not as interesting or hard to troubleshoot as a poorly made patch cable that's had one conductor go open, only goes open when the wire is tugged a certain direction, nicked wires shorting, a switch port with its RX side burned out, an RJ45 plug who's mistreated tab no longer works, and though it looks inserted in the port, it's really just kind of hanging there not making full/good contact, etc. I would hope in any data center, "stealing pairs" doesn't happen as much as any of the above or taking pairs the tech genuinely thought were dead. On Thu, 17 Sep 2009, Mike Lieman wrote:
Because no-one is stealing pairs anymore?
On Thu, Sep 17, 2009 at 7:23 PM, Jon Lewis <jlewis@lewis.org> wrote:
Not really. That's all too easy to diagnose and fix. Poorly terminated and or mistreated cabling is far more likely. I wrote a long post about all the crap termination and poor treatment I've seen...but canceled the message.
On Thu, 17 Sep 2009, Mike Lieman wrote:
We have a winner!
On Thu, Sep 17, 2009 at 5:59 PM, Marshall Eubanks <tme@americafree.tv
wrote:
Or until someone pulls out the wrong cable (which has happened to me).
Regards Marshall
~Seth
---------------------------------------------------------------------- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp<http://www.lewis.org/%7Ejlewis/pgp>for PGP public key_________
---------------------------------------------------------------------- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
-----Original Message----- From: Michael J McCafferty [mailto:mike@m5computersecurity.com] Sent: Thursday, September 17, 2009 2:46 PM To: nanog Subject: cross connect reliability
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes
Hello Michael: like
a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?
Thanks, Mike
I agree with their Reason for Outage, but it sounds like a design issue. We prewire all of our switches to patch panels so they don't get touched once they're installed. The patch panels are much more friendly to insertions and removals than a 48 port 1-U switch. We also have multiple connections on the fiber side to avoid those failures. With all of that, we still have failures, but their effect and frequency are minimized. Mike -- Michael K. Smith - CISSP, GISP Chief Technical Officer - Adhost Internet LLC mksmith@adhost.com w: +1 (206) 404-9500 f: +1 (206) 404-9050 PGP: B49A DDF5 8611 27F3 08B9 84BB E61E 38C0 (Key ID: 0x9A96777D)
From: Michael J McCafferty <mike@m5computersecurity.com> Organization: M5Hosting Date: Thu, 17 Sep 2009 14:45:36 -0700 To: nanog <nanog@nanog.org> Subject: cross connect reliability All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo? Thanks, Mike Does the colo let anyone run cables or do they have approved contractors? It sounds like a design issue to me in the way the cables are treated. In 4 years at a busy colo we have had one copper cross connect not act right. It would pass data but was flaky. We replaced it because it was an easy run just to rule it out. I am assuming your are in shared space. If so I would investigate your weak points (which I am sure you already are doing). Justin
On Thu, Sep 17, 2009 at 02:45:36PM -0700, Michael J McCafferty wrote:
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables.
I once had a circuit go down because the fiber connector wasn't crimped on correctly, and the fiber pulled out of the connector while a tech was working in the cable tray nearby. After we opened a ticket about the issue, said tech "fixed" it by shoving the fiber back into the connector by hand, and walking away. Needless to say it went down again the next day. Names withheld to protect the guilty and keep them from raising my prices for heckling them in public, but the moral of the story is never underestimate the laziness or stupidity of the cable monkeys some of these places hire and let touch your routers. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
On Thu, 17 Sep 2009, Richard A Steenbergen wrote:
I once had a circuit go down because the fiber connector wasn't crimped on correctly, and the fiber pulled out of the connector while a tech was working in the cable tray nearby. After we opened a ticket about the issue, said tech "fixed" it by shoving the fiber back into the connector by hand, and walking away. Needless to say it went down again the next day. Names withheld to protect the guilty and keep them from raising my prices for heckling them in public, but the moral of the story is never underestimate the laziness or stupidity of the cable monkeys some of these places hire and let touch your routers. :)
In their defense, that was clearly the fastest way to fix it. :) Just not a very long term solution. ---------------------------------------------------------------------- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
Having work in high traffic colo spaces around the world for the last ten years or so, in my experience this type of issue is very rare. If you are having this type of "quality" issue, I would sit down with your sales rep and ask to be stepped through their processes, there is obviously something that has gone VERY VERY WRONG. Shane On Sep 17, 2009, at 10:45 PM, Michael J McCafferty wrote:
All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate, they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?
Thanks, Mike
-- ************************************************************ Michael J. McCafferty Principal, Security Engineer M5 Hosting http://www.m5hosting.com
You can have your own custom Dedicated Server up and running today ! RedHat Enterprise, CentOS, Ubuntu, Debian, OpenBSD, FreeBSD, and more ************************************************************
Michael J McCafferty <mike@m5computersecurity.com> 9/17/2009 5:45 PM >>> All, Today I had yet another cross-connect fail at our colo provider. From memory, this is the 6th cross-connect to fail while in service, in 4yrs and recently there was a bad SFP on their end as well. This seemes like a high failure rate to me. When I asked about the high failure rate,
We've never had a fiber CC fail. We HAVE had DS3 and T1s fail. Those were due to other customer circuits being installed near ours and bumping them. they said that they run a lot of cables and there is a lot of jiggling and wiggling... lots of chances to get bent out of whack from activity near my patches and cables. Until a few years ago my time was spent mostly in single tenant data centers, and it may be true that we made fewer cabling changes and made less of a ruckus when cabling... but this still seems like a pretty high failure rate at the colo. I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo? Thanks, Mike -- ************************************************************ Michael J. McCafferty Principal, Security Engineer M5 Hosting http://www.m5hosting.com You can have your own custom Dedicated Server up and running today ! RedHat Enterprise, CentOS, Ubuntu, Debian, OpenBSD, FreeBSD, and more ************************************************************
On Thu, Sep 17, 2009 at 5:45 PM, Michael J McCafferty < mike@m5computersecurity.com> wrote: [ clip ]
I am curious; what do you expect the average reliability of your FastE or GigE copper cross-connects at a colo?
At the physical layer, near zero. If "jiggling and wiggling" is causing cable failures, they have bigger problems. The last time I had problems with "jiggle and wiggle" was with techs walking by ds3 xcon farms and "testing" if the cables were "taught" enough with respect to the termination hardware. If you pull on them enough, they will come apart and they will fail. I'd love to see a picture of their xcon frames. -M< -- Martin Hannigan martin@theicelandguy.com p: +16178216079 Power, Network, and Costs Consulting for Iceland Datacenters and Occupants
participants (21)
-
Alex Balashov
-
Brandon Palmer
-
Charles Wyble
-
Chris Adams
-
Deepak Jain
-
Jon Lewis
-
Justin Wilson - MTIN
-
Luke S Crawford
-
Mark Andrews
-
Marshall Eubanks
-
Martin Hannigan
-
Michael J McCafferty
-
Michael K. Smith - Adhost
-
Mike Lieman
-
Pete Carah
-
Richard A Steenbergen
-
Richard Golodner
-
Seth Mattinen
-
Shane Ronan
-
Valdis.Kletnieks@vt.edu
-
Warren Kumari