I have heard rumors that S&D has been having persistent switch problems with their switches at PAIX (Palo Alto), and I was kind of wondering if anyone actually cared?
On Wed, Apr 27, 2005 at 10:45:15AM -0400, Jay Patel wrote:
I have heard rumors that S&D has been having persistent switch problems with their switches at PAIX (Palo Alto), and I was kind of wondering if anyone actually cared?
Personally I tend to suspect the general lack of uproar is a rather unfortunate (for them) sign that PAIX is no longer relevant when it comes to critical backbone infrastructures. It looks like different folks have been seeing different levels of outages depending upon which switch/card they are connected to, but I havn't been able to find anyone who has seen fewer than 30 hits between April 16th and the two this morning. Our ports have seen just under 28 hours of total downtime so far this month, while some lucky people have only seen around 6 hours. I'm not sure if anyone at S&D or Extreme actually has any real idea what the problem is with these current switches, but given this amount of downtime, they should have replace every last component by now. If Extreme can't fix them, there should be a pile of Black Diamond's sitting on the curb waiting for trash day. In fact, 9/10ths of the way through writing this e-mail, I got a call from S&D stating that they are doing exactly that. :) In the mean time, here are some of the more interesting snipits of what has been tried on the current switches: 16 Apr 2005 20:19:53 GMT We are currently experiencing some problems with 2 network cards in our Palo Alto peering switch. This might be causing possible service degradations. Switch Engineers are expecting new cards to replace the 2 suspected faulty network cards. These cards should be arriving in or around 1 hour. Right after the cards arrive, we will be scheduling an emergency maintenance window to get these cards replaced. 19 Apr 2005 14:16:07 GMT The Purpose of this Emergency Maintenance window is for Switch Engineers to replace a faulty processor module card affecting the Bay Area Peering customers. The estimated down time will be 15 minutes. (Actual downtime several hours) 19 Apr 2005 19:27:49 GMT This is the final update regarding the problems experienced today with the peering fabric. Our Switch Engineers corrected the problems during the emergency maintenance window by replacing two line cards and 2 processor cards in the Palo Alto switch. All peering sessions should be restored at this time. 22 Apr 2005 21:56:15 GMT The purpose of this emergency maintenance window is for engineers to replace defective power supply units on the Paix Switch. No impact to your services is expected. 24 Apr 2005 21:25:48 GMT Our Switch Engineers will be conducting and emergency processor cards replacement at the Palo Alto site. The expected downtime while this maintenance is being conducting will be 2 hours. 24 Apr 2005 21:36:18 GMT Our Switch Engineers will be conducting and emergency chassis replacement at the Palo Alto site. The expected downtime while this maintenance is being conducting will be 3 hours. 25 Apr 2005 19:17:41 GMT Our engineers have escalated the problems with the peering switch in Palo Alto to 3rd level support at Extreme, the switch vendor. More details will follow as they become available. 26 Apr 2005 03:00:34 GMT Our Switch Engineers have advised us that the switch has been migrated to a different power bus to rule out any power variables. Power is being monitored for the next 24 hours. 28 Apr 2005 13:33:05 GMT At approximately 6:05 AM local time, the peering switch rebooted itself. Our switch engineers are investigating this issue and believe all sessions are back to normal at this time. More details will be provided as they become available. When I see a stable switching platform going forward, and some service credits for the massive outages we've all endured so far, I'll probably be a lot less cranky about the entire situation. Until then I have to say, if they keep this up their are going to need to change their name to "Switch or Data". Oh well, at least this didn't happen during the S&D sponsored NANOG. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
In a message written on Thu, Apr 28, 2005 at 01:51:54PM -0400, Richard A Steenbergen wrote:
Personally I tend to suspect the general lack of uproar is a rather unfortunate (for them) sign that PAIX is no longer relevant when it comes to critical backbone infrastructures.
That, or a sign that operators are doing their job. There should be enough redundancy in the system that loss of any one site, for whatever reason, doesn't cause a major, or even minor disruption. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/ Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org
On Thu, Apr 28, 2005 at 02:11:40PM -0400, Leo Bicknell wrote:
In a message written on Thu, Apr 28, 2005 at 01:51:54PM -0400, Richard A Steenbergen wrote:
Personally I tend to suspect the general lack of uproar is a rather unfortunate (for them) sign that PAIX is no longer relevant when it comes to critical backbone infrastructures.
That, or a sign that operators are doing their job. There should be enough redundancy in the system that loss of any one site, for whatever reason, doesn't cause a major, or even minor disruption.
If you have a Cisco router that craps out on a regular basis, Cisco will tell you to get a second one. Some people find this to be a great solution, while other people go buy a Juniper. This probably isn't the way they wanted to announce this, but PAIX is rolling out a new 10GE capable platform (the Extreme Aspen series). Equinix is about to follow suit with their 10GE platform, and the only other two modern competetive IX's in the US have already deployed new 10GE capable platforms (NYIIX with Foundry MG8 and NOTA with Force10). Of course the europeans have had customers up on 10GE for 6 months now, and at a fraction of the price that the US IX's will be charging, but lets ignore that and focus on our own backwater continent right now. :) At the moment, the US IX's largely price their ports as high as the market will possibly bear (and then sometimes a few bucks more just as a kick in the teeth), and largely doesn't have 10GE ports available for either customers or multiple-site trunking. This means that most serious providers don't even have the option of public peering at interesting capacities, even if they weren't concerned about reliability issues. As the US IX market finally gets its act together and rolls out 10GE, many networks are going to start upgrading, and start putting much larger amounts of traffic on them to save on PNI costs. After all, we both know that due to current financial conditions not every network can afford to have all of the spare PNI ports they would like to ensure that they have sufficiently diverse/redundant interconnections with their peers, yes? :) With these IX's poised to take another order of magnitude step (remember the good 'old days when GE seemed to large?), they are about to get another shot in the arm as far as being used for mission critical peering infrastructure is concerned. But no matter how good an idea it may be to make sure that you "always have diverse capacity at another location", if one IX is having significantly higher numbers of disruptions than the rest, the network operators are going to go elsewhere (well after their 5 year contracts are up at any rate). Besides, I don't think "and for when we go down, there is an Equinix facility down the road" is really the marketing angle that Switch and Data had in mind. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
On Thu, 28 April 2005 18:57:53 -0400, Richard A Steenbergen wrote:
At the moment, the US IX's largely price their ports as high as the market will possibly bear (and then sometimes a few bucks more just as a kick in the teeth)
Yeah, what's the issue? US public peering ports are absurdly overpriced. Anyone had a laugh at the PAIX list prices when seeing them first? Considering LINX and AMSIX are their own companies for some years now they are doing an excellent job at being (too?) affordable, but it surely works. Then when I think about the NYIIX woes some time ago and other stuff (like the current PAIX trouble) I cannot help but get rid of public peering, especially in the US. As another matter I do not believe in public peering at all when you have flows to a single peer that are ore than half of a full GE. Been there, was not at all nice. I guess more and more operators will have less and less public IX ports, and the open peering coalition will start wondering at some point... The AMSIX has a lot of 10G peers. While they just take two ports, and the AMSIX supposedly also being redundant (and cheap <g>) it is just a time- bomb. How many times did either LINX or AMSIX had issues (actually very rare!) and we happily overloaded our peers' interfaces at the respective other IX... Say what you want, but public peering (yes/no) has a lot to do with your amount of traffic, and your peers. Paying 3 to 4 times as much in the US for the very same I am sure I get even less value - and I'm pulling out. (Well, since we stumbled about this topic... thanks, ras!) Alexander
and we happily overloaded our peers' interfaces at the respective other IX...
That sounds more like a planning issue than anything else. If you have traffic going through a pipe, then you need to make sure you have somewhere else to send it. If you are managing your peers properly, private or public, there should be no issue.
On Fri, 29 April 2005 13:04:05 +0100, Neil J. McRae wrote:
and we happily overloaded our peers' interfaces at the respective other IX...
That sounds more like a planning issue than anything else. If you have traffic going through a pipe, then you need to make sure you have somewhere else to send it. If you are managing your peers properly, private or public, there should be no issue.
With public peering you simply never know how much spare capacity your peer has free. And would you expect your peer with 400 Mbit/s total to have 400 reserved on his AMSIX port for you when you see 300 at LINX and LINX goes down? Been there, numerous times. I still tend to say - it depends on your type of peers and traffic per peer. Alexander
On Fri, Apr 29, 2005 at 02:08:13PM +0200, Alexander Koch wrote:
With public peering you simply never know how much spare capacity your peer has free.
You also never know with private peering: Backbone links. Regards, Daniel -- CLUE-RIPE -- Jabber: dr@cluenet.de -- dr@IRCnet -- PGP: 0xA85C8AA0
With public peering you simply never know how much spare capacity your peer has free.
So with your key peers you talk to them and find out, but I don't see how this is any different if you have a private interconnect. Just because you have say a STM-1 into another peer doesn't mean they have the STM-1 to carry the traffic out, given your example below I'd say its even more unlikely.
And would you expect your peer with 400 Mbit/s total to have 400 reserved on his AMSIX port for you when you see 300 at LINX and LINX goes down?
Key ones, yes.
Been there, numerous times. I still tend to say - it depends on your type of peers and traffic per peer.
But your point on public versus private doesn't alter those facts.
On Fri, 29 Apr 2005, Alexander Koch wrote:
On Fri, 29 April 2005 13:04:05 +0100, Neil J. McRae wrote:
and we happily overloaded our peers' interfaces at the respective other IX...
That sounds more like a planning issue than anything else. If you have traffic going through a pipe, then you need to make sure you have somewhere else to send it. If you are managing your peers properly, private or public, there should be no issue.
With public peering you simply never know how much spare capacity your peer has free. And would you expect your peer with 400 Mbit/s total to have 400 reserved on his AMSIX port for you when you see 300 at LINX and LINX goes down?
what makes this a public peering issue.. i see a couple folks already made the point i wanted to do but just because you have capacity to a peer (on a public interface or a dedicated) PI doesnt mean they arent aggregating at their side and/or have enough capacity to carry the traffic where it needs to go this is also about scale, i would hope you arent peering 400Mb flows across a 1Gb port at an IX, this would imho not be good practice.. if your example were 40Mb then it would be different or perhaps 400mb on a 10Gb port. you might even argue there is more incentive to ensure public ix ports have capacity as congestion will affect multiple peers Steve
On Thu, Apr 28, 2005 at 01:51:54PM -0400, Richard A Steenbergen wrote:
Personally I tend to suspect the general lack of uproar is a rather unfortunate (for them) sign that PAIX is no longer relevant when it comes to critical backbone infrastructures.
I'm not so sure you can draw that conclusion. At this point, everyone's high traffic peers are private interconnects anyway. The site is likely more important to them than the public fabric within the site. A facility power outage would probably be a lot more painful than some public fabric issues. Any traffic that works its way down to a public fabric probably has other public fabrics to go to as well. --msa
participants (9)
-
Alexander Koch
-
Daniel Roesen
-
Jay Patel
-
Leo Bicknell
-
Majdi S. Abbas
-
Neil J. McRae
-
Randy Bush
-
Richard A Steenbergen
-
Stephen J. Wilcox