Most of the SF Bay Area portion of BBN Planet's network has been offline since late last night. I've finally heard, via a friend who is a major customer of theirs, that they are having power problems in Palo Alto.
Now, I haven't heard a peep about this on any of the mailing lists I'm on... and since knowing about BIG outages like this would make it easier for me to answer my customer's questions... am I missing something?
Is there a mailing list that I should be on? Or is this just a private BBN issue that I'd need to call their ops center people about (people who I'm sure are very very busy without small network operators calling them)
-matthew kaufman matthew@scruz.net
I just spoke to their NOC and was told that a power switch that is supposed to be able to switch them between 3 different power utilities failed at 12:30 this morning (friday). They bought and installed 2 large diesel generators today and are hoping to be back up using the generators within 30 minutes. Rob
Rob Liebschutz <rob@rjl.com> says:
Most of the SF Bay Area portion of BBN Planet's network has been offline since late last night. I've finally heard, via a friend who is a major customer of theirs, that they are having power problems in Palo Alto.
Now, I haven't heard a peep about this on any of the mailing lists I'm on... and since knowing about BIG outages like this would make it easier for me to answer my customer's questions... am I missing something?
Is there a mailing list that I should be on? Or is this just a private BBN issue that I'd need to call their ops center people about (people who I'm sure are very very busy without small network operators calling them)
-matthew kaufman matthew@scruz.net
I just spoke to their NOC and was told that a power switch that is supposed to be able to switch them between 3 different power utilities failed at 12:30 this morning (friday). They bought and installed 2 large diesel generators today and are hoping to be back up using the generators within 30 minutes.
We're trying to figure out how long BBN's been down, are you talking 12:30am PDT? If so, then they been down 16 hours. I started seeing mail failures to Stanford (a big BBN site) at 9:33am PDT and first got email about the problem from another ISP at around 11am. --charles charles@etak.com
On Fri, 11 Oct 1996, Rob Liebschutz wrote:
I just spoke to their NOC and was told that a power switch that is supposed to be able to switch them between 3 different power utilities failed at 12:30 this morning (friday). They bought and installed 2 large diesel generators today and are hoping to be back up using the generators within 30 minutes.
Perhaps there is a lesson in redundancy here. Based on this account it would appear that although they had arranged for 3 power sources that there were two mistakes made. One was that switching between sources relied on a single piece of hardware thus negating the redundancy of 3 sources. The other was that they did not have alternate types of power source, i.e. they did not have generators on site but expected that 3 different utility power sources was adequate redundancy. Some NOC engineers (or perhaps NOC managers) aren't paranoid enough. Michael Dillon - ISP & Internet Consulting Memra Software Inc. - Fax: +1-604-546-3049 http://www.memra.com - E-mail: michael@memra.com
"md" == Michael Dillon <michael@memra.com> writes: md> Perhaps there is a lesson in redundancy here. [...] md> The other was that they did not have alternate types of power md> source, i.e. they did not have generators on site but expected md> that 3 different utility power sources was adequate redundancy. This reminds me of a passage from Bruce Sterling's _The Hacker Crackdown_. Parked outside the back is a power-generation truck. The generator strikes me as rather anomalous. Don't they already have their own generators in this eight-story monster? Then the suspicion strikes me that NYNEX must have heard of the September 17 AT&T power-outage which crashed New York City. Belt-and-suspenders, this generator. Very telco. michael
On Fri, 11 Oct 1996, Rob Liebschutz wrote:
I just spoke to their NOC and was told that a power switch that is supposed to be able to switch them between 3 different power utilities failed at 12:30 this morning (friday). They bought and installed 2 large diesel generators today and are hoping to be back up using the generators within 30 minutes.
Yep this happens all the time, the transfer switch dies and then you are screwed because you don't switch to backup power. Your UPS system then run out of power and you are dead. That is why we are building a manual maintenance wraparound around the UPS AND the transfer switch so that if they switch does die you can manually have some guy bypass the switch. Nathan Stratton CEO, NetRail, Inc. Tracking the future today! --------------------------------------------------------------------------- Phone (703)524-4800 NetRail, Inc. Fax (703)534-5033 2007 N. 15 St. Suite 5 Email sales@netrail.net Arlington, Va. 22201 WWW http://www.netrail.net/ Access: (703) 524-4802 guest --------------------------------------------------------------------------- "Therefore do not worry about tomorrow, for tomorrow will worry about itself. Each day has enough trouble of its own." Matthew 6:34
On Sat, 12 Oct 1996, Nathan Stratton wrote:
On Fri, 11 Oct 1996, Rob Liebschutz wrote:
I just spoke to their NOC and was told that a power switch that is supposed to be able to switch them between 3 different power utilities failed at 12:30 this morning (friday). They bought and installed 2 large diesel generators today and are hoping to be back up using the generators within 30 minutes.
Greetings, I don't understand how they could have 3 different power systems fail without a serious operations procedural error. No one buys more than one generator and it appears that they didn't have one at all so, that's ruled out. They have one utility company, and let's say two UPS's to feed the dual power bussed routers. At best they have a DC plant. Overloaded UPS's are easy to spot. Weak batteries are found during routine, in-service tests. So.. It looks like they abhorently ignored thier power situation. This is an "act of stupidity", not an "act of god" as one other person mentioned.
Yep this happens all the time, the transfer switch dies and then you are screwed because you don't switch to backup power.
In 18 years of telecom management I have never seen a "Transfer Switch" as a component failure. I have seen quite a few overloaded battery strings, and UPS's backed by rusty generators.
Your UPS system then run out of power and you are dead. That is why we are building a manual maintenance wraparound around the UPS AND the transfer switch so that if they switch does die you can manually have some guy bypass the switch.
All quality Transfer Switches should have manual activation as a root function. Even a relatively small 15kw transfer switches automatic functions work to move a manual switch. As a rule, power system maintenance for critical equipment should comprise the following: UPS- Never exceed 80% load. Replace batteries (good or not) per manufactures guidlines. Initiate load transfer tests once per quarter. Batteries- Preform cell maintenance quarterly, to include individual cell voltage and gravity tests and, the surface cleaning of all terminal hardware. Generators- Find a good generator maintenance contractor for routine maintenace needs. Exercise the unit under load each month. Regards Patrick J. Chicas Email: pjc@unix.off-road.com URL: http://www.Off-Road.com -------------------------------- The Off-Road Center of The 'Net!
On Mon, 14 Oct 1996, Patrick J. Chicas wrote:
Greetings,
I don't understand how they could have 3 different power systems fail without a serious operations procedural error.
They did not, the transfer switch did.
Yep this happens all the time, the transfer switch dies and then you are screwed because you don't switch to backup power.
In 18 years of telecom management I have never seen a "Transfer Switch" as a component failure. I have seen quite a few overloaded battery strings, and UPS's backed by rusty generators.
Well this is one, and I have only been on this planet 20 years and have seen 2 be the failure.
Your UPS system then run out of power and you are dead. That is why we are building a manual maintenance wraparound around the UPS AND the transfer switch so that if they switch does die you can manually have some guy bypass the switch.
All quality Transfer Switches should have manual activation as a root function. Even a relatively small 15kw transfer switches automatic functions work to move a manual switch.
True, all things can break.
As a rule, power system maintenance for critical equipment should comprise the following:
UPS- Never exceed 80% load. Replace batteries (good or not) per manufactures guidlines. Initiate load transfer tests once per quarter.
Batteries- Preform cell maintenance quarterly, to include individual cell voltage and gravity tests and, the surface cleaning of all terminal hardware.
Also keep keep at or around 75^
Generators- Find a good generator maintenance contractor for routine maintenace needs. Exercise the unit under load each month.
Nathan Stratton CEO, NetRail, Inc. Tracking the future today! --------------------------------------------------------------------------- Phone (703)524-4800 NetRail, Inc. Fax (703)534-5033 2007 N. 15 St. Suite 5 Email sales@netrail.net Arlington, Va. 22201 WWW http://www.netrail.net/ Access: (703) 524-4802 guest --------------------------------------------------------------------------- "Therefore do not worry about tomorrow, for tomorrow will worry about itself. Each day has enough trouble of its own." Matthew 6:34
On Mon, 14 Oct 1996, Nathan Stratton wrote: Greetings
On Mon, 14 Oct 1996, Patrick J. Chicas wrote:
I don't understand how they could have 3 different power systems fail without a serious operations procedural error.
They did not, the transfer switch did.
I think you have missed my point. By the RISKS message it appears that BBN relied blindly on whatever power solution Stanford had in place. THis, is an operational failure.
In 18 years of telecom management I have never seen a "Transfer Switch" as a component failure. I have seen quite a few overloaded battery strings, and UPS's backed by rusty generators.
Well this is one, and I have only been on this planet 20 years and have seen 2 be the failure.
I have seen automatic tranfer mechanisms fail but, never the manual portion of the switch. Regards Patrick J. Chicas Email: pjc@unix.off-road.com URL: http://www.Off-Road.com -------------------------------- The Off-Road Center of The 'Net!
On Mon, 14 Oct 1996, Patrick J. Chicas wrote:
I don't understand how they could have 3 different power systems fail without a serious operations procedural error.
They did not, the transfer switch did.
In 18 years of telecom management I have never seen a "Transfer Switch" as a component failure.
I have seen automatic tranfer mechanisms fail but, never the manual portion of the switch.
Somone mentioned that they had seen Stanfords generator plant and it was all dusty. Is it possible that no one knew how the transfer switch worked or possibly no one knew that it had a manual override? Is there anyone close enough to Stanford to check up on this stuff? Michael Dillon - ISP & Internet Consulting Memra Software Inc. - Fax: +1-604-546-3049 http://www.memra.com - E-mail: michael@memra.com
On Mon, 14 Oct 1996, Michael Dillon wrote:
On Mon, 14 Oct 1996, Patrick J. Chicas wrote:
Somone mentioned that they had seen Stanfords generator plant and it was all dusty. Is it possible that no one knew how the transfer switch worked or possibly no one knew that it had a manual override? Is there anyone close enough to Stanford to check up on this stuff?
The message led me to believe that the generator was not attached to the transfer switch. Also, almost all transfer switches have a manual handle on the front that levers the contacts inside the switch box. The automatic portion is commonly a very, very large relay type winding that moves the same contacts. It would be hard to imagine that the rodent did it's thing on the transfer switch itself which should be upstream from the buildings, main AC switchgear. Referencing the messages so far, I am led to believe that the main switch gear smoked and the generator was not attached to any piece of transfer switch gear. It's also tough to comprehend why BBN didn't upgrade the hub after purchase. It's just not that much of a capital expense for them to do so. I'll bet they make changes now on the order of a dedicated generator and UPS or battery plant for their Servers, Routers and Terminal equipment. The stuff just isn't that expensive. As an example; I just had our third Lorain DC-AC 10kva inverter installed. The total price, turn key was under $13,000 with AC runs into the equipment areas. BIG APC UPS's with loads of batteries are not much more and you can plug them in the wall. Regards Patrick J. Chicas Email: pjc@unix.off-road.com URL: http://www.Off-Road.com -------------------------------- The Off-Road Center of The 'Net!
On Mon, 14 Oct 1996, Patrick J. Chicas wrote:
So.. It looks like they abhorently ignored thier power situation. This is an "act of stupidity", not an "act of god" as one other person mentioned.
Well, the original outage was caused by a rat gnawing a cable.... There are always layers of complexity, layers of causation and layers of blame in a situation like this. You might end up finding that 20 people were 5% to blame but it just so happened that all 20 made the wrong mistake at the wrong time to cause 100% failure. But you do have some good advice here regarding power. Thanks. Michael Dillon - ISP & Internet Consulting Memra Software Inc. - Fax: +1-604-546-3049 http://www.memra.com - E-mail: michael@memra.com
On Mon, 14 Oct 1996, Michael Dillon wrote: Greetings
On Mon, 14 Oct 1996, Patrick J. Chicas wrote:
Well, the original outage was caused by a rat gnawing a cable.... There are always layers of complexity, layers of causation and layers of blame in a situation like this. You might end up finding that 20 people were 5% to blame but it just so happened that all 20 made the wrong mistake at the wrong time to cause 100% failure.
And the generator was rusting away next door.
But you do have some good advice here regarding power. Thanks.
My mind was molded early in my career by the Bell System practice. Pitty me.. Seriously, working in the wireless business with lots of money, and the ISP business with just enough money to keep pace with growth, I have seen both sides of the coin. I completely understand the economic constraints of a small to medium ISP's and their struggle to make ends meet. On the other hand, I believe the large IAP's (almost all publicly traded companies) owe their subscribers only the best level of performance. Patrick J. Chicas Email: pjc@unix.off-road.com URL: http://www.Off-Road.com -------------------------------- The Off-Road Center of The 'Net!
participants (6)
-
Charles R. Hoynowski
-
Michael Dillon
-
michael shiplett
-
Nathan Stratton
-
Patrick J. Chicas
-
Rob Liebschutz