Just a heads up to anyone on list that PG&E has just sustained a large outage in San Francisco that has caused a few hiccups (both network, electrical, infrastructural, etc.) around the city. I've confirmed that both customers in 365 Main and parts of telecom 1 have both sustained brief blackouts. No word yet form 200 Paul. Anyone in the area that could use a hand with anything, I'll probably be wrapping up fixes for my stuff soon, and would be glad to help however I can. Cheers, jonathan -- Jonathan Lassoff echo thejof | sed 's/^/jof@/;s/$/.com/' http://thejof.com 415-215-2464 GPG: 0xC8579EE5
Jonathan Lassoff wrote:
Just a heads up to anyone on list that PG&E has just sustained a large outage in San Francisco that has caused a few hiccups (both network, electrical, infrastructural, etc.) around the city.
I've confirmed that both customers in 365 Main and parts of telecom 1 have both sustained brief blackouts. No word yet form 200 Paul.
Anyone in the area that could use a hand with anything, I'll probably be wrapping up fixes for my stuff soon, and would be glad to help however I can.
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it. If you do accept this is a good reason for failure, why? ~Seth
On Tue, Jul 24, 2007, Seth Mattinen wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
If you do accept this is a good reason for failure, why?
Didn't you read? He paid extra for super-reliable power from his electricity provider.. Adrian
They should have generators running...I can't foresee any good datacenter not having multiple generators to keep their customers servers online with UPS. -Ray -----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Adrian Chadd Sent: Tuesday, July 24, 2007 7:54 PM To: Seth Mattinen Cc: nanog list Subject: Re: San Francisco Power Outage On Tue, Jul 24, 2007, Seth Mattinen wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing
said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
If you do accept this is a good reason for failure, why?
Didn't you read? He paid extra for super-reliable power from his electricity provider.. Adrian
365 I believe has flywheels...from what I'm gathering it wasn't a full building outage. Static switch issues again, anyone? Either way, happy I moved out of there. It was overpriced for when it was working. I hear they had a scheduled power outage for maintenance this coming weekend. I'll give benefit of doubt and assume it was for something else, not that they knew they had an issue and had their fingers crossed[1] On a related note - one of my clients came to within 5 minutes of the DC UPSs running out today before power came back. Generator truck was still en-route, but hey power's back! So they cancel it. *sigh* John 1: ...but not crossed tight enough. On Tue, Jul 24, 2007 at 08:36:59PM -0400, Raymond L. Corbin wrote:
They should have generators running...I can't foresee any good datacenter not having multiple generators to keep their customers servers online with UPS.
-Ray
-----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Adrian Chadd Sent: Tuesday, July 24, 2007 7:54 PM To: Seth Mattinen Cc: nanog list Subject: Re: San Francisco Power Outage
On Tue, Jul 24, 2007, Seth Mattinen wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing
said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
If you do accept this is a good reason for failure, why?
Didn't you read? He paid extra for super-reliable power from his electricity provider..
Adrian
But as George mentions... Sh*t happens.... There are things you can't forsee, or maybe spend way too much engineering to overcome that 1 in a million "oops". I've been at Telehouse 25B a few times when the "I never expected something like that would happen" happened. (I remember two guys with VERY LONG screwdrivers poking a live transfer switch to get it to reset properly, and was told to step back 20 feet as thats how far they expected to get thrown if they did something wrong). (I also remember them resetting the switch, then TRIPPING it again just to make sure it could be reset again!) Tuc/TBOH
They should have generators running...I can't foresee any good datacenter not having multiple generators to keep their customers servers online with UPS.
-Ray
-----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Adrian Chadd Sent: Tuesday, July 24, 2007 7:54 PM To: Seth Mattinen Cc: nanog list Subject: Re: San Francisco Power Outage
On Tue, Jul 24, 2007, Seth Mattinen wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing
said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
If you do accept this is a good reason for failure, why?
Didn't you read? He paid extra for super-reliable power from his electricity provider..
Adrian
On Tue, 24 Jul 2007, Tuc at T-B-O-H.NET wrote:
(I remember two guys with VERY LONG screwdrivers poking a live transfer switch to get it to reset properly, and was told to step back 20 feet as thats how far they expected to get thrown if they did something wrong). (I also remember them resetting the switch, then TRIPPING it again just to make sure it could be reset again!)
Ahhh, a trip down memory lane :) The ISP I used to work at had a small ping-and-power colo space, and we also housed a large dial/DSL POP in the same building. A customer went in to do hardware maintenance on one of their colo boxes. Two important notes here: 1. The machine was still plugged in to the power outlet when they decided to do this work. 2. They decided to stick a screwdriver into the power supply WHILE said machine was plugged into said power outlet. I guess those "no user serviceable parts inside" warning labels are just friendly recommendations and nothing more... While the machine was fed from a circuit that other colo customers were on, the breaker apparently didn't trip quickly enough to keep the resulting short from sending the 20 kva Liebert UPS at the back of the room into a fit. It alarmed then shut down within 1-2 seconds of this customer doing the trick with the screwdriver. This UPS also fed said large dial and DSL POP. Nothing quite like the sound of a whole machine room spinning down at the same time. It gives you that lovely "oh shit" feeling in the pit of your stomach. I do remember fighting back the urge to stab said customer with that screwdriver... jms
From: "Justin M. Streiner" <streiner@cluebyfour.org> Sent: Tuesday, July 24, 2007 5:58 PM Subject: Re: San Francisco Power Outage
Nothing quite like the sound of a whole machine room spinning down at the same time. It gives you that lovely "oh shit" feeling in the pit of your stomach.<<
Yep. I plugged in my soldering iron and (coincidentally) the whole room at State of Calif., Franchise Tax, EPO'd. Everyone immediately started staring at me of course. --Michael
On Jul 24, 2007, at 6:54 PM, Seth Mattinen wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
Sad that the little Telcove DC here in Lancaster, PA, that Level3 bought a few months ago, has weekly full-on generator tests where 100% of the load is transferred to the generator, while apparently large DCs that are charging premium rates, do not. Cordially Patrick Giagnocavo patrick@zill.net
On Tue, 2007-07-24 at 19:57 -0400, Patrick Giagnocavo wrote:
On Jul 24, 2007, at 6:54 PM, Seth Mattinen wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
Sad that the little Telcove DC here in Lancaster, PA, that Level3 bought a few months ago, has weekly full-on generator tests where 100% of the load is transferred to the generator, while apparently large DCs that are charging premium rates, do not.
Perhaps they do. Wouldn't have mattered in this case if the big-red-button rumor is real. ;-) -Jim P.
Sad that the little Telcove DC here in Lancaster, PA, that Level3 bought a few months ago, has weekly full-on generator tests where 100% of the load is transferred to the generator, while apparently large DCs that are charging premium rates, do not.
Perhaps they do. Wouldn't have mattered in this case if the big-red-button rumor is real. ;-)
Also, doing a "full-on generator test where 100% of the load is transferred" is not always the best option. Unneeded usage of VRLA's only shortens their lives.
We also have weekly backups where 100% of the load for our entire company is put on the three generators. Everything inside the building is put onto the generators power, this way we can test for faulty UPS's etc and ensure the generators are working etc. I don't believe that they don't have a similar setup. Ray Corbin rcorbin@hostmysite.com -----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Patrick Giagnocavo Sent: Tuesday, July 24, 2007 7:57 PM To: nanog@nanog.org Subject: Re: San Francisco Power Outage On Jul 24, 2007, at 6:54 PM, Seth Mattinen wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
Sad that the little Telcove DC here in Lancaster, PA, that Level3 bought a few months ago, has weekly full-on generator tests where 100% of the load is transferred to the generator, while apparently large DCs that are charging premium rates, do not. Cordially Patrick Giagnocavo patrick@zill.net
On Jul 24, 2007, at 4:57 PM, Patrick Giagnocavo wrote:
On Jul 24, 2007, at 6:54 PM, Seth Mattinen wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
Sad that the little Telcove DC here in Lancaster, PA, that Level3 bought a few months ago, has weekly full-on generator tests where 100% of the load is transferred to the generator, while apparently large DCs that are charging premium rates, do not.
I am not familiar with the operational details of 365 Main, but, I suspect that they, like most datacenters, probably do have weekly generator and transfer test procedures. However, there are lots of things that can go wrong that are not covered by generators and transfer tests: It is possible to cascade fail a power distribution system in a number of ways. It is possible for someone to connect things out of phase during a maintenance procedure in such a way that everything is fine until a transfer occurs, then, all hell breaks loose (ever seen what happens when a large CRAC unit starts trying to run backwards because the 3 Phase rotation is out of order?) There are also things that can go wrong in the transfer process (like putting the UPS and Generators on the bus together some degrees out of phase). Most of these things become far more likely and far harder to avoid as the amount of power and the number of units in the system increases. I'm not defending the situation at 365 Main. I don't have any first hand knowledge. I'm just saying that the mere fact that they are dark for several hours today does not necessarily mean that they don't do weekly full-on generator tests. I have no idea what the root cause of today's outage is. I will be interested in hearing from any credible source as to any actual details, but, I'm betting that right now, any such credible source is a bit busy. Owen
On 7/24/07, Owen DeLong <owen@delong.com> wrote:
On Jul 24, 2007, at 4:57 PM, Patrick Giagnocavo wrote:
I have no idea what the root cause of today's outage is. I will be interested in hearing from any credible source as to any actual details, but, I'm betting that right now, any such credible source is a bit busy.
Owen
It appears that 365 is using the Hytec Continuous Power System [ http://hitec.pageprocessor.nl/p3.php?RubriekID=2016], which is a motor, generator, flywheel, clutch, and Diesel engine all on the same shaft. They don't use batteries. If the flywheels spent their energy before the generators came online, they don't have the ability to start the generators up without utility power (unless they purchased the Dark Start option, which is simply extra batteries). -brandon
It appears that 365 is using the Hytec Continuous Power System [ <http://hitec.pageprocessor.nl/p3.php?RubriekID=2016> http://hitec.pageprocessor.nl/p3.php?RubriekID=2016], which is a motor, generator, flywheel, clutch, and Diesel engine all on the same shaft. They don't use batteries. If the flywheels spent their energy before the generators came online, they don't have the ability to start the generators up without utility power (unless they purchased the Dark Start option, which is simply extra batteries). -brandon
They claim in the video tour that they do not have any battery systems on the site. They rely solely on the flywheels. Randy
:They claim in the video tour that they do not have any battery systems on :the site. They rely solely on the flywheels. And, there's nothing wrong with that... Bottom line, regardless of the colo outage, any network that suffered downtime did so due to their own lack of diligence. The crazy drunkard headline seems to have been debunked, so this is nothing more than your random run-of-the-mill outage. The obvious has been stated and restated throughout this thread. This isn't fodder for nanog, let's move on... cheers, brian
On Tue, Jul 24, 2007 at 09:57:09PM -0500, Brandon Galbraith wrote:
It appears that 365 is using the Hytec Continuous Power System [ http://hitec.pageprocessor.nl/p3.php?RubriekID=2016], which is a motor, generator, flywheel, clutch, and Diesel engine all on the same shaft. They don't use batteries.
Yes. I used to work for the company that originally built the 365 Main datacenter and remember touring it near the end of the construction phase. The collection of power units up on the roof was impressive, as were the seismic isolators in the basement. But even when you try and do everythying right Murphy usually finds a way to sneak up behind you and whisper "BOHICA" in your ear. For example, we had a failure at another datacenter that uses Piller units, which operate on the same basic principle as the Hitec ones. While running on generator one of the engines overheated due to an oil-flow problem and threw a rod. When the on-duty electrician responded to the alarm, there were red-hot chunks of engine *outside* of the enclosure, and there was a hole in the side of the unit large enough to stick your arm in. The facility manager kept the damaged piston as a momento. :-) I don't remember whether this was due to a design flaw, improper installation, or what, but the important points are that (1) this is the real world and shit happens, and (2) it wasn't until the generator was worked long enough that the reduction in oil flow caused enough friction to trigger a catastrophic failure. I.e., there's no guarantee that you will catch this kind of problem in your monthly tests. On Tue, Jul 24, 2007 at 05:39:34PM -0700, George William Herbert wrote:
Unfortunate real-world lesson: there is a functional difference between pushing the UPS test cutover button, and some of the stuff that can happen out on the power lines (including rapid voltage swings, harmonics, etc).
Precisely. --Jeff
jaitken@aitken.com (Jeff Aitken) writes:
..., we had a failure at another datacenter that uses Piller units, which operate on the same basic principle as the Hitec ones. ...
i guess i never understood why anyone would install a piller that far from the equator. (it spins like a top, on a vertical axis, and the angular momentum is really quite gigantic for its size -- it's heavy and it spins really really fast -- and i remember asking a piller tech why his machine wasn't tipped slightly southward to account for Coriolis, and he said i was confused. probably i am.) but for north america, whenever i had a choice, i chose hitec. (which spins with an axis parallel to gravity.) -- Paul Vixie
It is not as exciting as valleywag suggests. --- cut here -- Hello, The Internap NOC has confirmed with 365 Main at approximately 13:50 PDT, they experienced a loss of utility power to their San Francisco facility. The facility's backup generators did not automatically react and failover upon the loss of utility power, resulting in loss of power to numerous customer cabinets. At 14:24 PDT our logs show customers circuits were restored as 365 Main facility was able to bring their backup generators online. 365 Main will continue to run on backup generators until they are confident it is safe to run on utility power again. Internap will continue to follow up with 365 Main throughout the evening for updates concerning this event. As we receive additional updates we will be sure to relay them on to your team. Internap will continue to track this event under ticket 243443. Again we apologize for any inconvenience this may have caused your team. If you have any questions or concerns please contact us at noc@internap.com or call 877-843-4662. Thank you.
--On July 24, 2007 7:57:28 PM -0400 Patrick Giagnocavo <patrick@zill.net> wrote:
On Jul 24, 2007, at 6:54 PM, Seth Mattinen wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
Sad that the little Telcove DC here in Lancaster, PA, that Level3 bought a few months ago, has weekly full-on generator tests where 100% of the load is transferred to the generator, while apparently large DCs that are charging premium rates, do not.
There's graceful startup testing, then there's dark start testing. During a recent dark start test one of the other customers in the facility I'm in found out their Juniper was not even plugged into their batteries.
Cordially
Patrick Giagnocavo patrick@zill.net
-- "Genius might be described as a supreme capacity for getting its possessors into trouble of all kinds." -- Samuel Butler
Well, the fact still remains that operating a datacenter smack-dab in the center of some of the most inflated real estate in recent history is quite a castly endeavor. I really wouldn't be all that surprised if 365 Main cut some corners here and there behind the scenes to save costs while saving face. As it is, they don't have remotely enough power to fill that facility to capacity, and they've suffered some pretty nasty outages in the recent past. I'm strongly considering the possibility of completely moving out of there. --j On 7/24/07, Patrick Giagnocavo <patrick@zill.net> wrote:
On Jul 24, 2007, at 6:54 PM, Seth Mattinen wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
Sad that the little Telcove DC here in Lancaster, PA, that Level3 bought a few months ago, has weekly full-on generator tests where 100% of the load is transferred to the generator, while apparently large DCs that are charging premium rates, do not.
Cordially
Patrick Giagnocavo patrick@zill.net
-- Jonathan Lassoff echo thejof | sed 's/^/jof@/;s/$/.com/' http://thejof.com GPG: 0xC8579EE5
Heh. I am moving about 500 boxes out of there by end of September. Anyone want a temporary job? :-) I could use the help. j. Jonathan Lassoff wrote:
As it is, they don't have remotely enough power to fill that facility to capacity, and they've suffered some pretty nasty outages in the recent past. I'm strongly considering the possibility of completely moving out of there.
jof@thejof.com ("Jonathan Lassoff") writes:
Well, the fact still remains that operating a datacenter smack-dab in the center of some of the most inflated real estate in recent history is quite a castly endeavor.
yes. (speaking for both 365 main, and 529 bryant.)
I really wouldn't be all that surprised if 365 Main cut some corners here and there behind the scenes to save costs while saving face.
no expense was spared in the conversion of this tank turret factory into a modern data center. if there was a dark start option, MFN ordered it. (but if it required maintainance, MFN's bankruptcy interrupted that, but the current owner has never been bankrupt.)
As it is, they don't have remotely enough power to fill that facility to capacity, and they've suffered some pretty nasty outages in the recent past. I'm strongly considering the possibility of completely moving out of there.
2mW/floor seemed like a lot at the time. ~6kW/rack wasn't contemplated. (is it time to build out the land adjacent to 200 paul, then?) -- Paul Vixie
On Jul 24, 2007, at 6:54 PM, Seth Mattinen wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
Sad that the little Telcove DC here in Lancaster, PA, that Level3 bought a few months ago, has weekly full-on generator tests where 100% of the load is transferred to the generator, while apparently large DCs that are charging premium rates, do not.
And I could tell you about large DC's that are charging premium rates, had (admittedly) quarterly generator tests that ended up failing and causing down time MULTIPLE TIMES too. Meanwhile the generator at my parents house I had installed has weekly tests and runs fine, but I'm waiting for that unbelievably cold unbelievably harsh winters day where the power goes out and the generator fails... Because its a machine. It has wear, it breaks. I don't know that I'd be comfortable with a full load every time. Rather it be load banks.... Tuc/TBOH
On 7/24/07, Seth Mattinen <sethm@rollernet.us> wrote:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
If you do accept this is a good reason for failure, why?
~Seth
I'm unable to find a link at the moment, but many moons ago power was lost at the 350 E Cermak Equinix facility in Chicago. At the time, we didn't have production equipment there (only a firewall in a shared colo cage/cabinet). This occured on a Friday evening and lasted for quite some time into Saturday morning because their generators would start up but would refuse to continue running. I believe the root cause was a problem related to insulation on the power cables somewhere. I understand testing is done frequently, but I'm also aware that if I want full redundancy, I'm going to have two physically separate locations. There are some events you can't plan for, as well as failure modes that aren't easily/quickly resolved. -brandon
sethm@rollernet.us (Seth Mattinen) writes:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
If you do accept this is a good reason for failure, why?
sometimes the problem is in the redundancy gear itself. PAIX lost power twice during its first five years of operation, and both times it was due to faulty GFI in the UPS+redundancy gear. which had passed testing during construction and subsequently, but eventually some component just wore out. -- Paul Vixie
On Tue, Jul 24, 2007 at 11:57:37PM +0000, Paul Vixie wrote:
sethm@rollernet.us (Seth Mattinen) writes:
I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.
If you do accept this is a good reason for failure, why?
sometimes the problem is in the redundancy gear itself. PAIX lost power twice during its first five years of operation, and both times it was due to faulty GFI in the UPS+redundancy gear. which had passed testing during construction and subsequently, but eventually some component just wore out.
I had an issue with exactly that 7 or 8 years ago at Via Networks.. the switchover gear shorted and died horrifically leading to an outage that lasted well through the night (something like 16hours in total). Being on a Friday evening it was difficult to get people on site promptly. The lesson learned was 'the big switch' .. a huge thing that took the weight of two adults to move it, but did mean that should something similar occur we could transfer the whole building power manually directly to the generator. I doubt such a beast would scale to the power loads on a large datacentre tho, but then they are generally not on a single grid/UPS feed. Steve
participants (20)
-
Adrian Chadd
-
Alex Rubenstein
-
Brandon Galbraith
-
Brian Wallingford
-
Jason Matthews
-
Jeff Aitken
-
Jim Popovitch
-
John Kinsella
-
Jonathan Lassoff
-
Justin M. Streiner
-
Michael Loftis
-
Michael Painter
-
Owen DeLong
-
Patrick Giagnocavo
-
Paul Vixie
-
Randy Epstein
-
Raymond L. Corbin
-
Seth Mattinen
-
Stephen Wilcox
-
Tuc at T-B-O-H.NET