Re: HE.net, Fremont-2 outage?
Yeah. They had yet another power outage. The fourth in 16 months. Luckily we have already begun plans to leave their facility. William ------Original Message------ From: Tico To: nanog@nanog.org Subject: HE.net, Fremont-2 outage? Sent: Nov 3, 2009 1:50 PM Hey guys, I can't get through to Hurricane Electric, and they seem to be having an outage at their Fremont-2 facility again (as of 17:30 UTC or thereabouts) -- ticket system is unanswered, phones go to voicemail, all equipment is unreachable. Does anyone here have a presence at 48233 Warm Springs Blvd, that can provide any information about this? I got hit by the ATS failure last month, so I guess it's possible that that equipment may have flaked again. -t -- William Pitcock SystemInPlace - Simple Hosting Solutions 1-866-519-6149
FWIW: http://www.he.net/releases/release18.html Jeff On Tue, Nov 3, 2009 at 2:28 PM, William Pitcock <nenolod@systeminplace.net> wrote:
Yeah. They had yet another power outage. The fourth in 16 months.
Luckily we have already begun plans to leave their facility.
William ------Original Message------ From: Tico To: nanog@nanog.org Subject: HE.net, Fremont-2 outage? Sent: Nov 3, 2009 1:50 PM
Hey guys,
I can't get through to Hurricane Electric, and they seem to be having an outage at their Fremont-2 facility again (as of 17:30 UTC or thereabouts) -- ticket system is unanswered, phones go to voicemail, all equipment is unreachable.
Does anyone here have a presence at 48233 Warm Springs Blvd, that can provide any information about this? I got hit by the ATS failure last month, so I guess it's possible that that equipment may have flaked again.
-t
-- William Pitcock SystemInPlace - Simple Hosting Solutions 1-866-519-6149
-- Jeffrey Lyon, Leadership Team jeffrey.lyon@blacklotus.net | http://www.blacklotus.net Black Lotus Communications of The IRC Company, Inc. Platinum sponsor of HostingCon 2010. Come to Austin, TX on July 19 - 21 to find out how to "protect your booty."
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently. Thanks, Charles
We are also seeing this in the Metro NY market. Ben On Nov 3, 2009, at 8:18 PM, <Charles.Jouglard@cox.com> wrote:
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently.
Thanks, Charles
Voice service works in the Austin market but SMS is down. This thread suggests something major died: http://www.howardforums.com/showthread.php?t=1585834 On Nov 3, 2009, at 7:19 PM, Ben Carleton wrote:
We are also seeing this in the Metro NY market.
Ben
On Nov 3, 2009, at 8:18 PM, <Charles.Jouglard@cox.com> wrote:
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently.
Thanks, Charles
-- Jonathan Bishop -- moonwick@lasthome.net http://lasthome.net/~moonwick/ "Embrace the company of those that seek the truth, and run from those who have found it." -- Václav Havel
Illinois is experiencing the outtage as well looks to be all of the U.S. -- Is ait an mac an sol. Life is strange (such is life). On Nov 3, 2009, at 7:19 PM, Ben Carleton <bc-list@beztech.net> wrote:
We are also seeing this in the Metro NY market.
Ben
On Nov 3, 2009, at 8:18 PM, <Charles.Jouglard@cox.com> wrote:
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently.
Thanks, Charles
I also noticed an outage here in Colorado, however the service appears to be back for us now, at least incoming and outgoing calls, plus data is available. -DS On Tue, Nov 3, 2009 at 18:23, Michael Schuler <mike_schuler@me.com> wrote:
Illinois is experiencing the outtage as well looks to be all of the U.S.
-- Is ait an mac an sol. Life is strange (such is life).
On Nov 3, 2009, at 7:19 PM, Ben Carleton <bc-list@beztech.net> wrote:
We are also seeing this in the Metro NY market.
Ben
On Nov 3, 2009, at 8:18 PM, <Charles.Jouglard@cox.com> wrote:
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot
reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently.
Thanks, Charles
This problem also seems to be affecting tmo users in NYC. I'm unable to reach any tmo user from Att network Sent from my iPhone 3GS. On Nov 3, 2009, at 8:23 PM, Michael Schuler <mike_schuler@me.com> wrote:
Illinois is experiencing the outtage as well looks to be all of the U.S.
-- Is ait an mac an sol. Life is strange (such is life).
On Nov 3, 2009, at 7:19 PM, Ben Carleton <bc-list@beztech.net> wrote:
We are also seeing this in the Metro NY market.
Ben
On Nov 3, 2009, at 8:18 PM, <Charles.Jouglard@cox.com> wrote:
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently.
Thanks, Charles
On Nov 3, 2009, at 8:18 PM, <Charles.Jouglard@cox.com> wrote:
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently.
T-Mobile has admitted the outage. Apparently T-Mobile phones cannot receive calls or texts, even from other T-Mobile phones: <http://twitter.com/TMobile_USA> But the good news is that they apparently added a 'feature' when you go to pay your bill: <http://consumerist.com/5395978/reader-paid-my-t+mobile-bill-saw-some-boobs
-- TTFN, patrick
Thank you for posting this. I have a G1 and have randomly seen this exact picture pop up on my screen without explanation. I had previously assured myself I was going crazy since I was never able to reproduce the glitch in front of anyone. Jeff
But the good news is that they apparently added a 'feature' when you go to pay your bill:
<http://consumerist.com/5395978/reader-paid-my-t+mobile-bill-saw-some-boobs>
-- TTFN, patrick
-- Jeffrey Lyon, Leadership Team jeffrey.lyon@blacklotus.net | http://www.blacklotus.net Black Lotus Communications of The IRC Company, Inc. Platinum sponsor of HostingCon 2010. Come to Austin, TX on July 19 - 21 to find out how to "protect your booty."
Patrick W. Gilmore wrote:
On Nov 3, 2009, at 8:18 PM, <Charles.Jouglard@cox.com> wrote:
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently.
T-Mobile has admitted the outage. Apparently T-Mobile phones cannot receive calls or texts, even from other T-Mobile phones:
Twitter is over capacity. Too many tweets! Please wait a moment and try again. @ 5:37 PST
We're using T-Mobile here, no issues reported. I have staff at sites in Virginia, California, and Arizona. Jeff On Tue, Nov 3, 2009 at 8:18 PM, <Charles.Jouglard@cox.com> wrote:
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently.
Thanks, Charles
-- Jeffrey Lyon, Leadership Team jeffrey.lyon@blacklotus.net | http://www.blacklotus.net Black Lotus Communications of The IRC Company, Inc. Platinum sponsor of HostingCon 2010. Come to Austin, TX on July 19 - 21 to find out how to "protect your booty."
Hard down in Fairfax, VA here. Inbound calls are met with a fast busy (no path) signal. Outbound calls fail. Inbound/outbound SMS text has not worked since at least 6:00PM EST. Surprisingly the IP network seems to still be up. BlackBerry specific functions such as BlackBerry email (BIS) and BBM still work. Web browsing still works. Seems their UMA (GSM over TCP/IP) servers are down too. Pretty bad. - Cary On Tue, Nov 3, 2009 at 8:20 PM, Jeffrey Lyon <jeffrey.lyon@blacklotus.net>wrote:
We're using T-Mobile here, no issues reported. I have staff at sites in Virginia, California, and Arizona.
Jeff
On Tue, Nov 3, 2009 at 8:18 PM, <Charles.Jouglard@cox.com> wrote:
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently.
Thanks, Charles
-- Jeffrey Lyon, Leadership Team jeffrey.lyon@blacklotus.net | http://www.blacklotus.net Black Lotus Communications of The IRC Company, Inc.
Platinum sponsor of HostingCon 2010. Come to Austin, TX on July 19 - 21 to find out how to "protect your booty."
This was brought up on the outages list. T-Mobile is currently experiencing large scale outbound voice and data outages across the nation right now. No current ETA, estimates are several hours. Paul B. On Tue, Nov 3, 2009 at 7:18 PM, <Charles.Jouglard@cox.com> wrote:
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently.
Thanks, Charles
-- Paul H Bosworth CCNP, CCIP, CCDP, CCDA, CCNA, CCNA Security
Thanks. Experiencing the inability to reach several personnel. Thanks, Charles -----Original Message----- From: Paul Bosworth [mailto:pbosworth@gmail.com] Sent: Tuesday, November 03, 2009 7:22 PM To: Jouglard, Charles (CCI-Louisiana) Cc: nanog@nanog.org Subject: Re: T-Mobile ? This was brought up on the outages list. T-Mobile is currently experiencing large scale outbound voice and data outages across the nation right now. No current ETA, estimates are several hours. Paul B. On Tue, Nov 3, 2009 at 7:18 PM, <Charles.Jouglard@cox.com> wrote: Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently. Thanks, Charles -- Paul H Bosworth CCNP, CCIP, CCDP, CCDA, CCNA, CCNA Security
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2009.11.03 17:18:06, Charles.Jouglard@cox.com wrote:
Anyone hear of any issues on the T-Mobile network? Seems as if we cannot reach anyone with a T-Mobile cell phone. Dialing out works sporadically, but calls drop frequently.
Thanks, Charles
There's been a little chatter about it over on outages. (outages.org) Next time can you start a new message instead of replying to something, then removing the body/subject. Most modern mail clients use message references that allow threaded conversations to happen. And how there's a t-mobile conversation happening under a thread about a he.net failure... - -sean -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.12 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJK8NgeAAoJEPWrDhFvhqzdDbIH/Aq5DIjmxmPr4nQc7eRmZqYB L/4+o45+mOKC2odzVUcI5cHmDQwSHqdRB/mxBTEkRM91n077lJIKO+4pdQp5h6Jp VhfICySusHd/05caZh0vVLGs7EZaMlBxQx/Gzs8/8310ipSZlVYXMYLuuDRLBXAO Vfygn6zWRRvdIl8KtMTY6T1nKtrg1+IWeWKlrvP2m6taOsJdkkf7gZ++PuauTPDk qerBMStG2zIIAaWlkuQnMjwBGLqqewDoyKMcDqJL2aavNd9eQSOs+Nb/+v0H+VE2 85Fn+KBsktlUlHhmrkEAC9kmMilpECDiq/PtLoy2/bf9hG5pqzHmrCS0HiymoXA= =qDlI -----END PGP SIGNATURE-----
That release is from 10/31/01. On Tue, 2009-11-03 at 20:04 -0500, Jeffrey Lyon wrote:
FWIW: http://www.he.net/releases/release18.html
Jeff
On Tue, Nov 3, 2009 at 2:28 PM, William Pitcock <nenolod@systeminplace.net> wrote:
Yeah. They had yet another power outage. The fourth in 16 months.
Luckily we have already begun plans to leave their facility.
William ------Original Message------ From: Tico To: nanog@nanog.org Subject: HE.net, Fremont-2 outage? Sent: Nov 3, 2009 1:50 PM
Hey guys,
I can't get through to Hurricane Electric, and they seem to be having an outage at their Fremont-2 facility again (as of 17:30 UTC or thereabouts) -- ticket system is unanswered, phones go to voicemail, all equipment is unreachable.
Does anyone here have a presence at 48233 Warm Springs Blvd, that can provide any information about this? I got hit by the ATS failure last month, so I guess it's possible that that equipment may have flaked again.
-t
-- William Pitcock SystemInPlace - Simple Hosting Solutions 1-866-519-6149
-- ************************************************************ Michael J. McCafferty Principal, Security Engineer M5 Hosting http://www.m5hosting.com You can have your own custom Dedicated Server up and running today ! RedHat Enterprise, CentOS, Ubuntu, Debian, OpenBSD, FreeBSD, and more ************************************************************
How long can they go on those 3000 gallons under their current load?
http://www.dieselserviceandsupply.com/Diesel_Fuel_Consumption.aspx On Tue, Nov 3, 2009 at 5:49 PM, Lyndon Nerenberg (VE6BBM/VE7TFX) <lyndon@orthanc.ca> wrote:
How long can they go on those 3000 gallons under their current load?
Well, they say it's a Cat unit, so probably one like this : http://www.cat.com/cda/components/securedFile/displaySecuredFileServletJSP?x=7&fileId=1081064
How long can they go on those 3000 gallons under their current load?
That engine is rated to consume 70.9g/hr at 50% .. so using a conservative estimate, I'd say about 42 hours. Cheers, Michael Holstein Cleveland State University
On Wed, Nov 4, 2009 at 8:19 AM, Michael Holstein <michael.holstein@csuohio.edu> wrote:
Well, they say it's a Cat unit, so probably one like this : http://www.cat.com/cda/components/securedFile/displaySecuredFileServletJSP?x=7&fileId=1081064
How long can they go on those 3000 gallons under their current load?
That engine is rated to consume 70.9g/hr at 50% .. so using a conservative estimate, I'd say about 42 hours.
Wouldn't the conservative estimate be 21 hours? (3000 gallons, 142 gal/hr at 100% load); you'd get more hours out by guessing at what fraction of full load the generator is running, but anything longer than 21 hours is fudge-factor guesstimate based, and not to be counted on.
Cheers,
Michael Holstein Cleveland State University
Matt
On Wed, Nov 4, 2009 at 8:19 AM, Michael Holstein <michael.holstein@csuohio.edu> wrote:
Well, they say it's a Cat unit, so probably one like this : http://www.cat.com/cda/components/securedFile/displaySecuredFileServletJSP?x=7&fileId=1081064
How long can they go on those 3000 gallons under their current load?
That engine is rated to consume 70.9g/hr at 50% .. so using a conservative estimate, I'd say about 42 hours.
Wouldn't the conservative estimate be 21 hours? (3000 gallons, 142 gal/hr at 100% load); you'd get more hours out by guessing at what fraction of full load the generator is running, but anything longer than 21 hours is fudge-factor guesstimate based, and not to be counted on.
The mildly conservative estimate is 21 hours minus the guaranteed turnaround time for your fuel vendor to show up, minus some more fudge factor to allow for someone to actually hook up and actually refuel, etc. The paranoid conservative estimate is more complex; you have to assume you call the primary vendor, they don't show, and then you have to call your backup(s). If you have a three hour guarantee in the contract, you have to remember that this can still represent some scrambling by your vendor, and if you're lights out, it's quite possible that others are as well, and hospitals and city hall might rate as more urgent. It's also possible that the truck'll have a flat, mechanical problems, or try to rush through the railroad crossing about to be rendered unpassable by a slow-moving freight train. It'll probably take you an additional hour to panic and call your backup supplier; now you are a bunch of hours shorter on capacity than you thought. Of course, a lot of this is simply how you look at the problem. If we're talking runtime-until-dry, yeah, 21 hours. If we're talking a practical number of how long can you go until it's proper for some panic to set in and calls to get made, it's more like half that. ;-) With power: N+1 is usually better than N Best to assume full load when doing math Things will go wrong, predict common failures The best plans are still prone to failure Safety margins can save your rear etc ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.
Joe Greco wrote:
With power:
N+1 is usually better than N Best to assume full load when doing math Things will go wrong, predict common failures The best plans are still prone to failure Safety margins can save your rear etc
I find that electrical panelboards, busways, transfer switches, etc. are often put in the category of things that don't need maintenance or routine inspections. Big deal if you can start your fancy generator once a month (I prefer on-load weekly) but the in between stuff is in disrepair or full of mice. Even a simple dusty transfer switch could arc weld itself to once side of the contacts. ~Seth
Joe Greco wrote:
With power:
N+1 is usually better than N Best to assume full load when doing math Things will go wrong, predict common failures The best plans are still prone to failure Safety margins can save your rear etc
I find that electrical panelboards, busways, transfer switches, etc. are often put in the category of things that don't need maintenance or routine inspections. Big deal if you can start your fancy generator once a month (I prefer on-load weekly) but the in between stuff is in disrepair or full of mice. Even a simple dusty transfer switch could arc weld itself to once side of the contacts.
Yup. Related: "100% availability" is a marketing person's dream; it sounds good in theory but is unattainable in practice, and is a reliable sign of non-100%-reliability. The most common way to gain "100% availability" is to avoid testing under load. This surely protects the equipment against a whole slew of failures in the less-used portions of your power systems, but also protects you from detecting them outside your Hour(s) Of Greatest Need. And even for those who follow best practices... You can inspect and maintain things until you're blue in the face. One day a contractor will drop a wrench into a PDU or UPS or whatever and spectacular things will happen. Or a battery develops a strange fault. You do live load testing, you'll lose now and then. It's best to simply assume no single circuit is 100% reliable. You should be able to get two circuits from separate power systems and the combination of the two should really closely approximate 100%, but even there... it isn't. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.
Yup. Related: "100% availability" is a marketing person's dream; it sounds good in theory but is unattainable in practice, and is a reliable sign of non-100%-reliability.
You are confusing two different things. Availability != Reliability. For instance, an airplane is designed to be 100% reliable, but much less available. To keep a 747 from not crashing (100% reliability) it needs significant downtime (not 100% available).
And even for those who follow best practices... You can inspect and maintain things until you're blue in the face. One day a contractor will drop a wrench into a PDU or UPS or whatever and spectacular things will happen.
That's were policies, procedures and methods come in (read: SAS70)
Or a battery develops a strange fault.
Get more than one string, one more than one UPS, with monitoring. Batteries are NOT the Achilles heel everyone wants to make you believe they are. "Question everything, assume nothing, discuss all, and resolve quickly." -- Alex Rubenstein, AR97, K2AHR, alex@nac.net, latency, Al Reuben -- -- Net Access Corporation, 800-NET-ME-36, http://www.nac.net --
Yup. Related: "100% availability" is a marketing person's dream; it sounds good in theory but is unattainable in practice, and is a reliable sign of non-100%-reliability.
You are confusing two different things.
No, I'm not. They're interrelated. That doesn't mean that they are the same thing, but to talk about them in terms of their relationship or their effect on service is perfectly fair.
And even for those who follow best practices... You can inspect and maintain things until you're blue in the face. One day a contractor will drop a wrench into a PDU or UPS or whatever and spectacular things will happen.
That's were policies, procedures and methods come in (read: SAS70)
Policies, procedures, and methods are nice. Unfortunately, it is not too uncommon for all of the above to be bent or broken for a whole slew of reasons. What about a problem that hasn't been planned for? It only takes one time ... one mistake ... of just the right kind.
Or a battery develops a strange fault.
Get more than one string, one more than one UPS, with monitoring. Batteries are NOT the Achilles heel everyone wants to make you believe they are.
I know you have a rather higher faith in batteries than some of us, but practical experience suggests that batteries are merely a mostly- reliable technology. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.
I know you have a rather higher faith in batteries than some of us, but practical experience suggests that batteries are merely a mostly- reliable technology.
Agreed batteries are unreliable, an alternative to battery based UPS are flywheel energy storage devices, they come either as an integrated solution with the diesel generator (i think cat offers such a package) or as a standalone UPS (see: www.pentadyne.com/uploads/18/File/Pentadyne-VSS-Brochure.pdf) another vendor is Active Power (which i think partners with cat) They seem to be MUCH more reliable than batteries from what i read HE probably acquired one of those solutions -Raphael Carrier
On Nov 4, 2009, at 2:08 PM, Raphael Carrier wrote:
I know you have a rather higher faith in batteries than some of us, but practical experience suggests that batteries are merely a mostly- reliable technology.
Agreed batteries are unreliable, an alternative to battery based UPS are flywheel energy storage devices, they come either as an integrated solution with the diesel generator (i think cat offers such a package) or as a standalone UPS (see: www.pentadyne.com/uploads/18/File/Pentadyne-VSS-Brochure.pdf)
Apparently you do not remember 365 Main... Batteries are reliable. Flywheels are reliable. Both require proper maintenance and proper procedures to handle corner cases (like the multiple-outage corner-case that took out 365 main). Both have their issues. In my experience working at and with a variety of datacenters, I have to day that I have had generally better luck with batteries than flywheels, but, the key difference that suggests flywheels could actually be better technology is this: About 50% of battery failures traced back to human factors. 100% of the flywheel failures I experienced were human factors related. Owen Speaking as an individual, not representing any affiliation.
Sry for the top post... As more facilities are built/retrofitted with an eye toward overall efficiency using CCHP, we will start seeing more facilities (like Syracuse U's new datacenter) use systems like the Capstone turbines for primary power/secure power/CCHP. The main grid will become the backup. Not saying this approach replaces the need for batteries or some other storage device such as a flywheel system.. "This Year InGuard has Stopped 159,953,000 Spam E-Mails and 573,000 Viruses... Do you have http://www.inline.com/SolutionsbyTechnology/InternetDataCenter/InGuard/tabid..." InLine> bryan king | Internet Department Director InLine> Solutions Through Technology 600 Lakeshore Pkwy Birmingham AL, 35209 205-278-8139 [p] 205-314-7729[f] bking@inline.com www.InLine.com All Quotes from InLine are only valid for 30 days. This message and any attached files may contain confidential information and are intended solely for the message recipient. If you are not the message recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version. From: Owen DeLong [mailto:owen@delong.com] Sent: Wednesday, November 04, 2009 4:18 PM To: Raphael Carrier Cc: nanog@nanog.org; Joe Greco Subject: Re: HE.net, Fremont-2 outage? On Nov 4, 2009, at 2:08 PM, Raphael Carrier wrote:
I know you have a rather higher faith in batteries than some of us, but practical experience suggests that batteries are merely a mostly- reliable technology.
Agreed batteries are unreliable, an alternative to battery based UPS are flywheel energy storage devices, they come either as an integrated solution with the diesel generator (i think cat offers such a package) or as a standalone UPS (see: www.pentadyne.com/uploads/18/File/Pentadyne-VSS-Brochure.pdf)
Apparently you do not remember 365 Main... Batteries are reliable. Flywheels are reliable. Both require proper maintenance and proper procedures to handle corner cases (like the multiple-outage corner-case that took out 365 main). Both have their issues. In my experience working at and with a variety of datacenters, I have to day that I have had generally better luck with batteries than flywheels, but, the key difference that suggests flywheels could actually be better technology is this: About 50% of battery failures traced back to human factors. 100% of the flywheel failures I experienced were human factors related. Owen Speaking as an individual, not representing any affiliation.
On Wed, Nov 4, 2009 at 2:08 PM, Raphael Carrier <raphael.carrier@gmail.com>wrote:
Agreed batteries are unreliable, an alternative to battery based UPS are flywheel energy storage devices, they come either as an integrated solution with the diesel generator (i think cat offers such a package)
Yup, just ask 365 Main how reliable they are - http://365main.com/status_update.html I'm not saying that battery-based UPS's are better, but no matter what type of system you look at you're going to find failures. Scott
On Wed, Nov 4, 2009 at 2:08 PM, Raphael Carrier <raphael.carrier@gmail.com>wrote:
Agreed batteries are unreliable, an alternative to battery based UPS are flywheel energy storage devices, they come either as an integrated solution with the diesel generator (i think cat offers such a package)
Yup, just ask 365 Main how reliable they are - http://365main.com/status_update.html
I'm not saying that battery-based UPS's are better, but no matter what type of system you look at you're going to find failures.
I would point out that my cursory review of the document linked above leaves a very positive impression. I don't know the actual details well enough to know if there is any reason to doubt the document... I would, however, tend to trust a vendor who disclosed events in this manner. Even the best systems can fail. How a failure is handled is in many ways the more important factor; being transparent about it is good for confidence. Best to plan for the occasional issue. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.
On Wed, Nov 4, 2009 at 2:56 PM, Joe Greco <jgreco@ns.sol.net> wrote:
Yup, just ask 365 Main how reliable they are - http://365main.com/status_update.html
I would point out that my cursory review of the document linked above leaves a very positive impression. I don't know the actual details well enough to know if there is any reason to doubt the document...
I would, however, tend to trust a vendor who disclosed events in this manner. Even the best systems can fail. How a failure is handled is in many ways the more important factor; being transparent about it is good for confidence.
Absolutely! 365 Main handled this outage very well, both at the time, but more importantly with the followup as you can see from the URL above, which as you can see was made (very!) public by them at the time, and not covered in "confidential/customer only/etc" warnings. Those that have (finally) received the notification from HE about yesterdays outage will notice the stark difference between the way they've handled it and the way 365 Main handled things... Scott.
Alex Rubenstein wrote:
Yup. Related: "100% availability" is a marketing person's dream; it sounds good in theory but is unattainable in practice, and is a reliable sign of non-100%-reliability.
You are confusing two different things.
Availability != Reliability.
Pardon the interruption... In the aforementioned statement, there appears an intense/flagrant - compartmentalization/separation of terms without sufficient explanation. Note that in being available, 'a' criteria to ensure reliability is met. If one has the desire to delve into some of the nuanced operational perspective, see: http://ow.ly/zmQg (pdf) or http://ow.ly/zmTB (web friendly). The article is also available through the IEEE Portal at http://ow.ly/zn3a (if one of the other links appear to be unavailable, anytime).
For instance, an airplane is designed to be 100% reliable, but much less available. To keep a 747 from not crashing (100% reliability) it needs significant downtime (not 100% available).
This explanation, aside from being unsatisfactory, is misleading. Operating times and maintenance times are very much separate quantities.
And even for those who follow best practices... You can inspect and maintain things until you're blue in the face. One day a contractor will drop a wrench into a PDU or UPS or whatever and spectacular things will happen.
That's were policies, procedures and methods come in (read: SAS70)
For the operationally minded -- on one hand, there is an assumption here that 'accidents' are not preventable; on the other hand, there is at least an assumption being made here that SAS 70 is the curative for 'accidents.' To be brief, accounting for human behavior as an underlying contributor to accidents can be a backbreaking and immensely messy endeavor. In this respect, SAS 70 can only be assistive. All the best, Robert Mathews. --
Alex Rubenstein wrote:
Yup. Related: "100% availability" is a marketing person's dream; it sounds good in theory but is unattainable in practice, and is a reliable sign of non-100%-reliability.
You are confusing two different things.
Availability != Reliability.
Pardon the interruption...
In the aforementioned statement, there appears an intense/flagrant - compartmentalization/separation of terms without sufficient explanation.
Correct. It's even a bit more interesting than that; there's an implication that marketing people will not really know the difference, having heard repeatedly about "high availability", may proceed to use "availability" as a buzzword... I guess I was a bit more oblique than intended.
Note that in being available, 'a' criteria to ensure reliability is met. If one has the desire to delve into some of the nuanced operational perspective, see: http://ow.ly/zmQg (pdf) or http://ow.ly/zmTB (web friendly). The article is also available through the IEEE Portal at http://ow.ly/zn3a (if one of the other links appear to be unavailable, anytime).
I doubt marketing people will care. :-)
For instance, an airplane is designed to be 100% reliable, but much less available. To keep a 747 from not crashing (100% reliability) it needs significant downtime (not 100% available).
This explanation, aside from being unsatisfactory, is misleading. Operating times and maintenance times are very much separate quantities.
And airplanes aren't 100% reliable regardless... For a power system as a whole, though, one could see 100% availability as a prereq for 100% reliability. Of course, you more closely approach 100% through redundancies... oops, should we introduce another term to debate? :-)
And even for those who follow best practices... You can inspect and maintain things until you're blue in the face. One day a contractor will drop a wrench into a PDU or UPS or whatever and spectacular things will happen.
That's were policies, procedures and methods come in (read: SAS70)
For the operationally minded -- on one hand, there is an assumption here that 'accidents' are not preventable;
You cannot eliminate accidents. Accidents represent things which are by definition unforeseen and unplanned. Accidents may be reducible through the use of good planning and practices. On one hand, one can foresee a risk in resting a wrench near some energized busbars while needing one's hands to do something else; you can define good practices that forbid this sort of thing. Even that may not completely eliminate the practice; there are plenty of examples of companies having good policies that are disregarded by employees in the field. On the other hand, when Bruno is moving a construction excavator around next door, suffers a heart attack, and floors the controls such that the excavator rams your building and the boom arm penetrates your wall and shoves a guy face-first into the busbars, well, obviously we're talking extremely unlikely (I hope it's obvious I'm even trying to be a bit ridiculous), but that's an Accident. And they happen.
on the other hand, there is at least an assumption being made here that SAS 70 is the curative for 'accidents.' To be brief, accounting for human behavior as an underlying contributor to accidents can be a backbreaking and immensely messy endeavor. In this respect, SAS 70 can only be assistive.
Correct. We can only hope to reduce accidents. My original point was simply that I prefer people who recognize 100% as a desirable-but-unobtainable goal. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.
Regarding Reliability and Availability: 1. Reliability and Availability are related, but not identical. 2. Systemic availability is, generally, the result of the combination of component reliability, component redundancy, policies, procedures, and discipline. 3. Policies, procedures, and discipline help to reduce and/or mitigate accidents. In terms of accidents and human factors: 1. Accidents cannot be eliminated, but, with proper procedures, policies, and disciplines, most can be eliminated or prevented. 2. Most accidents which cannot be eliminated can be mitigated, but, doing so often comes at a cost which exceeds the product of benefit and likelihood. We could learn a lot about this from Aviation. Nowhere in human history has more research, care, training, and discipline been applied to accident prevention, mitigation, and analysis as in aviation. A few examples: NTSB investigations of EVERY US aircraft accident and published findings. NASA Aviation Safety Reporting System When NTSB finds a design flaw in an aircraft at fault for an accident there is a process by which that error gets translated into an Airworthiness Directive forcing aircraft owners to have the flaw corrected to continue operating the aircraft. When NTSB finds a training discrepancy, procedural problem, etc., there is a process by which those discrepancies are addressed.through training, retraining, etc. For example, after a couple of accidents related to microbursts, NTSB and FAA determined that all pilots should undergo training on windshear and windshear avoidance, including microburts. etc. (There are many more examples) Owen
At 09:20 AM 11/5/2009, Owen DeLong wrote:
Regarding Reliability and Availability:
We could learn a lot about this from Aviation.
Owen, I think if we conducted a poll, a disproportionate percentage of NANOG folks are likely also pilots (compared to the general population anyway) I agree with you completely that aviation is a good model to follow if it is adapted where it makes sense. All, The real problem is same human factors we have in aviation which cause most accidents. Look at the list below and replace the word Pilot with Network Engineer or Support Tech or Programmer or whatever... and think about all the problems where something didn't work out right. It's because someone circumvented the rules, processes, and cross checks put in place to prevent the problem in the first place. Nothing can be made idiot proof because idiots are so creative. -Robert SEL/MEL Private Instrument Listed here: THE FIVE HAZARDOUS ATTITUDES 1. Anti-Authority: "Don't tell me." This attitude is found in people who do not like anyone telling them what to do. In a sense, they are saying, "No one can tell me what to do." They may be resentful of having someone tell them what to do, or may regard rules, regulations, and procedures as silly or unnecessary. However, it is always your prerogative to question authority if you feel it is in error. 2. Impulsivity: "Do it quickly." This is the attitude of people who frequently feel the need to do something, anything, immediately. They do not stop to think about what they are about to do; they do not select the best alternative, and they do the first thing that comes to mind. 3. Invulnerability: "It won't happen to me." Many people feel that accidents happen to others, but never to them. They know accidents can happen, and they know that anyone can be affected. They never really feel or believe that they will be personally involved. Pilots who think this way are more likely to take chances and increase risk. 4. Macho: "I can do it." Pilots who are always trying to prove that they are better than anyone else are thinking, "I can do it �I'll show them." Pilots with this type of attitude will try to prove themselves by taking risks in order to impress others. While this pattern is thought to be a male characteristic, women are equally susceptible. 5. Resignation: "What's the use?" Pilots who think, "What's the use?" do not see themselves as being able to make a great deal of difference in what happens to them. When things go well, the pilot is apt to think that it is good luck. When things go badly, the pilot may feel that someone is out to get me, or attribute it to bad luck. The pilot will leave the action to others, for better or worse. Sometimes, such pilots will even go along with unreasonable requests just to be a "nice guy." Tellurian Networks - A Dell Perot Systems Company http://www.tellurian.com | 888-TELLURIAN | 973-300-9211 "Well done is better than well said." - Benjamin Franklin
On November 5, 2009, Robert Boyle wrote:
It's because someone circumvented the rules, processes, and cross checks put in place to prevent the problem in the first place. Nothing can be made idiot proof because idiots are so creative.
-Robert SEL/MEL Private Instrument
No, no commercial pilot every flew overweight, or in weather below minimums, or more that the max hours in a month.. never happens ;) And there was never a boss that 'pushed' them into it, for the sake of expediency or financial gain, and the phrase.. 'Big Sky, Little Plane' was nevered uttered.. logbooks never fudged and rules are always followed.. C(om)255379 -- -- "Catch the Magic of Linux..." ------------------------------------------------------------------------ Michael Peddemors - President/CEO - LinuxMagic Products, Services, Support and Development Visit us at http://www.linuxmagic.com ------------------------------------------------------------------------ A Wizard IT Company - For More Info http://www.wizard.ca "LinuxMagic" is a Registered TradeMark of Wizard Tower TechnoServices Ltd. ------------------------------------------------------------------------ 604-589-0037 Beautiful British Columbia, Canada This email and any electronic data contained are confidential and intended solely for the use of the individual or entity to which they are addressed. Please note that any views or opinions presented in this email are solely those of the author and are not intended to represent those of the company.
On Nov 5, 2009, at 4:30 PM, Michael Peddemors wrote:
On November 5, 2009, Robert Boyle wrote:
It's because someone circumvented the rules, processes, and cross checks put in place to prevent the problem in the first place. Nothing can be made idiot proof because idiots are so creative.
-Robert SEL/MEL Private Instrument
No, no commercial pilot every flew overweight, or in weather below minimums, or more that the max hours in a month.. never happens ;) And there was never a boss that 'pushed' them into it, for the sake of expediency or financial gain, and the phrase.. 'Big Sky, Little Plane' was nevered uttered.. logbooks never fudged and rules are always followed..
C(om)255379
Of course, all of those things have happened. However, if we started treating networking errors more like the way we treat aviation errors, the reliability of networking would improve dramatically. OTOH, if we did that, the cost of networking would also probably gain a zero. Owen Commercial ASEL Instrument Airplane
Owen DeLong wrote:
We could learn a lot about this from Aviation. Nowhere in human history has more research, care, training, and discipline been applied to accident prevention, mitigation, and analysis as in aviation. A few examples:
NTSB investigations of EVERY US aircraft accident and published findings.
Ask any commercial pilot (and especially a commercial commuter flight pilot) what they think of NTSB investigations when the pilot had a "bad schedule" that doesn't allow enough time for adequate sleep. They will point out that lack of sleep can't be determined in an autopsy. The NTSB routinely puts an accident down to "pilot error" even when pilots who regularly fly those routes and shifts are convinced that exhaustion (lack of sleep, long working days) was clearly involved. And for even worse news - the smaller the plane the more complicated it is to fly and the LESS rest the pilots receive in their overnight stays because commuter airlines are covered under part 135 while major airlines are covered under part 121. My ex flew turbo-prop planes for American Eagle (American Airlines commuter flights). It was common to have the pilot get off duty near 10 pm and be requited to report back at 6 am. That's just 8 hours for rest. The "rest period" starts with a wait for a shuttle to the hotel, then the drive to the hotel (often 15 minutes or more from the airport) then check-in - it can add up to 30-45 minutes before the pilot is actually inside a hotel room. These overnight stays are in smaller towns like Santa Rosa, Fresno, Bakersfield, etc. Usually the pilots are put up at hotels that don't have a restaurant open this late, and no neighboring restaurants (even fast food) so the pilot doesn't get dinner. (There is no time for dinner in the flight schedule - they get at most 20 minutes of free time between arrival and take-off - enough time to get a bio-break and hit a vending machine but not enough time to actually get a meal.) Take a shower, get to bed at about 11:30. Set the alarm for 4:45 am and catch the shuttle back to the airport at 5:15 to get there before the 6:00 reporting time. In that "8 hour" rest period you get less than 6 hours of sleep - if you can fall asleep easily in a strange hotel. Commuter route pilots have been fighting to get regulations changed to require longer overnight periods, and especially to get the required rest period changed to "behind the door" so that the airlines can't include the commute time to/from the airport in the "rest" period. This would force the airlines to select hotels closer to the airport or else allow longer overnight layovers - either way the pilots would get adequate rest. See: http://asrs.arc.nasa.gov/publications/directline/dl5_one.htm The NTSB does a great job with mechanical issues and with training issues, but they totally miss the boat when it comes to regulating adequate rest periods in the airline schedules. To bring this back to NANOG territory, how many times have you or one of your network admins made a mistake when working with inadequate sleep - due to extra early start hours (needless 8 am meetings), or working long/late hours, or being called to work in the middle of the night? Finally, having lived with a commercial aviation pilot for 5 years and having worked with network types for much longer, I can say that while there is some overlap between pilots and IT techs, there are also a LOT of people who go into computers (programming, network and system administration) who are totally unsuitable for the regimented environment required for commercial aviation - people who HATE following a lot of rules and regulations and fixed schedules. If you tried to impose FAA-type rules and regulations and airline schedules on an IT organization, you would have a revolt on your hands. Tread carefully when you consider to emulating Aviation. jc
On Nov 6, 2009, at 12:04 PM, JC Dill wrote:
Owen DeLong wrote:
We could learn a lot about this from Aviation. Nowhere in human history has more research, care, training, and discipline been applied to accident prevention, mitigation, and analysis as in aviation. A few examples:
NTSB investigations of EVERY US aircraft accident and published findings.
Ask any commercial pilot (and especially a commercial commuter flight pilot) what they think of NTSB investigations when the pilot had a "bad schedule" that doesn't allow enough time for adequate sleep. They will point out that lack of sleep can't be determined in an autopsy.
As a point of information, I _AM_ a commercial pilot.
The NTSB routinely puts an accident down to "pilot error" even when pilots who regularly fly those routes and shifts are convinced that exhaustion (lack of sleep, long working days) was clearly involved. And for even worse news - the smaller the plane the more complicated it is to fly and the LESS rest the pilots receive in their overnight stays because commuter airlines are covered under part 135 while major airlines are covered under part 121. My ex flew turbo-prop planes for American Eagle (American Airlines commuter flights). It was common to have the pilot get off duty near 10 pm and be requited to report back at 6 am. That's just 8 hours for rest. The "rest period" starts with a wait for a shuttle to the hotel, then the drive to the hotel (often 15 minutes or more from the airport) then check-in - it can add up to 30-45 minutes before the pilot is actually inside a hotel room. These overnight stays are in smaller towns like Santa Rosa, Fresno, Bakersfield, etc. Usually the pilots are put up at hotels that don't have a restaurant open this late, and no neighboring restaurants (even fast food) so the pilot doesn't get dinner. (There is no time for dinner in the flight schedule - they get at most 20 minutes of free time between arrival and take- off - enough time to get a bio-break and hit a vending machine but not enough time to actually get a meal.) Take a shower, get to bed at about 11:30. Set the alarm for 4:45 am and catch the shuttle back to the airport at 5:15 to get there before the 6:00 reporting time. In that "8 hour" rest period you get less than 6 hours of sleep - if you can fall asleep easily in a strange hotel.
Flying in such a state of exhaustion is, whether you like it or not, a form of pilot error. A pilot who chooses to fly on such a schedule is making an error in judgment. Sure, there are all kinds of pressures and employment issues that need to be resolved to reduce and eliminate that pressure, and, I support the idea of updating the crew duty time regulations with that in mind. That does not change the fact that FAR 91.3 still applies: Sec. 91.3 Responsibility and authority of the pilot in command. (a) The pilot in command of an aircraft is directly responsible for, and is the final authority as to, the operation of that aircraft. (b) In an in-flight emergency requiring immediate action, the pilot in command may deviate from any rule of this part to the extent required to meet that emergency. (c) Each pilot in command who deviates from a rule under paragraph (b) of this section shall, upon the request of the Administrator, send a written report of that deviation to the Administrator. A failure to declare him/herself to be incapable of safely completing the flight is a failure to meet the requirements of 91.3(a).
Commuter route pilots have been fighting to get regulations changed to require longer overnight periods, and especially to get the required rest period changed to "behind the door" so that the airlines can't include the commute time to/from the airport in the "rest" period. This would force the airlines to select hotels closer to the airport or else allow longer overnight layovers - either way the pilots would get adequate rest. See:
http://asrs.arc.nasa.gov/publications/directline/dl5_one.htm
And that would be a good change. In part, that change is supported by the number of times that the NTSB has made statments such as: We find the probable cause of the accident was pilot error. We believe that fatigue was likely a factor in the accident.
The NTSB does a great job with mechanical issues and with training issues, but they totally miss the boat when it comes to regulating adequate rest periods in the airline schedules.
No, you miss the boat on the relationship between the stakeholders. The NTSB has repeatedly commented on the need for better regulations and better studies of crew duty time requirements and fatigue as a factor in accidents and incidents. However, the NTSB CANNOT change regulations. They investigate accidents and make recommendations to the regulatory agencies. The FAA needs to be the one to change the regulations. The FAA has not done a particularly good job in addressing this topic, where they have done a better job in improving mechanical and training issues and have been more likely to follow up on NTSB recommendations in these areas. In part, that is the result of reduced pushback on the FAA in these areas from industry. After all, Boeing does NOT want to publicly say "We think that this mechanical factor the NTSB just determined as the cause of 400 fatalities isn't really an issue and the FAA should not issue an AD to make us correct it." On the other hand, it's much harder for the kind of public feedback loop that exists in the above statement to apply to crew fatigue issues. In any case, this has drifted well off the NANOG topic, and, I would be happy to discuss the NTSB, FAA, etc. with you off-list if you wish.
To bring this back to NANOG territory, how many times have you or one of your network admins made a mistake when working with inadequate sleep - due to extra early start hours (needless 8 am meetings), or working long/late hours, or being called to work in the middle of the night?
Sure, this happens, but, it's not the only thing that happens.
Finally, having lived with a commercial aviation pilot for 5 years and having worked with network types for much longer, I can say that while there is some overlap between pilots and IT techs, there are also a LOT of people who go into computers (programming, network and system administration) who are totally unsuitable for the regimented environment required for commercial aviation - people who HATE following a lot of rules and regulations and fixed schedules. If you tried to impose FAA-type rules and regulations and airline schedules on an IT organization, you would have a revolt on your hands. Tread carefully when you consider to emulating Aviation.
That's very true. I wasn't advocating that we should emulate aviation, so much as I was attempting to point out that if you want to reduce accidents/incidents, there is a proven model for doing so and that it comes at a cost. Today, we actually seem, and in my opinion, rightly so, to prefer to live with the existing situation. However, given that is the choice we are making, we should realize that is the choice we have made and accept the tradeoffs or make a different choice. Owen
Owen DeLong wrote:
On Nov 6, 2009, at 12:04 PM, JC Dill wrote:
Owen DeLong wrote:
We could learn a lot about this from Aviation. Nowhere in human history has more research, care, training, and discipline been applied to accident prevention, mitigation, and analysis as in aviation. A few examples:
NTSB investigations of EVERY US aircraft accident and published findings.
Ask any commercial pilot (and especially a commercial commuter flight pilot) what they think of NTSB investigations when the pilot had a "bad schedule" that doesn't allow enough time for adequate sleep. They will point out that lack of sleep can't be determined in an autopsy.
As a point of information, I _AM_ a commercial pilot.
There are commercial pilots who fly for a living, and there are those who have the certification but who don't fly for a living. Do you regularly fly for a commercial airline where your schedule is determined by the airline's needs, part 135 or part 121 rules, union rules, etc. with no ability to modify your work schedule to allow for adequate rest?
The NTSB routinely puts an accident down to "pilot error" even when pilots who regularly fly those routes and shifts are convinced that exhaustion (lack of sleep, long working days) was clearly involved. And for even worse news - the smaller the plane the more complicated it is to fly and the LESS rest the pilots receive in their overnight stays because commuter airlines are covered under part 135 while major airlines are covered under part 121. My ex flew turbo-prop planes for American Eagle (American Airlines commuter flights). It was common to have the pilot get off duty near 10 pm and be requited to report back at 6 am. That's just 8 hours for rest. The "rest period" starts with a wait for a shuttle to the hotel, then the drive to the hotel (often 15 minutes or more from the airport) then check-in - it can add up to 30-45 minutes before the pilot is actually inside a hotel room. These overnight stays are in smaller towns like Santa Rosa, Fresno, Bakersfield, etc. Usually the pilots are put up at hotels that don't have a restaurant open this late, and no neighboring restaurants (even fast food) so the pilot doesn't get dinner. (There is no time for dinner in the flight schedule - they get at most 20 minutes of free time between arrival and take-off - enough time to get a bio-break and hit a vending machine but not enough time to actually get a meal.) Take a shower, get to bed at about 11:30. Set the alarm for 4:45 am and catch the shuttle back to the airport at 5:15 to get there before the 6:00 reporting time. In that "8 hour" rest period you get less than 6 hours of sleep - if you can fall asleep easily in a strange hotel.
Flying in such a state of exhaustion is, whether you like it or not, a form of pilot error.
There is no other effective option. Almost all the commuter airline schedules have these short overnights, and it's impossible for most pilots to avoid being scheduled to fly them. If you bid for these schedules you are expected to fly them. You can't just decide at 11:30 pm that you need more than 5 hour's rest and that you won't be getting up at 4:30 am to get to the airport by your 6:00 am report time, or decide when your alarm wakes you at 4:30 that you are too tired and are going to get another 2 hours sleep, or decide at 7 pm that you are too exhausted from flying this schedule for 2 days and are not going to fly your last leg. If you do this *even once* you will get in very hot water with the company and if you do it repeatedly you will ultimately lose your job. They aren't going to change the schedule because it's "legal" under part 135.
A pilot who chooses to fly on such a schedule is making an error in judgment. Sure, there are all kinds of pressures and employment issues that need to be resolved to reduce and eliminate that pressure,
Right now there is no way to avoid putting your job in jeopardy by refusing to fly these unsafe schedules.
and, I support the idea of updating the crew duty time regulations with that in mind.
That does not change the fact that FAR 91.3 still applies:
The airlines don't care. They draw up these unsafe schedules and expect pilots to magically be capable of flying them safely. If there's an accident it goes down as pilot error, but if you try to claim exhaustion and refuse to fly citing 91.3 on a repeated basis you WILL be fired. Catch 22. Sounds a lot like working in IT with clueless management, doesn't it?
To bring this back to NANOG territory, how many times have you or one of your network admins made a mistake when working with inadequate sleep - due to extra early start hours (needless 8 am meetings), or working long/late hours, or being called to work in the middle of the night?
Sure, this happens, but, it's not the only thing that happens.
Finally, having lived with a commercial aviation pilot for 5 years and having worked with network types for much longer, I can say that while there is some overlap between pilots and IT techs, there are also a LOT of people who go into computers (programming, network and system administration) who are totally unsuitable for the regimented environment required for commercial aviation - people who HATE following a lot of rules and regulations and fixed schedules. If you tried to impose FAA-type rules and regulations and airline schedules on an IT organization, you would have a revolt on your hands. Tread carefully when you consider to emulating Aviation.
That's very true. I wasn't advocating that we should emulate aviation, so much as I was attempting to point out that if you want to reduce accidents/incidents, there is a proven model for doing so and that it comes at a cost. Agreed. Today, we actually seem, and in my opinion, rightly so, to prefer to live with the existing situation. However, given that is the choice we are making, we should realize that is the choice we have made and accept the tradeoffs or make a different choice.
Fast(big/powerful), cheap, good - pick any two. :-) jc
Owen,
We could learn a lot about this from Aviation. Nowhere in human history has more research, care, training, and discipline been applied to accident prevention, mitigation, and analysis as in aviation. A few examples:
Others later in this thread duly noted a definite relationship of costs associated, which are clearly "worth it" given the particular application of these methods [snipped]. However, I assert this is warranted because of the specific public trust that commercial aviation must be given. Additionally, this form of professional or industry "standard" isn't unique in the world; you can find (albeit small) parallels in most states' PE certification tracks and the like. In the case of the big-I internet, I assert we can't (yet) successfully argue that it's deserving of similar public trust. In short, I'm arguing that big-I internet deserves special-pleading status in these sorts of "instrument -> record -> improve" strawmen and that we shouldn't apply similar concepts or regulation. (Robert B. then responded):
All, The real problem is same human factors we have in aviation which cause most accidents. Look at the list below and replace the word Pilot with Network Engineer or Support Tech or Programmer or whatever... and think about all the problems where something didn't work out right. It's because someone circumvented the rules, processes, and cross checks put in place to prevent the problem in the first place. Nothing can be made idiot proof because idiots are so creative.
I'd like to suggest we also swap "bug" for "software defect" or "hardware defect" - perhaps if operators started talking about problems like engineers, we'd get more global buy-in for a process-based solution. I certainly like the idea of improving the state of affairs where possible - especially the operator->device direction (i.e fat-fingering acl, prefix list, community list, etc). When people make mistakes, it seems very wise to accurately record the entrance criteria, the results of their actions, and ways to avoid it - then shared to all operators (like at NANOG meetings!). The part I don't like is being ultimately responsible for, or to "design around" a class of systemic problems which are entirely outside of an operators sphere of control. What curve must we shift to get routers with hardware and software that's both a) fast b) reliable and c) cheap -- in the hopes that the only problems left to solve indeed are human ones? -Tk
Anton Kapela wrote:
What curve must we shift to get routers with hardware and software that's both a) fast b) reliable and c) cheap -- in the hopes that the only problems left to solve indeed are human ones?
Fast, Reliable, Cheap - pick any two. No, you can't have all three. The fastest(best) and most reliable *anything* can't be the cheapest one because someone will quickly seize the market opportunity to make one that is lower quality (slower) or less reliable and sell it for a lower price. jc
Joe Greco wrote:
Yup. Related: "100% availability" is a marketing person's dream; it sounds good in theory but is unattainable in practice, and is a reliable sign of non-100%-reliability.
The most common way to gain "100% availability" is to avoid testing under load. This surely protects the equipment against a whole slew of failures in the less-used portions of your power systems, but also protects you from detecting them outside your Hour(s) Of Greatest Need.
Not testing under load is silly, IMHO. Does it work? Maybe. If it does something strange during testing it's attended, expected, and utility is available to fall back on. Starting your generator only means it'll turn over and idle, not that it'll provide power under load all the way to the racks. Some people may prefer a colo that never risks it and therefore never does more than idle the genset to claim 100% uptime. Others may prefer one that won't promise 100% everything but does load tests. I'd rather have a test go wrong while utility is available rather than a failed backup with no utility hoping the power comes back before the UPS dies or the room cooks itself. Both extremes are available to choose from if you do your research before picking a colo.
And even for those who follow best practices... You can inspect and maintain things until you're blue in the face. One day a contractor will drop a wrench into a PDU or UPS or whatever and spectacular things will happen. Or a battery develops a strange fault.
You do live load testing, you'll lose now and then. It's best to simply assume no single circuit is 100% reliable. You should be able to get two circuits from separate power systems and the combination of the two should really closely approximate 100%, but even there... it isn't.
Separate power systems are overrated, especially if the fire department ends up being involved for some reason. (Re: the infamous gas leak story.) And of course with increased complexity comes increased risk of failure and longer downtime to diagnose and repair. There is no perfect balance. ~Seth
On Wed, 04 Nov 2009 12:26:15 CST, Joe Greco said:
With power:
N+1 is usually better than N Best to assume full load when doing math Things will go wrong, predict common failures
And uncommon ones. :) So as part of a major compute-cluster install, we upgraded our UPS and diesel generator one weekend, and breathed a collective sigh of relief that we were now safe from power outages and mostly dodged a bullet. We *did* have some scary moments when we discovered that (a) of the 400 or so disks on our Sun E10K, about 10 didn't spin up again and (b) several of the boot disks on said box weren't mirrored. Fortunately, none of the 10 fails were on a non-mirrored disk. By Tuesday, all the non-mirrored boot disks were in fact mirrored. That Friday, a bozo contractor relocating a doorway managed to set off the Halon. Only lost two disks on the E10K. Guess which two? ;) And a month later, we discovered that the nice shiny new automatic cutover switch was wired in backwards, necessitating another power outage to re-wire it correctly. So much for safe from power outages... :)
Jeffrey Lyon wrote:
No date on that 'press release' but the way back machine helps put it somewhere in 2002. A lot of good this "Alameda" sized generator has done recently... http://web.archive.org/web/*/http://www.he.net/releases/release18.html Cheers, Stef
Jeffrey Lyon wrote:
No date on that 'press release' but the way back machine helps put it somewhere in 2002. A lot of good this "Alameda" sized generator has done recently...
http://web.archive.org/web/*/http://www.he.net/releases/release18.html
2MW isn't super huge or anything. I would expect that, given the size I have been led to believe HE is, they've got a lot more than that now. My memory is that Alameda isn't huge, but it isn't small either. I'm not sure .. ah, here http://www.reuters.com/article/pressRelease/idUS179594+03-Apr-2009+BW2009040... peak 70MW I'm not sure what the basis for the claim is that a 2MW generator is "large enough to power the entire city of Alameda" ... 2MW gensets are common enough in this business and it's possible to burn through 2MW in a few hundred racks. It isn't *that* much power. A more conventional comparison might be to something like a hospital; one of our local hospitals installed a 1.25MW generator which, IIRC, powers all critical circuits. http://hhenergyservices.com/electrical/photos.php?category_id=2845&subcategory_id=5027&id=196&number=7 Sometimes it is easier to picture things that way. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.
Joe Greco wrote:
Jeffrey Lyon wrote:
No date on that 'press release' but the way back machine helps put it somewhere in 2002. A lot of good this "Alameda" sized generator has done recently...
http://web.archive.org/web/*/http://www.he.net/releases/release18.html
2MW isn't super huge or anything. I would expect that, given the size I have been led to believe HE is, they've got a lot more than that now.
My memory is that Alameda isn't huge, but it isn't small either. I'm not sure .. ah, here
http://www.reuters.com/article/pressRelease/idUS179594+03-Apr-2009+BW2009040...
peak 70MW
I'm not sure what the basis for the claim is that a 2MW generator is "large enough to power the entire city of Alameda" ... 2MW gensets are common enough in this business and it's possible to burn through 2MW in a few hundred racks. It isn't *that* much power.
A more conventional comparison might be to something like a hospital; one of our local hospitals installed a 1.25MW generator which, IIRC, powers all critical circuits.
Sometimes it is easier to picture things that way.
Regardless of generator sizing issues or disparities, if the ATS fails, then no amount of grid or generator power will keep the cabinets juiced up. Since this is the second time in recent history that this building has experienced a short power outage caused by ATS flakiness, perhaps keeping a small UPS in the cabinet isn't such a bad idea? Even if the distribution switches/routers lose power, at least the servers wouldn't have to go through fscks and DB integrity checks due to unplanned power loss, and the recovery time would be significantly faster. Hell, for a 5 minute power outage, some of my services were down for 20 minutes. I'll happily take a 75% reduction in downtime for the cost of a UPS , though clearly redundancy across more reliable datacenters is a better solution.
... JG
On Wed, Nov 04, 2009 at 07:09:48AM +0000, Tico wrote:
Since this is the second time in recent history that this building has experienced a short power outage caused by ATS flakiness, perhaps keeping a small UPS in the cabinet isn't such a bad idea?
It sounds like a great idea....until one of those small UPSes smokes out, triggering the fire suppression (or at least preaction), possibly also causing the power to be cut to the floor. The customer with the small UPS that smoked out generally does not like receiving the bill for everyone else's equipment cleaning, too. --msa
On Tue, Nov 3, 2009 at 11:09 PM, Tico <tico-nanog@raapid.net> wrote:
Since this is the second time in recent history that this building has experienced a short power outage caused by ATS flakiness, perhaps keeping a small UPS in the cabinet isn't such a bad idea? Even if
Although this time it was "short", the outage 5 weeks ago was about 90 minutes. Scott
Regardless of generator sizing issues or disparities, if the ATS fails, then no amount of grid or generator power will keep the cabinets juiced up.
Sure. Having no direct knowledge of the HE DC in question, I was merely commenting on the issue I replied to.
Since this is the second time in recent history that this building has experienced a short power outage caused by ATS flakiness,
Has this been verified?
perhaps keeping a small UPS in the cabinet isn't such a bad idea? Even if the distribution switches/routers lose power, at least the servers wouldn't have to go through fscks and DB integrity checks due to unplanned power loss, and the recovery time would be significantly faster.
Small UPS's have their own set of ugly failure modes. For example, we find that the APC Smart-UPS 1400's have a tendency to cook their batteries; if you don't have monitoring of some sort, you may not find out that your batteries are cooked until the UPS decides it is hopeless and shuts itself off. In the meantime, the lingering sulfur smell may panic someone... or cause a falsing of the fire system... Colos frequently forbid the use of small UPS's for a variety of reasons.
Hell, for a 5 minute power outage, some of my services were down for 20 minutes. I'll happily take a 75% reduction in downtime for the cost of a UPS , though clearly redundancy across more reliable datacenters is a better solution.
So is redundancy across power systems within the colo, but only for well- designed colos. Stories omitted. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.
Colos frequently forbid the use of small UPS's for a variety of reasons.
In my experience they always need to be connected to the EPO switch, which poses it's own risks. Plus try to find a UPS with that feature for reasonable prices. Which leads me to this question: What questions do you ask any potential colocation provider to determine if they are built out to your needs? -Dave
Colos frequently forbid the use of small UPS's for a variety of reasons.
In my experience they always need to be connected to the EPO switch, which poses it's own risks. Plus try to find a UPS with that feature for reasonable prices.
APC says it's available on the SUA2200RM2U and SUA3000RM2U, and lists it as optional for the APC SUA1500RM2U. I would consider all of these to be reasonably priced.
Which leads me to this question: What questions do you ask any potential colocation provider to determine if they are built out to your needs?
See if they'll guarantee diverse power as part of the contract. :-) It's disappointing to find a colo that feeds you your primary and redundant power off the same UPS. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.
On Wed, Nov 4, 2009 at 7:09 AM, Tico <tico-nanog@raapid.net> wrote:
Sometimes it is easier to picture things that way.
Regardless of generator sizing issues or disparities, if the ATS fails, then no amount of grid or generator power will keep the cabinets juiced up.
Since this is the second time in recent history that this building has experienced a short power outage caused by ATS flakiness, perhaps keeping a small UPS in the cabinet isn't such a bad idea? Even if the distribution switches/routers lose power, at least the servers wouldn't have to go through fscks and DB integrity checks due to unplanned power loss, and the recovery time would be significantly faster.
Hell, for a 5 minute power outage, some of my services were down for 20 minutes. I'll happily take a 75% reduction in downtime for the cost of a UPS , though clearly redundancy across more reliable datacenters is a better solution.
Maybe some of us [[soon-to-be-]ex-]customers of Hurricane can bake them a cake and beg for UPSes. Or reliable power. Or for someone to actually answer the voicemails much less phone calls within even a few hours of an outage. Or for there to be at the very least a status page notifying customers that they are, in fact, screwed, and for how long, and that it's useless to continue trying to get through at such time. Who's with me?
Regardless of generator sizing issues or disparities, if the ATS fails, then no amount of grid or generator power will keep the cabinets juiced up.
That is patently false. Assume N+1 UPS, with each UPS module having its own ATS fed from a utility and emergency bus. Then you can even individually maintain each UPS module and ATS. Bonus and score. And if it's a really good place, you have two of the above (2(n+1)) and each of your power cords goes to one each. "Question everything, assume nothing, discuss all, and resolve quickly." -- Alex Rubenstein, AR97, K2AHR, alex@nac.net, latency, Al Reuben -- -- Net Access Corporation, 800-NET-ME-36, http://www.nac.net --
On 11/4/09 11:44 AM, "Alex Rubenstein" <alex@corp.nac.net> wrote:
Regardless of generator sizing issues or disparities, if the ATS fails, then no amount of grid or generator power will keep the cabinets juiced up.
That is patently false.
At it's root it's true - if an ATS fails the power between the source and destination will be interrupted.
Assume N+1 UPS, with each UPS module having its own ATS fed from a utility and emergency bus. Then you can even individually maintain each UPS module and ATS. Bonus and score.
And if it's a really good place, you have two of the above (2(n+1)) and each of your power cords goes to one each.
Which doesn't address the failure of one piece of equipment. Of course, if you're dual chorded from your server through fully redundant switch gear to multiple, diverse vaults then a single ATS failure shouldn't affect you. Regards, Mike
On Wednesday, November 4, 2009 10:00am, "dan syn" <dan.syn.ack@gmail.com> said:
Maybe some of us [[soon-to-be-]ex-]customers of Hurricane can bake them a cake and beg for UPSes. Or reliable power. Or for someone to actually answer the voicemails much less phone calls within even a few hours of an outage. Or for there to be at the very least a status page notifying customers that they are, in fact, screwed, and for how long, and that it's useless to continue trying to get through at such time.
Who's with me?
Yeah, after years of dealing with them all I can say is Best of luck. While we still have some legacy systems in Fremont #1 we moved 98% of our operations out to other data centers back in 2005 because of the same lack of communications even about scheduled events (which to this day I don't believe are posted anywhere). We were rapidly expanding at the time, and given the brush off, so we moved. That was the only way to get good, timely, and details information about things taking place. Flash forward almost 5 years and it seems their flagship Fremont #2 which was just being announced when we started moving, is still the same song, different year... -- Nevin Lyne -- Founder / Director of Technology -- EngineHosting.com
On Tuesday, November 3, 2009 10:03pm, "Joe Greco" <jgreco@ns.sol.net> said:
Jeffrey Lyon wrote:
No date on that 'press release' but the way back machine helps put it somewhere in 2002. A lot of good this "Alameda" sized generator has done recently...
http://web.archive.org/web/*/http://www.he.net/releases/release18.html
2MW isn't super huge or anything. I would expect that, given the size I have been led to believe HE is, they've got a lot more than that now.
My memory is that Alameda isn't huge, but it isn't small either. I'm not sure .. ah, here
The 2002 press release is talking about the Fremont 1 facility not the newer Fremont 2 facility. Fremont 1 has a fixed power availability to each cabinet of just a single 15A circuit. You can not modify or change that, and if you need more power your option is to add another cabinet. You are not allowed to route power cords between cabinets so you are forever running a single circuit and 80% of your 15A circuit max. The data center was built in a different time. -- Nevin Lyne -- Founder / Director of Technology -- EngineHosting.com
nevin@enginehosting.com wrote:
The 2002 press release is talking about the Fremont 1 facility not the newer Fremont 2 facility. Fremont 1 has a fixed power availability to each cabinet of just a single 15A circuit. You can not modify or change that, and if you need more power your option is to add another cabinet. You are not allowed to route power cords between cabinets so you are forever running a single circuit and 80% of your 15A circuit max. The data center was built in a different time.
The same is true of racks in most of the suites in the more recent Freemont 2 facility. Cheers, Stef
On Wed, Nov 4, 2009 at 6:38 PM, <nevin@enginehosting.com> wrote:
The 2002 press release is talking about the Fremont 1 facility not the newer Fremont 2 facility. Fremont 1 has a fixed power availability to each cabinet of just a single 15A circuit. You can not modify or change that, and if you need more power your option is to add another cabinet. You are not allowed to route power cords between cabinets so you are forever running a single circuit and 80% of your 15A circuit max. The data center was built in a different time.
A different time, but obviously not that much different... Fremont 2 is still limited to either a single 15A or a single 20A circuit per rack. They are rebuilding one of the Fremont 2 wings and turning it into a single area rather than the existing suites, so it'll be interesting to see if things are done differently there. Scott
participants (37)
-
Alex Rubenstein
-
Anton Kapela
-
Ben Carleton
-
Bryan King
-
Cary Wiedemann
-
Charles.Jouglard@cox.com
-
dan syn
-
David B. Peterson
-
David Stearns
-
JC Dill
-
Jeffrey Lyon
-
Joe Greco
-
Jonathan Bishop
-
Lyndon Nerenberg (VE6BBM/VE7TFX)
-
Majdi S. Abbas
-
Matthew Petach
-
Max Clark
-
Michael Holstein
-
Michael J McCafferty
-
Michael K. Smith
-
Michael Peddemors
-
Michael Schuler
-
nevin@enginehosting.com
-
Owen DeLong
-
Patrick W. Gilmore
-
Paul Bosworth
-
Raphael Carrier
-
Robert Boyle
-
Robert Mathews (OSIA)
-
rodrick brown
-
Scott Howard
-
Sean Head
-
Seth Mattinen
-
Stef Walter
-
Tico
-
Valdis.Kletnieks@vt.edu
-
William Pitcock