Data Center testing
Does any one know of any data centers that do failure testing of their networking equipment regularly? I mean to verify that everything fails over properly after changes have been made over time. Is there any best practice guides for doing this? Thanks, Dan
"-----Original Message----- From: Dan Snyder [mailto:sliplever@gmail.com] Sent: Mon 8/24/2009 2:00 PM To: NANOG list Subject: [SPAM-HEADER] - Data Center testing - Email has different SMTP TO: and MIME TO: fields in the email addresses Does any one know of any data centers that do failure testing of their networking equipment regularly? I mean to verify that everything fails over properly after changes have been made over time. Is there any best practice guides for doing this? Thanks, Dan Dan" It is quite surprising how often data centres lose both primary and backup power.
I know Peer1 in vancouver reguarly send out notifications of "non-impacting" generator load testing, like monthly. Also InterXion in Dublin, Ireland have occasionally sent me notification that there was a power outage of less than a minute however their backup successfully took the load. I only remember one complete outage in Peer1 a few years ago... Never seen any outage in InterXion Dublin. Also I don't ever remember any power failure at AiNet (Deepak will probably elaborate) 2009/8/24 Dan Snyder <sliplever@gmail.com>:
Does any one know of any data centers that do failure testing of their networking equipment regularly? I mean to verify that everything fails over properly after changes have been made over time. Is there any best practice guides for doing this?
Thanks, Dan
We have done power tests before and had no problem. I guess I am looking for someone who does testing of the network equipment outside of just power tests. We had an outage due to a configuration mistake that became apparent when a switch failed. It didn't cause a problem however when we did a power test for the whole data center. -Dan On Mon, Aug 24, 2009 at 9:31 AM, Ken Gilmour <ken.gilmour@gmail.com> wrote:
I know Peer1 in vancouver reguarly send out notifications of "non-impacting" generator load testing, like monthly. Also InterXion in Dublin, Ireland have occasionally sent me notification that there was a power outage of less than a minute however their backup successfully took the load.
I only remember one complete outage in Peer1 a few years ago... Never seen any outage in InterXion Dublin.
Also I don't ever remember any power failure at AiNet (Deepak will probably elaborate)
2009/8/24 Dan Snyder <sliplever@gmail.com>:
Does any one know of any data centers that do failure testing of their networking equipment regularly? I mean to verify that everything fails over properly after changes have been made over time. Is there any best practice guides for doing this?
Thanks, Dan
Dan Snyder wrote:
We have done power tests before and had no problem. I guess I am looking for someone who does testing of the network equipment outside of just power tests. We had an outage due to a configuration mistake that became apparent when a switch failed. It didn't cause a problem however when we did a power test for the whole data center.
The plus side of failure testing is that it can be controlled. The downside to failure testing is that you can induce a failure. Maintenance windows are cool, but some people really dislike failures of any type which limits how often you can test. I personally try for once a year. However, a lot can go wrong in a year. Jack
Most Provider type datacenters I've worked with get a lot of flak from customers when they announce they're doing network failover testing, because there's always going to be a certain amount of chance (at least) of disruption. Its the exception to find a provider that does it I think (or maybe just one that admits it when they're doing it). Power tests are a different thing. As for testing your own equipment, there are a couple ways to do that, regular failover tests (quarterly, or more likely at 6 month intervals), and/or routing traffic so that you have some of your traffic on all paths (ie internal traffic on one path, external traffic on another). The latter doesn't necessarily tell you that your failover will work perfectly, only that all your gear in the 2nd path is functioning. I prefer doing both. When doing the failover tests, no matter how good your setup is, there's always a chance for taking a hit, so I always do this kind of work during a maintenance window, not too close to quarter end, etc. If you have your equipment set up correctly of course, it goes like butter and is a total non-event. For test procedure, I usually pull cables. I'll go all the way to line cards or power cables if I really want to test, though that can be hard on equipment. E On Mon, Aug 24, 2009 at 10:45 AM, Jack Bates <jbates@brightok.net> wrote:
Dan Snyder wrote:
We have done power tests before and had no problem. I guess I am looking for someone who does testing of the network equipment outside of just power tests. We had an outage due to a configuration mistake that became apparent when a switch failed. It didn't cause a problem however when we did a power test for the whole data center.
The plus side of failure testing is that it can be controlled. The downside to failure testing is that you can induce a failure. Maintenance windows are cool, but some people really dislike failures of any type which limits how often you can test. I personally try for once a year. However, a lot can go wrong in a year.
Jack
On Mon, Aug 24, 2009 at 09:38:38AM -0400, Dan Snyder wrote:
We have done power tests before and had no problem. I guess I am looking for someone who does testing of the network equipment outside of just power tests. We had an outage due to a configuration mistake that became apparent when a switch failed. It didn't cause a problem however when we did a power test for the whole data center.
Dan, With all due respect, if there are config changes being made to your devices that aren't authorized or in accordance with your standards (you *do* have config standards, right?) then you don't have a testing problem, you have a data integrity problem. Periodically inducing failures to catch them is sorta like using your smoke detector as an oven timer. There are several tools that can help in this area; a good free one is rancid [1], which logs in to your routers and collects copies of configs and other info, all of which gets stored in a central repository. By default, you will be notified via email of any changes. An even better approach than scanning the hourly config diff emails is to develop scripts that compare the *actual* state of the network with the *desired* state and alert you if the two are not in sync. Obviously this is more work because you have to have some way of describing the desired state of the network in machine-parsable format, but the benefit is that you know in pseudo-realtime when something is wrong, as opposed to finding out the next time a device fails. Rancid diffs + tacacs logs will tell you who made the changes, and with that info you can get at the root of the problem. Having said that, every planned maintenance activity is an opportunity to run through at least some failure cases. If one of your providers is going to take down a longhaul circuit, you can observe how traffic re-routes and verify that your metrics and/or TE are doing what you expect. Any time you need to load new code on a device you can test that things fail over appropriately. Of course, you have to willing to just shut the device down without draining it first, but that's between you and your customers. Link and/or device failures will generate routing events that could be used to test convergence times across your network, etc. The key is to be prepared. The more instrumentation you have in place prior to the test, the better you will be able to analyze the impact of the failure. An experienced operator can often tell right away when looking at a bunch of MRTG graphs that "something doesn't look right", but that doesn't tell you *what* is wrong. There are tools (free and commercial) that can help here, too. Have a central syslog server and some kind of log reduction tool in place. Have beacons/probes deployed, in both the control and data planes. If you want to record, analyze, and even replay routing system events, you might want to take a look at the Route Explorer product from Packet Design [2]. You said "switch failure" above, so I'm guessing that this doesn't apply to you, but there are also good network simulation packages out there. Cariden [3] and WANDL [4] can build models of your network based on actual router configs and let you simulate the impact of various scenarios, including device/link failures. However, these tools are more appropriate for design and planning than for catching configuration mistakes, so they may not be what you're looking for in this case. --Jeff [1] http://www.shrubbery.net/rancid/ [2] http://www.packetdesign.com/products/rex.htm [3] http://www.cariden.com/ [4] http://www.wandl.com/html/index.php
We have done power tests before and had no problem. I guess I am looking for someone who does testing of the network equipment outside of just
tests. We had an outage due to a configuration mistake that became apparent when a switch failed. It didn't cause a problem however when we did a
There's more to data integrity in a data center (well, anything powered, that is) than network configurations. There's the loading of individual power outlets, UPS loading, UPS battery replacement cycles, loading of circuits, backup lighting, etc. And the only way to know if something is really working like it's designed is to test it. That's why we have financial auditors, military exercises, fire drills, etc. So while your analogy emphasizes the importance of having good processes in place to catch the problems up front, it doesn't eliminate throwing the switch. Frank -----Original Message----- From: Jeff Aitken [mailto:jaitken@aitken.com] Sent: Tuesday, August 25, 2009 7:53 AM To: Dan Snyder Cc: NANOG list Subject: Re: Data Center testing On Mon, Aug 24, 2009 at 09:38:38AM -0400, Dan Snyder wrote: power power
test for the whole data center.
Dan, With all due respect, if there are config changes being made to your devices that aren't authorized or in accordance with your standards (you *do* have config standards, right?) then you don't have a testing problem, you have a data integrity problem. Periodically inducing failures to catch them is sorta like using your smoke detector as an oven timer. There are several tools that can help in this area; a good free one is rancid [1], which logs in to your routers and collects copies of configs and other info, all of which gets stored in a central repository. By default, you will be notified via email of any changes. An even better approach than scanning the hourly config diff emails is to develop scripts that compare the *actual* state of the network with the *desired* state and alert you if the two are not in sync. Obviously this is more work because you have to have some way of describing the desired state of the network in machine-parsable format, but the benefit is that you know in pseudo-realtime when something is wrong, as opposed to finding out the next time a device fails. Rancid diffs + tacacs logs will tell you who made the changes, and with that info you can get at the root of the problem. Having said that, every planned maintenance activity is an opportunity to run through at least some failure cases. If one of your providers is going to take down a longhaul circuit, you can observe how traffic re-routes and verify that your metrics and/or TE are doing what you expect. Any time you need to load new code on a device you can test that things fail over appropriately. Of course, you have to willing to just shut the device down without draining it first, but that's between you and your customers. Link and/or device failures will generate routing events that could be used to test convergence times across your network, etc. The key is to be prepared. The more instrumentation you have in place prior to the test, the better you will be able to analyze the impact of the failure. An experienced operator can often tell right away when looking at a bunch of MRTG graphs that "something doesn't look right", but that doesn't tell you *what* is wrong. There are tools (free and commercial) that can help here, too. Have a central syslog server and some kind of log reduction tool in place. Have beacons/probes deployed, in both the control and data planes. If you want to record, analyze, and even replay routing system events, you might want to take a look at the Route Explorer product from Packet Design [2]. You said "switch failure" above, so I'm guessing that this doesn't apply to you, but there are also good network simulation packages out there. Cariden [3] and WANDL [4] can build models of your network based on actual router configs and let you simulate the impact of various scenarios, including device/link failures. However, these tools are more appropriate for design and planning than for catching configuration mistakes, so they may not be what you're looking for in this case. --Jeff [1] http://www.shrubbery.net/rancid/ [2] http://www.packetdesign.com/products/rex.htm [3] http://www.cariden.com/ [4] http://www.wandl.com/html/index.php
On Tue, Aug 25, 2009 at 10:45:07PM -0500, Frank Bulk - iName.com wrote:
There's more to data integrity in a data center (well, anything powered, that is) than network configurations.
Understood and agreed. My point was that induced failure testing isn't the right way to catch incorrect or unauthorized config changes, which is what I understood the original poster to have said was his problem. My apologies if I misunderstood what he was asking.
So while your analogy emphasizes the importance of having good processes in place to catch the problems up front, it doesn't eliminate throwing the switch.
Yup, and it's precisely why I suggested using planned maintenance events as one way of doing at least limited failure testing. --Jeff
On Tue, Aug 25, 2009 at 7:53 AM, Jeff Aitken<jaitken@aitken.com> wrote:
[..] Periodically inducing failures to catch [...] them is sorta like using your smoke detector as an oven timer. [..] machine-parsable format, but the benefit is that you know in pseudo-realtime when something is wrong, as opposed to finding out the next time a device fails.
Config checking can't say much about silent hardware failures. Unanticipated problems are likely to arise in failover systems, especially complicated ones. A failover system that has not been periodically verified may not work as designed. Simulations, config review, and change controls are not substitutes for testing, they address overlapping but different problems. Testing detects unanticipated error; config review is a preventive measure that helps avoid and correct apparent configuration issues. Config checking (both software and hardware choices) also help to keep out unnecessary complexity. A human still has to write the script and review its output -- an operator error would eventually occur that is an accidental omission from both the current state and from the "desired" state; there is a chance that an erroneous entry escapes detection. There can be other types of errors: Possibly there is a damaged patch cable, dying port, failing power supply, or other hardware on the warm spare that has silently degraded and its poor condition won't be detected (until it actually tries to take a heavy workload, blows a fuse, eats a transceiver, and everything just falls apart). Perhaps you upgraded a hardware module or software image X months ago, to fix bug Y on the secondary unit, and the upgrade caused completely unanticipated side effect Z. Config checking can't say much about silent hardware problems. -- -Mysid
James Hess wrote:
Config checking can't say much about silent hardware failures. Unanticipated problems are likely to arise in failover systems, especially complicated ones. A failover system that has not been periodically verified may not work as designed.
I've seen 3-4 failover failures in the last year alone on the sonet transport gear. In almost each case, the backup cards were dead when the primary either died or induced errors causing telco to switch to the backup card. I have no doubts that they haven't been testing. While it didn't effect most of my network, I have a few customers that aren't multihomed, and it wiped them out in the middle of the day up to 3 hours.
There can be other types of errors: Possibly there is a damaged patch cable, dying port, failing power supply, or other hardware on the warm spare that has silently degraded and its poor condition won't be detected (until it actually tries to take a heavy workload, blows a fuse, eats a transceiver, and everything just falls apart).
Lots of weird things to test for. I remember once rebooting a c5500 that had been cruising along for 3 years and the bootup diag detected 1/2 a linecard as bad, which had been running decently up until the reload. Over the years, I think I've seen or detected everything you mentioned either during routine testing or in production "oh crap" events. Jack
On Tue, Aug 25, 2009 at 12:53:10PM +0000, Jeff Aitken wrote:
you have to have some way of describing the desired state of the network in machine-parsable format
Any suggested tools for describing the desired state of the network? NDL, the only option I'm familiar with, is just a brute-force approach to describing routers in XML. This is hardly better than a router-config, and the visualizations break down on any graph with more than a few nodes or edges. I'd need thousands to describe customer routers. Or do we just give up on describing all of those customer-facing interfaces, and only manage descriptions for the service-provider part of the network? This seems to be what people actually do with network descriptions (oversimplify), and that doesn't seem like much of a description to me. Is there a practical middle-ground between dismissing a multitude of relevant customer configuration and the data overload created by merely replicating the entire network config in a new language? Ross -- Ross Vandegrift ross@kallisti.us "If the fight gets hot, the songs get hotter. If the going gets tough, the songs get tougher." --Woody Guthrie
I would hope that the data center engineers built and ran suite of tests to find failure points before the network infrastructure was put into production. That said, changes are made constantly to the infrastructure and it can become very difficult very quickly to know if the failovers are still going to work. This is one place where the power and network in a datacenter divulge. The power systems may take on additional load over the course of the life of the facility, but the transfer switches and generators do not get many changes made to them. Also, network infrastructure tests are not going to be zero impact if there is a config problem. Generator tests are much easier. You can start up the generator and do a load test. You can also load test the UPS systems as well. Then you can initiate your failover. Network tests are not going to be zero impact even if there isn't a problem. Let's say you wanted to power fail a edge router participating in BGP, it can take 30 seconds for that routers route to get withdrawn from the BGP tables of the world. The other problem is network failures always seem to come from "unexpected" issues. I always love it when I get an outage report from my ISP's or datacenter and they say an "unexpected issue" or "unforseen issue" caused the problem. Dylan -----Original Message----- From: Dan Snyder [mailto:sliplever@gmail.com] Sent: Monday, August 24, 2009 8:39 AM To: Ken Gilmour Cc: NANOG list Subject: Re: Data Center testing We have done power tests before and had no problem. I guess I am looking for someone who does testing of the network equipment outside of just power tests. We had an outage due to a configuration mistake that became apparent when a switch failed. It didn't cause a problem however when we did a power test for the whole data center. -Dan On Mon, Aug 24, 2009 at 9:31 AM, Ken Gilmour <ken.gilmour@gmail.com> wrote:
I know Peer1 in vancouver reguarly send out notifications of "non-impacting" generator load testing, like monthly. Also InterXion in Dublin, Ireland have occasionally sent me notification that there was a power outage of less than a minute however their backup successfully took the load.
I only remember one complete outage in Peer1 a few years ago... Never seen any outage in InterXion Dublin.
Also I don't ever remember any power failure at AiNet (Deepak will probably elaborate)
2009/8/24 Dan Snyder <sliplever@gmail.com>:
Does any one know of any data centers that do failure testing of their networking equipment regularly? I mean to verify that everything fails over properly after changes have been made over time. Is there any best practice guides for doing this?
Thanks, Dan
The idea of regular testing is to essentially detect failures on your time schedule rather than entropy's (or Murphy's). There can be flaws in your testing methodology too. This is why generic load bank tests and network load simulators rarely tell the whole story. Customers are rightfully unpleased with any testing that affects their normal peace-of-mind, and doubly so when it affects actual operational effectiveness. However, since no system can operate indefinitely without maintenance, failover and other items, the question of taking a window is not negotiable. The only thing that is negotiable (somewhat) is when, and only in one direction (ahead of the item failing on its own). So, taking this concept to networks. It's not negotiable whether a link or a device will fail, the question is only how long you are going to forward bits along the dead path before rerouting and how long that rerouting will take. SONET says about 50ms, standard BGP about 30-300seconds. BFD and other things may improve these dramatically in your setup. You build your network around your business case and vice versa. Clearly, most of the known universe has decided that BGP time is "good enough" for the Internet as a whole right now. Most are aware of the costs in terms of overall jitter, CPU and stability if we reduce those times too far. Its intellectually dishonest to talk about never losing a packet or never forwarding along a dead path for even a nanosecond when the state-of-the-art says something very different indeed. Deepak Jain AiNET
-----Original Message----- From: Dylan Ebner [mailto:dylan.ebner@crlmed.com] Sent: Wednesday, August 26, 2009 11:33 AM To: Dan Snyder; Ken Gilmour Cc: NANOG list Subject: RE: Data Center testing
I would hope that the data center engineers built and ran suite of tests to find failure points before the network infrastructure was put into production. That said, changes are made constantly to the infrastructure and it can become very difficult very quickly to know if the failovers are still going to work. This is one place where the power and network in a datacenter divulge. The power systems may take on additional load over the course of the life of the facility, but the transfer switches and generators do not get many changes made to them. Also, network infrastructure tests are not going to be zero impact if there is a config problem. Generator tests are much easier. You can start up the generator and do a load test. You can also load test the UPS systems as well. Then you can initiate your failover. Network tests are not going to be zero impact even if there isn't a problem. Let's say you wanted to power fail a edge router participating in BGP, it can take 30 seconds for that routers route to get withdrawn from the BGP tables of the world. The other problem is network failures always seem to come from "unexpected" issues. I always love it when I get an outage report from my ISP's or datacenter and they say an "unexpected issue" or "unforseen issue" caused the problem.
Dylan -----Original Message----- From: Dan Snyder [mailto:sliplever@gmail.com] Sent: Monday, August 24, 2009 8:39 AM To: Ken Gilmour Cc: NANOG list Subject: Re: Data Center testing
We have done power tests before and had no problem. I guess I am looking for someone who does testing of the network equipment outside of just power tests. We had an outage due to a configuration mistake that became apparent when a switch failed. It didn't cause a problem however when we did a power test for the whole data center.
-Dan
On Mon, Aug 24, 2009 at 9:31 AM, Ken Gilmour <ken.gilmour@gmail.com> wrote:
I know Peer1 in vancouver reguarly send out notifications of "non-impacting" generator load testing, like monthly. Also InterXion in Dublin, Ireland have occasionally sent me notification that there was a power outage of less than a minute however their backup successfully took the load.
I only remember one complete outage in Peer1 a few years ago... Never seen any outage in InterXion Dublin.
Also I don't ever remember any power failure at AiNet (Deepak will probably elaborate)
2009/8/24 Dan Snyder <sliplever@gmail.com>:
Does any one know of any data centers that do failure testing of their networking equipment regularly? I mean to verify that everything fails over properly after changes have been made over time. Is there any best practice guides for doing this?
Thanks, Dan
On Wed, Aug 26, 2009 at 03:32:42PM +0000, Dylan Ebner wrote:
I always love it when I get an outage report from my ISP's or datacenter and they say an "unexpected issue" or "unforseen issue" caused the problem.
Well, at least it's better than "yeah, we knew about it, but didn't think it was worth worrying about". - Matt
On Aug 24, 2009, at 9:38 AM, Dan Snyder wrote:
We have done power tests before and had no problem. I guess I am looking for someone who does testing of the network equipment outside of just power tests. We had an outage due to a configuration mistake that became apparent when a switch failed.
So, one of the better ways to make sure that your failover system is working when you need it is just to do away with the concept of a failover system and make your "failover" system be part of your "primary" system . This means that your failover system is always passing traffic and you know that it is alive and well -- it also helps mitigate the pain when a device fails (you are sharing the load over both systems and so only half as much traffic gets disrupted). Scheduled maintenance is also simpler and less stressful as you already know that your other path is alive and well. Your design and use case dictates how exactly you implement this, but in general it involves things like tuning your IGP so you are using all your links, staggering VLANs if you rely on them, multiple VRRP groups per subnet, etc. This does require a tiny bit more planning during the design phase, and also requires that you check every now and then to make sure that you are actually using both devices (and didn't, for example, shift traffic to one device and then forget to shift it back :-)). It also requires that you keep capacity issues in mind -- in a primary and failover scenario you might be able to run devices fairly close to capacity, but if you are sharing the load you need to keep things under 50% (so when you *do* have a failure the remaining device can handle the full load) -it's important to make this clear to the finance folks before going down this path :-) W
It didn't cause a problem however when we did a power test for the whole data center.
-Dan
On Mon, Aug 24, 2009 at 9:31 AM, Ken Gilmour <ken.gilmour@gmail.com> wrote:
I know Peer1 in vancouver reguarly send out notifications of "non-impacting" generator load testing, like monthly. Also InterXion in Dublin, Ireland have occasionally sent me notification that there was a power outage of less than a minute however their backup successfully took the load.
I only remember one complete outage in Peer1 a few years ago... Never seen any outage in InterXion Dublin.
Also I don't ever remember any power failure at AiNet (Deepak will probably elaborate)
2009/8/24 Dan Snyder <sliplever@gmail.com>:
Does any one know of any data centers that do failure testing of their networking equipment regularly? I mean to verify that everything fails over properly after changes have been made over time. Is there any best practice guides for doing this?
Thanks, Dan
-- "Does Emacs have the Buddha nature? Why not? It has bloody well everything else!"
Thanks for the kind words Ken. Power failure testing and network testing are very different disciplines. We operate from the point of view that if a failure occurs because we have scheduled testing, it is far better since we have the resources on-site to address it (as opposed to an unplanned event during a hurricane). Not everyone has this philosophy. This is one of the reasons we do monthly or bimonthly, full live load transfer tests on power at every facility we own and control during the morning hours (~10:00am local time on a weekday, run on gensets for up to two hours). Of course there is sufficient staff and contingency planning on-site to handle almost anything that comes up. The goal is to have a measurable "good" outcome at our highest reasonable load levels [temperature, data load, etc]. We don't hesitate to show our customers and auditors our testing and maintenance logs, go over our procedures, etc. They can even watch events if they want (we provide the ear protection). I don't think any facility of any significant size can operate differently and do it well. This is NOT advisable to folks who do not do proper preventative maintenance on their transfer bus ways, PDUs, switches, batteries, transformers and of course generators. The goal is to identify questionable relays, switches, breakers and other items that may fail in an actual emergency. On the network side, during scheduled maintenance we do live failovers -- sometimes as dramatic as pulling the cable without preemptively removing traffic. Part of *our* procedures is to make sure it reroutes and heals the way it is supposed to before the work actually starts. Often network and topology changes happen over time and no one has had a chance to actually test all the "glue" works right. Regular planned maintenance (if you have a fast reroute capability in your network) is a very good way to handle it. For sensitive trunk links and non-invasive maintenance, it is nice to softly remove traffic via local pref or whatever in advance of the maintenance to minimize jitter during a major event. As part of your plan, be prepared for things like connectors (or cables) breaking and have a plan for what you do if that occurs. Have a plan or a rain-date if a connector takes a long time to get out or the blade it sits in gets damaged. This stuff looks pretty while its running and you don't want something that has been friction-frozen to ruin your window. All of this works swimmingly until you find a vendor (X) bug. :) Not for the faint-of-heart. Anyone who has more specific questions, I'll be glad to answer off-line. Deepak Jain AiNET
I know Peer1 in vancouver reguarly send out notifications of "non-impacting" generator load testing, like monthly. Also InterXion in Dublin, Ireland have occasionally sent me notification that there was a power outage of less than a minute however their backup successfully took the load.
I only remember one complete outage in Peer1 a few years ago... Never seen any outage in InterXion Dublin.
Also I don't ever remember any power failure at AiNet (Deepak will probably elaborate)
Does any one know of any data centers that do failure testing of
2009/8/24 Dan Snyder <sliplever@gmail.com>: their
networking equipment regularly? I mean to verify that everything fails over properly after changes have been made over time. Is there any best practice guides for doing this?
Thanks, Dan
Deepak Jain wrote:
Thanks for the kind words Ken.
Power failure testing and network testing are very different disciplines.
We operate from the point of view that if a failure occurs because we have scheduled testing, it is far better since we have the resources on-site to address it (as opposed to an unplanned event during a hurricane). Not everyone has this philosophy.
This is one of the reasons we do monthly or bimonthly, full live load transfer tests on power at every facility we own and control during the morning hours (~10:00am local time on a weekday, run on gensets for up to two hours). Of course there is sufficient staff and contingency planning on-site to handle almost anything that comes up. The goal is to have a measurable "good" outcome at our highest reasonable load levels [temperature, data load, etc].
At least once a year I like to go out and kick the service entrance breaker to give the whole enchilada an honest to $diety plugs out test. As you said, not recommenced if you don't maintain stuff, but that's how confident I feel that my system works. ~Seth
At least once a year I like to go out and kick the service entrance breaker to give the whole enchilada an honest to $diety plugs out test. As you said, not recommenced if you don't maintain stuff, but that's how confident I feel that my system works.
Nature has a way of testing it, even if you don't. :) For those who haven't seen this occur, make sure you have a plan in case your breaker doesn't flip back to the normal position, or your transfer switch stops switching (in either direction -- for example, it fuses itself into the "generator/emergency" position). For small supplies (say <1MW) it's not as big a deal, but when the breakers in a bigger facility can weigh hundreds of pounds each and can take months to replace, these are real issues and will test your sparing, consistency and other disciplines. Deepak Jain AiNET
participants (14)
-
Dan Snyder
-
Deepak Jain
-
Dylan Ebner
-
eric clark
-
Frank Bulk - iName.com
-
Jack Bates
-
James Hess
-
Jeff Aitken
-
Ken Gilmour
-
Matthew Palmer
-
Rod Beck
-
Ross Vandegrift
-
Seth Mattinen
-
Warren Kumari