On Mon, Aug 24, 2009 at 09:38:38AM -0400, Dan Snyder wrote:
We have done power tests before and had no problem. I guess I am looking for someone who does testing of the network equipment outside of just power tests. We had an outage due to a configuration mistake that became apparent when a switch failed. It didn't cause a problem however when we did a power test for the whole data center.
Dan, With all due respect, if there are config changes being made to your devices that aren't authorized or in accordance with your standards (you *do* have config standards, right?) then you don't have a testing problem, you have a data integrity problem. Periodically inducing failures to catch them is sorta like using your smoke detector as an oven timer. There are several tools that can help in this area; a good free one is rancid [1], which logs in to your routers and collects copies of configs and other info, all of which gets stored in a central repository. By default, you will be notified via email of any changes. An even better approach than scanning the hourly config diff emails is to develop scripts that compare the *actual* state of the network with the *desired* state and alert you if the two are not in sync. Obviously this is more work because you have to have some way of describing the desired state of the network in machine-parsable format, but the benefit is that you know in pseudo-realtime when something is wrong, as opposed to finding out the next time a device fails. Rancid diffs + tacacs logs will tell you who made the changes, and with that info you can get at the root of the problem. Having said that, every planned maintenance activity is an opportunity to run through at least some failure cases. If one of your providers is going to take down a longhaul circuit, you can observe how traffic re-routes and verify that your metrics and/or TE are doing what you expect. Any time you need to load new code on a device you can test that things fail over appropriately. Of course, you have to willing to just shut the device down without draining it first, but that's between you and your customers. Link and/or device failures will generate routing events that could be used to test convergence times across your network, etc. The key is to be prepared. The more instrumentation you have in place prior to the test, the better you will be able to analyze the impact of the failure. An experienced operator can often tell right away when looking at a bunch of MRTG graphs that "something doesn't look right", but that doesn't tell you *what* is wrong. There are tools (free and commercial) that can help here, too. Have a central syslog server and some kind of log reduction tool in place. Have beacons/probes deployed, in both the control and data planes. If you want to record, analyze, and even replay routing system events, you might want to take a look at the Route Explorer product from Packet Design [2]. You said "switch failure" above, so I'm guessing that this doesn't apply to you, but there are also good network simulation packages out there. Cariden [3] and WANDL [4] can build models of your network based on actual router configs and let you simulate the impact of various scenarios, including device/link failures. However, these tools are more appropriate for design and planning than for catching configuration mistakes, so they may not be what you're looking for in this case. --Jeff [1] http://www.shrubbery.net/rancid/ [2] http://www.packetdesign.com/products/rex.htm [3] http://www.cariden.com/ [4] http://www.wandl.com/html/index.php