On Tue, 26 Oct 2004 13:52:51 EDT, "Gregory (Grisha) Trubetskoy" said:
average your servers are 98% underutilized, you are wasting a lot of
Remember in your analysis to include premature hardware failure due to too many power cycles... A server can *easily* "on average" be running at only 20-30% of capacity, simply because requests arrive at essentially random times - so you have to deal with the case where "average" over a minute is 20% of capacity for 600 hits (10/sec), but some individual seconds only have 1 hit, and others have 50 (at which point you're running with the meter spiked). Time-of-day issues also get involved - you may need to have enough iron to handle the peak load at 2PM, but be sitting mostly idle at 2AM. Unfortunately, I've seen very few rack-mount boxes that support partial power-down to save energy - if it's got 2 Xeon processors and 2G of memory, both CPUs and all the memory cards are hot all the time... There's also latency issues - if some CPUs on a node or some nodes in a cluster are powered down, there is a timing lag between when you start firing them up and when they're ready to go - so you need to walk the very fine line between "too short a spike powers stuff up needlessly" (very bad for the hardware), and "too much dampening means you get bottlenecked while waiting for spin-up". (Been there, done that - there's a 1200-node cluster across the hall, and there's no really good/easy way to ramp up all 1200 for big jobs and power down 800 nodes if there's only 400-nodes worth of work handy. So we end up leaving it all fired up and let the node's "idle loop" be "good enough").. If it was as easy as all that, we'd all be doing it already.. :)