On Mon, Feb 27, 2012 at 7:28 AM, William Herrin <bill@herrin.us> wrote:
On Sun, Feb 26, 2012 at 7:02 PM, Randy Carpenter <rcarpen@network1.net> wrote:
On Feb 26, 2012, at 4:56 PM, Randy Carpenter wrote:
1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!)
This is actually a much harder problem to solve than it sounds, and gets progressively harder depending on what you mean by "failover".
At the very least, having two physical hosts capable of running your VM requires that your VM be stored on some kind of SAN (usually iSCSI based) storage system. Otherwise, two hosts have no way of accessing your VM's data if one were to die. This makes things an order of magnitude or higher more expensive.
This does not have to be true at all. Even having a fully fault-tolerant SAN in addition to spare servers should not cost much more than having separate RAID arrays inside each of the server, when you are talking about 1,000s of server (which Rackspace certainly has)
Randy,
You're kidding, right?
SAN storage costs the better part of an order of magnitude more than server storage, which itself is several times more expensive than workstation storage. That's before you duplicate the SAN and set up the replication process so that cabinet and room level failures don't take you out.
This is clearly becoming a not-NANOG-ish thread, however... Failing to have central shared storage (iSCSI, NAS, SAN, whatever you prefer) fails the smell test on a local enterprise-grade virtualization cluster, much less a shared cloud service. Some people have done tricks with distributing the data using one of the research-ish shared filesystems, rather than separate shared storage. That can be made to work if the host OS model and its available shared filesystems work for you. Doesn't work for Vmware Vcenter / Vmotion-ish stuff as far as I know. There are plenty of people doing non-enterprise-grade virtualization. There's no mandate that you have the ability to migrate a virtual to another node in realtime or restart it immediately on another node if the first node dies suddenly. But anyone saying "we have a cloud" and not providing that type of service, is in marketing not engineering.
From a systems architecture point of view, you can't do that.
-- -george william herbert george.herbert@gmail.com