You dont say who the "clients" are - I presume this is a web based application so essentially you are trying to migrate service in flight to another set of servers within the TCP/HTTP session timeout without the client missing a beat ?

If another kind of client, does it also have auto reconnect/retry logic built in for service restoral if the connection timesout ?

Is the session/host state worth preserving for communication between the servers in the cluster or between the clients and the service also ?

I know of people who have been able to do this on LANs using SANs to store shared host states and having a new VM pick up the connections, but on an internet-wide scale you are likely looking only at a probabilistic guarentee assuming that your routing would always converge in time and packets start flowing to the Disaster Recovery (DR) site.

This is much easier if you can stick within a single AS ofcourse.

Others will be able to answer whether these routing changes will attract dampening penalties if you have to pick providers in different ASes.

Assuming all of that doesnt matter, then a somewhat cleaner way to do this would be to advertize a less specific route from the DR location covering the more specific route of the primary location.  If the primary route is withdrawn, voila .. traffic starts moving to the less specific route automatically without you having to scramble at the time of the outage to inject a new route.



Andrew Warfield <andrew.warfield@cl.cam.ac.uk> wrote:

I've got a bit of a network reconfiguration question that I'm
wondering if anyone on NANOG might be able to provide a bit of advice
on:

I'm working on a project to provide failover of entire cluster-based
(and so multi-host) applications to a geographically distinct backup
site. The general idea is that as one datacentre burns down, a live
service may be moved over to an alternate site without any
interruption to clients. All of the host-state migration is done
using virtual machines and associated magic; I'm trying to get a more
clear understanding as to what is involved in terms of moving the IPs,
and how fast it can potentially be done.

I'm fairly sure that what I would like to do is to arrange what is
effectively dual-homing, but with two geographically distinct homes:
Assuming that I have an in-service primary site A, and an emergency
backup site B, each with a distinct link into a common provider AS, I
would configure B's link as redundant into the stub AS for A -- as if
the link to B were the redundant link in a (traditional single-site)
dual-homing setup. B would additionally host it's own IP range, used
for control traffic between the two sites in normal operation.

When I desire to migrate hosts to the failover site, B would send a
BGP update advertizing that the redundant link should become
preferred, and (hopefully) the IGP in the provider AS would seamlessly
redirect traffic. Assuming that everything works okay with the
virtual machine migration, connections would continue as they were and
clients would be unaware of the reconfiguration.

Does the routing reconfiguration story here sound plausible? Does
anyone have any insight as to how long such a reconfiguration would
reasonably take and/or if it is something that I might be able to
negotiate a SLA for with a provider if I wanted to actually deploy
this sort of redundancy as a service? Is anyone aware of similar
high-speed failover schemes in use on the network today?

Thoughts appreciated, I hope this is reasonably on-topic for the list.

best,
a.