Hi All, I'm attempting to devise a method which will provide continuous operation of certain resources in the event of a disaster at a single facility. The types of resources that need to be available in the event of a disaster are ecommerce applications and other business critical resources. Some of the questions I keep running into are: Should the additional sites be connected to the primary site (and/or the Internet directly)? What is the best way to handle the routing? Obviously two devices cannot occupy the same IP address at the same time, so how do you provide that instant 'cut-over'? I could see using application balancers to do this but then what if the application balancers fail, etc? Any advice from folks on list or off who have done similar work is greatly appreciated. Thanks, -Drew
On Jun 3, 2009, at 7:09 PM, Drew Weaver wrote:
What is the best way to handle the routing? Obviously two devices cannot occupy the same IP address at the same time, so how do you provide that instant 'cut-over'?
Avoid 'cut-over' entirely - go active/active/etc., and use DNS-based GSLB for the various system elements. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> Unfortunately, inefficiency scales really well. -- Kevin Lawton
Drew, IMO as your %Availability goes up, (99, 99.9, 99.99.....100%)... your price to implement will go up exponentially. That being said, a majority of this will depend on what your budget is to achieve your desired availability. You can either - Failover to backup servers using DNS (but may not be instant) Also should consider replication solution if your resources need fresh data On the high-end you can: - run secondary mirror location...replicating data - Setup BGP announcing your IP block out of both locations - You should have a private link interconnecting your sites for your IBGP session. -----Original Message----- From: Drew Weaver [mailto:drew.weaver@thenap.com] Sent: Wednesday, June 03, 2009 8:10 AM To: 'nanog@nanog.org' Subject: Facility wide DR/Continuity Hi All, I'm attempting to devise a method which will provide continuous operation of certain resources in the event of a disaster at a single facility. The types of resources that need to be available in the event of a disaster are ecommerce applications and other business critical resources. Some of the questions I keep running into are: Should the additional sites be connected to the primary site (and/or the Internet directly)? What is the best way to handle the routing? Obviously two devices cannot occupy the same IP address at the same time, so how do you provide that instant 'cut-over'? I could see using application balancers to do this but then what if the application balancers fail, etc? Any advice from folks on list or off who have done similar work is greatly appreciated. Thanks, -Drew
On Wed, Jun 3, 2009 at 8:09 AM, Drew Weaver<drew.weaver@thenap.com> wrote:
I'm attempting to devise a method which will provide continuous operation of certain resources in the event of a disaster at a single facility.
Drew, If you can afford it, stretch the LAN across the facilities via fiber and rebuild the critical services as a load balanced active-active cluster. Then a facility failure and a routine server failure are identical and are handled by the load balancer. F5's if you like commercial solutions, Linux LVS if you're partial to open source as I am. Then make sure you have a Internet entry into each location with BGP. BTW, this tends to make maintenance easier too. Just remove servers from the cluster when you need to work on them and add them back in when you're done. Really reduces the off-hours maintenance windows. This is how I did it when I worked at the DNC and it worked flawlessly. If you can't afford the fiber or need to put the DR site too far away for fiber to be practical, you can still build a network which virtualizes your LAN. However, you then have to worry about issues with the broadcast domain and traffic demand between the clustered servers over the slower WAN. It's doable. I've done it with VPNs over Internet T1's. But you better have your developers on board early and and provide them with a simulated environment so that they can get used to the idea of having little bandwidth between the clustered servers. On Wed, Jun 3, 2009 at 9:25 AM, Ricky Duman<rduman@internap.com> wrote:
- Failover to backup servers using DNS (but may not be instant)
If your budget is more than a shoestring, save yourself some grief and don't go down this road. Even with the TTLs set to 5 minutes, it takes hours to get to two-nines recovery from a DNS change and months to get to five-nines. The DNS protocol is designed to be able to recover quickly but the applications which use it aren't. Like web browsers. Google "DNS Pinning." Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004
On Wed, Jun 3, 2009 at 9:37 AM, William Herrin <herrin-nanog@dirtside.com>wrote:
On Wed, Jun 3, 2009 at 8:09 AM, Drew Weaver<drew.weaver@thenap.com> wrote:
<snip>
If you can't afford the fiber or need to put the DR site too far away for fiber to be practical, you can still build a network which virtualizes your LAN. However, you then have to worry about issues with the broadcast domain and traffic demand between the clustered servers over the slower WAN.
It's doable. I've done it with VPNs over Internet T1's. But you better have your developers on board early and and provide them with a simulated environment so that they can get used to the idea of having little bandwidth between the clustered servers.
In most cases, the fiber is affordable (a certain bandwidth provider out there offers Layer 2 point to point anywhere on their network for very low four digit prices). We recently put into place an active/active environment with one end point in the US and the other end point in Amsterdam, and both sides see the other as if they were on the same physical lan segment. I've found that, like you said, you *must* have the application developers onboard early, as you can only do so much at the network level without the app being aware. -brandon
--
Brandon Galbraith Mobile: 630.400.6992 FNAL: 630.840.2141
On Jun 3, 2009, at 9:37 PM, William Herrin wrote:
If you can afford it, stretch the LAN across the facilities via fiber and rebuild the critical services as a load balanced active-active cluster.
I would advise strongly against stretching a layer-2 topology across sites, if at all possible - far better to go for layer-3 separation, work with the app/database/sysadmin folks to avoid dependence on direct adjacencies, and gain the topological freedom of routing. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> Unfortunately, inefficiency scales really well. -- Kevin Lawton
On Wed, Jun 3, 2009 at 7:09 AM, Drew Weaver <drew.weaver@thenap.com> wrote:
Hi All,
I'm attempting to devise a method which will provide continuous operation of certain resources in the event of a disaster at a single facility.
The types of resources that need to be available in the event of a disaster are ecommerce applications and other business critical resources.
Some of the questions I keep running into are:
Should the additional sites be connected to the primary site (and/or the Internet directly)? What is the best way to handle the routing? Obviously two devices cannot occupy the same IP address at the same time, so how do you provide that instant 'cut-over'? I could see using application balancers to do this but then what if the application balancers fail, etc?
Any advice from folks on list or off who have done similar work is greatly appreciated.
Thanks, -Drew
In an environment where a DR site is deemed critical, it is my experience that critical business applications also have a test or development environment associated with the production one. If you look at the problem this way, then a DR equipped with the test/devel systems, with one "instance" of production always available, would only be challenging in terms of data sync. Various SAN solutions would resolve that (SAN sync-ing over WAN/MAN/etc.). Virtualization of critical systems may also add some benefits here: clone the critical VMs in the DR, and in conjunction with the storage being available, you'll be able to bring up this type of machines in no time - just make sure you have some sort of L2 available - maybe EoS, or tunneling over an L3 connectivity - tons of info when querying for virtual machine mobility and inter-site connectivity. Voice has to be considered, also - f/PSTN - make arrangements with provider to re-route (8xx) in case of disaster. VoIP may add some extra capabilities in terms of reachability over the Internet, in case your DR site cannot accommodate - C/S people, for example, who are critical to interface with customers in case of disaster (if no information - bigger loss - perception issues) have to be able to connect even from home. As far as "immediate" switch from one to another - DNS is the primary concern (unless some wise people have hardcoded IPs all over), but there are other issues people tend to forget, at the core of some clilents - take Oracle "fat" client and its TNS names - I've seen those associated with IPs, instead of host names ... etc. Disclaimer: the above = one of many aspects. Have seen DNS comments already, so I won't repeat those aspects. HTH, -- ***Stefan http://twitter.com/netfortius
On Wed, 3 Jun 2009, Drew Weaver wrote: > Should the additional sites be connected to the primary site > (and/or the Internet directly)? Yes, because any out-of-band synchronization method between the servers at the production site and the servers at the DR site is likely to be more difficult to manage. You could do UUCP over a serial line, but... > What is the best way to handle the routing? Obviously two devices > cannot occupy the same IP address at the same time, so how do you > provide that instant 'cut-over'? This is one of the only instances in which I like NATs. Set up a NAT between the two sites to do static 1-to-1 mapping of each site into a different range for the other, so that the DR servers have the same IP addresses as their production masters, but have a different IP address to synchronize with. -Bill
On Wed, Jun 3, 2009 at 12:47 PM, Bill Woodcock <woody@pch.net> wrote:
On Wed, 3 Jun 2009, Drew Weaver wrote:
Should the additional sites be connected to the primary site (and/or the Internet directly)?
Yes, because any out-of-band synchronization method between the servers at the production site and the servers at the DR site is likely to be more difficult to manage. You could do UUCP over a serial line, but...
What is the best way to handle the routing? Obviously two devices cannot occupy the same IP address at the same time, so how do you provide that instant 'cut-over'?
This is one of the only instances in which I like NATs. Set up a NAT between the two sites to do static 1-to-1 mapping of each site into a different range for the other, so that the DR servers have the same IP addresses as their production masters, but have a different IP address to synchronize with.
Or you use RFC1918 address space at each location, and NAT each side between public anycasted space and your private IP space. Prevents internal IP conflicts, having to deal with site to site NAT, etc. -brandon -- Brandon Galbraith Mobile: 630.400.6992 FNAL: 630.840.2141
On Jun 4, 2009, at 12:53 AM, Brandon Galbraith wrote:
Or you use RFC1918 address space at each location, and NAT each side between public anycasted space and your private IP space. Prevents internal IP conflicts, having to deal with site to site NAT, etc.
With all due respect, both of these posited choices are quite ugly and tend to lead to huge operational difficulties, susceptibility to DDoS, etc. Definitely not recommended except as a last resort in a difficult situation, IMHO. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> Unfortunately, inefficiency scales really well. -- Kevin Lawton
On Thu, 4 Jun 2009, Roland Dobbins wrote: > With all due respect, both of these posited choices are quite ugly and > tend to lead to huge operational difficulties, susceptibility to DDoS, > etc. Definitely not recommended except as a last resort in a difficult > situation, IMHO. I wouldn't go quite so far as to say that they have security implications, but I definitely agree that these are solutions of last resort, and that any live load-balanced solution is infinitely preferable to a stand-by solution. Which, IMHO, is unlikely to ever work as hoped for. I was just answering the question at hand, rather than the meta-question of whether the question being asked was the right question. :-) -Bill
participants (7)
-
Bill Woodcock
-
Brandon Galbraith
-
Drew Weaver
-
Ricky Duman
-
Roland Dobbins
-
Stefan
-
William Herrin