Re: ISC DHCP server failover

19 Mar 2010

      On Wed, Mar 17, 2010 at 09:22:06AM -0500, Dan White wrote:
...
The servers stop balancing their addresses, and one server starts to
exhibit 'peer holds all free leases' in its logs, in which case we need to
restart the dhcpd process(es) to force a rebalance.
If restarting one or both dhcpd processes corrects a pool balancing
problem, then I suspect what you're looking at is a bug where the
servers would fail to schedule a reconnection if the failover socket
is lost in a particular way.  Because the protocol also uses a message
exchange inside the TCP channel to determine if the socket is up
(rather than just TCP keepalives) this can sometimes happen even
without a network outage during load spikes or other brief hiccups on
the partner DHCP server.

So far as I know that particular problem is fixed in current
maintenance releases (4.1.1, 4.0.2), so I'm curious if you are
using them.

But it's also possible that your pools need rebalancing more often
than the default minimum rebalance interval.
...
In some cases, and I'm not sure which equipment may be to blame, if one
server goes down then the other server will not hand out addresses to
clients which had originally received addresses from the failed server.
We've dealt with that by balancing our lease times with our MTTR for a
failed server.
To me this sounds like another symptom of the failover connection
between the servers failing and not being reconnected.  It would
explain why the server doesn't have the partner's "recently active"
bindings, it wouldn't have received updated information if the socket
was inactive.

My own opinion of the failover software is that it may be clunky and
hard to use (we're working on that), but it is reliable.

One of the warnings I want to give when someone is interested in
deploying failover pairs of ISC DHCP servers is to still be prepared
to react to a server failure.  Unlike other fault tolerance protocols
where losing communication with the peer can safely be assumed that
the peer is off-line, DHCP servers can go out of communication with
each other while still being able to reach clients.

To cope with this, failover segregates the idea of entering a
"communications-interrupted" state from a "partner-down" state, and
many rules govern how servers can allocate or extend leases to ensure
there are no addressing conflicts (caused by the servers giving one
IP address to two different clients without the knowledge of the
other server).

Failover essentially bridges the gap of a server outage by giving each
server in the pair roughly half of the remaining pool of unallocated
addresses, which they individually allocate from normally and when
operating in communications-interrupted.

Either server can continue to extend already active leases, letting
clients keep the addresses they already have, but if it runs out of
free* leases, or if the current clients' leases are allowed to expire,
it won't be able to admit any new clients or extend expired leases.

Because the software can't detect if its peer is truly off-line, the
operator must manually move the surviving server to partner-down to
inform it that it's operating alone in order to use expired or leases
in the peer's free pool.

Most people have a bad experience with failover, and therefore form
a poor opinion of it, because of experiencing such an event without
knowing of the need to transition the server state explicitly during
an outage.  It isn't as automatic as the word 'failover' makes it
sound.

For now the lesson is that failover gives you precious hours or days
to find a terminal and repair the partner server or put the surviving
server into partner-down.  It's also quite a good idea to monitor the
failover state of your servers and ensure they aren't spending a great
deal of time in communications-interrupted.

NEW in 4.2.0:

There is a new configuration option intended for servers sharing a
"Heartbeat Cable", or similar situations where the operator is
convinced with certainty that a failover socket disconnect likely
implies the peer is truly down.  A configurable timeout can now be
entered such that the server automatically enters partner-down.
Note that the failover protocol still requires an "MCLT delay" before
the server is allowed to use the peer's free leases, but this is
not normally a problem in usual operation.

There is a new optimization that significantly increases endurance
during communications-interrupted and in many cases could mean you
could remain in that state indefinitely, but it won't help you for
example if the active load requires the full free pool; it can't
allocate the peer's leases.

So it doesn't replace entering partner-down for long-term outages.

These features were both provided in 4.2.0 (a1 and a2 respectively
as memory serves), currently an alpha which I hope to move to its
first beta soon.

* I'm simplifying a lot to make this shorter and easier to read, and
  also using language that failover overloads.  If you really want to
  know all about failover internals, ask us on dhcp-users. :)

-- 
David W. Hankins	BIND 10 needs more DHCP voices,
Software Engineer		   there just aren't enough in our heads.
Internet Systems Consortium, Inc.	http://www.isc.org/bind10