On Wed, Mar 17, 2010 at 09:22:06AM -0500, Dan White wrote:
The servers stop balancing their addresses, and one server starts to exhibit 'peer holds all free leases' in its logs, in which case we need to restart the dhcpd process(es) to force a rebalance.
If restarting one or both dhcpd processes corrects a pool balancing problem, then I suspect what you're looking at is a bug where the servers would fail to schedule a reconnection if the failover socket is lost in a particular way. Because the protocol also uses a message exchange inside the TCP channel to determine if the socket is up (rather than just TCP keepalives) this can sometimes happen even without a network outage during load spikes or other brief hiccups on the partner DHCP server. So far as I know that particular problem is fixed in current maintenance releases (4.1.1, 4.0.2), so I'm curious if you are using them. But it's also possible that your pools need rebalancing more often than the default minimum rebalance interval.
In some cases, and I'm not sure which equipment may be to blame, if one server goes down then the other server will not hand out addresses to clients which had originally received addresses from the failed server. We've dealt with that by balancing our lease times with our MTTR for a failed server.
To me this sounds like another symptom of the failover connection between the servers failing and not being reconnected. It would explain why the server doesn't have the partner's "recently active" bindings, it wouldn't have received updated information if the socket was inactive. My own opinion of the failover software is that it may be clunky and hard to use (we're working on that), but it is reliable. One of the warnings I want to give when someone is interested in deploying failover pairs of ISC DHCP servers is to still be prepared to react to a server failure. Unlike other fault tolerance protocols where losing communication with the peer can safely be assumed that the peer is off-line, DHCP servers can go out of communication with each other while still being able to reach clients. To cope with this, failover segregates the idea of entering a "communications-interrupted" state from a "partner-down" state, and many rules govern how servers can allocate or extend leases to ensure there are no addressing conflicts (caused by the servers giving one IP address to two different clients without the knowledge of the other server). Failover essentially bridges the gap of a server outage by giving each server in the pair roughly half of the remaining pool of unallocated addresses, which they individually allocate from normally and when operating in communications-interrupted. Either server can continue to extend already active leases, letting clients keep the addresses they already have, but if it runs out of free* leases, or if the current clients' leases are allowed to expire, it won't be able to admit any new clients or extend expired leases. Because the software can't detect if its peer is truly off-line, the operator must manually move the surviving server to partner-down to inform it that it's operating alone in order to use expired or leases in the peer's free pool. Most people have a bad experience with failover, and therefore form a poor opinion of it, because of experiencing such an event without knowing of the need to transition the server state explicitly during an outage. It isn't as automatic as the word 'failover' makes it sound. For now the lesson is that failover gives you precious hours or days to find a terminal and repair the partner server or put the surviving server into partner-down. It's also quite a good idea to monitor the failover state of your servers and ensure they aren't spending a great deal of time in communications-interrupted. NEW in 4.2.0: There is a new configuration option intended for servers sharing a "Heartbeat Cable", or similar situations where the operator is convinced with certainty that a failover socket disconnect likely implies the peer is truly down. A configurable timeout can now be entered such that the server automatically enters partner-down. Note that the failover protocol still requires an "MCLT delay" before the server is allowed to use the peer's free leases, but this is not normally a problem in usual operation. There is a new optimization that significantly increases endurance during communications-interrupted and in many cases could mean you could remain in that state indefinitely, but it won't help you for example if the active load requires the full free pool; it can't allocate the peer's leases. So it doesn't replace entering partner-down for long-term outages. These features were both provided in 4.2.0 (a1 and a2 respectively as memory serves), currently an alpha which I hope to move to its first beta soon. * I'm simplifying a lot to make this shorter and easier to read, and also using language that failover overloads. If you really want to know all about failover internals, ask us on dhcp-users. :) -- David W. Hankins BIND 10 needs more DHCP voices, Software Engineer there just aren't enough in our heads. Internet Systems Consortium, Inc. http://www.isc.org/bind10