On 19/03/10 17:10 -0700, Mike wrote:
David W. Hankins wrote:
On Wed, Mar 17, 2010 at 09:22:06AM -0500, Dan White wrote:
The servers stop balancing their addresses, and one server starts to exhibit 'peer holds all free leases' in its logs, in which case we need to restart the dhcpd process(es) to force a rebalance.
If restarting one or both dhcpd processes corrects a pool balancing problem, then I suspect what you're looking at is a bug where the servers would fail to schedule a reconnection if the failover socket is lost in a particular way. Because the protocol also uses a message exchange inside the TCP channel to determine if the socket is up (rather than just TCP keepalives) this can sometimes happen even without a network outage during load spikes or other brief hiccups on
<long explanation snipped>
With all due respect and acknowledgment of the tremendous contributions of ISC and you yourself Mr. Hankins, I have to comment that failover in isc-dhcp is broken by design because it requires the amount of handholding and operator thinking in the event of a failure that you explained to us at length is required. Failure needs to be handled automatically and without any intervention at all, otherwise you might as well not have it and I think most network operators would agree.
I don't want to defend bad code where it may exist, but I view the problems we've encountered with ISC DHCP to be minor compared to the benefits. It may not be fair to compare DHCP failover to redundancy in a routing scenario. In a routing failure, I'd be highly motivated to find the root cause, open tickets, and get the problem fixed. In a scenario where a couple of customers are unable to pull an IP address, every few months, I'm OK with manual intervention as long as state is maintained. I'd argue that it's more important to maintain data integrity (no two servers think they own the same IP) than availability (where one server is too aggressive and corrupts data). That's true of much of the open source software I use, such as cyrus (email) replication and openldap synchronization. Given the resources I and others in my company have to deal with issues, it's always a matter of putting out the biggest fire. If/when problems with DHCP failover become a big enough issues, we'll spend the time to find out what in our network is causing the issue and fix it, or find out what the bug is in the software and open a bug report. All problems are fixable given enough resources, and enough motivation. -- Dan White