On Aug 10, 2007, at 1:55 AM, Paul Reubens wrote:
How do you engineer around enterprise and ISP recursors that don't honor TTL, instead caching DNS records for a week or more?
A friend of mine was working for a place that performed some service on data (not important what, you send them some data (through this really ugly client app that they wrote in-house) and they sent you back something...). Anyway, for various reasons they needed to move out of their current data-center to a new provider. They had this truly monumental plan for doing this that they had been working on for months --- MS Project printouts that covered entire walls in this huge rainbow of colors, 400 or so pages of plans, etc etc etc -- it all boiled down to: Decrease the TTL, then swap in the new A record at midnight on Friday. As soon as the TTL expired everything would start working in the new place and it will all be transparent to the end users... Anyway, my friend calls me at like 3 in the morning on Saturday -- they have updated DNS and none of their clients are connecting to the new place... It seems that they have burnt some bridges with the old provider and will be shut off on Saturday evening -- he's really desperate, so I agree to wander over and take a look... I arrive to find utter confusion -- the CEO is screaming at the CTO, who appears to have decided that the best way to fix things is by getting drunk, random other people are screaming (apparently just for fun), etc.... I manage to get someone to calm down for long enough to explain the summary of the plan to me and run nslookup.. Sure enough the TTL is really low and the new IP is being handed out, etc. I ask how long it took for the client to fail over during their tests -- "Oh, no, we didn't test like that, we didn't want to impact the current service, so we tested with a different domain and checked how long it took for a IE to pick up the change... It was less than 10 minutes..." We track down one of the developers and talk to him. He explains this long and involved system with the client performing heath-checks on the server and reconnecting wit exponential back-off, etc etc etc. Its all great -- apart from the fact that he calls gethostbyname() during startup, and then never again.... This is a *really* common issue.... W
On 8/7/07, Patrick W.Gilmore <patrick@ianai.net> wrote: On Aug 7, 2007, at 10:05 AM, Michal Krsek wrote:
5) User redirection - You have to implement a scalable mechanisms that redirects users to the closes POP. You can use application redirect (fast, but not so much scalable), DNS redirect (scalable, but not so fast) or anycasting (this needs cooperation with ISP).
What is slow about handing back different answers to the same query via DNS, especially when they are pre-calculated? Seems very fast to me.
Yes DNS-based redirection scales very pretty.
But there are two problems: 1) Client may not be in same network as DNS server (I'm using my home DNS server even if I'm at IETF or I2 meeting on other side of globe)
This has been discussed. Operational experience posted here by Owen shows < 10% of users are "far" from their recursive NS.
You are the tiny minority. (Don't feel bad, so am I. :) Most "users" either use the NS handed out by their local DHCP server, or they are VPN'ing anyway.
2) DNS TTL makes realtime traffic management inpossible. Remember you may not distribute network traffic, but sometimes also server load. If one server/POP fails or is overloaded, you need to redirect users to another one in realtime.
Define "real time"? To do it in 1 second or less is nigh impossible. But I challenge you to fail anything over in 1 second when IP communication with end users not on your LAN is involved.
I've seen TTLs as low as 20s, giving you a mean fail-over time of 10 seconds. That's more than fast enough for most applications these days.
-- TTFN, patrick