Reliable Cloud host ?

Randy Carpenter

26 Feb 2012 26 Feb '12

10:56 p.m.

Does anyone have any recommendation for a reliable cloud host? We require 1 or 2 very small virtual hosts to host some remote services to serve as backup to our main datacenter. One of these services is a DNS server, so it is important that it is up all the time. We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all. Basic requirements: 1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!) 2. Actual support (with a phone number I can call) 3. reasonable pricing (No, $800/month is not reasonable when I need a tiny 256MB RAM Server with <1GB/mo of data transfers) thanks, -Randy

Show replies by date

Mike Lyon

26 Feb 26 Feb

11:04 p.m.

Godaddy? Servint.com? Amazon EC2? -mike Sent from my iPhone On Feb 26, 2012, at 12:57, Randy Carpenter <rcarpen@network1.net> wrote:

...

Does anyone have any recommendation for a reliable cloud host?

We require 1 or 2 very small virtual hosts to host some remote services to serve as backup to our main datacenter. One of these services is a DNS server, so it is important that it is up all the time.

We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all.

Basic requirements:

1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!) 2. Actual support (with a phone number I can call) 3. reasonable pricing (No, $800/month is not reasonable when I need a tiny 256MB RAM Server with <1GB/mo of data transfers)

thanks, -Randy

Ben Carleton

11:27 p.m.

On 2/26/2012 6:04 PM, Mike Lyon wrote:

...

Godaddy? Servint.com? Amazon EC2?

-mike

Sent from my iPhone

On Feb 26, 2012, at 12:57, Randy Carpenter <rcarpen@network1.net> wrote:

...
Does anyone have any recommendation for a reliable cloud host?

We require 1 or 2 very small virtual hosts to host some remote services to serve as backup to our main datacenter. One of these services is a DNS server, so it is important that it is up all the time.

We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all.

Basic requirements:

1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!) 2. Actual support (with a phone number I can call) 3. reasonable pricing (No, $800/month is not reasonable when I need a tiny 256MB RAM Server with <1GB/mo of data transfers)

thanks, -Randy

With that some of those cloud providers are charging per-instance, automatic hot standby is really not a given, but that could just be me :) We use Amazon and are happy with them. With them, you would have to set up your own failover operation but it's absolutely doable. They give you all the tools you need (load balancing, EBS, etc) but it's up to you to make it happen. We use their load-balancing feature with HTTP but it looks like you could do it with any service (DNS, etc). As a result, when they had their last huge outage (a whole datacenter), we lost some of our instances but our customer-facing services remained available. Their support options are pretty good but you have to shell out for a package to get them on the phone. Pricing for that is tied to how much of their resources you are using. -- Ben

Kevin Day

11:26 p.m.

On Feb 26, 2012, at 4:56 PM, Randy Carpenter wrote:

...

We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all.

Basic requirements:

1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!)

This is actually a much harder problem to solve than it sounds, and gets progressively harder depending on what you mean by "failover". At the very least, having two physical hosts capable of running your VM requires that your VM be stored on some kind of SAN (usually iSCSI based) storage system. Otherwise, two hosts have no way of accessing your VM's data if one were to die. This makes things an order of magnitude or higher more expensive. But then all you've really done is moved your single point of failure to the SAN. Small SANs aren't economical, so you end up having tons of customers on one SAN. If it dies tons of VMs are suddenly down. So you now need a redundant SAN capable of live-mirroring everyone's data. These aren't cheap either, and add a lot of complexity to things. (How to handle failover if it died mid-write, who has the most recent data after a total blackout, etc) And this is really just saying "If hardware fails, i want my VM to reboot on another host." If what you're defining high availability to mean "even if a physical host fails, i don't want a second of downtime, my VM can't reboot" you want something like VMware's ESXi High Availability modules where your VM is actually running on two hosts at once, running in lock-step with each other so if one fails the other takes over transparently. Licenses for this are ridiculously expensive, and requires some reasonably complex networking and storage systems. And I still haven't touched on having to make sure both physical hosts capable of running your VM are on totally independent switches/power/etc, the SAN has multiple interfaces so it's not all going through one switch, etc. I also haven't run into anyone deploying a high-availability/redundant system where they haven't accidentally ended up with a split-brain scenario (network isolation causes the backup node to think it's live, when the primary is still running). Carefully synchronizing things to prevent this is hard and fragile. I'm not saying you can't have this feature, but it's not typical in "reasonably priced" cloud services, and nearly unheard-of to be something automatically used. Just moving your virtual machine from using local storage to ISCSI backed storage drastically increases disk latency and caps the whole physical host's disk speed to 1gbps (not much deployment for 10GE adapters on the low-priced VM provider yet). Any provider who automatically provisions a virtual machine this way will get complaints that their servers are slow, which is true compared to someone selling VMs that use local storage. The "running your VM on two hosts at once" system has such a performance penalty, and costs so much in licensing, you really need to NEED it for it not to be a ridiculous waste of resources. Amazon comes sorta close to this, in that their storage is mostly-totally separate from the hosts running your code. But they have had failures knock out access to your storage, so it's still not where I think you're saying you want to be. The moral of the story is that just because it's "in the cloud", it doesn't gain higher reliability unless you're specifically taking steps to ensure it. Most people solve this by taking things that are already distributable (like DNS) and setting up multiple DNS servers in different places - that's where all this "cloud stuff" really shines. (please no stories about how you were able to make a redundant virtual machine run using 5 year old servers in your basement, i'm talking about something that's supportable on a provider scale, and isn't adding more single-points-of-failure) -- Kevin

Randy Carpenter

27 Feb 27 Feb

12:02 a.m.

----- Original Message -----

...

On Feb 26, 2012, at 4:56 PM, Randy Carpenter wrote:

...
We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all.

Basic requirements:

1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!)

This is actually a much harder problem to solve than it sounds, and gets progressively harder depending on what you mean by "failover".

At the very least, having two physical hosts capable of running your VM requires that your VM be stored on some kind of SAN (usually iSCSI based) storage system. Otherwise, two hosts have no way of accessing your VM's data if one were to die. This makes things an order of magnitude or higher more expensive.

This does not have to be true at all. Even having a fully fault-tolerant SAN in addition to spare servers should not cost much more than having separate RAID arrays inside each of the server, when you are talking about 1,000s of server (which Rackspace certainly has)

...

But then all you've really done is moved your single point of failure to the SAN. Small SANs aren't economical, so you end up having tons of customers on one SAN. If it dies tons of VMs are suddenly down. So you now need a redundant SAN capable of live-mirroring everyone's data. These aren't cheap either, and add a lot of complexity to things. (How to handle failover if it died mid-write, who has the most recent data after a total blackout, etc)

NetApp. HA heads. Done. Add a DR site with replication, and you can survive a site failure, and be back up and running in less than an hour. I would think that the big datacenter guys already have this type of thing set up.

...

And this is really just saying "If hardware fails, i want my VM to reboot on another host." If what you're defining high availability to mean "even if a physical host fails, i don't want a second of downtime, my VM can't reboot" you want something like VMware's ESXi High Availability modules where your VM is actually running on two hosts at once, running in lock-step with each other so if one fails the other takes over transparently. Licenses for this are ridiculously expensive, and requires some reasonably complex networking and storage systems.

I don't need that kind of HA, and understand that it is not going to be available. 15 minutes of downtime is fine. 6 hours is completely unacceptable, and it false advertising to say you have a "Cloud" service, and then have the realization that you could have *indefinite* downtime.

...

And I still haven't touched on having to make sure both physical hosts capable of running your VM are on totally independent switches/power/etc, the SAN has multiple interfaces so it's not all going through one switch, etc.

That is all just basic datacenter design. I have that level of redundancy with my extremely small datacenter. I only have 2 hypervisor hosts running around 12 VMs.

...

I also haven't run into anyone deploying a high-availability/redundant system where they haven't accidentally ended up with a split-brain scenario (network isolation causes the backup node to think it's live, when the primary is still running). Carefully synchronizing things to prevent this is hard and fragile.

I've never had it. Not if you properly set up failover (look at STONITH)

...

I'm not saying you can't have this feature, but it's not typical in "reasonably priced" cloud services, and nearly unheard-of to be something automatically used. Just moving your virtual machine from using local storage to ISCSI backed storage drastically increases disk latency and caps the whole physical host's disk speed to 1gbps

No it doesn't. Haven't you heard of multipath? Using 4 1Gb/s paths gives me about the same I/O as a local RAID array, with the added feature of failover if a link drops. 4 1Gb/s ports is ridiculously cheap. And, 10Gb is not nearly as expensive as it used to be.

...

(not much deployment for 10GE adapters on the low-priced VM provider yet). Any provider who automatically provisions a virtual machine this way will get complaints that their servers are slow, which is true compared to someone selling VMs that use local storage. The "running your VM on two hosts at once" system has such a performance penalty, and costs so much in licensing, you really need to NEED it for it not to be a ridiculous waste of resources.

I don't follow what you mean by "running the VM on two hosts." I just want my single virtual to be booted up on a spare hypervisor if there is a hypervisor failure. No license costs for that, and should not have any performance implications at all.

...

Amazon comes sorta close to this, in that their storage is mostly-totally separate from the hosts running your code. But they have had failures knock out access to your storage, so it's still not where I think you're saying you want to be.

The moral of the story is that just because it's "in the cloud", it doesn't gain higher reliability unless you're specifically taking steps to ensure it. Most people solve this by taking things that are already distributable (like DNS) and setting up multiple DNS servers in different places - that's where all this "cloud stuff" really shines.

The funky problem with DNS specifically, is that all the servers need to be up, or someone will get bad answers. Not having a preference system, like MX records has hurt in this regard. Anycast fixes this to a certain degree. Anycast is another challenge for these hosting providers.

...

(please no stories about how you were able to make a redundant virtual machine run using 5 year old servers in your basement, i'm talking about something that's supportable on a provider scale, and isn't adding more single-points-of-failure)

I have actually done this :-) But, I also have a fully redundant system at our main office using very few components. We also have a DR site, connected with fiber. The challenge we have is if we run into routing issues upstream that are beyond our control. Hence the need to have a few things also hosted externally geographically and routing-wise.

david raistrick

12:19 a.m.

On Sun, 26 Feb 2012, Randy Carpenter wrote:

...

I don't need that kind of HA, and understand that it is not going to be available. 15 minutes of downtime is fine. 6 hours is completely unacceptable, and it false advertising to say you have a "Cloud" service, and then have the realization that you could have *indefinite* downtime.

Um. You and I apparently work in different clouds. In my world, the SLAs I have agreed to state, roughly, that uptime is not guaranteed, nor is data recoverability. They suggest that that sort of thing is -my- problem to engineer and architect around. I don't use Rackspace's cloud solution - but I haven't seen anything to suggest that they advertise their service any differently. The "cloud" provides flexibility and rapid deployment at the expense of hands-on control and reliability (and SLAs). Perhaps you forgot to read the SLA? Or you can show us where someone defines "Cloud" as "highly available" and "without indefinite downtime" ? -- david raistrick http://www.netmeister.org/news/learn2quote.html drais@icantclick.org http://www.expita.com/nomime.html

Tony Patti

2:39 a.m.

...

-----Original Message----- From: david raistrick [mailto:drais@icantclick.org] Sent: Sunday, February 26, 2012 7:19 PM To: Randy Carpenter Cc: Nanog Subject: Re: Reliable Cloud host ?

On Sun, 26 Feb 2012, Randy Carpenter wrote:

...
I don't need that kind of HA, and understand that it is not going to be available. 15 minutes of downtime is fine. 6 hours is completely unacceptable, and it false advertising to say you have a "Cloud" service, and then have the realization that you could have *indefinite* downtime.

Um. You and I apparently work in different clouds.

Since it is the weekend, I can't resist writing down a little equation: Marketing(cloud) <> Technology(cloud) For some values of "cloud" perhaps? p.s. tongue firmly in cheek - Tony

Leigh Porter

9:40 a.m.

...

-----Original Message----- From: Tony Patti [mailto:tony@swalter.com] Sent: 27 February 2012 02:42 To: 'david raistrick'; 'Randy Carpenter' Cc: 'Nanog' Subject: RE: Reliable Cloud host ?

...
-----Original Message----- From: david raistrick [mailto:drais@icantclick.org] Sent: Sunday, February 26, 2012 7:19 PM To: Randy Carpenter Cc: Nanog Subject: Re: Reliable Cloud host ?

On Sun, 26 Feb 2012, Randy Carpenter wrote:

...
I don't need that kind of HA, and understand that it is not going to be available. 15 minutes of downtime is fine. 6 hours is completely unacceptable, and it false advertising to say you have a "Cloud" service, and then have the realization that you could have *indefinite* downtime.

Um. You and I apparently work in different clouds.

Since it is the weekend, I can't resist writing down a little equation:

Marketing(cloud) <> Technology(cloud)

For some values of "cloud" perhaps?

Well indeed that is a valid point. All cloud to me means is that there is some abstracted instance of x and that it does not always relate to a particular physical device, indeed, it may well be spread around a few physical devices. I don't think there is any implied magic redundancy automatic failover move your instance to another bit of metal if something breaks in there unless that's specifically stated. caveat emptor -- Leigh ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

Jon Lewis

12:29 a.m.

On Sun, 26 Feb 2012, Randy Carpenter wrote:

...

This does not have to be true at all. Even having a fully fault-tolerant SAN in addition to spare servers should not cost much more than having separate RAID arrays inside each of the server, when you are talking about 1,000s of server (which Rackspace certainly has)

When is your cloud offering going to be available to the public?

...

I don't need that kind of HA, and understand that it is not going to be available. 15 minutes of downtime is fine. 6 hours is completely unacceptable, and it false advertising to say you have a "Cloud" service, and then have the realization that you could have *indefinite* downtime.

I think you're assuming "cloud" means things that the provider does not. To me, "cloud" just means VPS that I can create/destroy quickly whenever I feel like it, without any interaction from the provider's people. i.e. a few mouse clicks or an API call can "provision a 10gb CentOS VM with 256mb RAM". It's up and running before I could locate a CentOS install CD, and if I don't like it, a few clicks or an API call deletes it, reprovisions it, exchanges it for a Ubuntu server, etc. Cloud doesn't mean if the node my VM(s) are on dies or crashes, my VMs boot up on an alternate node. That would certainly be a nice feature, but that's just a form of redundancy in a cloud...not a defining attribute of cloud.

...

The funky problem with DNS specifically, is that all the servers need to be up, or someone will get bad answers. Not having a preference system,

DNS "handles" down servers. How would one of your DNS servers being down give someone bad answers? It won't give any answers, and another server will be queried. Or do you mean if storage "goes away", but your DNS server is still running, it'll either give nxdomain or stale data...depending on whether it had the data in memory or storage went away and updates began failing because of it? ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route Senior Network Engineer | therefore you are Atlantic Net | _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

William Herrin

3:28 p.m.

On Sun, Feb 26, 2012 at 7:02 PM, Randy Carpenter <rcarpen@network1.net> wrote:

...

...
On Feb 26, 2012, at 4:56 PM, Randy Carpenter wrote:

...
1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!)

This is actually a much harder problem to solve than it sounds, and gets progressively harder depending on what you mean by "failover".

At the very least, having two physical hosts capable of running your VM requires that your VM be stored on some kind of SAN (usually iSCSI based) storage system. Otherwise, two hosts have no way of accessing your VM's data if one were to die. This makes things an order of magnitude or higher more expensive.

This does not have to be true at all. Even having a fully fault-tolerant SAN in addition to spare servers should not cost much more than having separate RAID arrays inside each of the server, when you are talking about 1,000s of server (which Rackspace certainly has)

Randy, You're kidding, right? SAN storage costs the better part of an order of magnitude more than server storage, which itself is several times more expensive than workstation storage. That's before you duplicate the SAN and set up the replication process so that cabinet and room level failures don't take you out. DR sites then create a ferocious (read: expensive) bandwidth challenge. Data can't flush from the primary SAN's write cache until the DR SAN acknowledges receipt. If you don't have enough bandwidth to keep up under the heaviest daily loads, the cache quickly fills and the writes block. I maintain 50ish VMs with about 30 different providers at the moment. Not one of them attempts to do anything like what you describe.

...

NetApp. HA heads. Done. Add a DR site with replication, and you can survive a site failure, and be back up and running in less than an hour. I would think that the big datacenter guys already have this type of thing set up.

That's expensive and VMs are sold primarily on price. You want high reliability, you start with the dedicated colo server. Customers who want DR in a VM environment buy two VMs and build data replication at the app layer. On Mon, Feb 27, 2012 at 9:31 AM, Max <perldork@webwizarddesign.com> wrote:

...

Linode.com is not cloud based but they offer IP failover between VPS instances at no additonal charge - their pricing is excellent, I have had no down time issues with them in 3+ years with 3 different customers using them and they have nice OOB and programmatic API access for controlling VPs instances as well.

Hi Max, I have had superb results from Linode and highly recommend them. However, they're facilitating application level failover not keeping your VM magically alive. And: http://library.linode.com/linux-ha/ip-failover-heartbeat-pacemaker-ubuntu-10... "Both Linodes must reside in the same datacenter for IP failover" So they don't support a full DR capability even if you're smart at the app level. On Mon, Feb 27, 2012 at 9:39 AM, Jared Mauch <jared@puck.nether.net> wrote:

...

Is the DNS service authoritative or recursive? If auth, you can solve this a few ways, either by giving the DNS name people point to multiple AAAA (and A) records pointing at a diverse set of instances. DNS is designed to work around a host being down. Same goes for MX and several other services. While it may make the service slightly slower, it's certainly not the end of the world.

Hi Jared, How DNS is designed to work and how it actually works is not the same. Look up "DNS Pinning" for example. For most kinds of DR you need IP level failover where the IP address is rerouted to the available site. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Jared Mauch

5:09 p.m.

On Feb 27, 2012, at 10:28 AM, William Herrin wrote:

...

On Mon, Feb 27, 2012 at 9:39 AM, Jared Mauch <jared@puck.nether.net> wrote:

...
Is the DNS service authoritative or recursive? If auth, you can solve this a few ways, either by giving the DNS name people point to multiple AAAA (and A) records pointing at a diverse set of instances. DNS is designed to work around a host being down. Same goes for MX and several other services. While it may make the service slightly slower, it's certainly not the end of the world.

Hi Jared,

How DNS is designed to work and how it actually works is not the same. Look up "DNS Pinning" for example. For most kinds of DR you need IP level failover where the IP address is rerouted to the available site.

If you want a system with 0 loss and 0 delay, start building your private network. I'm never claimed your response would be perfect, but it will certainly work well enough to avoid major problems. Or you can pay someone to do it for you. I'm not sure what a DNS hosted solution costs, and I'm geeky and run my own DNS on beta/RC quality software as well ;). What I do know is that my domain hasn't disappeared from the net wholesale as the name servers are "diverse-enough". Is DNS performance important? Sure. Should everyone set their TTL to 30? No. Reaching a high percentage of the internet doesn't require such a high SLA. Note, I didn't say reaching the top sites. While super-old, http://www.zooknic.com/Domains/counts.html says > 111m named sites in a few gTLDs. I'm sure there are better stats, but most of them don't need the same dns infrastructure that a google, bing, Facebook, etc require. If your DNS fits on a VM in someone else's "cloud", you likely won't notice the difference. A few extra NS records will likely do the right thing and go unnoticed. - Jared

William Herrin

7:02 p.m.

On Mon, Feb 27, 2012 at 12:09 PM, Jared Mauch <jared@puck.nether.net> wrote:

...

On Feb 27, 2012, at 10:28 AM, William Herrin wrote:

...
How DNS is designed to work and how it actually works is not the same. Look up "DNS Pinning" for example. For most kinds of DR you need IP level failover where the IP address is rerouted to the available site.

I'm never claimed your response would be perfect, but it will certainly work well enough to avoid major problems.

No, actually, it won't. In practice, most end user applications disregard the DNS TTL. In some cases this is because of carelessness: The application does a gethostbyname once when it starts, grabs the first IP address in the list and retains it indefinitely. The gethostbyname function doesn't even pass the TTL to the application. Ntpd is/used to be one of the notable offenders, continuing to poll the dead address for years after the server moved. In other cases disregarding the TTL was a deliberate design decision. Web browser DNS Pinning is an example of this. All modern web browsers implement a form of DNS Pinning where they refuse to try an alternate IP address for a web server on subsequent TCP connections after making the first successful contact. This plugs a javascript security leak where a client side application could be made to scan the interior of its user's firewall by switching the DNS back and forth between local and remote addresses. In some cases this stuck-address behavior can persist until the browser is completely closed and reopened, possibly when the PC is rebooted weeks later. The net result is that when you switch the IP address of your server, a percentage of your users (declining over time) will be unable to access it for hours, days, weeks or even years regardless of the DNS TTL setting. This isn't theoretical, by the way. I had to renumber a major web site once. 1 hour TTL at the beginning of the process. Three month overlap in which both addresses were online and the DNS pointed to the new one. At the end of the three months a fraction of a percent of the *real user traffic* was _still_ coming in the obsolete address. Using the correct name in the Host: header, so the user wasn't deliberately picking the IP address. If you want DR that *works*, reroute the IP address. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Valdis.Kletnieks＠vt.edu

7:53 p.m.

On Mon, 27 Feb 2012 14:02:04 EST, William Herrin said:

...

The net result is that when you switch the IP address of your server, a percentage of your users (declining over time) will be unable to access it for hours, days, weeks or even years regardless of the DNS TTL setting.

Amen brother. So just for grins, after seeing William's I set up a listener on an address that had an NTP server on it many moons ago. As in the machine was shut down around 2002/06/30 22:49 and we didn't re-assign the IP address ever since *because* it kept getting hit with NTP packets.. Yes, a decade ago. In the first 15 minutes, 234 different IP's have tried to NTP to that address. And the winner for "most confused host", which in addition to trying to NTP also did this: 14:23:24.518136 IP 74.254.73.90.68 > 128.173.14.71.123: BOOTP/DHCP, unknown (0xdb), length 48 14:23:57.395525 IP 74.254.73.90.53 > 128.173.14.71.123: 56064 [b2&3=0x6ee] [3494a] [0q] [307au] (48) 14:24:28.536351 IP 74.254.73.90.68 > 128.173.14.71.123: BOOTP/DHCP, unknown (0xdb), length 48 14:24:53.382719 IP 74.254.73.90.500 > 128.173.14.71.123: isakmp: 14:25:01.391268 IP 74.254.73.90.53 > 128.173.14.71.123: 56064 [b2&3=0x6ee] [3494a] [0q] [307au] (48) 14:25:32.522313 IP 74.254.73.90.68 > 128.173.14.71.123: BOOTP/DHCP, unknown (0xdb), length 48 14:26:05.399885 IP 74.254.73.90.53 > 128.173.14.71.123: 56064 [b2&3=0x6ee] [3494a] [0q] [307au] (48) 14:26:36.529713 IP 74.254.73.90.68 > 128.173.14.71.123: BOOTP/DHCP, unknown (0xdb), length 48 14:27:09.405922 IP 74.254.73.90.53 > 128.173.14.71.123: 56064 [b2&3=0x6ee] [3494a] [0q] [307au] (48) 14:27:40.528381 IP 74.254.73.90.68 > 128.173.14.71.123: BOOTP/DHCP, unknown (0xdb), length 48 14:28:13.393794 IP 74.254.73.90.53 > 128.173.14.71.123: 56064 [b2&3=0x6ee] [3494a] [0q] [307au] (48) 14:28:20.971269 IP 74.254.73.90.69 > 128.173.14.71.123: 48 tftp-#6914 14:28:37.907704 IP 74.254.73.90.161 > 128.173.14.71.123: [id?P/x/27] 14:28:44.525585 IP 74.254.73.90.68 > 128.173.14.71.123: BOOTP/DHCP, unknown (0xdb), length 48 14:29:17.399784 IP 74.254.73.90.53 > 128.173.14.71.123: 56064 [b2&3=0x6ee] [3494a] [0q] [307au] (48) 14:29:48.531804 IP 74.254.73.90.68 > 128.173.14.71.123: BOOTP/DHCP, unknown (0xdb), length 48 14:30:21.398360 IP 74.254.73.90.53 > 128.173.14.71.123: 56064 [b2&3=0x6ee] [3494a] [0q] [307au] (48) 14:30:52.530148 IP 74.254.73.90.68 > 128.173.14.71.123: BOOTP/DHCP, unknown (0xdb), length 48 14:31:25.403931 IP 74.254.73.90.53 > 128.173.14.71.123: 56064 [b2&3=0x6ee] [3494a] [0q] [307au] (48) 14:31:56.536594 IP 74.254.73.90.68 > 128.173.14.71.123: BOOTP/DHCP, unknown (0xdb), length 48 14:32:29.404457 IP 74.254.73.90.53 > 128.173.14.71.123: 56064 [b2&3=0x6ee] [3494a] [0q] [307au] (48) 14:33:00.534956 IP 74.254.73.90.68 > 128.173.14.71.123: BOOTP/DHCP, unknown (0xdb), length 48 14:33:33.402336 IP 74.254.73.90.53 > 128.173.14.71.123: 56064 [b2&3=0x6ee] [3494a] [0q] [307au] (48) Somewhere in BellSouth territory, a machine desperately needs to be whacked upside the head.

Jared Mauch

28 Feb 28 Feb

2:02 p.m.

On Feb 27, 2012, at 2:53 PM, Valdis.Kletnieks@vt.edu wrote:

...

On Mon, 27 Feb 2012 14:02:04 EST, William Herrin said:

...
The net result is that when you switch the IP address of your server, a percentage of your users (declining over time) will be unable to access it for hours, days, weeks or even years regardless of the DNS TTL setting.

Amen brother.

So just for grins, after seeing William's I set up a listener on an address that had an NTP server on it many moons ago. As in the machine was shut down around 2002/06/30 22:49 and we didn't re-assign the IP address ever since *because* it kept getting hit with NTP packets.. Yes, a decade ago.

In the first 15 minutes, 234 different IP's have tried to NTP to that address.

I hereby reject the principle that one can not renumber a host/name and move it. Certainly some people will see breakage. This is because their software is defective, sometimes in a critical way, other times in a way that is non-obvious. But I reject the idea that you can't move a service, or have one MX, DNS, etc.. host be down and have it be fatal without something else being SERIOUSLY broken. If you are right, nobody could ever renumber anything ever, nor take a service down ever in the most absolute terms. I've been involved in large scale DNS server renumbering/moving/whatnot. It's harder these days than it was in the past, but its feasible. I know those resolver addresses that have been retired still get queries from *very* broken hosts. Just because they're getting queries, doesn't mean they are expecting an answer, or will properly handle it. Sometimes you have to break the service worse for people to repair it. Look at the DCWG.org site and try to get an idea if you're infected. At some point those will go away. Doesn't mean those people aren't broken/infected and REQUIRE remediation. - Jared

William Herrin

6:22 p.m.

On Tue, Feb 28, 2012 at 9:02 AM, Jared Mauch <jared@puck.nether.net> wrote:

...

On Feb 27, 2012, at 2:53 PM, Valdis.Kletnieks@vt.edu wrote:

...
On Mon, 27 Feb 2012 14:02:04 EST, William Herrin said:

...
The net result is that when you switch the IP address of your server, a percentage of your users (declining over time) will be unable to access it for hours, days, weeks or even years regardless of the DNS TTL setting.

Amen brother.

So just for grins, after seeing William's I set up a listener on an address that had an NTP server on it many moons ago. As in the machine was shut down around 2002/06/30 22:49 and we didn't re-assign the IP address ever since *because* it kept getting hit with NTP packets.. Yes, a decade ago.

In the first 15 minutes, 234 different IP's have tried to NTP to that address.

I hereby reject the principle that one can not renumber a host/name and move it. I reject the idea that you can't move a service, or have one MX, DNS, etc.. host be down and have it be fatal without something else being SERIOUSLY broken. If you are right, nobody could ever renumber anything ever, nor take a service down ever in the most absolute terms.

Something else IS seriously broken. Several something elses actually: 1. DNS TTL at the application boundary, due in part to... 2. Pushing the name to layer 3 address mapping process up from layer 4 to layer 7 where each application has to (incorrectly) reinvent the process, and... 3. A layer 4 protocol which overloads the layer 3 address as an inseverable component of its transport identifier. Even stuff like SMTP which took care to respect the DNS TTL in its own standards gets busted at the back end: too many antispam process components rely on the source IP address, crushing large scale servers that suddenly appear, transmitting large amounts of email from a fresh IP address. Shockingly enough we have a strongly functional network despite this brokenness. But, it's broken all the same and renumbering is majorly impaired as a consequence. Renumbering in light of these issues isn't impossible. An overlap period is required in which both old and new addresses are operable. The duration of that overlap period is not defined by the the protocol itself. Rather, it varies with the tolerable level or residual brokenness, literally how many nines of users should be operating on the new address before the old address can go away. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Owen DeLong

7:46 p.m.

On Feb 28, 2012, at 10:22 AM, William Herrin wrote:

...

On Tue, Feb 28, 2012 at 9:02 AM, Jared Mauch <jared@puck.nether.net> wrote:

...
On Feb 27, 2012, at 2:53 PM, Valdis.Kletnieks@vt.edu wrote:

...
On Mon, 27 Feb 2012 14:02:04 EST, William Herrin said:

...
The net result is that when you switch the IP address of your server, a percentage of your users (declining over time) will be unable to access it for hours, days, weeks or even years regardless of the DNS TTL setting.

Amen brother.

So just for grins, after seeing William's I set up a listener on an address that had an NTP server on it many moons ago. As in the machine was shut down around 2002/06/30 22:49 and we didn't re-assign the IP address ever since *because* it kept getting hit with NTP packets.. Yes, a decade ago.

In the first 15 minutes, 234 different IP's have tried to NTP to that address.

I hereby reject the principle that one can not renumber a host/name and move it. I reject the idea that you can't move a service, or have one MX, DNS, etc.. host be down and have it be fatal without something else being SERIOUSLY broken. If you are right, nobody could ever renumber anything ever, nor take a service down ever in the most absolute terms.

Something else IS seriously broken. Several something elses actually:

1. DNS TTL at the application boundary, due in part to...

DNS TTL shouldn't make it to the application boundary...

...

2. Pushing the name to layer 3 address mapping process up from layer 4 to layer 7 where each application has to (incorrectly) reinvent the process, and...

But they don't have to... They can simply use getaddrinfo()/getnameinfo() and let the OS libraries do it. The fact that some applications choose to use their own resolvers instead of system libraries is what is broken.

...

3. A layer 4 protocol which overloads the layer 3 address as an inseverable component of its transport identifier.

Even stuff like SMTP which took care to respect the DNS TTL in its own standards gets busted at the back end: too many antispam process components rely on the source IP address, crushing large scale servers that suddenly appear, transmitting large amounts of email from a fresh IP address.

I think this is orthogonal to DNS TTL issues.

...

Shockingly enough we have a strongly functional network despite this brokenness. But, it's broken all the same and renumbering is majorly impaired as a consequence.

In my experience, the biggest hurdle to renumbering has nothing to do with DNS, DNS TTLs, respect or failure to respect them, etc. In my experience the biggest renumbering challenges come from the number of configuration files which contain your IP addresses yet are not under your control. VPNs (the configuration at the far side of the VPN) Firewalls (vendors, clients, etc. that have put your IP addresses into exceptions) Router configurations (vendors, clients, etc. that have special routing policy to reach you) etc. These are the things that make renumbering hard. The DNS stuff is usually fairly trivial to work around with a little time and planning.

...

Renumbering in light of these issues isn't impossible. An overlap period is required in which both old and new addresses are operable.

That's desirable even if you have a 5 second TTL and everyone did honor it.

...

The duration of that overlap period is not defined by the the protocol itself. Rather, it varies with the tolerable level or residual brokenness, literally how many nines of users should be operating on the new address before the old address can go away.

There is some truth to that. The combination of applications having their own (broken) resolver libraries and operating systems that provide even more broken resolvers (thanks, Redmond) has made this a bigger challenge than it should be. The ideal solution is to go back to using the OS resolver libraries and fix them. Best of luck actually achieving that. Owen

david raistrick

8:50 p.m.

On Tue, 28 Feb 2012, Owen DeLong wrote:

...

But they don't have to... They can simply use getaddrinfo()/getnameinfo() and let the OS libraries do it. The fact that some applications choose to use their own resolvers instead of system libraries is what is broken.

Not always true - firewall software, for example, generally requires IP addresses in their rules (ipfw, pfsense, iptables, at least a few years ago) and for validly sane reasons (even some of our best kernel guys were not crazy enough to change that for ipfw). Proxy software that supports high connection rates and connection churn generally prefer to cache the IP address internally because OS resolvers and the caches they read from just can't keep up [except in specificly well designed systems - which proxy developers can't expect blow joe to know how to do]. A stress test tool I'm working with just had to be modified for exactly that reason (and because adding more caches in front of AWS semiauthorative caches (due to split horizon) wouldn't solve anything. a short TTL is a short TTL is a short TTL....). Some of those proxy developers claim that within the chrootwhatchamajiggy that their socket handling code runs they don't have access to the resolvers - so they have to store them at startup (see haproxy). -- david raistrick http://www.netmeister.org/news/learn2quote.html drais@icantclick.org http://www.expita.com/nomime.html

Tei

29 Feb 29 Feb

7:12 p.m.

related to the topic: http://slashdot.org/story/12/02/29/153226/microsofts-azure-cloud-suffers-maj... -- -- ℱin del ℳensaje.

Bobby Mac

9:24 p.m.

HP has built an Openstack based cloud. I got a beta account and things are surprisingly stable. hpcloud dot com On Wed, Feb 29, 2012 at 1:12 PM, Tei <oscar.vives@gmail.com> wrote:

...

related to the topic:

http://slashdot.org/story/12/02/29/153226/microsofts-azure-cloud-suffers-maj...

-- -- ℱin del ℳensaje.

Robert Suh

1 Mar 1 Mar

8:13 p.m.

Check out Firehost. Just came back from RSA2012 and talked with them. VPS provider using VMWare ESX with Dell/Compellent (auto tiered with SSD) for storage. They offer DDoS mitigation (they use Arbor) out of the box along with managed firewall and web application firewall. More expensive than EC2, but their high touch features seems worthwhile. Live support is included. Rob -----Original Message----- From: Bobby Mac [mailto:bobbyjim@gmail.com] Sent: Wednesday, February 29, 2012 1:24 PM To: Tei Cc: Nanog Subject: Re: Reliable Cloud host ? HP has built an Openstack based cloud. I got a beta account and things are surprisingly stable. hpcloud dot com On Wed, Feb 29, 2012 at 1:12 PM, Tei <oscar.vives@gmail.com> wrote:

...

related to the topic:

http://slashdot.org/story/12/02/29/153226/microsofts-azure-cloud-suffers-maj...

-- -- ℱin del ℳensaje.

Valdis.Kletnieks＠vt.edu

28 Feb 28 Feb

6:43 p.m.

On Tue, 28 Feb 2012 09:02:00 EST, Jared Mauch said:

...

Sometimes you have to break the service worse for people to repair it.

I broke it a decade ago, I think I can pretty much give up on expecting people to repair it. :)

david raistrick

27 Feb 27 Feb

8:42 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mon, 27 Feb 2012, William Herrin wrote:

...

In some cases this is because of carelessness: The application does a gethostbyname once when it starts, grabs the first IP address in the list and retains it indefinitely. The gethostbyname function doesn't even pass the TTL to the application. Ntpd is/used to be one of the notable offenders, continuing to poll the dead address for years after the server moved.

While yes it often is carelessness - it's been reported by hardcore development sorts that I trust that there is no standardized API to obtain the TTL... What needs to get fixed is get[hostbyname,addrinfo,etc] so programmers have better tools. -- david raistrick http://www.netmeister.org/news/learn2quote.html drais@icantclick.org http://www.expita.com/nomime.html

david raistrick

8:43 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mon, 27 Feb 2012, William Herrin wrote:

...

In some cases this is because of carelessness: The application does a gethostbyname once when it starts, grabs the first IP address in the list and retains it indefinitely. The gethostbyname function doesn't even pass the TTL to the application. Ntpd is/used to be one of the notable offenders, continuing to poll the dead address for years after the server moved.

William Herrin

11:50 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mon, Feb 27, 2012 at 3:43 PM, david raistrick <drais@icantclick.org> wrote:

...

On Mon, 27 Feb 2012, William Herrin wrote:

...
In some cases this is because of carelessness: The application does a gethostbyname once when it starts, grabs the first IP address in the list and retains it indefinitely. The gethostbyname function doesn't even pass the TTL to the application. Ntpd is/used to be one of the notable offenders, continuing to poll the dead address for years after the server moved.

While yes it often is carelessness - it's been reported by hardcore development sorts that I trust that there is no standardized API to obtain the TTL... What needs to get fixed is get[hostbyname,addrinfo,etc] so programmers have better tools.

Meh. What should be fixed is that connect() should receive a name instead of an IP address. Having an application deal directly with the IP address should be the exception rather than the rule. Then, deal with the TTL issues once in the standard libraries instead of repeatedly in every single application. In theory, that'd even make the app code protocol agnostic so that it doesn't have to be rewritten yet again for IPv12. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Owen DeLong

28 Feb 28 Feb

12:07 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Feb 27, 2012, at 3:50 PM, William Herrin wrote:

...

On Mon, Feb 27, 2012 at 3:43 PM, david raistrick <drais@icantclick.org> wrote:

...
On Mon, 27 Feb 2012, William Herrin wrote:

...
In some cases this is because of carelessness: The application does a gethostbyname once when it starts, grabs the first IP address in the list and retains it indefinitely. The gethostbyname function doesn't even pass the TTL to the application. Ntpd is/used to be one of the notable offenders, continuing to poll the dead address for years after the server moved.

While yes it often is carelessness - it's been reported by hardcore development sorts that I trust that there is no standardized API to obtain the TTL... What needs to get fixed is get[hostbyname,addrinfo,etc] so programmers have better tools.

Meh. What should be fixed is that connect() should receive a name instead of an IP address. Having an application deal directly with the IP address should be the exception rather than the rule. Then, deal with the TTL issues once in the standard libraries instead of repeatedly in every single application.

In theory, that'd even make the app code protocol agnostic so that it doesn't have to be rewritten yet again for IPv12.

While I agree with the principle of what you are trying to say, I would argue that it should be dealt with in getnameinfo() / getaddrinfo() and not connect(). It is perfectly reasonable for connect() to deal with an address structure. If people are not using getnameinfo()/getaddrinfo() from the standard libraries, then, I don't see any reason to believe that they would use connect() from the standard libraries if it incorporated their functionality. Owen

William Herrin

12:59 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mon, Feb 27, 2012 at 7:07 PM, Owen DeLong <owen@delong.com> wrote:

...

On Feb 27, 2012, at 3:50 PM, William Herrin wrote:

...
Meh. What should be fixed is that connect() should receive a name instead of an IP address. Having an application deal directly with the IP address should be the exception rather than the rule. Then, deal with the TTL issues once in the standard libraries instead of repeatedly in every single application.

In theory, that'd even make the app code protocol agnostic so that it doesn't have to be rewritten yet again for IPv12.

While I agree with the principle of what you are trying to say, I would argue that it should be dealt with in getnameinfo() / getaddrinfo() and not connect().

It is perfectly reasonable for connect() to deal with an address structure.

Yes, well, that's why we're still using a layer 4 protocol (TCP) that can't dynamically rebind to the protocol level below (IP). God help us when folks start overriding the ethernet MAC address to force machines to keep the same IPv6 address that's been hardcoded somewhere or is otherwise too much trouble to change. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

George Herbert

1:12 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mon, Feb 27, 2012 at 4:59 PM, William Herrin <bill@herrin.us> wrote:

...

.... Yes, well, that's why we're still using a layer 4 protocol (TCP) that can't dynamically rebind to the protocol level below (IP).

This is somewhat irritating, but on the scale of 0 (all is well) to 10 (you want me to do WHAT with DHCPv6???) this is about a 2. The application can re-connect from the TCP layer if something wiggy happens to the layer below. This is an application layer solution, is well established, and works fine. One just has to notice something's amiss and retry connection rather than abort the application.

...

God help us when folks start overriding the ethernet MAC address to force machines to keep the same IPv6 address that's been hardcoded somewhere or is otherwise too much trouble to change.

It could be worse. Back in the day I worked for a company that did one of the earlier two-on-motherboard ethernet chip servers. The Boot PROM (from another vendor) had no clue about multiple ethernet interfaces. It came up with both interfaces set to the same NVRAM-set MAC. We wanted to fix it in firmware but kept having issues with that. I had to get an init script to rotate the MAC for the second interface up one, and ensure that it was in the OS and run before the interfaces got plumbed, get it bundled into the OS distribution, and ensure that factory MACs were only set to even numbers to start with. One of these steps ultimately failed rather spectacularly. -- -george william herbert george.herbert@gmail.com

Matt Addison

4:57 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Feb 27, 2012, at 19:10, Owen DeLong <owen@delong.com> wrote:

...

On Feb 27, 2012, at 3:50 PM, William Herrin wrote:

...
On Mon, Feb 27, 2012 at 3:43 PM, david raistrick <drais@icantclick.org> wrote:

...
On Mon, 27 Feb 2012, William Herrin wrote:

...
In some cases this is because of carelessness: The application does a gethostbyname once when it starts, grabs the first IP address in the list and retains it indefinitely. The gethostbyname function doesn't even pass the TTL to the application. Ntpd is/used to be one of the notable offenders, continuing to poll the dead address for years after the server moved.

While yes it often is carelessness - it's been reported by hardcore development sorts that I trust that there is no standardized API to obtain the TTL... What needs to get fixed is get[hostbyname,addrinfo,etc] so programmers have better tools.

Meh. What should be fixed is that connect() should receive a name instead of an IP address. Having an application deal directly with the IP address should be the exception rather than the rule. Then, deal with the TTL issues once in the standard libraries instead of repeatedly in every single application.

In theory, that'd even make the app code protocol agnostic so that it doesn't have to be rewritten yet again for IPv12.

While I agree with the principle of what you are trying to say, I would argue that it should be dealt with in getnameinfo() / getaddrinfo() and not connect().

It is perfectly reasonable for connect() to deal with an address structure.

If people are not using getnameinfo()/getaddrinfo() from the standard libraries, then, I don't see any reason to believe that they would use connect() from the standard libraries if it incorporated their functionality.

gai/gni do not return TTL values on any platforms I'm aware of, the only way to get TTL currently is to use a non standard resolver (e.g. lwres). The issue is application developers not calling gai every time they connect (due to aforementioned security concerns, at least in the browser realm), instead opting to hold onto the original resolved address for unreasonable amounts of time. Modifying gai to provide TTL has been proposed in the past (dnsop '04) but afaik was shot down to prevent inconsistencies in the API. Maybe when happy eyeballs stabilizes someone will propose an API for inclusion in the standard library that implements HE style connections. Looks like there was already some talk on v6ops headed this way, but as always there's resistance to standardizing it. ~Matt

Mark Andrews

5:45 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

getaddrinfo was designed to be extensible as was struct addrinfo. Part of the problem with TTL is not data sources used by getaddrinfo have TTL information. Additionally for many uses you want to reconnect to the same server rather than the same name. Note there is nothing to prevent a getaddrinfo implementation maintaining its own cache though if I was implementing such a cache I would have a flag to to force a refresh. -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Owen DeLong

9:32 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Feb 27, 2012, at 9:45 PM, Mark Andrews wrote:

...

getaddrinfo was designed to be extensible as was struct addrinfo. Part of the problem with TTL is not data sources used by getaddrinfo have TTL information. Additionally for many uses you want to reconnect to the same server rather than the same name. Note there is nothing to prevent a getaddrinfo implementation maintaining its own cache though if I was implementing such a cache I would have a flag to to force a refresh.

Sorry if I wasn't clear... My point to Bill was that we should be using calls that don't have TTL information (GAI/GNI in their default forms). That we don't need to abuse connect() to achieve that. That if people use GAI/GNI(), then, any brokenness is system-wide brokenness in the system's resolver library and should be addressed there. Owen

William Herrin

1:11 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Tue, Feb 28, 2012 at 12:45 AM, Mark Andrews <marka@isc.org> wrote:

...

getaddrinfo was designed to be extensible as was struct addrinfo. Part of the problem with TTL is not [all] data sources used by getaddrinfo have TTL information.

Hi Mark, By the time getaddrinfo replaced gethostbyname, NIS and similar systems were on their way out. It was reasonably well understood that many if not most of the calls would return information gained from the DNS. Depending on how you look at it, choosing not to propagate TTL knowledge was either a belligerent choice to continue disrespecting the DNS Time To Live or it was fatalistic acceptance that the DNS TTL isn't and would not become functional at the application level. Still works fine deeper in the query system, timing out which server holds the records though.

...

Additionally for many uses you want to reconnect to the same server rather than the same name.

The SRV record was designed to solve that whole class of problems without damaging the operation of the TTL. No one uses it. It's all really very unfortunate. The recipe for SOHO multihoming, the end of routing table bloat and IP roaming without pivoting off a home base all boils down to two technologies: (1) a layer 4 protocol that can dynamically rebind to the layer 3 IP address the same way IP uses ARP to rebind to a changing ethernet MAC and (2) a DNS TTL that actually works so that the DNS supports finding a connection's current IP address. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Mark Andrews

9:06 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

In message <CAP-guGV09HF7in+vZbKpGk0RR1Q4gpMMo5jQREUZVEj+ewzmkg@mail.gmail.com>, William Herrin writes:

...

On Tue, Feb 28, 2012 at 12:45 AM, Mark Andrews <marka@isc.org> wrote:

...
getaddrinfo was designed to be extensible as was struct addrinfo. Part of the problem with TTL is not [all] dat=

a sources

...
used by getaddrinfo have TTL information.

Hi Mark,

By the time getaddrinfo replaced gethostbyname, NIS and similar systems were on their way out. It was reasonably well understood that many if not most of the calls would return information gained from the DNS. Depending on how you look at it, choosing not to propagate TTL knowledge was either a belligerent choice to continue disrespecting the DNS Time To Live or it was fatalistic acceptance that the DNS TTL isn't and would not become functional at the application level.

No. Propogating TTL is still a issue especially when you do not always have one. You can't just wave the problem away. As for DNS TTL addresses are about the only thing which have multiple sources. You also don't have to use getaddrinfo. It really is designed to be the first step in connecting to a host. If you need to reconnect you call it again.

...

Still works fine deeper in the query system, timing out which server holds the records though.

...
Additionally for many uses you want to reconnect to the same server rather than the same name.

The SRV record was designed to solve that whole class of problems without damaging the operation of the TTL. No one uses it.

You don't need to know the TTL to use SRV.

...

It's all really very unfortunate. The recipe for SOHO multihoming, the end of routing table bloat and IP roaming without pivoting off a home base all boils down to two technologies: (1) a layer 4 protocol that can dynamically rebind to the layer 3 IP address the same way IP uses ARP to rebind to a changing ethernet MAC and (2) a DNS TTL that actually works so that the DNS supports finding a connection's current IP address.

DNS TTL works. Applications that don't honour it arn't a indication that it doesn't work.

...

Regards, Bill Herrin

-- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004 -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

William Herrin

9:21 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Tue, Feb 28, 2012 at 4:06 PM, Mark Andrews <marka@isc.org> wrote:

...

DNS TTL works. Applications that don't honour it arn't a indication that it doesn't work.

Mark, If three people died and the building burned down then the sprinkler system didn't work. It may have sprayed water, but it didn't *work*. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Mark Andrews

29 Feb 29 Feb

1:46 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

In message <CAP-guGXK3WQGPLpmnVsnM0xnnU8==4zONK=UWTLkYWuduA6T9Q@mail.gmail.com>, William Herrin writes:

...

On Tue, Feb 28, 2012 at 4:06 PM, Mark Andrews <marka@isc.org> wrote:

...
DNS TTL works. =A0Applications that don't honour it arn't a indication th= at it doesn't work.

Mark,

If three people died and the building burned down then the sprinkler system didn't work. It may have sprayed water, but it didn't *work*.

Not enough evidence to say if it worked or not. Sprinkler systems are designed to handle particular classes of fire, not every fire. A 0 TTL means use this information for this transaction. We don't tear down TCP sessions on DNS TTL going to zero. If one really want to deprecate addresses we need something a lot more complicated than A and AAAA records in the DNS. We need stuff like "use this address for new transactions", "this address is going away soon, don't use it unless no other works". One also has to use multiple addresses at the same time. Mark -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Joe Greco

12:57 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

...

In message <CAP-guGXK3WQGPLpmnVsnM0xnnU8==4zONK=UWTLkYWuduA6T9Q@mail.gmail.com>, William Herrin writes:

...
On Tue, Feb 28, 2012 at 4:06 PM, Mark Andrews <marka@isc.org> wrote:

...
DNS TTL works. =A0Applications that don't honour it arn't a indication th= at it doesn't work.

Mark,

If three people died and the building burned down then the sprinkler system didn't work. It may have sprayed water, but it didn't *work*.

Not enough evidence to say if it worked or not. Sprinkler systems are designed to handle particular classes of fire, not every fire.

It is also worth noting that many fire systems are not intended to put out the fire, but to provide warning and then provide an extended window for people to exit the affected building through use of sprinklers and other measures to slow the spread of the fire. As you suggest, most sprinkler systems are not actually designed to be able to completely extinguish fires - but that even applies to fires they are intended to be able to "handle". This is a common misunderstanding of the technology.

...

A 0 TTL means use this information for this transaction. We don't tear down TCP sessions on DNS TTL going to zero.

If one really want to deprecate addresses we need something a lot more complicated than A and AAAA records in the DNS. We need stuff like "use this address for new transactions", "this address is going away soon, don't use it unless no other works". One also has to use multiple addresses at the same time.

This has always been a weakness of the technology and documentation. The common usage scenario of static hosts and merely needing to be able to resolve a hostname to reach the traditional example of a "departmental server" or something like that is what most code and code examples are intended to tackle; very little of what developers are actually given (in real practical terms) even hints at needing to consider aspects such as TTL or periodically refreshing host->ip mappings, and most of the documentation I've seen fails to discuss the implications of overloading things like TTL for deliberate load-balancing or geo purposes. Shocking it's poorly understood by developers who just want their poor little program to connect over the Internet. It's funny how these two technologies are both often misunderstood. I would not have thought of comparing DNS to fire suppression. :-) ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.

William Herrin

2:18 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Wed, Feb 29, 2012 at 7:57 AM, Joe Greco <jgreco@ns.sol.net> wrote:

...

...
In message <CAP-guGXK3WQGPLpmnVsnM0xnnU8==4zONK=UWTLkYWuduA6T9Q@mail.gmail.com>, William Herrin writes:

...
On Tue, Feb 28, 2012 at 4:06 PM, Mark Andrews <marka@isc.org> wrote:

...
DNS TTL works. =A0Applications that don't honour it arn't a indication th= at it doesn't work.

Mark,

If three people died and the building burned down then the sprinkler system didn't work. It may have sprayed water, but it didn't *work*.

Not enough evidence to say if it worked or not. Sprinkler systems are designed to handle particular classes of fire, not every fire.

It is also worth noting that many fire systems are not intended to put out the fire, but to provide warning and then provide an extended window for people to exit the affected building through use of sprinklers and other measures to slow the spread of the fire.

Hi Joe, The sprinkler system is designed to delay the fire long enough for everyone to safely escape. As a secondary objective, it reduces the fire damage that occurs while waiting for firefighters to arrive and extinguish the fire. If "three people died" then the system failed. Perhaps the design was inadequate. Perhaps some age-related issue prevented the sprinkler heads from melting. Perhaps someone stacked boxes to the ceiling and it blocked the water. Perhaps the water was shut off and nobody knew it. Perhaps an initial explosion damaged the sprinkler system so it could no longer work effectively. Whatever the exact details, that sprinkler system failed. Whoever you want to blame, DNS TTL dysfunction at the application level is the same way. It's a failed system. With the TTL on an A record set to 60 seconds, you can't change the address attached to the A record and expect that 60 seconds later no one will continue to connect to the old address. Nor 600 seconds later nor 6000 seconds later. The "system" for renumbering a service of which the TTL setting is a part consistently fails to reliably function in that manner. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Owen DeLong

6:01 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Feb 29, 2012, at 6:18 AM, William Herrin wrote:

...

On Wed, Feb 29, 2012 at 7:57 AM, Joe Greco <jgreco@ns.sol.net> wrote:

...
...
In message <CAP-guGXK3WQGPLpmnVsnM0xnnU8==4zONK=UWTLkYWuduA6T9Q@mail.gmail.com>, William Herrin writes:

...
On Tue, Feb 28, 2012 at 4:06 PM, Mark Andrews <marka@isc.org> wrote:

...
DNS TTL works. =A0Applications that don't honour it arn't a indication th= at it doesn't work.

Mark,

If three people died and the building burned down then the sprinkler system didn't work. It may have sprayed water, but it didn't *work*.

Not enough evidence to say if it worked or not. Sprinkler systems are designed to handle particular classes of fire, not every fire.

It is also worth noting that many fire systems are not intended to put out the fire, but to provide warning and then provide an extended window for people to exit the affected building through use of sprinklers and other measures to slow the spread of the fire.

Hi Joe,

The sprinkler system is designed to delay the fire long enough for everyone to safely escape. As a secondary objective, it reduces the fire damage that occurs while waiting for firefighters to arrive and extinguish the fire. If "three people died" then the system failed. Perhaps the design was inadequate. Perhaps some age-related issue prevented the sprinkler heads from melting. Perhaps someone stacked boxes to the ceiling and it blocked the water. Perhaps the water was shut off and nobody knew it. Perhaps an initial explosion damaged the sprinkler system so it could no longer work effectively. Whatever the exact details, that sprinkler system failed.

Bill, you are blaming the sprinkler system for what could, in fact, be not a failure of the sprinkler system, but, of the 3 humans. If they were too intoxicated or stoned to react, for example, the sprinkler system is not to blame. If they were overcome by smoke before the sprinklers went off, that may be a failure of the smoke detectors, but, it is not a failure of the sprinklers. If they were killed or rendered unconsious and/or unresponsive in the preceding explosion you mentioned and did not die in the subsequent fire, then, that is not a failure in the sprinkler system.

...

Whoever you want to blame, DNS TTL dysfunction at the application level is the same way. It's a failed system. With the TTL on an A record set to 60 seconds, you can't change the address attached to the A record and expect that 60 seconds later no one will continue to connect to the old address. Nor 600 seconds later nor 6000 seconds later. The "system" for renumbering a service of which the TTL setting is a part consistently fails to reliably function in that manner.

Yes, the assumption by developers that gni/ghi is a fire-and-forget mechanism and that the data received is static is a failure. It is not a failure of DNS TTL. It is a failure of the application developers that code that way. Further analysis of the underlying causes of that failure to properly understand name resolution technology and the environment in which it operates is left as an exercise for the reader. The fact that people playing interesting games with DNS TTLs don't necessarily understand or well document the situation to raise awareness among application developers could also be argued to be a failure on the part of those people. It is not, in either case, a failure of the technology. One should always call gni/gai in close temporal (and ideally close in the code as well) proximity to calling connect(). Obviously one should call these resolver functions prior to calling connect(). Most example code is designed for short-lived non-recovering flows, so, it's designed along the lines of resolve->(iterate through results calling connect() for each result untill connect() succeeds)->process-> close->exit. Examples for persistent connections and/or connections that recover or re-establish after a failure and/or browsers that stay running for a long time and connect to the same system again significantly later are few and far between. As a result, most code doing that ends up being poorly written. Further, DNS performance issues in the past have led developers of such applications to "take matters into their own hands" to try and improve the performance/behavior of their application in spite of DNS. This is one of the things that led to many of the TTL ignorant application-level DNS caches which you are complaining about. Again, not a failure of DNS technology, but, of the operators of that technology and the developers that tried to compensate for those failures. They introduced a cure that is often worse than the disease. Owen

...

Regards, Bill Herrin

-- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Robert Hajime Lanning

6:38 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On 02/29/12 10:01, Owen DeLong wrote:

...

Further, DNS performance issues in the past have led developers of such applications to "take matters into their own hands" to try and improve the performance/behavior of their application in spite of DNS. This is one of the things that led to many of the TTL ignorant application-level DNS caches which you are complaining about.

I have found some carriers to run hacked nameservers. Several years ago I was moving a website and found that Cox was overriding the TTL for all "www" names. At least for their residential customers in Oklahoma. The TTL value our test subject was getting was larger than it had ever been set. -- Mr. Flibble King of the Potato People

James M Keller

9:02 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On 2/29/2012 1:38 PM, Robert Hajime Lanning wrote:

...

On 02/29/12 10:01, Owen DeLong wrote:

...
Further, DNS performance issues in the past have led developers of such applications to "take matters into their own hands" to try and improve the performance/behavior of their application in spite of DNS. This is one of the things that led to many of the TTL ignorant application-level DNS caches which you are complaining about.

I have found some carriers to run hacked nameservers. Several years ago I was moving a website and found that Cox was overriding the TTL for all "www" names. At least for their residential customers in Oklahoma. The TTL value our test subject was getting was larger than it had ever been set.

Back in the day, the uu.net cache servers where set for 24 hours (can't remember if they claimed it was a performance issue or some other justification). Several other large ISPs of the day also did this, so you typically got the "allow 24 hours for full propagation of DNS changes ..." response when updating external DNS entries. Nominal best practice is to expect that and to run the service on old and new IPs for at least 24 hours then start doing redirection (where possible by protocol) or stop servicing the protocols on the old IP. I'm sure other providers are doing the same to slow down fast flux entries being used for spam site hosting today. -- --- James M Keller

Joe Greco

9:02 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

...

On Wed, Feb 29, 2012 at 7:57 AM, Joe Greco <jgreco@ns.sol.net> wrote:

...
...
In message <CAP-guGXK3WQGPLpmnVsnM0xnnU8==4zONK=UWTLkYWuduA6T9Q@mail.gmail.com>, �William Herrin writes:

...
On Tue, Feb 28, 2012 at 4:06 PM, Mark Andrews <marka@isc.org> wrote:

...
DNS TTL works. =A0Applications that don't honour it arn't a indication th= at it doesn't work.

Mark,

If three people died and the building burned down then the sprinkler system didn't work. It may have sprayed water, but it didn't *work*.

Not enough evidence to say if it worked or not. �Sprinkler systems are designed to handle particular classes of fire, not every fire.

It is also worth noting that many fire systems are not intended to put out the fire, but to provide warning and then provide an extended window for people to exit the affected building through use of sprinklers and other measures to slow the spread of the fire.

Hi Joe,

The sprinkler system is designed to delay the fire long enough for everyone to safely escape.

Hi Bill, No, the sprinkler system is *intended* to delay the fire long enough for everyone to safely escape, however, in order to accomplish this, the designer chooses from some reasonable options to meet certain goals that are commonly accepted to allow that. For example, the suppression design applied to a multistory dwelling where people live, cook, and sleep is typically different from the single-story light office space. Neither design will be effective against all possible types of fire

...

As a secondary objective, it reduces the fire damage that occurs while waiting for firefighters to arrive and extinguish the fire. If "three people died" then the system failed.

That's silly. The system fails if the system *fails* or doesn't behave as designed. No system is capable of guaranteeing survival. Just yesterday, here in Milwaukee, we had a child killed at a railroad crossing. The crossing was well-marked, with signals and gates. Visibility of approaching trains for close to a mile in either direction. The crew on the train saw him crossing, blew their horn, laid on the emergency brakes. CP Rail inspected the gates and signals for any possible faults, but eyewitness accounts were that the gates and signals were working, and the train made every effort to make itself known. The 11 year old kid had his hood up and earbuds in, and apparently didn't see the signals or look up and down the track before crossing, and for whatever reason, didn't hear the train horn blaring at him. At a certain point, you just can't protect against every possible bad thing that can happen. I have a hard time seeing this as a failure of the railroad's fully functional railroad crossing and related safety mechanisms. The system doesn't guarantee survival.

...

Whoever you want to blame, DNS TTL dysfunction at the application level is the same way. It's a failed system. With the TTL on an A record set to 60 seconds, you can't change the address attached to the A record and expect that 60 seconds later no one will continue to connect to the old address. Nor 600 seconds later nor 6000 seconds later. The "system" for renumbering a service of which the TTL setting is a part consistently fails to reliably function in that manner.

It's a failure because people don't understand the intent of the system, and it is pretty safe to argue that it is a multifaceted failure, due to failures by client implementations, server implementations, sample code, attempts to use the system for things it wasn't meant for, etc. This is by no means limited to TTL; we've screwed up multiple addresses, IPv6 handling, negative caching, um, do I need to go on...? In the specific case of TTL, the problem is made much worse due to the way most client code has hidden this data from developers, so that many developers don't even have any idea that such a thing exists. I'm not sure how to see that a design failure of the TTL mechanism. I don't see developers ignoring DNS and hardcoding IP addresses into code as a failure of the DNS system. I see both as naive implementation errors. The difference with TTL is that the implementation errors are so widespread as to render any sane implementation relatively useless. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.

William Herrin

9:20 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Wed, Feb 29, 2012 at 4:02 PM, Joe Greco <jgreco@ns.sol.net> wrote:

...

In the specific case of TTL, the problem is made much worse due to the way most client code has hidden this data from developers, so that many developers don't even have any idea that such a thing exists.

I'm not sure how to see that a design failure of the TTL mechanism.

Hi Joe, You shouldn't see that as a design failure of the TTL mechanism. It isn't. It's a failure of the system of which DNS TTL is a component. The TTL component itself was reasonably designed. The failure is likened to installing a well designed sprinkler system (the DNS with a TTL) and then shutting off the water valve (gethostbyname/getaddrinfo).

...

I don't see developers ignoring DNS and hardcoding IP addresses into code as a failure of the DNS system.

It isn't. It's a failure of the sockets API design which calls on every application developer to (a) translate the name to a set of addresses with a mechanism that discards the TTL knowledge and (b) implement his own glue between name to address mapping and connect by address. It would be like telling an app developer: here's the ARP function and the SEND function. When you Send to an IP address, make sure you attach the right destination MAC. Of course the app developer gets it wrong most of the time. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Joe Greco

1 Mar 1 Mar

1:25 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

...

On Wed, Feb 29, 2012 at 4:02 PM, Joe Greco <jgreco@ns.sol.net> wrote:

...
In the specific case of TTL, the problem is made much worse due to the way most client code has hidden this data from developers, so that many developers don't even have any idea that such a thing exists.

I'm not sure how to see that a design failure of the TTL mechanism.

Hi Joe,

You shouldn't see that as a design failure of the TTL mechanism. It isn't. It's a failure of the system of which DNS TTL is a component. The TTL component itself was reasonably designed.

Think that's pretty much what I said.

...

The failure is likened to installing a well designed sprinkler system (the DNS with a TTL) and then shutting off the water valve (gethostbyname/getaddrinfo).

No, the water still works as intended. I think your analogy starts to fail here. It's more like expecting a water suppression system to put out a grease fire. The TTL mechanism is completely suitable for what it was originally meant for, and in an environment where everyone has followed the rules, it works fine. If you take a light office space with sprinklers and remodel it into a short order grill, the fire inspector will require you to rework the fire suppression system to an appropriate system. Problem is, TTL is a relatively light-duty system that people have felt free to ignore, overload for other purposes, etc., but there's no fire inspector to come around and tell people how and why what they've done is broken. In the case of TTL, the system is even largely hidden from users, so that it is rarely thought about except now and then on NANOG, dns-operations, etc. ;-) No wonder it is even poorly understood.

...

...
I don't see developers ignoring DNS and hardcoding IP addresses into code as a failure of the DNS system.

It isn't. It's a failure of the sockets API design which calls on every application developer to (a) translate the name to a set of addresses with a mechanism that discards the TTL knowledge and (b) implement his own glue between name to address mapping and connect by address.

It would be like telling an app developer: here's the ARP function and the SEND function. When you Send to an IP address, make sure you attach the right destination MAC. Of course the app developer gets it wrong most of the time.

That's correct - and it doesn't imply that the system that was engineered is faulty. In all likelihood, the fault lies with what the app developer was told. You originally said: "If three people died and the building burned down then the sprinkler system didn't work. It may have sprayed water, but it didn't *work*." That's not true. If it sprayed water in the manner it was designed to, then it worked. If three people took sleeping pills and didn't wake up when the alarms blared, and an arsonist poured ten gallons of gas everywhere before lighting the fire, the system still worked. It failed to save those lives or protect the building from burning down, but I am aware of no fire suppression systems that realistically attempts to address that. It is an unreasonable expectation. I have a hard time seeing the many self-inflicted wounds of people who have attempted to abuse TTL for various purposes as a failure of the TTL design. The design is reasonable. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.

William Herrin

2:31 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Thu, Mar 1, 2012 at 8:25 AM, Joe Greco <jgreco@ns.sol.net> wrote:

...

"If three people died and the building burned down then the sprinkler system didn't work. It may have sprayed water, but it didn't *work*."

That's not true. If it sprayed water in the manner it was designed to, then it worked.

That's like the old crack about ICBM interceptors. Why yes, our system performed swimmingly in the latest test achieving nine out of the ten criteria for success. Which criteria didn't it achieve? It missed the target. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Joe Greco

2:53 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

...

On Thu, Mar 1, 2012 at 8:25 AM, Joe Greco <jgreco@ns.sol.net> wrote:

...
"If three people died and the building burned down then the sprinkler system didn't work. It may have sprayed water, but it didn't *work*."

That's not true. =A0If it sprayed water in the manner it was designed to, then it worked.

That's like the old crack about ICBM interceptors. Why yes, our system performed swimmingly in the latest test achieving nine out of the ten criteria for success. Which criteria didn't it achieve? It missed the target.

Difference: the fire suppression system worked as designed, the ICBM didn't. That's kind of the whole point here. If you have something like an automobile that's designed to protect you against certain kinds of accidents, it isn't a failure if it does not protect you against an accident that is not reasonably within the protection envelope. For example, cars these days are designed to protect against many different types of impacts and provide survivability. It is a failure if my car is designed to protect against a head-on crash at 30MPH by use of engineered crumple zones and deploying air bags, and I get into such an accident and am killed regardless. However, if I fly my car into a bridge abutment at 150MPH and am instantly pulverized, I am not prepared to consider that a failure of the car. Likewise, if a freeway overpass slab falls on my car and crushes me as I drive underneath it, I am not going to consider that a failure of the car. There's a definite distinction between a system that fails when it is deployed and used in the intended manner, and a system that doesn't work as you'd like it to when it is used in some incorrect manner, which is really not a failure as the word is normally used. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.

Jimmy Hess

6:15 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mon, Feb 27, 2012 at 10:57 PM, Matt Addison <matt.addison@lists.evilgeni.us> wrote:

...

gai/gni do not return TTL values on any platforms I'm aware of, the only way to get TTL currently is to use a non standard resolver (e.g. lwres). The issue is application developers not calling gai every time

GAI/GNI do not return TTL values, but this should not be a problem. If they were to return anything, it should not be a TTL, but a time() value, after which the result may no longer be used. One way to achieve that would be for GAI to return an opaque structure that contained the IP and such a value, in a manner consumable by the sockets API, and adjust connect() to return an error if passed a structure containing a ' returned time + TTL' in the past. TTL values are a DNS resolver function; the application consuming the sockets API should not be concerned about details of the DNS protocol. All the application developer should need to know is that you invoke GAI/GNI and wait for a response. Once you have that response, it is permissible to use the value immediately, but you may not store or re-use that value for more than a few seconds. If you require that value again later, then you invoke GAI/GNI again; any caching details are the concern of the resolver library developer who has implemented GAI/GNI. -- -JH

Tim Franklin

10:54 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

...

GAI/GNI do not return TTL values, but this should not be a problem. If they were to return anything, it should not be a TTL, but a time() value, after which the result may no longer be used.

One way to achieve that would be for GAI to return an opaque structure that contained the IP and such a value, in a manner consumable by the sockets API, and adjust connect() to return an error if passed a structure containing a ' returned time + TTL' in the past.

AF_INET_TTL and AFINET6_TTL, with correspondingly expanded struct sockaddr_* ? Code that explictly requests AF_INET or AF_INET6 would get what it was expecting, code that requests AF_UNSPEC on a system with modified getaddrinfo() would get the expanded structs with the different ai_family set, and could pass them straight into a modified connect(). I'm sure I'm grossly oversimplifying somewhere though... Regards, Tim.

Owen DeLong

12:20 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Feb 29, 2012, at 10:15 PM, Jimmy Hess wrote:

...

On Mon, Feb 27, 2012 at 10:57 PM, Matt Addison <matt.addison@lists.evilgeni.us> wrote:

...
gai/gni do not return TTL values on any platforms I'm aware of, the only way to get TTL currently is to use a non standard resolver (e.g. lwres). The issue is application developers not calling gai every time

GAI/GNI do not return TTL values, but this should not be a problem. If they were to return anything, it should not be a TTL, but a time() value, after which the result may no longer be used.

One way to achieve that would be for GAI to return an opaque structure that contained the IP and such a value, in a manner consumable by the sockets API, and adjust connect() to return an error if passed a structure containing a ' returned time + TTL' in the past.

TTL values are a DNS resolver function; the application consuming the sockets API should not be concerned about details of the DNS protocol.

All the application developer should need to know is that you invoke GAI/GNI and wait for a response. Once you have that response, it is permissible to use the value immediately, but you may not store or re-use that value for more than a few seconds.

If you require that value again later, then you invoke GAI/GNI again; any caching details are the concern of the resolver library developer who has implemented GAI/GNI.

-- -JH

The simpler approach and perfectly viable without mucking up what is already implemented and working: Don't keep returns from GAI/GNI around longer than it takes to cycle through your connect() loop immediately after the GAI/GNI call. If you write your code to the standard of: getaddrinfo(); /* do something with the results */ freeaddrinfo(); with a very limited amount of time passing between getaddrinfo() and freeaddrinfo(), then, you don't need TTLs and it doesn't matter. The system resolver library should do the right thing with DNS TTLs for records retrieved from DNS and a subsequent call to getaddrinfo() within the DNS TTL for the previously retrieved record should be a relatively cheap, fast in-memory operation. Owen

William Herrin

2:26 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Thu, Mar 1, 2012 at 7:20 AM, Owen DeLong <owen@delong.com> wrote:

...

The simpler approach and perfectly viable without mucking up what is already implemented and working:

Don't keep returns from GAI/GNI around longer than it takes to cycle through your connect() loop immediately after the GAI/GNI call.

The even simpler approach: create an AF_NAME with a sockaddr struct that contains a hostname instead of an IPvX address. Then let connect() figure out the details of caching, TTLs, protocol and address selection, etc. Such a connect() could even support a revised TCP stack which is able to retry with the other addresses at the first subsecond timeout rather than camping on each address in sequence for the typical system default of two minutes. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Michael Thomas

3:01 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On 03/01/2012 06:26 AM, William Herrin wrote:

...

...
The simpler approach and perfectly viable without mucking up what is already implemented and working:

Don't keep returns from GAI/GNI around longer than it takes to cycle through your connect() loop immediately after the GAI/GNI call. The even simpler approach: create an AF_NAME with a sockaddr struct

On Thu, Mar 1, 2012 at 7:20 AM, Owen DeLong<owen@delong.com> wrote: that contains a hostname instead of an IPvX address. Then let connect() figure out the details of caching, TTLs, protocol and address selection, etc. Such a connect() could even support a revised TCP stack which is able to retry with the other addresses at the first subsecond timeout rather than camping on each address in sequence for the typical system default of two minutes.

The effect of what you're recommending is to move all of this into the kernel, and in the process greatly expand its scope. Also: even if you did this, you'd be saddled with the same problem because nothing existing would use an AF_NAME. The real issue is that gethostbyxxx has been inadequate for a very long time. Moving it across the kernel boundary solves nothing and most likely causes even more trouble: what if I want, say, asynchronous name resolution? What if I want to use SRV records? What if a new DNS RR comes around -- do i have do recompile the kernel? It's for these reasons and probably a whole lot more that connect just confuses the actual issues. When I was writing the first version of DKIM I used a library that I scraped off the net called ARES. It worked adequately for me, but the most notable thing was the very fact that I had to scrape it off the net at all. As far as I could tell, standard distos don't have libraries with lower level access to DNS (in my case, it needed to not block). Before positing a super-deluxe gethostbyxx that does addresses picking, etc, etc, it would be better to lobby all of the distos to settle on a decomposed resolver library from which that and more could be built. Mike

Joe Greco

3:22 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

...

On 03/01/2012 06:26 AM, William Herrin wrote:

...
...
The simpler approach and perfectly viable without mucking up what is already implemented and working:

Don't keep returns from GAI/GNI around longer than it takes to cycle through your connect() loop immediately after the GAI/GNI call. The even simpler approach: create an AF_NAME with a sockaddr struct

On Thu, Mar 1, 2012 at 7:20 AM, Owen DeLong<owen@delong.com> wrote: that contains a hostname instead of an IPvX address. Then let connect() figure out the details of caching, TTLs, protocol and address selection, etc. Such a connect() could even support a revised TCP stack which is able to retry with the other addresses at the first subsecond timeout rather than camping on each address in sequence for the typical system default of two minutes.

The effect of what you're recommending is to move all of this into the kernel, and in the process greatly expand its scope. Also: even if you did this, you'd be saddled with the same problem because nothing existing would use an AF_NAME.

The real issue is that gethostbyxxx has been inadequate for a very long time. Moving it across the kernel boundary solves nothing and most likely causes even more trouble: what if I want, say, asynchronous name resolution? What if I want to use SRV records? What if a new DNS RR comes around -- do i have do recompile the kernel? It's for these reasons and probably a whole lot more that connect just confuses the actual issues.

When I was writing the first version of DKIM I used a library that I scraped off the net called ARES. It worked adequately for me, but the most notable thing was the very fact that I had to scrape it off the net at all. As far as I could tell, standard distos don't have libraries with lower level access to DNS (in my case, it needed to not block). Before positing a super-deluxe gethostbyxx that does addresses picking, etc, etc, it would be better to lobby all of the distos to settle on a decomposed resolver library from which that and more could be built.

It's deeper than just that, though. The whole paradigm is messy, from the point of view of someone who just wants to get stuff done. The examples are (almost?) all fatally flawed. The code that actually gets at least some of it right ends up being too complex and too hard for people to understand why things are done the way they are. Even in the "old days", before IPv6, geez, look at this: bcopy(host->h_addr_list[n], (char *)&addr->sin_addr.s_addr, sizeof(addr->sin_addr.s_addr)); That's real comprehensible - and it's essentially the data interface between the resolver library and the system's addressing structures for syscalls. On one hand, it's "great" that they wanted to abstract the dirty details of DNS away from users, but I'd say they failed pretty much even at that. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.

Michael Thomas

3:56 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On 03/01/2012 07:22 AM, Joe Greco wrote:

...

It's deeper than just that, though. The whole paradigm is messy, from the point of view of someone who just wants to get stuff done. The examples are (almost?) all fatally flawed. The code that actually gets at least some of it right ends up being too complex and too hard for people to understand why things are done the way they are.

Even in the "old days", before IPv6, geez, look at this:

bcopy(host->h_addr_list[n], (char *)&addr->sin_addr.s_addr, sizeof(addr->sin_addr.s_addr));

That's real comprehensible - and it's essentially the data interface between the resolver library and the system's addressing structures for syscalls.

On one hand, it's "great" that they wanted to abstract the dirty details of DNS away from users, but I'd say they failed pretty much even at that.

Yes, as simple as the normal kernel interface is for net io, getting to the point that you can do a connect() is both maddeningly messy and maddeningly inflexible -- the worst of all possible worlds. We shouldn't kid ourselves that DNS is a simple protocol though. It has layers of complexity and the policy decisions about address picking are not easy. But things like dealing with caching correctly shouldn't be that painful if done correctly by, say, discouraging copying addresses with, say, a wrapper function that validates the TTL and hands you back a filled out sockaddr. But not wanting to block -- which is needed for an event loop or run to completion like interface -- adds a completely new dimension. Maybe it's the intersection of all of these complexities that's at the root of why we're stuck with either gethostbyxx or roll your own. Mike

David Conrad

4:57 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

Hi, On Mar 1, 2012, at 7:22 AM, Joe Greco wrote:

...

On Mar 1, 2012, at 7:01 AM, Michael Thomas wrote:

...
The effect of what you're recommending is to move all of this into the kernel, and in the process greatly expand its scope. Also: even if you did this, you'd be saddled with the same problem because nothing existing would use an AF_NAME.

I always thought the right way to deal with IPv6 would have been to use a 32-bit number from the class E space as a 'network handle' where the actual address (be it IPv4 or IPv6) was handled by the kernel. I suspect this would have allowed the majority of network-utilizing applications to magically just work, regardless of whether the name supplied by gethosbyname/getnameinfo/etc. was mapped to an address with A or AAAA. Probably would make stuff faster too since you'd only have to deal with an unsigned int instead of (worst case) 16 bytes that have to be copied back and forth. Instead, we have forced application developers to use a really odd mixture of old and new, e.g. 'struct sockaddr_in6' and GNI/GAI. Seems this is the worst of both worlds -- no backwards compatibility yet an adherence to a really broken model that requires applications to know useless details like the length of an address ("what do you mean a sizeof(struct sockaddr) isn't big enough to hold an IPv6 address?") and even its bit patterns.

...

...
Moving it across the kernel boundary solves nothing

Actually, it does. Right now, applications effectively cache the address in their data space, requiring the application developer to go to quite a bit of work to deal with the address changing (or, far more typically, just pretend addresses never change). This has a lot of unfortunate side effects.

...

...
and most likely causes even more trouble: what if I want, say, asynchronous name resolution?

Set non-blocking on the socket?

...

...
What if I want to use SRV records? What if a new DNS RR comes around -- do i have do recompile the kernel?

I believe with the exception of A/AAAA, RDATA is typically returned as either opaque (to the DNS) data blobs or names. This means the only stuff the kernel would need to deal with would be the A/AAAA lookups, everything else would be passed back as data, presumably via a new system call.

...

...
As far as I could tell, standard distos don't have libraries with lower level access to DNS (in my case, it needed to not block).

There have been lower-level resolver APIs since (at least) BSD 4.3 (man resolver(3)).

...

It's deeper than just that, though. The whole paradigm is messy, from the point of view of someone who just wants to get stuff done. The

...

examples are (almost?) all fatally flawed. The code that actually gets at least some of it right ends up being too complex and too hard for people to understand why things are done the way they are.

Exactly. Even before IPv6, it was icky. Now, it's just crazy. We had an opportunity to fix this with IPv6 since IPv6 required non-trivial kernel hackage. Unfortunately, we didn't take advantage of that opportunity. Regards, -drc

Jeroen Massar

5:25 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On 2012-03-01 17:57 , David Conrad wrote:

...

Hi,

On Mar 1, 2012, at 7:22 AM, Joe Greco wrote:

...
On Mar 1, 2012, at 7:01 AM, Michael Thomas wrote:

...
The effect of what you're recommending is to move all of this into the kernel, and in the process greatly expand its scope. Also: even if you did this, you'd be saddled with the same problem because nothing existing would use an AF_NAME.

I always thought the right way to deal with IPv6 would have been to use a 32-bit number from the class E space as a 'network handle' where the actual address (be it IPv4 or IPv6) was handled by the kernel.

This is the case when you pass in a sockaddr. Note, not a sockaddr_in or a sockaddr_in6, but just a sockaddr. There is a nice 14 year old article about this: http://www.kame.net/newsletter/19980604/

...

I suspect this would have allowed the majority of network-utilizing applications to magically just work, regardless of whether the name supplied by gethosbyname/getnameinfo/etc. was mapped to an address with A or AAAA. Probably would make stuff faster too since you'd only have to deal with an unsigned int instead of (worst case) 16 bytes that have to be copied back and forth.

There is quite a bit more state than that. And actually those addresses are only 'copied' once: during accept() or connect(), there is no "speed-loss" per send/recv as the only thing being moved from user space to kernel space is the file descriptor and the actual data. [..]

...

Instead, we have forced application developers to use a really odd mixture of old and new, e.g. 'struct sockaddr_in6' and GNI/GAI. Seems this is the worst of both worlds -- no backwards compatibility yet an adherence to a really broken model that requires applications to know useless details like the length of an address ("what do you mean a sizeof(struct sockaddr) isn't big enough to hold an IPv6 address?") and even its bit patterns.

Ever heard of sockaddr_storage? It was made to solve that little issue. See also, that article above. [..]

...

Exactly. Even before IPv6, it was icky. Now, it's just crazy. We had an opportunity to fix this with IPv6 since IPv6 required non-trivial kernel hackage. Unfortunately, we didn't take advantage of that opportunity.

What you are talking about is an API wrapper. Depending on platform these have existed for years already. Quite a few do not expose addresses at all to the calling code. One of the many reasons why putting the IPv6 enabled winsock dll in place 14 years ago made various winsock applications understand IPv6. Greets, Jeroen

David Conrad

7:27 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

Jeroen, On Mar 1, 2012, at 9:25 AM, Jeroen Massar wrote:

...

...
I always thought the right way to deal with IPv6 would have been to use a 32-bit number from the class E space as a 'network handle' where the actual address (be it IPv4 or IPv6) was handled by the kernel.

This is the case when you pass in a sockaddr. Note, not a sockaddr_in or a sockaddr_in6, but just a sockaddr.

Sorry? On which system? As far as I'm aware, there are no libraries that make use of class E addresses to act as a layer of indirection similar to file handles. Would love to know such exists.

...

There is a nice 14 year old article about this: http://www.kame.net/newsletter/19980604/

Quoting from that article: "This way the network address and address family is will not live together, and leads to bunch of if/switch statement and mistakes in programming. " which is exactly the point. It has been 14 years and people are _STILL_ discussing this.

...

And actually those addresses are only 'copied' once: during accept() or connect(),

Assuming the application doesn't need to copy the address, ever.

...

Ever heard of sockaddr_storage?

Oddly, yes. It still astonishes me that sizeof(struct sockaddr) < sizeof(struct sockaddr_storage).

...

It was made to solve that little issue. See also, that article above.

Thus requiring people to go in and muck with code thereby increasing the cost of migration with obvious effect.

...

What you are talking about is an API wrapper. Depending on platform these have existed for years already. Quite a few do not expose addresses at all to the calling code.

And yet, look at the code Mark Andrews just referenced as his recommend way of dealing with initiating connections. How many applications actually do anything like that? More to the point, how many books/article/etc. exist that reference these APIs you're talking about vs. how many reference the traditional way one goes about dealing with networks? Rhetorical questions, no need to answer. Got tired of tilting at this windmill some time ago and I know nothing will change. I'm just amazed that people defend the abominable kludge that are the existing common sockets/resolver APIs. Regards, -drc

Michael Thomas

6 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On 03/01/2012 08:57 AM, David Conrad wrote:

...

...
Moving it across the kernel boundary solves nothing Actually, it does. Right now, applications effectively cache the address in their data space, requiring the application developer to go to quite a bit of work to deal with the address changing (or, far more typically, just pretend addresses never change). This has a lot of unfortunate side effects.

My rule of thumb is for this sort of thing "does it *require* kernel level access?" In this case, the answer is manifestly "no". As far as ttl's go in particular, most apps would work perfectly well always doing real DNS socket IO to a local resolver each time which has the side effect that it would honor ttl, as well as benefiting from cross process caching. It could be done in the kernel, but it would be introducing a *lot* of complexity and inflexibility. Even if you did want super high performance local DNS resolution, there are still a lot of other ways to achieve that besides jamming it into the kernel. A lot of the beauty of UNIX is that the kernel system interface is simple... dragging more into the kernel is aesthetically wrong.

...

...
...
What if I want to use SRV records? What if a new DNS RR comes around -- do i have do recompile the kernel? I believe with the exception of A/AAAA, RDATA is typically returned as either opaque (to the DNS) data blobs or names. This means the only stuff the kernel would need to deal with would be the A/AAAA lookups, everything else would be passed back as data, presumably via a new system call.

SRV records? This is starting to get really messy inside the kernel and for no good reason that I can see.

...

...
...
As far as I could tell, standard distos don't have libraries with lower level access to DNS (in my case, it needed to not block). There have been lower-level resolver APIs since (at least) BSD 4.3 (man resolver(3)).

This is all getting sort of hazy since it was 8 years ago, but yes res_XX existed, and hence the ares_ analog that I used. Maybe all that's really needed for low level access primitives is a merger of res_ and ares_... asynchronous resolution is a fairly important feature for modern event loop like things. But I don't claim to be a DNS wonk so it might be worse than that. Mike

David Conrad

7:11 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

Michael, On Mar 1, 2012, at 10:00 AM, Michael Thomas wrote:

...

My rule of thumb is for this sort of thing "does it *require* kernel level access?" In this case, the answer is manifestly "no".

This is tilting at windmills since it's wildly unlikely anything will change, but... The idea is to add a level of indirection that does not currently exist, similar to the mapping of filename/file handle/inode in the filesystem. This layer of indirection allows the kernel to remap things as it sees fit without impacting the application. If such functionality existed, the kernel could manage the mapping between name and address to do things like honoring DNS TTL, transparently handling renumbering events, deal with protocol transitions even during a connection, etc. As things are now, it's like having to rewrite non-tivial sections of code for _all_ disk-aware applications because we've gone from a 32-bit file system to a 64-bit file system, even though the vast majority of those applications couldn't care less.

...

SRV records?

Do not have addresses in their RDATA, they have names. Regards, -drc

Owen DeLong

9:07 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

...

It's deeper than just that, though. The whole paradigm is messy, from the point of view of someone who just wants to get stuff done. The examples are (almost?) all fatally flawed. The code that actually gets at least some of it right ends up being too complex and too hard for people to understand why things are done the way they are.

Even in the "old days", before IPv6, geez, look at this:

bcopy(host->h_addr_list[n], (char *)&addr->sin_addr.s_addr, sizeof(addr->sin_addr.s_addr));

That's real comprehensible - and it's essentially the data interface between the resolver library and the system's addressing structures for syscalls.

On one hand, it's "great" that they wanted to abstract the dirty details of DNS away from users, but I'd say they failed pretty much even at that.

... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.

I think that the modern set of getaddrinfo and connect is actually not that complicated: /* Hints for getaddrinfo() (tell it what we want) */ memset(&addrinfo, 0, sizeof(addrinfo)); /* Zero out the buffer */ addrinfo.ai_family=PF_UNSPEC; /* Any and all address families */ addrinfo.ai_socktype=SOCK_STREAM; /* Stream Socket */ addrinfo.ai_protocol=IPPROTO_TCP; /* TCP */ /* Ask the resolver library for the information. Exit on failure. */ /* argv[1] is the hostname passed in by the user. "demo" is the service name */ if (rval = getaddrinfo(argv[1], "demo", &addrinfo, &res) != 0) { fprintf(stderr, "%s: Failed to resolve address information.\n", argv[0]); exit(2); } /* Iterate through the results */ for (r=res; r; r = r->ai_next) { /* Create a socket configured for the next candidate */ sockfd6 = socket(r->ai_family, r->ai_socktype, r->ai_protocol); /* Try to connect */ if (connect(sockfd6, r->ai_addr, r->ai_addrlen) < 0) { /* Failed to connect */ e_save = errno; /* Destroy socket */ (void) close(sockfd6); /* Recover the error information */ errno = e_save; /* Tell the user that this attempt failed */ fprintf(stderr, "%s: Failed attempt to %s.\n", argv[0], get_ip_str((struct sockaddr *)r->ai_addr, buf, BUFLEN)); /* Give error details */ perror("Socket error"); } else { /* Success! */ /* Inform the user */ snprintf(s, BUFLEN, "%s: Succeeded to %s.", argv[0], get_ip_str((struct sockaddr *)r->ai_addr, buf, BUFLEN)); debug(5, argv[0], s); /* Flag our success */ success++; /* Stop iterating */ break; } } /* Out of the loop. Either we succeeded or ran out of possibilities */ if (success == 0) /* If we ran out of possibilities... */ { /* Inform the user, free up the resources, and exit */ fprintf(stderr, "%s: Failed to connect to %s.\n", argv[0], argv[1]); freeaddrinfo(res); exit(5); } /* Succeeded. Inform the user and continue with the application */ printf("%s: Successfully connected to %s at %s on FD %d.\n", argv[0], argv[1], get_ip_str((struct sockaddr *)r->ai_addr, buf, BUFLEN), sockfd6); /* Free up the memory held by the resolver results */ freeaddrinfo(res); It's really hard to make a case that this is all that complex. I put a lot of extra comments in there to make it clear what's happening for people who may not be used to coding in C. It also contains a whole lot of extra user notification and debugging instrumentation because it is designed as an example people can use to learn with. Yes, this was a lot messier and a lot stranger and harder to get right with get*by{name,addr}, but, those days are long gone and anyone still coding with those needs to move forward. Owen

Mark Andrews

9:45 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

In message <CAC38B59-1F54-4788-87A2-A1A8BE453500@delong.com>, Owen DeLong write s:

...

...
=20 It's deeper than just that, though. The whole paradigm is messy, from the point of view of someone who just wants to get stuff done. The examples are (almost?) all fatally flawed. The code that actually = gets at least some of it right ends up being too complex and too hard for people to understand why things are done the way they are. =20 Even in the "old days", before IPv6, geez, look at this: =20 bcopy(host->h_addr_list[n], (char *)&addr->sin_addr.s_addr, = sizeof(addr->sin_addr.s_addr)); =20 That's real comprehensible - and it's essentially the data interface=20=

...
between the resolver library and the system's addressing structures for syscalls. =20 On one hand, it's "great" that they wanted to abstract the dirty = details of DNS away from users, but I'd say they failed pretty much even at = that. =20 ... JG --=20 Joe Greco - sol.net Network Services - Milwaukee, WI - = http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] = then I won't contact you again." - Direct Marketing Ass'n position on e-mail = spam(CNN) With 24 million small businesses in the US alone, that's way too many = apples.

I think that the modern set of getaddrinfo and connect is actually not = that complicated:

/* Hints for getaddrinfo() (tell it what we want) */ memset(&addrinfo, 0, sizeof(addrinfo)); /* Zero out the buffer = */ addrinfo.ai_family=3DPF_UNSPEC; /* Any and all = address families */ addrinfo.ai_socktype=3DSOCK_STREAM; /* Stream Socket */ addrinfo.ai_protocol=3DIPPROTO_TCP; /* TCP */ /* Ask the resolver library for the information. Exit on failure. */ /* argv[1] is the hostname passed in by the user. "demo" is the = service name */ if (rval =3D getaddrinfo(argv[1], "demo", &addrinfo, &res) !=3D 0) { fprintf(stderr, "%s: Failed to resolve address information.\n", = argv[0]); exit(2); }

/* Iterate through the results */ for (r=3Dres; r; r =3D r->ai_next) { /* Create a socket configured for the next candidate */ sockfd6 =3D socket(r->ai_family, r->ai_socktype, r->ai_protocol); /* Try to connect */ if (connect(sockfd6, r->ai_addr, r->ai_addrlen) < 0) { /* Failed to connect */ e_save =3D errno; /* Destroy socket */ (void) close(sockfd6); /* Recover the error information */ errno =3D e_save; /* Tell the user that this attempt failed */ fprintf(stderr, "%s: Failed attempt to %s.\n", argv[0],=20 get_ip_str((struct sockaddr *)r->ai_addr, buf, BUFLEN)); /* Give error details */ perror("Socket error"); } else { /* Success! */ /* Inform the user */ snprintf(s, BUFLEN, "%s: Succeeded to %s.", argv[0], get_ip_str((struct sockaddr *)r->ai_addr, buf, BUFLEN)); debug(5, argv[0], s); /* Flag our success */ success++; /* Stop iterating */ break; } } /* Out of the loop. Either we succeeded or ran out of possibilities */ if (success =3D=3D 0) /* If we ran out of possibilities... */ { /* Inform the user, free up the resources, and exit */ fprintf(stderr, "%s: Failed to connect to %s.\n", argv[0], argv[1]); freeaddrinfo(res); exit(5); } /* Succeeded. Inform the user and continue with the application */ printf("%s: Successfully connected to %s at %s on FD %d.\n", argv[0], = argv[1], get_ip_str((struct sockaddr *)r->ai_addr, buf, BUFLEN), sockfd6); /* Free up the memory held by the resolver results */ freeaddrinfo(res);

It's really hard to make a case that this is all that complex.

I put a lot of extra comments in there to make it clear what's happening = for people who may not be used to coding in C. It also contains a whole = lot of extra user notification and debugging instrumentation because it = is designed as an example people can use to learn with.=20

Yes, this was a lot messier and a lot stranger and harder to get right = with get*by{name,addr}, but, those days are long gone and anyone still = coding with those needs to move forward.

Owen

These days you want something more complicated as everyone is or will be soon multi-homed. The basic loop above has very bad error characteristics if the first machines are not reachable. I've got working select, poll and thread based examples here: http://www.isc.org/community/blog/201101/how-to-connect-to-a-multi-homed-ser....

...

From http://www.isc.org/files/imce/select-connect_0.c:

/* * Copyright (C) 2011 Internet Systems Consortium, Inc. ("ISC") * * Permission to use, copy, modify, and/or distribute this software for any * purpose with or without fee is hereby granted, provided that the above * copyright notice and this permission notice appear in all copies. * * THE SOFTWARE IS PROVIDED "AS IS" AND ISC DISCLAIMS ALL WARRANTIES WITH * REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY * AND FITNESS. IN NO EVENT SHALL ISC BE LIABLE FOR ANY SPECIAL, DIRECT, * INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM * LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE * OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR * PERFORMANCE OF THIS SOFTWARE. */ #define TIMEOUT 500 /* ms */ int connect_to_host(struct addrinfo *res0) { struct addrinfo *res; int fd = -1, n, i, j, flags, count, max = -1, *fds; struct timeval *timeout, timeout0 = { 0, TIMEOUT * 1000}; fd_set fdset, wrset; /* * Work out how many possible descriptors we could use. */ for (res = res0, count = 0; res; res = res->ai_next) count++; fds = calloc(count, sizeof(*fds)); if (fds == NULL) { perror("calloc"); goto cleanup; } FD_ZERO(&fdset); for (res = res0, i = 0, count = 0; res; res = res->ai_next) { fd = socket(res->ai_family, res->ai_socktype, res->ai_protocol); if (fd == -1) { /* * If AI_ADDRCONFIG is not supported we will get * EAFNOSUPPORT returned. Behave as if the address * was not there. */ if (errno != EAFNOSUPPORT) perror("socket"); else if (res->ai_next != NULL) continue; } else if (fd >= FD_SETSIZE) { close(fd); } else if ((flags = fcntl(fd, F_GETFL)) == -1) { perror("fcntl"); close(fd); } else if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1) { perror("fcntl"); close(fd); } else if (connect(fd, res->ai_addr, res->ai_addrlen) == -1) { if (errno != EINPROGRESS) { perror("connect"); close(fd); } else { /* * Record the information for this descriptor. */ fds[i] = fd; FD_SET(fd, &fdset); if (max == -1 || fd > max) max = fd; count++; i++; } } else { /* * We connected without blocking. */ goto done; } if (count == 0) continue; assert(max != -1); do { if (res->ai_next != NULL) timeout = &timeout0; else timeout = NULL; /* The write bit is set on both success and failure. */ wrset = fdset; n = select(max + 1, NULL, &wrset, NULL, timeout); if (n == 0) { timeout0.tv_usec >>= 1; break; } if (n < 0) { if (errno == EAGAIN || errno == EINTR) continue; perror("select"); fd = -1; goto done; } for (fd = 0; fd <= max; fd++) { if (FD_ISSET(fd, &wrset)) { socklen_t len; int err; for (j = 0; j < i; j++) if (fds[j] == fd) break; assert(j < i); /* * Test to see if the connect * succeeded. */ len = sizeof(err); n = getsockopt(fd, SOL_SOCKET, SO_ERROR, &err, &len); if (n != 0 || err != 0) { close(fd); FD_CLR(fd, &fdset); fds[j] = -1; count--; continue; } /* Connect succeeded. */ goto done; } } } while (timeout == NULL && count != 0); } /* We failed to connect. */ fd = -1; done: /* Close all other descriptors we have created. */ for (j = 0; j < i; j++) if (fds[j] != fd && fds[j] != -1) { close(fds[j]); } if (fd != -1) { /* Restore default blocking behaviour. */ if ((flags = fcntl(fd, F_GETFL)) != -1) { flags &= ~O_NONBLOCK; if (fcntl(fd, F_SETFL, flags) == -1) perror("fcntl"); } else perror("fcntl"); } cleanup: /* Free everything. */ if (fds) free(fds); return (fd); } -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

William Herrin

10:09 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Thu, Mar 1, 2012 at 4:07 PM, Owen DeLong <owen@delong.com> wrote:

...

I think that the modern set of getaddrinfo and connect is actually not that complicated:

Owen, If took you 50 lines of code to do 'socket=connect("www.google.com",80,TCP);' and you still managed to produce a version which, due to the timeout on dead addresses, is worthless for any kind of interactive program like a web browser. And because that code isn't found in a system library, every single application programmer has to write it all over again. I'm a fan of Rube Goldberg machines but that was ridiculous. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Mark Andrews

10:29 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

In message <CAP-guGXLpzai4LrxyJcNn06yQ1jAEu4QeRpVzGRah=+OGLy9Zw@mail.gmail.com> , William Herrin writes:

...

On Thu, Mar 1, 2012 at 4:07 PM, Owen DeLong <owen@delong.com> wrote:

...
I think that the modern set of getaddrinfo and connect is actually not th= at complicated:

Owen,

If took you 50 lines of code to do 'socket=connect("www.google.com",80,TCP);' and you still managed to produce a version which, due to the timeout on dead addresses, is worthless for any kind of interactive program like a web browser. And because that code isn't found in a system library, every single application programmer has to write it all over again.

And your 'socket=connect("www.google.com",80,TCP);' won't work for a web browser either unless you are using threads and are willing to have the thread stall. The existing connect() semantics actually work well for browsers but they need to be properly integrated into the system as a whole. Nameservers have similar connect() issues as web browsers with one advantage, most of the time we are connecting to a machine we have just connected to via UDP. That doesn't mean we don't do non-blocking connect however.

...

I'm a fan of Rube Goldberg machines but that was ridiculous.

Regards, Bill Herrin -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Owen DeLong

10:37 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

William, I could have done it in a lot less lines of code, but, it would have been much less readable. Not blocking on the connect() call is a little more complex, but, not terribly so. It does, however, again, make the code quite a bit less readable. There are libraries available that abstract everything I did there and you are welcome to use them. Since C does not support overloading, they export different functions for the behavior you seek. If you want, program in Python where the libraries do provide the abstraction you seek. Of course, that means you have to cope with Python's other disgusting habits like spaces are meaningful and variables are indistinguishable from code, but, there's always a tradeoff. You don't have to reinvent what I've done. Neither does every or any other application programmer. You are welcome to use any of the many connection abstraction libraries that are available in open source. I suggest you make a trip through google code. Owen On Mar 1, 2012, at 2:09 PM, William Herrin wrote:

...

On Thu, Mar 1, 2012 at 4:07 PM, Owen DeLong <owen@delong.com> wrote:

...
I think that the modern set of getaddrinfo and connect is actually not that complicated:

Owen,

If took you 50 lines of code to do 'socket=connect("www.google.com",80,TCP);' and you still managed to produce a version which, due to the timeout on dead addresses, is worthless for any kind of interactive program like a web browser. And because that code isn't found in a system library, every single application programmer has to write it all over again.

I'm a fan of Rube Goldberg machines but that was ridiculous.

Regards, Bill Herrin

-- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

William Herrin

10:57 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Thu, Mar 1, 2012 at 5:37 PM, Owen DeLong <owen@delong.com> wrote:

...

You don't have to reinvent what I've done. Neither does every or any other application programmer. You are welcome to use any of the many connection abstraction libraries that are available in open source. I suggest you make a trip through google code.

Which is what everybody basically does. And when it works during the decidedly non-rigorous testing, they move on to the next problem... with code that doesn't perform well in the corner cases. Such as when a host has just been renumbered or one of the host's addresses is unreachable. And because most everybody has made more or less the same errors, the DNS TTL fails to cause their applications to work as intended and loses its utility as a tool to facilitate renumbering.

...

If you want, program in Python where the libraries do provide the abstraction you seek. Of course, that means you have to cope with Python's other disgusting habits like spaces are meaningful and variables are indistinguishable from code, but, there's always a tradeoff.

::shudder:: I don't *want* to do anything in python. The occasional reality of a situation dictates that I do some work in python, but I most definitely don't *want* to. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Chuck Anderson

11:01 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Thu, Mar 01, 2012 at 05:57:11PM -0500, William Herrin wrote:

...

Which is what everybody basically does. And when it works during the decidedly non-rigorous testing, they move on to the next problem... with code that doesn't perform well in the corner cases. Such as when a host has just been renumbered or one of the host's addresses is unreachable.

And because most everybody has made more or less the same errors, the DNS TTL fails to cause their applications to work as intended and loses its utility as a tool to facilitate renumbering.

Is there an RFC or BCP that describes how to correctly write such a library? Perhaps we need to work to get such a thing, and then push for RFC-compliance of the resolver libraries, or develop a set of libraries named after and fully compliant with the RFC and get software to use them.

Owen DeLong

2 Mar 2 Mar

1:02 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mar 1, 2012, at 2:57 PM, William Herrin wrote:

...

On Thu, Mar 1, 2012 at 5:37 PM, Owen DeLong <owen@delong.com> wrote:

...
You don't have to reinvent what I've done. Neither does every or any other application programmer. You are welcome to use any of the many connection abstraction libraries that are available in open source. I suggest you make a trip through google code.

Which is what everybody basically does. And when it works during the decidedly non-rigorous testing, they move on to the next problem... with code that doesn't perform well in the corner cases. Such as when a host has just been renumbered or one of the host's addresses is unreachable.

Then push for better written abstraction libraries. There's no need to break the current functionality of the underlying system calls and libc functions which would be needed by any such library anyway.

...

And because most everybody has made more or less the same errors, the DNS TTL fails to cause their applications to work as intended and loses its utility as a tool to facilitate renumbering.

Since I don't write applications for a living, I will admit I haven't rigorously tested any of the libraries out there, but, I'm willing to bet that someone, somewhere has probably written a good one by now.

...

...
If you want, program in Python where the libraries do provide the abstraction you seek. Of course, that means you have to cope with Python's other disgusting habits like spaces are meaningful and variables are indistinguishable from code, but, there's always a tradeoff.

::shudder:: I don't *want* to do anything in python. The occasional reality of a situation dictates that I do some work in python, but I most definitely don't *want* to.

Believe me, I'm in the same boat on that one. However, it is the only language I know of that provides the kind of interface you are demanding. Perhaps this should tell you something about what you are asking for. ;-) Owen

William Herrin

1:15 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Thu, Mar 1, 2012 at 8:02 PM, Owen DeLong <owen@delong.com> wrote:

...

There's no need to break the current functionality of the underlying system calls and libc functions which would be needed by any such library anyway.

Owen, Point to one sentence written by anybody in this entire thread in which breaking current functionality was proposed.

...

...
And because most everybody has made more or less the same errors, the DNS TTL fails to cause their applications to work as intended and loses its utility as a tool to facilitate renumbering.

Since I don't write applications for a living, I will admit I haven't rigorously tested any of the libraries out there, but, I'm willing to bet that someone, somewhere has probably written a good one by now.

Yeah, and if you give me a few weeks I can probably find it amidst all the others which aren't so hot. Regards, Bill -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Owen DeLong

1:47 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mar 1, 2012, at 5:15 PM, William Herrin wrote:

...

On Thu, Mar 1, 2012 at 8:02 PM, Owen DeLong <owen@delong.com> wrote:

...
There's no need to break the current functionality of the underlying system calls and libc functions which would be needed by any such library anyway.

Owen,

Point to one sentence written by anybody in this entire thread in which breaking current functionality was proposed.

When you said that: connect(char *name, uint16_t port) should work That can't work without breaking the existing functionality of the connect() system call.

...

...
...
And because most everybody has made more or less the same errors, the DNS TTL fails to cause their applications to work as intended and loses its utility as a tool to facilitate renumbering.

Since I don't write applications for a living, I will admit I haven't rigorously tested any of the libraries out there, but, I'm willing to bet that someone, somewhere has probably written a good one by now.

Yeah, and if you give me a few weeks I can probably find it amidst all the others which aren't so hot.

I doubt it would take weeks, but, in any case, it's probably faster than writing and debugging your own. Owen

William Herrin

5:34 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Thu, Mar 1, 2012 at 8:47 PM, Owen DeLong <owen@delong.com> wrote:

...

On Mar 1, 2012, at 5:15 PM, William Herrin wrote:

...
On Thu, Mar 1, 2012 at 8:02 PM, Owen DeLong <owen@delong.com> wrote:

...
There's no need to break the current functionality of the underlying system calls and libc functions which would be needed by any such library anyway.

Owen,

Point to one sentence written by anybody in this entire thread in which breaking current functionality was proposed.

When you said that:

connect(char *name, uint16_t port) should work

That can't work without breaking the existing functionality of the connect() system call.

You know, when I wrote 'socket=connect("www.google.com",80,TCP);' I stopped and thought to myself, "I wonder if I should change that to 'connectbyname' instead just to make it clear that I'm not replacing the existing connect() call?" But then I thought, "No, there's a thousand ways someone determined to misunderstand what I'm saying will find to misunderstand it. To someone who wants to understand my point, this is crystal clear." -Bill -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Owen DeLong

6:03 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mar 1, 2012, at 9:34 PM, William Herrin wrote:

...

On Thu, Mar 1, 2012 at 8:47 PM, Owen DeLong <owen@delong.com> wrote:

...
On Mar 1, 2012, at 5:15 PM, William Herrin wrote:

...
On Thu, Mar 1, 2012 at 8:02 PM, Owen DeLong <owen@delong.com> wrote:

...
There's no need to break the current functionality of the underlying system calls and libc functions which would be needed by any such library anyway.

Owen,

Point to one sentence written by anybody in this entire thread in which breaking current functionality was proposed.

When you said that:

connect(char *name, uint16_t port) should work

That can't work without breaking the existing functionality of the connect() system call.

You know, when I wrote 'socket=connect("www.google.com",80,TCP);' I stopped and thought to myself, "I wonder if I should change that to 'connectbyname' instead just to make it clear that I'm not replacing the existing connect() call?" But then I thought, "No, there's a thousand ways someone determined to misunderstand what I'm saying will find to misunderstand it. To someone who wants to understand my point, this is crystal clear."

I'm all for additional library functionality built on top of what exists that does what you want. As I said, there are many such libraries out there to do that. If someone wants to add it to libc, more power to them. I'm not the libc maintainer. I just don't want conect() to stop working the way it does or for getaddrinfo() to stop working the way it does. Since you were hell bent on calling the existing mechanisms broken rather than conceding the point that the current process is not broken, but, could stand some improvements in the library (http://owend.corp.he.net/ipv6 I even say as much myself), it was not entirely clear that you did not intend to replace connect() rather than augment the current capabilities with additional more abstract functions with different names. Owen

William Herrin

6:12 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Fri, Mar 2, 2012 at 1:03 AM, Owen DeLong <owen@delong.com> wrote:

...

On Mar 1, 2012, at 9:34 PM, William Herrin wrote:

...
You know, when I wrote 'socket=connect("www.google.com",80,TCP);' I stopped and thought to myself, "I wonder if I should change that to 'connectbyname' instead just to make it clear that I'm not replacing the existing connect() call?" But then I thought, "No, there's a thousand ways someone determined to misunderstand what I'm saying will find to misunderstand it. To someone who wants to understand my point, this is crystal clear."

"Hyperbole." If I had remembered the word, I could have skipped the long description.

...

I'm all for additional library functionality I just don't want conect() to stop working the way it does or for getaddrinfo() to stop working the way it does.

Good. Let's move on. First question: who actually maintains the standard for the C sockets API these days? Is it a POSIX standard? Next, we have a set of APIs which, with sufficient caution and skill (which is rarely the case) it's possible to string together a reasonable process which starts with a some kind of name in a text string and ends with established communication with a remote server for any sort of name and any sort of protocol. These APIs are complete but we repeatedly see certain kinds of error committed while using them. Is there a common set of activities an application programmer intends to perform 9 times out of 10 when using getaddrinfo+connect? I think there is, and it has the following functionality: Create a [stream].to one of the hosts satisfying [name] + [service] within [timeout] and return a [socket]. Does anybody disagree? Here's my reasoning: Better than 9 times out of 10 a steam and usually a TCP stream at that. Connect also designates a receiver for a connectionless protocol like UDP, but its use for that has always been a little peculiar since the protocol doesn't actually connect. And indeed, sendto() can designate a different receiver for each packet sent through the socket. Name + Service. If TCP, a hostname and a port. Sometimes you want to start multiple connection attempts in parallel or have some not-quire-threaded process implement its own scheduler for dealing with multiple connections at once, but that's the exception. Usually the only reason for dealing with the connect() in non-blocking mode is that you want to implement sensible error recover with timeouts. And the timeout - the direction that control should be returned to the caller no later than X. If it would take more than X to complete, then fail instead. Next item: how would this work under the hood? Well, you have two tasks: find a list of candidate endpoints from the name, and establish a connection to one of them. Find the candidates: ask all available name services in parallel (hosts, NIS, DNS, etc). Finished when: 1. All services have responded negative (failure) 2. You have a positive answer and all services which have not yet answered are at a lower priority (e.g. hosts answers, so you don't need to wait for NIS and DNS). 3. You have a positive answer from at least one name service and 1/2 of the requested time out has expired. 4. The full time out has expired (failure). Cache the knowledge somewhere along with TTLs (locally defined if the name service doesn't explicitly provide a TTL). This may well be the first of a series of connection requests for the same host. If cached and TTL valid knowledge was known for this name for a particular service, don't ask that service again. Also need to let the app tell us to deprioritize a particular result later on. Why? Let's say I get an HTTP connection to a host but then that connection times out. If the app is managing the address list, it can try again to another address for the same name. We're now hiding that detail from the app, so we need a callback for the app to tell us, "when I try again, avoid giving me this answer because it didn't turn out to work." So, now we have a list of addresses with valid TTLs as of the start of our connection attempt. Next step: start the connection attempt. Pick the "first" address (chosen by whatever the ordering rules are) and send the connection request packet and let the OS do its normal retry schedule. Wait one second (system or sysctl configurable) or until the previous connection request was either accepted or rejected, whichever is shorter. If not connected yet, background it, pick the next address and send a connection request. Repeat until a one connection request has been issued to all possible destination addresses for the name. Finished when: 1. Any of the pending connection requests completes (others are aborted). 2. The time out is reached (all pending request aborted). Once a connection is established, this should be cached alongside the address and its TTL so that next time around that address can be tried first. Thoughts? The idea here, of course, is that any application which uses this function to make its connections should, at an operations level, do a good job handling both multiple addresses with one of them unreachable as well as host renumbering that relies on the DNS TTL.

...

Since you were hell bent on calling the existing mechanisms broken rather than conceding the point that the current process is not broken, but, could stand some improvements in the library

I hold that if an architecture encourages a certain implementation mistake largely to the exclusion of correct implementations then that architecture is in some way broken. That error may be in a particular component, but it could be that the components themselves are correct. There could be in a missing component or the components could strung together in a way that doesn't work right. Regardless of the exact cause, there is an architecture level mistake which is the root cause of the consistently broken implementations. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Owen DeLong

8:59 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mar 2, 2012, at 10:12 AM, William Herrin wrote:

...

On Fri, Mar 2, 2012 at 1:03 AM, Owen DeLong <owen@delong.com> wrote:

...
On Mar 1, 2012, at 9:34 PM, William Herrin wrote:

...
You know, when I wrote 'socket=connect("www.google.com",80,TCP);' I stopped and thought to myself, "I wonder if I should change that to 'connectbyname' instead just to make it clear that I'm not replacing the existing connect() call?" But then I thought, "No, there's a thousand ways someone determined to misunderstand what I'm saying will find to misunderstand it. To someone who wants to understand my point, this is crystal clear."

"Hyperbole." If I had remembered the word, I could have skipped the long description.

...
I'm all for additional library functionality I just don't want conect() to stop working the way it does or for getaddrinfo() to stop working the way it does.

Good. Let's move on.

First question: who actually maintains the standard for the C sockets API these days? Is it a POSIX standard?

Well, some of it seems to be documented in RFCs, but, I think what you're wanting doesn't require adds to the sockets library, per se. In fact, I think wanting to make it part of that is a mistake. As I said, this should be a higher level library. For example, in Perl, you have Socket (and Socket6), but, you also have several other abstraction libraries such as Net::HTTP. While there's no hierarchical naming scheme for the functions in libc, if you look at the source for any of the open source libc libraries out there, you'll find definite hierarchy. POSIX certainly controls one standard. The GNU libc maintainers control the standard for the libc that accompanies GCC to the best of my knowledge. I would suggest that is probably the best place to start since I think anything that gains acceptance there will probably filter to the others fairly quickly.

...

Next, we have a set of APIs which, with sufficient caution and skill (which is rarely the case) it's possible to string together a reasonable process which starts with a some kind of name in a text string and ends with established communication with a remote server for any sort of name and any sort of protocol. These APIs are complete but we repeatedly see certain kinds of error committed while using them.

Right... Since these are user-errors (at the developer level) I wouldn't try to fix them in the APIs. I would, instead, build more developer proof add-on APIs on top of them.

...

Is there a common set of activities an application programmer intends to perform 9 times out of 10 when using getaddrinfo+connect? I think there is, and it has the following functionality:

Create a [stream].to one of the hosts satisfying [name] + [service] within [timeout] and return a [socket].

Seems reasonable, but ignores UDP. If we're going to do this, I think we should target a more complete solution to include a broader range of probabilities than just the most common TCP connect scenario.

...

Does anybody disagree? Here's my reasoning:

Better than 9 times out of 10 a steam and usually a TCP stream at that. Connect also designates a receiver for a connectionless protocol like UDP, but its use for that has always been a little peculiar since the protocol doesn't actually connect. And indeed, sendto() can designate a different receiver for each packet sent through the socket.

Most applications using UDP that I have seen use sendto()/recvfrom() et. al. Netflow data would suggest that it's less than 9 out of ten times for TCP, but, yes, I would agree it is the most common scenario.

...

Name + Service. If TCP, a hostname and a port.

That would apply to UDP as well. Just the semantics of what you do once you have the filehandle are different. (and it's not really a stream, per se).

...

Sometimes you want to start multiple connection attempts in parallel or have some not-quire-threaded process implement its own scheduler for dealing with multiple connections at once, but that's the exception. Usually the only reason for dealing with the connect() in non-blocking mode is that you want to implement sensible error recover with timeouts.

Agreed.

...

And the timeout - the direction that control should be returned to the caller no later than X. If it would take more than X to complete, then fail instead.

Actually, this is one thing I would like to see added to connect() and that could be done without breaking the existing API.

...

Next item: how would this work under the hood?

Well, you have two tasks: find a list of candidate endpoints from the name, and establish a connection to one of them.

Find the candidates: ask all available name services in parallel (hosts, NIS, DNS, etc). Finished when:

1. All services have responded negative (failure)

2. You have a positive answer and all services which have not yet answered are at a lower priority (e.g. hosts answers, so you don't need to wait for NIS and DNS).

3. You have a positive answer from at least one name service and 1/2 of the requested time out has expired.

4. The full time out has expired (failure).

I think the existing getaddrinfo() does this pretty well already. I will note that the services you listed only apply to resolving the host name. Don't forget that you might also need to resolve the service to a port number. (An application should be looking up HTTP, not assuming it is 80, for example). Conveniently, getaddrinfo simultaneously handles both of these lookups.

...

Cache the knowledge somewhere along with TTLs (locally defined if the name service doesn't explicitly provide a TTL). This may well be the first of a series of connection requests for the same host. If cached and TTL valid knowledge was known for this name for a particular service, don't ask that service again.

I recommend against doing this above the level of getaddrinfo(). Just call getaddrinfo() again each time you need something. If it has cached data, it will return quickly and is cheap. If it doesn't return quickly, it will still work just as quickly as anything else most likely. If getaddrinfo() on a particular system is not well behaved, we should seek to fix that implementation of getaddrinfo(), not write yet another replacement.

...

Also need to let the app tell us to deprioritize a particular result later on. Why? Let's say I get an HTTP connection to a host but then that connection times out. If the app is managing the address list, it can try again to another address for the same name. We're now hiding that detail from the app, so we need a callback for the app to tell us, "when I try again, avoid giving me this answer because it didn't turn out to work."

I would suggest that instead of making this opaque and then complicating it with these hints when we return, that we return use a mecahism where we return a pointer to a dynamically allocated result (similar to getaddrinfo) and if we get called again with a pointer to that structure, we know to delete the previously connected host from the list we try next time. When the application is done with the struct, it should free it by calling an appropriate free function exported by this new API.

...

So, now we have a list of addresses with valid TTLs as of the start of our connection attempt. Next step: start the connection attempt.

Pick the "first" address (chosen by whatever the ordering rules are) and send the connection request packet and let the OS do its normal retry schedule. Wait one second (system or sysctl configurable) or until the previous connection request was either accepted or rejected, whichever is shorter. If not connected yet, background it, pick the next address and send a connection request. Repeat until a one connection request has been issued to all possible destination addresses for the name.

Finished when:

1. Any of the pending connection requests completes (others are aborted).

2. The time out is reached (all pending request aborted).

Once a connection is established, this should be cached alongside the address and its TTL so that next time around that address can be tried first.

Seems mostly reasonable. I would consider possibly having some form of inverse exponential backoff on the initial connection attempts. Maybe wait 5 seconds for the first one before trying the second one and waiting 2 seconds, then 1 second if the third one hasn't connected, then bottoming out somewhere around 500ms for the remainder.

...

...
Since you were hell bent on calling the existing mechanisms broken rather than conceding the point that the current process is not broken, but, could stand some improvements in the library

I hold that if an architecture encourages a certain implementation mistake largely to the exclusion of correct implementations then that architecture is in some way broken. That error may be in a particular

I don't believe that the architecture encourages the implementation mistake. Rather, I think human behavior and our tendency not to seek proper understanding of the theory of operation of various things prior to implementing things which depend on them is more at fault. I suppose that you can argue that the API should be built to avoid that, but, we'll have to agree to disagree on that point. I think that low-level APIs (and this is a low-level API) have to be able to rely on the engineers that use them making the effort to understand the theory of operation. I believe that the fault here is the lack of a standardized higher-level API in some languages.

...

component, but it could be that the components themselves are correct. There could be in a missing component or the components could strung together in a way that doesn't work right. Regardless of the exact cause, there is an architecture level mistake which is the root cause of the consistently broken implementations.

I suppose by your definition this constitutes a missing component. I don't see it that way. I see it as a complete and functional system for a low-level API. There are high-level APIs available. As you have noted, some better than others. A standardized well-written high-level API would, indeed, be useful. However, that does not make the low-level API broken just because it is common for poorly trained users to make improper use of it. It is common for people using hammers to hit their thumbs. This does not mean that hammers are architecturally broken or that they should be re-engineered to have elaborate thumb-protection mechanisms. The fact that you can electrocute yourself by sticking a fork into a toaster while it is operating is likewise, not an indication that toasters are architecturally broken. It is precisely this attitude that has significantly increased the overhead and unnecessary expense of many systems while making product liability lawyers quite wealthy. Owen

Leo Bicknell

2:51 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

In a message written on Thu, Mar 01, 2012 at 05:02:30PM -0800, Owen DeLong wrote:

...

Then push for better written abstraction libraries. There's no need to break the current functionality of the underlying system calls and libc functions which would be needed by any such library anyway.

Agree in part and disagree in part. I think where the Open Source community has fallen behind in the last decade is application level libraries. Open source pioneered cross platform libraries (libX11, libresolv, libm) in the early days and the benefit was they worked darn near exactly the same on all platforms. It made programming and porting easier and lead to growth in the ecosystem. Today that mantle has been taken up by Apple and Microsoft. In Objective-C for example I can in one line of code say "retrieve this URL", and the libraries know about DNS, IPv4 vrs IPv6, happy eyeballs algorythms, multi-threading parts so that the user doesn't wait, and so on. Typical application programs on these platforms never make any of the system calls that have been discussed in this thread. Unfortunately the open source world is without even basic enhancements. Library work in many areas has stagnated, and in the areas where it is progressing it's often done in a way to make the same library (by name) perform differently on different operating systems! Plenty of people have done research finding rampent file copying and duplication of code, and that's a bad sign: http://tagide.com/blog/2011/09/file-cloning-in-open-source-the-good-the-bad-... http://www.solidsourceit.com/blog/?p=4 http://pages.cs.wisc.edu/~shanlu/paper/TSE-CPMiner.pdf I can't find it now but there was a paper a few years back that looked for a hash or CRC algorythm because they were easy to identify in source by the fixed, unique constant they used. In the Linux kernel alone was like 10 implementations, widen to all software in the application repository and there were like 10,000 instances of (nearly) the same code! Now, where I disagree. Better libraries means not just better ones at a high level (fetch me this URL), but better ones at a lower level. For instance libresolv discussed here is old and creaky. It was designed for a different time. Many folks doing DNS work have moved on to libldns from Unbound because libresolv does not do what they need with respect to DNSSEC or IPv4/IPv6 issues. I think the entire community needs to come together with a strong bit of emphasis on libraries, standardizing them, making them ship with the base OS so programmers can count on them, and rolling in new stuff that needs to be in them on a timely basis. Apple and Microsoft do it with their (mostly closed) platforms, open source can do it better. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/

Matt Addison

2:12 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mar 1, 2012, at 17:10, William Herrin <bill@herrin.us> wrote:

...

If took you 50 lines of code to do 'socket=connect("www.google.com",80,TCP);' and you still managed to produce a version which, due to the timeout on dead addresses, is worthless for any kind of interactive program like a web browser. And because that code isn't found in a system library, every single application programmer has to write it all over again.

I'm a fan of Rube Goldberg machines but that was ridiculous.

I'm thinking for this to work it would have to be 2 separate calls: Call 1 being to the resolver (using lwres, system resolver, or whatever you want to use) and returning an array of struct addrinfo- same as gai does currently. If applications need TTL/SRV/$NEWRR awareness it would be implemented here. Call 2 would be a "happy eyeballs" connect syscall (mconnect? In the spirit of sendmmsg) which accepts an array of struct addrinfo and returns an fd. In the case of O_NONBLOCK it would return a dummy fd (as non-blocking connects do currently) then once one of the connections finishes handshake the kernel connects it to the FD and signals writable to trigger select/poll/epoll. This allows developers to keep using the same loops (and most of the APIs) they're already comfortable with, keeps DNS out of the kernel, but hopefully provides a better and easier to use connect() experience, for SOCK_STREAM at least. It's not as neat as a single connect() accepting a name, but seems to be a happy medium and provides a standardized/predictable connect() experience without breaking existing APIs. ~Matt

Mark Andrews

3:32 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

In message <596196444196086313@unknownmsgid>, Matt Addison writes:

...

On Mar 1, 2012, at 17:10, William Herrin <bill@herrin.us> wrote:

...
If took you 50 lines of code to do 'socket=connect("www.google.com",80,TCP);' and you still managed to produce a version which, due to the timeout on dead addresses, is worthless for any kind of interactive program like a web browser. And because that code isn't found in a system library, every single application programmer has to write it all over again.

I'm a fan of Rube Goldberg machines but that was ridiculous.

I'm thinking for this to work it would have to be 2 separate calls:

Call 1 being to the resolver (using lwres, system resolver, or whatever you want to use) and returning an array of struct addrinfo- same as gai does currently. If applications need TTL/SRV/$NEWRR awareness it would be implemented here.

Call 2 would be a "happy eyeballs" connect syscall (mconnect? In the spirit of sendmmsg) which accepts an array of struct addrinfo and returns an fd. In the case of O_NONBLOCK it would return a dummy fd (as non-blocking connects do currently) then once one of the connections finishes handshake the kernel connects it to the FD and signals writable to trigger select/poll/epoll. This allows developers to keep using the same loops (and most of the APIs) they're already comfortable with, keeps DNS out of the kernel, but hopefully provides a better and easier to use connect() experience, for SOCK_STREAM at least.

It's not as neat as a single connect() accepting a name, but seems to be a happy medium and provides a standardized/predictable connect() experience without breaking existing APIs.

~Matt

And you can do the same in userland with kqueue and similar. int connectxx(struct addrinfo *res0, int *fd, int *timeout, void**state); 0 *fd is a connected socket. EINPROGRESS Wait on '*fd' with a timeout of 'timeout' nanoseconds. ETIMEDOUT connect failed. If timeout or state is NULL you block. You re-call with res0 set to NULL. -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

William Herrin

1 Mar 1 Mar

4:58 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Thu, Mar 1, 2012 at 10:01 AM, Michael Thomas <mike@mtcc.com> wrote:

...

On 03/01/2012 06:26 AM, William Herrin wrote:

...
The even simpler approach: create an AF_NAME with a sockaddr struct that contains a hostname instead of an IPvX address. Then let connect() figure out the details of caching, TTLs, protocol and address selection, etc. Such a connect() could even support a revised TCP stack which is able to retry with the other addresses at the first subsecond timeout rather than camping on each address in sequence for the typical system default of two minutes.

The effect of what you're recommending is to move all of this into the kernel, and in the process greatly expand its scope.

Hi Michael, libc != kernel. I want to move the action into the standard libraries where it can be done once and done well. A little kernel action on top to parallelize connection attempts where there are multiple candidate addresses would be gravy, but not required.

...

even if you did this, you'd be saddled with the same problem because nothing existing would use an AF_NAME.

It won't instantly fix everything so we shouldn't do it at all?

...

what if I want, say, asynchronous name resolution? What if I want to use SRV records? What if a new DNS RR comes around

Then you do it the long way, same as you do now. But in the 99% of the time that you're initiating a connection the "normal" way, you don't have to (badly) reinvent the wheel.

...

As far as I could tell, standard distos don't have libraries with lower level access to DNS (in my case, it needed to not block). Before positing a super-deluxe gethostbyxx that does addresses picking, etc, etc it would be better to lobby all of the distos to settle on a decomposed resolver library from which that and more could be built.

(A) Revised standards are -how- multiple OSes from multiple vendors coordinate the deployment of an identical capability. (B) Application programmers generally DO want the abstraction from "DNS" to "Name resolution." If there's an /etc/hosts name or a NIS name or a Windows name available, you ordinarily want to use it. You don't want to build extra code to search each name service independently any more than you want to build extra code to cycle through candidate addresses. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Michael Thomas

6:32 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On 03/01/2012 08:58 AM, William Herrin wrote:

...

On Thu, Mar 1, 2012 at 10:01 AM, Michael Thomas<mike@mtcc.com> wrote:

...
On 03/01/2012 06:26 AM, William Herrin wrote:

...
The even simpler approach: create an AF_NAME with a sockaddr struct that contains a hostname instead of an IPvX address. Then let connect() figure out the details of caching, TTLs, protocol and address selection, etc. Such a connect() could even support a revised TCP stack which is able to retry with the other addresses at the first subsecond timeout rather than camping on each address in sequence for the typical system default of two minutes.

The effect of what you're recommending is to move all of this into the kernel, and in the process greatly expand its scope. Hi Michael,

libc != kernel. I want to move the action into the standard libraries where it can be done once and done well. A little kernel action on top to parallelize connection attempts where there are multiple candidate addresses would be gravy, but not required.

connect(2) is a kernel level call just like open(2), etc. It may have a thin wrapper, but that's OS dependent, IIRC. man connect 2: "The connect() system call connects the socket referred to by the file descriptor..." Mike

William Herrin

6:49 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Thu, Mar 1, 2012 at 1:32 PM, Michael Thomas <mike@mtcc.com> wrote:

...

On 03/01/2012 08:58 AM, William Herrin wrote:

...
libc != kernel. I want to move the action into the standard libraries where [resolve and connect] can be done once and done well. A little kernel action on top to parallelize connection attempts where there are multiple candidate addresses would be gravy, but not required.

connect(2) is a kernel level call just like open(2), etc. It may have a thin wrapper, but that's OS dependent, IIRC.

man connect 2:

"The connect() system call connects the socket referred to by the file descriptor..."

Then name the new one something else and document it in man section 3. Next objection? -Bill -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Jared Mauch

2 Mar 2 Mar

7:32 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mar 1, 2012, at 10:01 AM, Michael Thomas wrote:

...

The real issue is that gethostbyxxx has been inadequate for a very long time. Moving it across the kernel boundary solves nothing and most likely causes even more trouble: what if I want, say, asynchronous name resolution? What if I want to use SRV records? What if a new DNS RR comes around -- do i have do recompile the kernel? It's for these reasons and probably a whole lot more that connect just confuses the actual issues.

<software-developer-hat-on> My experience is that these calls are expensive and require a lot of work to get a true result. Some systems also have interim caching that happens as well (e.g. NSCD). When building software that did a lot of dns lookups at once, I had to build my own internal cache to maintain performance. Startup costs were expensive, but maintaining it started to space out a bit more and be less of an issue. I ended up caching these entries for 1 hour by default. </hat ?xml-fail> - jared

Owen DeLong

1 Mar 1 Mar

8:46 p.m.

New subject: dns and software, was Re: Reliable Cloud host ?

On Mar 1, 2012, at 6:26 AM, William Herrin wrote:

...

On Thu, Mar 1, 2012 at 7:20 AM, Owen DeLong <owen@delong.com> wrote:

...
The simpler approach and perfectly viable without mucking up what is already implemented and working:

Don't keep returns from GAI/GNI around longer than it takes to cycle through your connect() loop immediately after the GAI/GNI call.

The even simpler approach: create an AF_NAME with a sockaddr struct that contains a hostname instead of an IPvX address. Then let connect() figure out the details of caching, TTLs, protocol and address selection, etc. Such a connect() could even support a revised TCP stack which is able to retry with the other addresses at the first subsecond timeout rather than camping on each address in sequence for the typical system default of two minutes.

That's not simpler for the following reasons: 1. It takes away abilities to manage the connect() process that some applications want. 2. It requires a rewrite of a whole lot of software built on the current mechanisms. Most systems provide a mechanism for overriding the timeout for connect(). Further, there are lots of classes, libraries, etc. that you can already use if you want to abstract the gai/gni + connect functionality. What exists isn't broken at the API level. Please stop trying to fix what is not broken. Owen

Mark Andrews

28 Feb 28 Feb

1:47 a.m.

New subject: dns and software, was Re: Reliable Cloud host ?

In message <CAP-guGVA4eHv0K=U=x2B-WPYDy2RQ7ZE1Di2AHc+dmA_huyGzA@mail.gmail.com>, William Herrin writes:

...

On Mon, Feb 27, 2012 at 3:43 PM, david raistrick <drais@icantclick.org> wro= te:

...
On Mon, 27 Feb 2012, William Herrin wrote:

...
In some cases this is because of carelessness: The application does a gethostbyname once when it starts, grabs the first IP address in the list and retains it indefinitely. The gethostbyname function doesn't even pass the TTL to the application. Ntpd is/used to be one of the notable offenders, continuing to poll the dead address for years after the server moved.

While yes it often is carelessness - it's been reported by hardcore development sorts that I trust that there is no standardized API to obtai= n the TTL... =A0What needs to get fixed is get[hostbyname,addrinfo,etc] so programmers have better tools.

Meh. What should be fixed is that connect() should receive a name instead of an IP address. Having an application deal directly with the IP address should be the exception rather than the rule. Then, deal with the TTL issues once in the standard libraries instead of repeatedly in every single application.

No. connect() should stay the way it is. Most developers cut and paste the connection code. It's just that the code they cut and paste is very old and is often IPv4 only.

...

In theory, that'd even make the app code protocol agnostic so that it doesn't have to be rewritten yet again for IPv12.

getaddrinfo() man page has IP version agnostic code examples. It is however simplistic code which doesn't behave well when a address is unreachable. For examples of how to behave better for TCP see: https://www.isc.org/community/blog/201101/how-to-connect-to-a-multi-homed-se... Mark -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

George Herbert

27 Feb 27 Feb

7:19 p.m.

On Mon, Feb 27, 2012 at 7:28 AM, William Herrin <bill@herrin.us> wrote:

...

On Sun, Feb 26, 2012 at 7:02 PM, Randy Carpenter <rcarpen@network1.net> wrote:

...
...
On Feb 26, 2012, at 4:56 PM, Randy Carpenter wrote:

...
1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!)

This is actually a much harder problem to solve than it sounds, and gets progressively harder depending on what you mean by "failover".

At the very least, having two physical hosts capable of running your VM requires that your VM be stored on some kind of SAN (usually iSCSI based) storage system. Otherwise, two hosts have no way of accessing your VM's data if one were to die. This makes things an order of magnitude or higher more expensive.

This does not have to be true at all. Even having a fully fault-tolerant SAN in addition to spare servers should not cost much more than having separate RAID arrays inside each of the server, when you are talking about 1,000s of server (which Rackspace certainly has)

Randy,

You're kidding, right?

SAN storage costs the better part of an order of magnitude more than server storage, which itself is several times more expensive than workstation storage. That's before you duplicate the SAN and set up the replication process so that cabinet and room level failures don't take you out.

This is clearly becoming a not-NANOG-ish thread, however... Failing to have central shared storage (iSCSI, NAS, SAN, whatever you prefer) fails the smell test on a local enterprise-grade virtualization cluster, much less a shared cloud service. Some people have done tricks with distributing the data using one of the research-ish shared filesystems, rather than separate shared storage. That can be made to work if the host OS model and its available shared filesystems work for you. Doesn't work for Vmware Vcenter / Vmotion-ish stuff as far as I know. There are plenty of people doing non-enterprise-grade virtualization. There's no mandate that you have the ability to migrate a virtual to another node in realtime or restart it immediately on another node if the first node dies suddenly. But anyone saying "we have a cloud" and not providing that type of service, is in marketing not engineering.

...

From a systems architecture point of view, you can't do that.

-- -george william herbert george.herbert@gmail.com

Paul Graydon

9:05 p.m.

On Mon, Feb 27, 2012 at 11:19:27AM -0800, George Herbert wrote:

...

On Mon, Feb 27, 2012 at 7:28 AM, William Herrin <bill@herrin.us> wrote:

...
On Sun, Feb 26, 2012 at 7:02 PM, Randy Carpenter <rcarpen@network1.net> wrote:

...
...
On Feb 26, 2012, at 4:56 PM, Randy Carpenter wrote:

...
1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!)

This is actually a much harder problem to solve than it sounds, and gets progressively harder depending on what you mean by "failover".

At the very least, having two physical hosts capable of running your VM requires that your VM be stored on some kind of SAN (usually iSCSI based) storage system. Otherwise, two hosts have no way of accessing your VM's data if one were to die. This makes things an order of magnitude or higher more expensive.

This does not have to be true at all. Even having a fully fault-tolerant SAN in addition to spare servers should not cost much more than having separate RAID arrays inside each of the server, when you are talking about 1,000s of server (which Rackspace certainly has)

Randy,

You're kidding, right?

SAN storage costs the better part of an order of magnitude more than server storage, which itself is several times more expensive than workstation storage. That's before you duplicate the SAN and set up the replication process so that cabinet and room level failures don't take you out.

This is clearly becoming a not-NANOG-ish thread, however...

Failing to have central shared storage (iSCSI, NAS, SAN, whatever you prefer) fails the smell test on a local enterprise-grade virtualization cluster, much less a shared cloud service.

Some people have done tricks with distributing the data using one of the research-ish shared filesystems, rather than separate shared storage. That can be made to work if the host OS model and its available shared filesystems work for you. Doesn't work for Vmware Vcenter / Vmotion-ish stuff as far as I know.

There are plenty of people doing non-enterprise-grade virtualization. There's no mandate that you have the ability to migrate a virtual to another node in realtime or restart it immediately on another node if the first node dies suddenly. But anyone saying "we have a cloud" and not providing that type of service, is in marketing not engineering. From a systems architecture point of view, you can't do that.

Cloud is utterly meaningless drivel. Your idea of cloud is different from mine, which is different from my co-workers, bosses, people in marketing etc. etc. It's a vague useless term that could mean everything from a bog standard mail server through to full on 'deploy your app' things like Heroku. It would be more accurate to focus on IaaS, PaaS, SaaS et al For what little it's probably worth mentioning, Amazon provides a shared storage platform in the form of EBS, Elastic Block Storage, which you can choose to use as your root device on your server if you so wish (wouldn't advise you do, latency is unpredictable), or you can have it mounted wherever is relevant for your data (the most common route). That's their non-physical server dependent storage provision. If you pay extra it'll replicate, or even replicate between availability zones. You can also choose to have Amazon monitor and ensure sufficient numbers of your server are running through autoscale. Paul

William Herrin

11:45 p.m.

On Mon, Feb 27, 2012 at 2:19 PM, George Herbert <george.herbert@gmail.com> wrote:

...

Failing to have central shared storage (iSCSI, NAS, SAN, whatever you prefer) fails the smell test on a local enterprise-grade virtualization cluster, much less a shared cloud service.

Hi George, Why would you imagine that a $30/month virtual private server is built on an enterprise-grade virtualization cluster? You know what it costs to builds fibre channel SANs and blade servers and DR. In what universe does $30/mo per customer recover that cost during the useful life of the equipment? A VPS is 2012's version of 2002's web server + CGI and a unix shell. Quite useful but don't expect magic from it. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

George Herbert

28 Feb 28 Feb

1:03 a.m.

On Mon, Feb 27, 2012 at 3:45 PM, William Herrin <bill@herrin.us> wrote:

...

On Mon, Feb 27, 2012 at 2:19 PM, George Herbert <george.herbert@gmail.com> wrote:

...
Failing to have central shared storage (iSCSI, NAS, SAN, whatever you prefer) fails the smell test on a local enterprise-grade virtualization cluster, much less a shared cloud service.

Hi George,

Why would you imagine that a $30/month virtual private server is built on an enterprise-grade virtualization cluster? You know what it costs to builds fibre channel SANs and blade servers and DR. In what universe does $30/mo per customer recover that cost during the useful life of the equipment?

As I stated, one can either do it with SANs or with alternate storage. Amazon hit those price points with a custom distributed filesystem that's more akin to the research distributed filesystems than anything else. It's using node storage, but not single-node locked; if the physical dies it should not lose the data. Amazon wrote that filesystem, but one could approach the problem with an OTS research / labs distributed FS using blade or 1U internal disks and duplicate what they did. In the "enterprise" space, there's a lot more variety and flexibility too. I bought a 100 TB (raw) NAS / storage unit for well under $30k not that long ago. Even accounting for RAID6 and duplicate units on the network (network RAID1 across two units doing RAID6 internally), that would cover something like 250 "standard" AWS instances, or about $100/unit for the storage. At typical useful amortization (24 to 48 months) that's about $2 to $4/month/server. That's not an EMC, a Hitachi, a BlueArc, a NetApp, a Compellant, even a Nexsan. But one can walk up the curve relatively smoothly from that low end point to the bestest brightest highest-tier stuff depending on one's customers' needs.

...

A VPS is 2012's version of 2002's web server + CGI and a unix shell. Quite useful but don't expect magic from it.

There are plenty of services that know what they should do and do it reasonably well. AWS, above. There are also a lot of services that (without naming names) are floating out there in sketchy-land. One should both know better and expect better. It's possible to design reliable services - with geographical redundancy and the like between service providers, in case one corks - out of unreliable services. One should do some of that anyways, with clouds. But the quality of the underlying service varies a lot. If you're paying AWS prices for non-replicated storage, think carefully about what you're doing. If you're paying half of what AWS costs, and duplicating locations to handle outages, then you're probably ok. If you're paying more and getting better service, ok. -- -george william herbert george.herbert@gmail.com

Jimmy Hess

1 Mar 1 Mar

7:23 a.m.

On Mon, Feb 27, 2012 at 7:03 PM, George Herbert <george.herbert@gmail.com> wrote:

...

On Mon, Feb 27, 2012 at 3:45 PM, William Herrin <bill@herrin.us> wrote:

...
universe does $30/mo per customer recover that cost during the useful life of the equipment? As I stated, one can either do it with SANs or with alternate storage.

Should not assume Rackspace et. all provide any level of fault tolerance for extraordinary situations such as hardware failure beyond what they have promised or advertised. IaaS provided redundancy is not always necessary, and may be unwanted in various situations due to its cost; single redundancy means a minimum of twice the cost of non-redundant (plus the overhead of failover coordination). With various computing applications it may make a great deal of sense to handle in software -- should a node fail, the software can detect a node failed, eject it, reassign its unfinished work units later. Typical Enterprise level fault tolerant SAN manufacturer prices seen are ~ $12 to $15/GB usable storage, for ~50 IOPS/TB, data mostly at rest, and SAN equipment has a useful lifetime of 5-6 years; a typical 200gb server then exceeds $30/month in intrinsic FT SAN hardware cost. There IS a place for IaaS providers to sell such product, probably at four or five times the $/month for a typical server. Just like there is a place for Network service providers to sell transport and network access products that have redundancy built into the product, such as protected circuits, multiple-port, dual WAN , that can sustain any single router failure, etc. Those network products still can't reliably guarantee 100% uptime for the service. There is a place for IaaS providers to sell products where they do NOT promise a level of reliability/fault tolerance or performance guarantee that requires them to utilize an Enterprise FC SAN or similar solution. Just like there is a place for NSPs to sell transport and network access products that will fail if a single router, card, port fails, or if there is a single fiber break or erroneously unplugged cable. This way the end user can save on their network connectivity costs; a tradeoff based on the impact of the difference between their network with 1% downtime and their network with 0.001% downtime; VERSUS the impact of the cost difference between those two options. End users may also prefer to implement their redundancy through dual-homing via multiple providers. It is very important that the end user and the provider's sales/marketing know exactly which kind of product each offering is.

...

Amazon hit those price points with a custom distributed filesystem that's more akin to the research distributed filesystems than anything

Amazon is quite unique; developing a custom distributed filesystem is a rather extraordinary measure, that provides them an advantage when selling certain services. But even EC2 instance storage is not guaranteed. The instance storage is scratch space If your instance becomes degraded, and you need to restart it, what you get is a "clean" instance, matching the original image. EBS and S3 are another matter. The same provider that offers some 'unprotected' services, may also offer some more expensive storage services that have greater protections. -- -JH

Tom

10:55 p.m.

On Mon, 27 Feb 2012, William Herrin wrote:

...

Why would you imagine that a $30/month virtual private server is built on an enterprise-grade virtualization cluster?

A lot of the time "the cloud" is billed as just that. The reality is that its more often a federated cluster of machines with some duct tape and glue. Sometimes that duct tape is name brand, sometimes its not. There are currently no viable OSS solutions that actually do HA in terms of storage nor VMs. Its all basically storage+machine provisioning, no healthchecking and no real auto-recovery. I've spent a fair amount of time digging into this for my business, and thats my state of the world. That said, if you want HA or even "failover" with any provider, you basically need to look at an expensive VMWare based solution. There are projects out there to build truly HA OSS "clouds" but they're not ready yet, and they're not terribly cheap either. -Tom

Michael DeMan

26 Feb 26 Feb

11:52 p.m.

We have found it effective at least for things like DNS and MX-backup to simply swap some VPS and/or physical colo with another ISP outside our geographic area. Both protocols are designed for that kind of redundancy. Definitely has limitations, but is also probably the cheapest solution. - Mike On Feb 26, 2012, at 2:56 PM, Randy Carpenter wrote:

...

Does anyone have any recommendation for a reliable cloud host?

We require 1 or 2 very small virtual hosts to host some remote services to serve as backup to our main datacenter. One of these services is a DNS server, so it is important that it is up all the time.

We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all.

Basic requirements:

1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!) 2. Actual support (with a phone number I can call) 3. reasonable pricing (No, $800/month is not reasonable when I need a tiny 256MB RAM Server with <1GB/mo of data transfers)

thanks, -Randy

Alex Brooks

27 Feb 27 Feb

12:24 p.m.

Hello, On Sun, Feb 26, 2012 at 10:56 PM, Randy Carpenter <rcarpen@network1.net> wrote:

...

Does anyone have any recommendation for a reliable cloud host?

We require 1 or 2 very small virtual hosts to host some remote services to serve as backup to our main datacenter. One of these services is a DNS server, so it is important that it is up all the time.

We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all.

Basic requirements:

1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!) 2. Actual support (with a phone number I can call) 3. reasonable pricing (No, $800/month is not reasonable when I need a tiny 256MB RAM Server with <1GB/mo of data transfers)

Well, as everyone has been saying, unfortunately with "infrastructure" clouds, you have to engineer your set up to their standards to have failover. For example, Amazon (as mentioned in the thread) give a 99.95% uptime SLA *if* you set up failover yourself accros more than one "Avaliability Zone" within a region. Details are at http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/using-regions-avai... and http://blog.rightscale.com/2008/03/26/setting-up-a-fault-tolerant-site-using... (though clearer, this one is a bit of an advert). As mentioned, with Amazon, you can use support if you pay for it, it's not included as standard. If you fancy some help though, people like RightScale sounds like exactly what you are after to make management much simpler for you http://www.rightscale.com/products/why-rightscale.php, but pricing for services like that can be a little high for small setups, though they do have a free edition that may be suitable. You can get the same kind of 99.95% SLA from other providers if you follow their deployment guidelines regarding their type of "zones". Microsoft will do it for not too much )http://www.windowsazure.com/en-us/support/sla/) include online and telephone support in the price and are in the process of making Red Hat Linux available. But let's not forget simply buying the software as a service is also an option, where fail-over becomes Someone Else's Problem. For DNS, EasyDNS (https://web.easydns.com/DNS_hosting.php) are rather good and not too expensive, and you can get a 100% up-time guarantee if you want. A review of them regarding availability is at http://www.theregister.co.uk/2012/01/31/why_i_use_easydns/ Do let us know who you end up picking and how it goes. Alex

Jason Gurtz

3:25 p.m.

...

[...] For DNS, EasyDNS (https://web.easydns.com/DNS_hosting.php) are rather good and not too expensive, and you can get a 100% up-time guarantee if you want. A review of them regarding availability is at http://www.theregister.co.uk/2012/01/31/why_i_use_easydns/

I have been a very satisfied EasyDNS customer for about a decade and concur with the article. Nothing is perfect, but the rapid response and support I've received have always been top-notch.

...

Do let us know who you end up picking and how it goes.

Indeed. "Cloud" outside of references to mists and objects in the sky is a completely meaningless term for operators. In fact, it has made it harder to differentiate between services (which I'm sure is the point). As an operator (knowing how things can be subject to accelerated roll-out when $business feels they are missing out), I wonder if a lot of these "cloud" service bumps-in-the-road aren't just a symptom of not being fully baked in. ~JasonG

David Miller

4:37 p.m.

On 2/27/2012 10:25 AM, Jason Gurtz wrote:

...

...
[...] For DNS, EasyDNS (https://web.easydns.com/DNS_hosting.php) are rather good and not too expensive, and you can get a 100% up-time guarantee if you want. A review of them regarding availability is at http://www.theregister.co.uk/2012/01/31/why_i_use_easydns/ I have been a very satisfied EasyDNS customer for about a decade and concur with the article. Nothing is perfect, but the rapid response and support I've received have always been top-notch.

I have been a satisfied DNS Made Easy customer for many years. Note: I am also an employee of DNS Made Easy. I was a customer for years before I became an employee.

...

...
Do let us know who you end up picking and how it goes. Indeed. "Cloud" outside of references to mists and objects in the sky is a completely meaningless term for operators. In fact, it has made it harder to differentiate between services (which I'm sure is the point).

As an operator (knowing how things can be subject to accelerated roll-out when $business feels they are missing out), I wonder if a lot of these "cloud" service bumps-in-the-road aren't just a symptom of not being fully baked in.

It depends on what you mean by "bumps-in-the-road"... If you mean issues experienced by customers of cloud service providers, then the most common issues are a symptom of not implementing redundancy (anticipating failure) in their usage of the platform. There are a whole lot of folks who believe that they can buy an instance from Vendor =~ /.*cloud.*/ and all of their DR worries will magically be "taken care of" by the platform. That isn't the case. Amazon is usually pretty good at providing RFOs after issues. All of their RFOs (that I have seen) include pointers to all of the Amazon redundancy configuration documents that customers who did experience an issue regarding the RFO did not follow (which caused them to experience an outage due to a platform issue). DR in using cloud services is the same as DR has always been - look at all potential failures and then implement redundancy where the cost/benefit works out in favor of the redundancy. Document, test, rinse, lather, repeat. Rightscale and other services like it provide tools to help. -DMM

Max

2:31 p.m.

Linode.com is not cloud based but they offer IP failover between VPS instances at no additonal charge - their pricing is excellent, I have had no down time issues with them in 3+ years with 3 different customers using them and they have nice OOB and programmatic API access for controlling VPs instances as well. Max On 2/26/12, Randy Carpenter <rcarpen@network1.net> wrote:

...

Does anyone have any recommendation for a reliable cloud host?

We require 1 or 2 very small virtual hosts to host some remote services to serve as backup to our main datacenter. One of these services is a DNS server, so it is important that it is up all the time.

We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all.

Basic requirements:

1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!) 2. Actual support (with a phone number I can call) 3. reasonable pricing (No, $800/month is not reasonable when I need a tiny 256MB RAM Server with <1GB/mo of data transfers)

thanks, -Randy

Jared Mauch

2:39 p.m.

On Feb 26, 2012, at 5:56 PM, Randy Carpenter wrote:

...

We require 1 or 2 very small virtual hosts to host some remote services to serve as backup to our main datacenter. One of these services is a DNS server, so it is important that it is up all the time.

We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all.

Pardon the weird question: Is the DNS service authoritative or recursive? If auth, you can solve this a few ways, either by giving the DNS name people point to multiple AAAA (and A) records pointing at a diverse set of instances. DNS is designed to work around a host being down. Same goes for MX and several other services. While it may make the service slightly slower, it's certainly not the end of the world. Taking a mesh of services from Rackspace, EC2, The Planet, or any other number of hosting providers will allow you to roll-your-own. The other solution is to go to a professional DNS service provider, e.g.: Dyn, Verisign, EveryDNS or NeuStar. While you can run your own infrastructure, the barrier for operating it properly is getting a bit higher each year in doing it "right". I was recently shown an attack graph of a ~200Gb/s attack against a DNS server. *ouch*. Sometimes being professional is knowing when to say "I can't do this justice myself, perhaps it's better/easier/cheaper to pay someone to do it right". - Jared (Disclosure: I work for one of the above named companies, but not in a capacity related to anything in this email).

Randy Carpenter

5:36 p.m.

...

Pardon the weird question:

Is the DNS service authoritative or recursive? If auth, you can solve this a few ways, either by giving the DNS name people point to multiple AAAA (and A) records pointing at a diverse set of instances.

Authoritative. But, also not the only thing that we are running that needs some geographic and route diversity.

...

DNS is designed to work around a host being down. Same goes for MX and several other services. While it may make the service slightly slower, it's certainly not the end of the world.

Oh, how I wish this were true in practice. If I had a dollar for every time we had serious issues because one of a few authoritative DNS servers was not responding... OK, I wouldn't be rich, but this happens all the time. Caching servers out on the net get a "non-answer" because the server they chose to ask was down, and it caches that. They shouldn't do that, but they do, and there's nothing that can be done about it. -Randy

Jason Ackley

28 Feb 28 Feb

2:50 a.m.

On Sun, Feb 26, 2012 at 4:56 PM, Randy Carpenter <rcarpen@network1.net> wrote:

...

We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all.

Disclaimer: I work for Rackspace in a network architect capacity. We have plenty of redundancy where it is needed. We have all sorts of solutions, for all sorts of intersections of problems, budgets and customers. Sometimes finding the 'correct' solution is not as easy as it could or should be. The menu is simply getting crowded :) I don't know the specifics of your issue, but if you contact me privately I can look into the specifics. You can also use my work email address if you don't think I am legit (email me at this address to get it). I do find that that impact is quite extreme, and certainly an exception and something that there are many folks probably still working on root causes and lessons learned. We take this stuff seriously.

...

1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!)

As others have mentioned, you will not be able to find some of these features for 1.5c/hr. They quickly spiral out of control for large-scale deployments. Every penny matters at 1.5c/hr . I would ask that you look in your product portfolio and see if you have anything at that price that you can answer a support phone call for :) . This is not meant to be antagonistic, just to have a clear mindset understanding of the $$ we are talking about and how careful you have to be. What these cloud price points allow you to do tho is to turn it from one type of a problem to another type of problem that you can have more control over. As others have mentioned, spreading out with many different providers is one example. They (cloud, VMs, VPS, whatever you want to call them) are cheap, disposable computing resources - don't treat them as anything else! As with anything, you get what you pay for, and I am sure we have all had 'that customer' that claims $1,000,000 in losses for every hour of impact, and they have a single whitebox server deployed.

...

2. Actual support (with a phone number I can call)

This is where the providers will typically start to differentiate themselves from each other. As a company, we pride ourselves on support. Full support has a price. I don't want to turn this into a sale-ish email tho.

...

3. reasonable pricing (No, $800/month is not reasonable when I need a tiny 256MB RAM Server with <1GB/mo of data transfers)

1.5c / hr is what our basic linux image starts at IIRC. Again, I am not in sales, so I don't really keep track of how that compares to some of the other folks out there, I would guess it is about the going rate. I have used Linode.com as well as EC2 as well and they both have some great feature sets and offers. Both also have areas that could use improvements. I do agree that there is general misconceptions of what 'cloud' means. That is simply a byproduct of the amount of folks involved in such trends, and yes, the marketing folks getting involved as well. This is unavoidable in the world today. If you have any other questions or concerns that I can help with, please let me know... cheers, -- jason

Jeroen van Aart

1 Mar 1 Mar

11:44 p.m.

Randy Carpenter wrote:

...

Does anyone have any recommendation for a reliable cloud host?

...

Basic requirements:

1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!)

Assuming a simple set up as you suggest. If what you want to do is a lot more complex it would be worth your while to use your own hardware at a coloc, and alternatively set up your own VPSes. I think your best bet is to design your systems with failover taken into account and not to depend on the VPS provider to provide you this. Say you want smtp in addition to DNS. You would set up a VPS in 2 different locations (or more) using 2 different VPS providers. You set up your favourite name server and email server on each server, configure your mx records to point to both and you tell your registrar to use both servers as the nameserver for your domain(s). When a server goes ofline dns queries and emails automagically go to the other server. No need to depend on one single VPS provider and their crappy infrastructure.

...

3. reasonable pricing (No, $800/month is not reasonable when I need a tiny 256MB RAM Server with <1GB/mo of data transfers)

Lots of reasonably priced VPS providers out there. And once you have set up redundancy in your own design it doesn't matter much how redundant they are. More important will be how spam/pollution free the network neighbourhood is. Amazon would not be the best choice in that regard. I have had good luck with small local VPS providers, often ISPs. Greetings, Jeroen -- Earthquake Magnitude: 3.2 Date: Thursday, March 1, 2012 16:31:08 UTC Location: Central California Latitude: 36.6378; Longitude: -121.2510 Depth: 5.50 km

4878

Age (days ago)

4883

Last active (days ago)

List overview

Download

93 comments

37 participants

participants (37)

Alex Brooks
Ben Carleton
Bobby Mac
Chuck Anderson
David Conrad
David Miller
david raistrick
George Herbert
James M Keller
Jared Mauch
Jason Ackley
Jason Gurtz
Jeroen Massar
Jeroen van Aart
Jimmy Hess
Joe Greco
Jon Lewis
Kevin Day
Leigh Porter
Leo Bicknell
Mark Andrews
Matt Addison
Max
Michael DeMan
Michael Thomas
Mike Lyon
Owen DeLong
Paul Graydon
Randy Carpenter
Robert Hajime Lanning
Robert Suh
Tei
Tim Franklin
Tom
Tony Patti
Valdis.Kletnieks＠vt.edu
William Herrin