On Oct 7, 2021, at 06:49 , Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
William Herrin wrote:
This is quite common to tie an underlying service announcement to BGP announcements in an Anycast or similar environment.
Yes, that is a commonly seen mistake with anycast. You don't know what you're talking about.
I do but you don't.
If your anycast node stops receiving updated data and you can't reach any of the other nodes to check whether they're online, 99 times out of 100 this means a local failure of some sort.
Yes. In case of DNS, if expiration period of a zone is passed without successful check of the current most zone version, unicast or anycast name servers stop responding requests for the zone.
But, it has nothing specifically to do with anycast. As there are other name servers with different IP addresses, there is no reason to withdraw routes. So?
WRONG. First, assuming that there are non-anycast name servers assumes facts not in evidence. Second, if you are a participant in an anycast name server network, there are good reasons to withdraw your announcement of that prefix in order to avoid users having to wait for timeouts (which in some cases might be even worse than serving stale data).
You withdraw the node's announcement so that you don't serve bad data to the end user.
That will only introduce new failure modes of mismatches between server availability and server reachability and is a bad idea.
No, if the server is available, it should announce the anycast prefix. If it i snot available, it should withdraw it. That’s the best way to make anycast work and it’s what virtually every anycast DNS server network does. If the server is unavailable, but doesn’t withdraw, then you have the failure mode of the server being reachable, but unavailable and it becomes a black hole for traffic that should otherwise flow to other available anycast nodes.
That's what happened here -
Yes, facebook did wrong thing to actively withdraw routes.
No, facebook did the right thing for 99+% of situations that would trigger this withdraw. The problem was that they withdrew EVERY server when the failure wasn’t local instead of having some way to recognize the failure for what it was, global in nature and continue serving DNS.
Simply turning themselves off, instead of withdrawing the routes, would result in suboptimal performance.
This time, facebook is saying that they could not reach their name servers even though the servers were perfectly working.
Because their servers couldn’t verify that they were working and thus thought that they had stale data. Thus, the servers were “perfectly working” with stale data and the safe thing to do if you can’t confirm that your reason for believing you have stale data is erroneous, is to stop serving what you have. If you’re not going to serve what you have, then you shouldn’t announce the anycast prefix, either.
How much performance, do you think, facebook enjoyed? A lot less than "suboptimal", I'm afraid.
As noted, this was that 1% failure that isn’t anticipated. The behavior of the system was correct for 99% of failures and the number of years facebook has operated without a significant or noticeable DNS outage is testament to that fact.
And 99 times out of 100, not doing one or the other would cause rather than prevent an outage.
That is a commonly seen misconception wrongly assuming that server routes were withdrawn if and only if the server is unavailable.
The servers withdrew their routes because the servers had no ability to verify that they were serving valid data. If you can’t verify your data is valid, it’s better (in most cases) to not serve the data you have. If you’re not going to serve, the best thing to do is withdraw the anycast prefix that claims you are a server for the data.
But, the reality is that it is impossible to correctly recognize server is unavailable or to correctly withdraw routes only when server is unavailable.
Yes… So you go with something that works 99% of the time and you get an event like this in that 1% of cases where the failure in question was not one of the failure modes that was previously anticipated. I’m betting that facebook is quickly figuring out changes that will mitigate this type of failure in the future and their DNS will likely stay up until the next 1 in 100 (or will it be 1 in 10,000 this time?) events pops up that surprised them again. That’s the nature of operations. Owen