William Herrin wrote:
This is quite common to tie an underlying service announcement to BGP announcements in an Anycast or similar environment.
Yes, that is a commonly seen mistake with anycast.
You don't know what you're talking about.
I do but you don't.
If your anycast node stops receiving updated data and you can't reach any of the other nodes to check whether they're online, 99 times out of 100 this means a local failure of some sort.
Yes. In case of DNS, if expiration period of a zone is passed without successful check of the current most zone version, unicast or anycast name servers stop responding requests for the zone. But, it has nothing specifically to do with anycast. As there are other name servers with different IP addresses, there is no reason to withdraw routes. So?
You withdraw the node's announcement so that you don't serve bad data to the end user.
That will only introduce new failure modes of mismatches between server availability and server reachability and is a bad idea.
That's what happened here -
Yes, facebook did wrong thing to actively withdraw routes.
Simply turning themselves off, instead of withdrawing the routes, would result in suboptimal performance.
This time, facebook is saying that they could not reach their name servers even though the servers were perfectly working. How much performance, do you think, facebook enjoyed? A lot less than "suboptimal", I'm afraid.
And 99 times out of 100, not doing one or the other would cause rather than prevent an outage.
That is a commonly seen misconception wrongly assuming that server routes were withdrawn if and only if the server is unavailable. But, the reality is that it is impossible to correctly recognize server is unavailable or to correctly withdraw routes only when server is unavailable. Masataka Ohta