On Wed, 6 Oct 2021, Michael Thomas wrote:
On 10/6/21 3:33 PM, Jon Lewis wrote:
On Wed, 6 Oct 2021, Michael Thomas wrote:
People have been anycasting DNS server IPs for years (decades?). So, no.
But it wasn't just their DNS subnets that were pulled, I thought. I'm obviously really confused. Anycast to a DNS server makes sense that they'd pull out if they couldn't contact the backend. But I thought that almost all of their routes to the backend were pulled? That is, the DFZ was emptied of FB routes.
Well, as someone else said, DNS wasn't the problem...it was just one of the more noticeable casualties. Whatever they did broke the network rather completely, and that took out all of their DNS, which broke lots of other things that depend on DNS.
Maybe the problem here is that two things happened and the article conflated the two: the DNS infrastructure pulled its routes from the anycast address and something else pulled all of the other routes but wasn't mentioned in the article.
From the engineering.fb.com article:
"This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally." If you kill the backbone, and every site determines "my connectivity is hosed, suppress anycast propagation.", then you simultaneously have no network, and no anycast (which might otherwise propagate to transit/peers at each or at least some subset of your sites). All of your internal data and communication systems that rely on both network and working DNS suddenly don't work, so internal communications likely degraded to engineers calling or texting each other.
From one of the earlier articles, it sounds like they don't have true out of band access to their routers/switches, which makes it kind of hard to fix the network, if it's no longer a network and you have no access to console or management ports.
---------------------------------------------------------------------- Jon Lewis, MCP :) | I route StackPath, Sr. Neteng | therefore you are _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________