----- On Oct 4, 2021, at 10:07 AM, Anne P. Mitchell, Esq. amitchell@isipp.com wrote: Hi Anne,
On a related note, what do you think the scene is like in FB HQ right now? (shaking head)
Very quiet, as their offices are still closed for all but essentials :) But, from experience I can tell you how that works. I assume Facebook works in a similar manner as some of my previous employers. This assumption comes from the fact that quite a number of my previous colleagues now work at Facebook in similar roles. First there is the question of detecting the outage. Obviously, Facebook will have a monitoring/SRE team that continuously monitors 1000s of metrics. They observe a number of metrics go down, and start to investigate. Most likely they will have some sort of overall technical lead (let's call this the Technical Duty Officer), that is responsible for the whole thing. Once the SRE team figured out where the problem lies, they will alert the TDO. TDO will then hit that big red button and send out alerts to the appropriate teams to jump on a bridge (let's call that the Technical Crisis Bridge), to fix the issue. If done right, whomever was on call for that team will take the lead and interface with adjoining teams, and other team members who are available to help out. Looking at how long this outage lasts, there must be either something very broken, or they're having trouble rolling back a change which was expected to not have impact. Once the issue is fixed, the TDO will write a report and submit it to the Problem Management group. This group will now contact the teams deemed responsible for the outage. This team will no have an opportunity to explain themselves during a post- mortem. Depending on the scale of the outage, the post-mortem can be a 10 minute call on a bridge with a Problem Management manager, or in the hot seat during a 60 minute meeting with a bunch of execs. I've been in that hot seat a few times. Not the most pleasurable experience. Perhaps it's time for a new career :) Thanks, Sabri