massive facebook outage presently

Eric Kuhnke

4 Oct 2021 4 Oct '21

4:03 p.m.

https://downdetector.com/status/facebook/ Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls. Appears to be failure in DNS resolution.

Attachments:

attachment.html (text/html — 387 bytes)

Show replies by date

chris

4 Oct 4 Oct

4:07 p.m.

yeah we are seeing same, appears that none of their NS are responding to dns queries.... affecting all their properties, whatsapp, facebook, instagram etc chris On Mon, Oct 4, 2021 at 12:06 PM Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

...

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

Darwin Costa

4:07 p.m.

...

On 4 Oct 2021, at 18:03, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

https://downdetector.com/status/facebook/ <https://downdetector.com/status/facebook/>

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls. Most probably, if not already.

Appears to be failure in DNS resolution.

Thanks for this Eric.

...

George Metz

4:07 p.m.

Also impacting Instagram and, apparently, WhatsApp. On Mon, Oct 4, 2021 at 12:05 PM Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

...

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

Hugo Slabbert

4:23 p.m.

Looks like their auth DNS dropped out of the DFZ: https://twitter.com/g_bonfiglio/status/1445056923309649926?s=20 https://twitter.com/g_bonfiglio/status/1445058771261313046?s=20 -- Hugo Slabbert On Mon, Oct 4, 2021 at 9:21 AM George Metz <george.metz@gmail.com> wrote:

...

Also impacting Instagram and, apparently, WhatsApp.

On Mon, Oct 4, 2021 at 12:05 PM Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

...
https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but

this will undoubtedly generate a large volume of customer service calls.

...
Appears to be failure in DNS resolution.

Tom Beecher

4:43 p.m.

I see the same. Still see those prefixes via direct peering, but DFZ doesn't have them. Their auths are not reachable even though I have routes for them. Some other weirdness I'm poking a bit on that doesn't seem like it could possibly be related to FB DNS though. On Mon, Oct 4, 2021 at 12:39 PM Hugo Slabbert <hugo@slabnet.com> wrote:

...

Looks like their auth DNS dropped out of the DFZ:

https://twitter.com/g_bonfiglio/status/1445056923309649926?s=20 https://twitter.com/g_bonfiglio/status/1445058771261313046?s=20

-- Hugo Slabbert

On Mon, Oct 4, 2021 at 9:21 AM George Metz <george.metz@gmail.com> wrote:

...
Also impacting Instagram and, apparently, WhatsApp.

On Mon, Oct 4, 2021 at 12:05 PM Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

...
https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here,

but this will undoubtedly generate a large volume of customer service calls.

...
Appears to be failure in DNS resolution.

tomocha

6:31 p.m.

Hi Some of the DNS addresses are no longer prefix from AS32934. On 2021/10/05 1:23, Hugo Slabbert wrote:

...

Looks like their auth DNS dropped out of the DFZ:

https://twitter.com/g_bonfiglio/status/1445056923309649926?s=20 https://twitter.com/g_bonfiglio/status/1445058771261313046?s=20

Jean St-Laurent

7:16 p.m.

Maybe the key to solve this issue is in an email sent to some_very_important_team@facebook.com -----Original Message----- From: NANOG <nanog-bounces+jean=ddostest.me@nanog.org> On Behalf Of tomocha Sent: October 4, 2021 2:32 PM To: nanog@nanog.org Subject: Re: massive facebook outage presently Hi Some of the DNS addresses are no longer prefix from AS32934. On 2021/10/05 1:23, Hugo Slabbert wrote:

...

Looks like their auth DNS dropped out of the DFZ:

https://twitter.com/g_bonfiglio/status/1445056923309649926?s=20 https://twitter.com/g_bonfiglio/status/1445058771261313046?s=20

Töma Gavrichenkov

7:41 p.m.

Peace, On Mon, Oct 4, 2021, 10:17 PM Jean St-Laurent via NANOG <nanog@nanog.org> wrote:

...

Maybe the key to solve this issue is in an email sent to some_very_important_team@facebook.com

Yeah except MX records on facebook dot com aren't working either -- Töma

Mel Beckman

4:08 p.m.

Here’s a screenshot: [cid:3E071EF9-BBC5-44BF-865D-2EDC36E05C71-L0-001] -mel beckman On Oct 4, 2021, at 9:06 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote: https://downdetector.com/status/facebook/ Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls. Appears to be failure in DNS resolution.

Jason Kuehl

4:25 p.m.

Looks like they run there own nameservers and I see the soa records are even missing. On Mon, Oct 4, 2021, 12:23 PM Mel Beckman <mel@beckman.org> wrote:

...

Here’s a screenshot:

-mel beckman

On Oct 4, 2021, at 9:06 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

Matthew Petach

4:51 p.m.

On Mon, Oct 4, 2021 at 9:47 AM Jason Kuehl <jason.w.kuehl@gmail.com> wrote:

...

Looks like they run there own nameservers and I see the soa records are even missing.

On Mon, Oct 4, 2021, 12:23 PM Mel Beckman <mel@beckman.org> wrote:

...
Here’s a screenshot:

-mel beckman

On Oct 4, 2021, at 9:06 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

If you check your BGP routing tables, you'll probably find that it's not so much the SOA records that are missing, as it is the prefixes to reach the DNS servers entirely.

I suspect the DNS entries on the servers themselves may look fine from inside facebook, leading to a slower diagnostic and repair, as it's only from the outside world the missing routing entries in the global table make the problem so painfully visible. Having the DNS team frantically checking their servers may slow the resolution down, if it is indeed a BGP failure rather than a DNS server failure situation, as it seems to appear at the moment. ^_^; Matt

Michael Spears

5:12 p.m.

Jonathan Kalbfeld

5:30 p.m.

With Facebook down, how are people doing their vaccine research? It’s got to be more than just DNS. Jonathan Kalbfeld office: +1 310 317 7933 <tel:%28310%29%20317-7933> fax: +1 310 317 7901 <tel:%28310%29%20317-7901> home: +1 310 317 7909 <tel:%28310%29%20317-7909> mobile: +1 310 227 1662 <tel:%28310%29%20227-1662> ThoughtWave Technologies, Inc. Studio City, CA 91604 https://thoughtwave.com <https://thoughtwave.com/> View our network at https://bgp.he.net/AS54380 <https://bgp.he.net/AS54380> +1 213 984 1000

...

On Oct 4, 2021, at 10:12 AM, Michael Spears <michael@spears.io> wrote:

https://www.reddit.com/r/sysadmin/comments/q181fv/looks_like_facebook_is_down/?utm_medium=android_app&utm_source=share

Sent from my Verizon, Samsung Galaxy smartphone Get Outlook for Android <https://aka.ms/AAb9ysg>

Mel Beckman

5:35 p.m.

Suspiciously, this comes the morning after Facebook whistleblower Frances Haugen disclosed on 60 Minutes that Facebook's own research shows that it chose to profit from misinformation and political unrest through deliberate escalation of conflicts. Occam’s razor says “When multiple causes are plausible, and CBS 60 Minutes is one of them, go with 60 Minutes.” :) -mel On Oct 4, 2021, at 10:30 AM, Michael Spears <michael@spears.io> wrote: https://www.reddit.com/r/sysadmin/comments/q181fv/looks_like_facebook_is_down/?utm_medium=android_app&utm_source=share Sent from my Verizon, Samsung Galaxy smartphone Get Outlook for Android<https://aka.ms/AAb9ysg>

Jay Hennigan

5:57 p.m.

On 10/4/21 10:35, Mel Beckman wrote:

...

Suspiciously, this comes the morning after Facebook whistleblower Frances Haugen disclosed on 60 Minutes that Facebook's own research shows that it chose to profit from misinformation and political unrest through deliberate escalation of conflicts. Occam’s razor says “When multiple causes are plausible, and CBS 60 Minutes is one of them, go with 60 Minutes.” :)

It could just be that after the 60 Minutes interview they've shut things down in order to divert all power to the shredders. -- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV

Martin List-Petersen

6:21 p.m.

Wishful thinking .. but hey, one is allowed to dream. /M On 04/10/2021 18:57, Jay Hennigan wrote:

...

On 10/4/21 10:35, Mel Beckman wrote:

...
Suspiciously, this comes the morning after Facebook whistleblower Frances Haugen disclosed on 60 Minutes that Facebook's own research shows that it chose to profit from misinformation and political unrest through deliberate escalation of conflicts. Occam’s razor says “When multiple causes are plausible, and CBS 60 Minutes is one of them, go with 60 Minutes.” :)

It could just be that after the 60 Minutes interview they've shut things down in order to divert all power to the shredders.

-- Airwire Ltd. - Ag Nascadh Pobail an Iarthair - http://www.airwire.ie - Phone: 091 395000 Registered Office: Moy, Kinvara, Co. Galway, 091-865 968 - Registered in Ireland No. 508961 -- This email has been checked for viruses by AVG. https://www.avg.com

bzs＠theworld.com

7:11 p.m.

Although I believe it's generally true that if a company appears prominently in the news it's liable to be attacked I assume because the miscreants sit around thinking "hmm, who shall we attack today oh look at that shiny headline!" I'd hate to ascribe any altruistic motivation w/o some evidence like even a credible twitter post (maybe they posted that on FB? :-) I wonder how often the US White House, congress, et al are attacked since they appear in almost any top headline list often centering on something someone out there really doesn't like? On October 4, 2021 at 17:35 mel@beckman.org (Mel Beckman) wrote:

...

Suspiciously, this comes the morning after Facebook whistleblower Frances Haugen disclosed on 60 Minutes that Facebook's own research shows that it chose to profit from misinformation and political unrest through deliberate escalation of conflicts. Occam’s razor says “When multiple causes are plausible, and CBS 60 Minutes is one of them, go with 60 Minutes.” :)

-mel

On Oct 4, 2021, at 10:30 AM, Michael Spears <michael@spears.io> wrote:

https://www.reddit.com/r/sysadmin/comments/q181fv/ looks_like_facebook_is_down/?utm_medium=android_app&utm_source=share

Sent from my Verizon, Samsung Galaxy smartphone Get Outlook for Android

Jay Hennigan

7:30 p.m.

On 10/4/21 12:11, bzs@theworld.com wrote:

...

Although I believe it's generally true that if a company appears prominently in the news it's liable to be attacked I assume because the miscreants sit around thinking "hmm, who shall we attack today oh look at that shiny headline!" I'd hate to ascribe any altruistic motivation w/o some evidence like even a credible twitter post (maybe they posted that on FB? :-)

I personally believe that the outage was caused by human error and not something malicious. Time will tell. However, if you missed the 60 Minutes piece, it was a former employee who spoke out with some rather powerful observations. I don't think that this type of worldwide outage was caused by an outside bad actor. It is certainly within the realm of possibility that it was an inside job. In other news: https://twitter.com/disclosetv/status/1445100931947892736?s=20 -- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV

Eric Kuhnke

8:33 p.m.

I am starting to see reports that in ISPs with very large numbers of residential users, customers are starting to press the factory-reset buttons on their home routers/modems/whatever, in an attempt to make Facebook work. This is resulting in much heavier than normal first tier support volumes. The longer it stays down the worse this is going to get. On Mon, Oct 4, 2021 at 3:30 PM Jay Hennigan <jay@west.net> wrote:

...

On 10/4/21 12:11, bzs@theworld.com wrote:

...
Although I believe it's generally true that if a company appears prominently in the news it's liable to be attacked I assume because the miscreants sit around thinking "hmm, who shall we attack today oh look at that shiny headline!" I'd hate to ascribe any altruistic motivation w/o some evidence like even a credible twitter post (maybe they posted that on FB? :-)

I personally believe that the outage was caused by human error and not something malicious. Time will tell.

However, if you missed the 60 Minutes piece, it was a former employee who spoke out with some rather powerful observations. I don't think that this type of worldwide outage was caused by an outside bad actor. It is certainly within the realm of possibility that it was an inside job.

In other news:

https://twitter.com/disclosetv/status/1445100931947892736?s=20

-- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV

Aaron C. de Bruyn

8:45 p.m.

It looks like it might take a while according to a news reporter's tweet: "Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors." https://twitter.com/sheeraf/status/1445099150316503057?s=20 -A On Mon, Oct 4, 2021 at 1:41 PM Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

...

I am starting to see reports that in ISPs with very large numbers of residential users, customers are starting to press the factory-reset buttons on their home routers/modems/whatever, in an attempt to make Facebook work. This is resulting in much heavier than normal first tier support volumes. The longer it stays down the worse this is going to get.

On Mon, Oct 4, 2021 at 3:30 PM Jay Hennigan <jay@west.net> wrote:

...
On 10/4/21 12:11, bzs@theworld.com wrote:

...
Although I believe it's generally true that if a company appears prominently in the news it's liable to be attacked I assume because the miscreants sit around thinking "hmm, who shall we attack today oh look at that shiny headline!" I'd hate to ascribe any altruistic motivation w/o some evidence like even a credible twitter post (maybe they posted that on FB? :-)

I personally believe that the outage was caused by human error and not something malicious. Time will tell.

However, if you missed the 60 Minutes piece, it was a former employee who spoke out with some rather powerful observations. I don't think that this type of worldwide outage was caused by an outside bad actor. It is certainly within the realm of possibility that it was an inside job.

In other news:

https://twitter.com/disclosetv/status/1445100931947892736?s=20

-- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV

Jorge Amodio

8:53 p.m.

How come such a large operation does not have an out of bound access in case of emergencies ??? Somebody's getting fired ! -J On Mon, Oct 4, 2021 at 3:51 PM Aaron C. de Bruyn via NANOG <nanog@nanog.org> wrote:

...

It looks like it might take a while according to a news reporter's tweet:

"Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors."

https://twitter.com/sheeraf/status/1445099150316503057?s=20

-A

On Mon, Oct 4, 2021 at 1:41 PM Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

...
I am starting to see reports that in ISPs with very large numbers of residential users, customers are starting to press the factory-reset buttons on their home routers/modems/whatever, in an attempt to make Facebook work. This is resulting in much heavier than normal first tier support volumes. The longer it stays down the worse this is going to get.

On Mon, Oct 4, 2021 at 3:30 PM Jay Hennigan <jay@west.net> wrote:

...
On 10/4/21 12:11, bzs@theworld.com wrote:

...
Although I believe it's generally true that if a company appears prominently in the news it's liable to be attacked I assume because the miscreants sit around thinking "hmm, who shall we attack today oh look at that shiny headline!" I'd hate to ascribe any altruistic motivation w/o some evidence like even a credible twitter post (maybe they posted that on FB? :-)

I personally believe that the outage was caused by human error and not something malicious. Time will tell.

However, if you missed the 60 Minutes piece, it was a former employee who spoke out with some rather powerful observations. I don't think that this type of worldwide outage was caused by an outside bad actor. It is certainly within the realm of possibility that it was an inside job.

In other news:

https://twitter.com/disclosetv/status/1445100931947892736?s=20

-- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV

Jared Mauch

5 Oct 5 Oct

12:50 p.m.

New subject: Disaster Recovery Process

...

On Oct 4, 2021, at 4:53 PM, Jorge Amodio <jmamodio@gmail.com> wrote:

How come such a large operation does not have an out of bound access in case of emergencies ???

I mentioned to someone yesterday that most OOB systems _are_ the internet. It doesn’t always seem like you need things like modems or dial-backup, or access to these services, except when you do it’s critical/essential. A few reminders for people: 1) Program your co-workers into your cell phone 2) Print out an emergency contact sheet 3) Have a backup conference bridge/system that you test - if zoom/webex/ms are down, where do you go? Slack? Google meet? Audio bridge? - No judgement, but do test the system! 4) Know how to access the office and who is closest. - What happens if they are in the hospital, sick or on vacation? 5) Complacency is dangerous - When the tools “just work” you never imagine the tools won’t work. I’m sure the lessons learned will be long internally. - I hope they share them externally so others can learn. 6) No really, test the backup process. * interlude * Back at my time at 2914 - one reason we all had T1’s at home was largely so we could get in to the network should something bad happen. My home IP space was in the router ACLs. Much changed since those early days as this network became more reliable. We’ve seen large outages in the past 2 years of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in my memory). Plan for the outages and make sure you understand your playbook. It may be from snow day to all hands on deck. Test it at least once, and ideally with someone who will challenge a few assumptions (eg: that the cell network will be up) - Jared

Karl Auer

2:05 p.m.

New subject: Disaster Recovery Process

On Tue, 2021-10-05 at 08:50 -0400, Jared Mauch wrote:

...

A few reminders for people: [excellent list snipped]

I'd add one "soft" list item: - in your emergency plan, have one or two people nominated who are VERY high up in the organisation. Their lines need to be open to the decisionmakers in the emergency team(s). Their job is to put the fear of a vengeful god into any idiot who tries to interfere with the recovery process by e.g. demanding status reports at ten-minute intervals. Regards, K. -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Karl Auer (kauer@biplane.com.au) http://www.biplane.com.au/kauer GPG fingerprint: 61A0 99A9 8823 3A75 871E 5D90 BADB B237 260C 9C58 Old fingerprint: 2561 E9EC D868 E73C 8AF1 49CF EE50 4B1D CCA1 5170

Jared Mauch

2:13 p.m.

New subject: Disaster Recovery Process

...

On Oct 5, 2021, at 10:05 AM, Karl Auer <kauer@biplane.com.au> wrote:

On Tue, 2021-10-05 at 08:50 -0400, Jared Mauch wrote:

...
A few reminders for people: [excellent list snipped]

I'd add one "soft" list item:

- in your emergency plan, have one or two people nominated who are VERY high up in the organisation. Their lines need to be open to the decisionmakers in the emergency team(s). Their job is to put the fear of a vengeful god into any idiot who tries to interfere with the recovery process by e.g. demanding status reports at ten-minute intervals.

At $dayjob we split the technical updates on a different bridge from the business updates. There is a dedicated team to coordinate an entire thing, they can be low severity (risk) or high severity (whole business impacting). They provide the timeline to next update and communicate what tasks are being done. There’s even training on how to be a SME in the environment. Nothing is perfect but this runs very smooth at $dayjob - Jared

Sean Donelan

2:34 p.m.

New subject: Disaster Recovery Process

On Wed, 6 Oct 2021, Karl Auer wrote:

...

I'd add one "soft" list item:

- in your emergency plan, have one or two people nominated who are VERY high up in the organisation. Their lines need to be open to the decisionmakers in the emergency team(s). Their job is to put the fear of a vengeful god into any idiot who tries to interfere with the recovery process by e.g. demanding status reports at ten-minute intervals.

A good idea I learned was designate separate "executive" conference room and "incident command" conference room. Executives are only allowed in the executive conference room. Executives are NOT allowed in any NOC/SOC/operations areas. The executive conference room was well stocked with coffee, snacks, TVs, monitors, paper and easels. An executive was anyone with a CxO, General Counsel, EVP, VP, etc. title. You know who you are :-) One operations person (i.e. Director of Operations or designee for shift) would brief the executives when they wanted something, and take their suggestions back to the incident room. The Incident Commander was God as far as the incident, with a pre-approved emergency budget authorization. One compromise, we did allow one lawyer in the incident command conference room, but it was NOT the corporate General Counsel.

Wolfgang Tremmel

6 Oct 6 Oct

7:21 a.m.

New subject: Disaster Recovery Process

And a layer 8 item from me: - put a number (as in money) into the process up to that anything spend by anyone working on the recovery is covered. Has to be a number because if you write "all cost are covered" it makes the recovery person 2nd-guess if the airplane ticket or spare part he just bought is really covered. Additionally you should put an "approver" on the team who approves higher cost on very short notice. Wolfgang

...

On 5. Oct 2021, at 16:05, Karl Auer <kauer@biplane.com.au> wrote:

On Tue, 2021-10-05 at 08:50 -0400, Jared Mauch wrote:

...
A few reminders for people: [excellent list snipped]

I'd add one "soft" list item:

Jeff Shultz

5 Oct 5 Oct

5:04 p.m.

New subject: Disaster Recovery Process

7. Make sure any access controlled rooms have physical keys that are available at need - and aren't secured by the same access control that they are to circumvent. . 8. Don't make your access control dependent on internet access - always have something on the local network it can fall back to. That last thing, that apparently their access control failed, locking people out when either their outward facing DNS and/or BGP routes went goodbye, is perhaps the most astounding thing to me - making your access control into an IoT device without (apparently) a quick workaround for a failure in the "I" part. On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch <jared@puck.nether.net> wrote:

...

...
On Oct 4, 2021, at 4:53 PM, Jorge Amodio <jmamodio@gmail.com> wrote:

How come such a large operation does not have an out of bound access in case of emergencies ???

I mentioned to someone yesterday that most OOB systems _are_ the internet. It doesn’t always seem like you need things like modems or dial-backup, or access to these services, except when you do it’s critical/essential.

A few reminders for people:

1) Program your co-workers into your cell phone 2) Print out an emergency contact sheet 3) Have a backup conference bridge/system that you test - if zoom/webex/ms are down, where do you go? Slack? Google meet? Audio bridge? - No judgement, but do test the system! 4) Know how to access the office and who is closest. - What happens if they are in the hospital, sick or on vacation? 5) Complacency is dangerous - When the tools “just work” you never imagine the tools won’t work. I’m sure the lessons learned will be long internally. - I hope they share them externally so others can learn. 6) No really, test the backup process.

* interlude *

Back at my time at 2914 - one reason we all had T1’s at home was largely so we could get in to the network should something bad happen. My home IP space was in the router ACLs. Much changed since those early days as this network became more reliable. We’ve seen large outages in the past 2 years of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in my memory).

Plan for the outages and make sure you understand your playbook. It may be from snow day to all hands on deck. Test it at least once, and ideally with someone who will challenge a few assumptions (eg: that the cell network will be up)

- Jared

-- Jeff Shultz -- Like us on Social Media for News, Promotions, and other information!! <https://www.facebook.com/SCTCWEB/> <https://www.instagram.com/sctc_sctc/> <https://www.yelp.com/biz/sctc-stayton-3> <https://www.youtube.com/c/sctcvideos> _**** This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ****_

jim deleskie

5:09 p.m.

New subject: Disaster Recovery Process

World broke. Crazy $$ per hour down time. Doors open with a fire axe. Glass breaks super easy too and much less expensive then adding 15 min to failure. -jim On Tue., Oct. 5, 2021, 7:05 p.m. Jeff Shultz, <jeffshultz@sctcweb.com> wrote:

...

7. Make sure any access controlled rooms have physical keys that are available at need - and aren't secured by the same access control that they are to circumvent. . 8. Don't make your access control dependent on internet access - always have something on the local network it can fall back to.

That last thing, that apparently their access control failed, locking people out when either their outward facing DNS and/or BGP routes went goodbye, is perhaps the most astounding thing to me - making your access control into an IoT device without (apparently) a quick workaround for a failure in the "I" part.

On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch <jared@puck.nether.net> wrote:

...
...
On Oct 4, 2021, at 4:53 PM, Jorge Amodio <jmamodio@gmail.com> wrote:

How come such a large operation does not have an out of bound access in case of emergencies ???

I mentioned to someone yesterday that most OOB systems _are_ the internet. It doesn’t always seem like you need things like modems or dial-backup, or access to these services, except when you do it’s critical/essential.

A few reminders for people:

1) Program your co-workers into your cell phone 2) Print out an emergency contact sheet 3) Have a backup conference bridge/system that you test - if zoom/webex/ms are down, where do you go? Slack? Google meet? Audio bridge? - No judgement, but do test the system! 4) Know how to access the office and who is closest. - What happens if they are in the hospital, sick or on vacation? 5) Complacency is dangerous - When the tools “just work” you never imagine the tools won’t work. I’m sure the lessons learned will be long internally. - I hope they share them externally so others can learn. 6) No really, test the backup process.

* interlude *

Back at my time at 2914 - one reason we all had T1’s at home was largely so we could get in to the network should something bad happen. My home IP space was in the router ACLs. Much changed since those early days as this network became more reliable. We’ve seen large outages in the past 2 years of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in my memory).

Plan for the outages and make sure you understand your playbook. It may be from snow day to all hands on deck. Test it at least once, and ideally with someone who will challenge a few assumptions (eg: that the cell network will be up)

- Jared

-- Jeff Shultz

Like us on Social Media for News, Promotions, and other information!!

<https://www.facebook.com/SCTCWEB/> [image: https://www.instagram.com/sctc_sctc/] <https://www.instagram.com/sctc_sctc/> <https://www.yelp.com/biz/sctc-stayton-3> <https://www.youtube.com/c/sctcvideos>

**** This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ****

Jamie Dahl

5:14 p.m.

New subject: Disaster Recovery Process

The NIMS/ICS system works very well for issues like this. I utilize ICS regularly in my Search and Rescue world, and the last two companies I worked for utilize(d) it extensively during outages. It allows folks from various different disciplines, roles and backgrounds to come in, and provide a divide and conquer methodology to incidents and can be scaled up/scaled out as necessary. Phrases like "Incident Commander" and such have been around for a few decades and are concepts used regularly by FEMA, CalFire and other natural disaster style incidents. But those of you who may be EMComm folks probably already knew that ;-). this was pounded out on my iPhone and i have fat fingers plus two left thumbs :) We have to remember that what we observe is not nature herself, but nature exposed to our method of questioning.

...

On Oct 5, 2021, at 10:11, jim deleskie <deleskie@gmail.com> wrote:

World broke. Crazy $$ per hour down time. Doors open with a fire axe. Glass breaks super easy too and much less expensive then adding 15 min to failure.

-jim

...
On Tue., Oct. 5, 2021, 7:05 p.m. Jeff Shultz, <jeffshultz@sctcweb.com> wrote: 7. Make sure any access controlled rooms have physical keys that are available at need - and aren't secured by the same access control that they are to circumvent. . 8. Don't make your access control dependent on internet access - always have something on the local network it can fall back to.

That last thing, that apparently their access control failed, locking people out when either their outward facing DNS and/or BGP routes went goodbye, is perhaps the most astounding thing to me - making your access control into an IoT device without (apparently) a quick workaround for a failure in the "I" part.

...
On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch <jared@puck.nether.net> wrote:

...
On Oct 4, 2021, at 4:53 PM, Jorge Amodio <jmamodio@gmail.com> wrote:

How come such a large operation does not have an out of bound access in case of emergencies ???

I mentioned to someone yesterday that most OOB systems _are_ the internet. It doesn’t always seem like you need things like modems or dial-backup, or access to these services, except when you do it’s critical/essential.

A few reminders for people:

1) Program your co-workers into your cell phone 2) Print out an emergency contact sheet 3) Have a backup conference bridge/system that you test - if zoom/webex/ms are down, where do you go? Slack? Google meet? Audio bridge? - No judgement, but do test the system! 4) Know how to access the office and who is closest. - What happens if they are in the hospital, sick or on vacation? 5) Complacency is dangerous - When the tools “just work” you never imagine the tools won’t work. I’m sure the lessons learned will be long internally. - I hope they share them externally so others can learn. 6) No really, test the backup process.

* interlude *

Back at my time at 2914 - one reason we all had T1’s at home was largely so we could get in to the network should something bad happen. My home IP space was in the router ACLs. Much changed since those early days as this network became more reliable. We’ve seen large outages in the past 2 years of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in my memory).

Plan for the outages and make sure you understand your playbook. It may be from snow day to all hands on deck. Test it at least once, and ideally with someone who will challenge a few assumptions (eg: that the cell network will be up)

- Jared

-- Jeff Shultz

Like us on Social Media for News, Promotions, and other information!!

*** This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ***

Niels Bakker

5:34 p.m.

New subject: Disaster Recovery Process

* deleskie@gmail.com (jim deleskie) [Tue 05 Oct 2021, 19:13 CEST]:

...

World broke. Crazy $$ per hour down time. Doors open with a fire axe.

Please stop spreading fake news. https://twitter.com/MikeIsaac/status/1445196576956162050 |need to issue a correction: the team dispatched to the Facebook site |had issues getting in because of physical security but did not need to |use a saw/ grinder. -- Niels.

jim deleskie

5:44 p.m.

New subject: Disaster Recovery Process

I don't see posting in a DR process thead about thinking to use alternative entry methods to locked doors and spreading false information. If do well. Mail filters are simple. -jim On Tue., Oct. 5, 2021, 7:35 p.m. Niels Bakker, <niels=nanog@bakker.net> wrote:

...

* deleskie@gmail.com (jim deleskie) [Tue 05 Oct 2021, 19:13 CEST]:

...
World broke. Crazy $$ per hour down time. Doors open with a fire axe.

Please stop spreading fake news.

https://twitter.com/MikeIsaac/status/1445196576956162050 |need to issue a correction: the team dispatched to the Facebook site |had issues getting in because of physical security but did not need to |use a saw/ grinder.

-- Niels.

Warren Kumari

5:25 p.m.

New subject: Disaster Recovery Process

On Tue, Oct 5, 2021 at 1:07 PM Jeff Shultz <jeffshultz@sctcweb.com> wrote:

...

7. Make sure any access controlled rooms have physical keys that are available at need - and aren't secured by the same access control that they are to circumvent. . 8. Don't make your access control dependent on internet access - always have something on the local network it can fall back to.

That last thing, that apparently their access control failed, locking people out when either their outward facing DNS and/or BGP routes went goodbye, is perhaps the most astounding thing to me - making your access control into an IoT device without (apparently) a quick workaround for a failure in the "I" part.

Keep in mind that the "some employees couldn't get into their offices" has been filtered through the public press and seems to have grown into "OMG! Lolz! No-one can fix the Facebook because no-one can reach the turn-it-off-and-on-again-button". Facebook has many office buildings, and needs to be able to add and revoke employee access as people are hired and quit, etc. Just because the press said that some random employees were unable to enter their office building doesn't actually mean that: 1: this was a datacenter and they really needed access or 2: no-one was able to enter or 3: this actually caused issues with recovery. Important buildings have security people who have controller-locked cards and / or physical keys, offices != datacenter, etc. I'm quite sure that this part of the story is a combination of some small tidbit of information that a non-technical reporter was able to understand, mixed with some "Hah. Look at those idiots, even I know to keep a spare key under the doormat" schadenfreude. W

...

On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch <jared@puck.nether.net> wrote:

...
...
On Oct 4, 2021, at 4:53 PM, Jorge Amodio <jmamodio@gmail.com> wrote:

How come such a large operation does not have an out of bound access in case of emergencies ???

I mentioned to someone yesterday that most OOB systems _are_ the internet. It doesn’t always seem like you need things like modems or dial-backup, or access to these services, except when you do it’s critical/essential.

A few reminders for people:

1) Program your co-workers into your cell phone 2) Print out an emergency contact sheet 3) Have a backup conference bridge/system that you test - if zoom/webex/ms are down, where do you go? Slack? Google meet? Audio bridge? - No judgement, but do test the system! 4) Know how to access the office and who is closest. - What happens if they are in the hospital, sick or on vacation? 5) Complacency is dangerous - When the tools “just work” you never imagine the tools won’t work. I’m sure the lessons learned will be long internally. - I hope they share them externally so others can learn. 6) No really, test the backup process.

* interlude *

Back at my time at 2914 - one reason we all had T1’s at home was largely so we could get in to the network should something bad happen. My home IP space was in the router ACLs. Much changed since those early days as this network became more reliable. We’ve seen large outages in the past 2 years of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in my memory).

Plan for the outages and make sure you understand your playbook. It may be from snow day to all hands on deck. Test it at least once, and ideally with someone who will challenge a few assumptions (eg: that the cell network will be up)

- Jared

-- Jeff Shultz

Like us on Social Media for News, Promotions, and other information!!

<https://www.facebook.com/SCTCWEB/> [image: https://www.instagram.com/sctc_sctc/] <https://www.instagram.com/sctc_sctc/> <https://www.yelp.com/biz/sctc-stayton-3> <https://www.youtube.com/c/sctcvideos>

**** This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ****

-- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra

None None

4 Oct 4 Oct

8:57 p.m.

Not all employees are having this issue But it is curious why the records can be moved to the name servers that are still being advertised over the Internet still . On Mon, Oct 4, 2021 at 4:53 PM Aaron C. de Bruyn via NANOG <nanog@nanog.org> wrote:

...

It looks like it might take a while according to a news reporter's tweet:

"Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors."

https://twitter.com/sheeraf/status/1445099150316503057?s=20

-A

On Mon, Oct 4, 2021 at 1:41 PM Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

...
I am starting to see reports that in ISPs with very large numbers of residential users, customers are starting to press the factory-reset buttons on their home routers/modems/whatever, in an attempt to make Facebook work. This is resulting in much heavier than normal first tier support volumes. The longer it stays down the worse this is going to get.

On Mon, Oct 4, 2021 at 3:30 PM Jay Hennigan <jay@west.net> wrote:

...
On 10/4/21 12:11, bzs@theworld.com wrote:

...
Although I believe it's generally true that if a company appears prominently in the news it's liable to be attacked I assume because the miscreants sit around thinking "hmm, who shall we attack today oh look at that shiny headline!" I'd hate to ascribe any altruistic motivation w/o some evidence like even a credible twitter post (maybe they posted that on FB? :-)

I personally believe that the outage was caused by human error and not something malicious. Time will tell.

However, if you missed the 60 Minutes piece, it was a former employee who spoke out with some rather powerful observations. I don't think that this type of worldwide outage was caused by an outside bad actor. It is certainly within the realm of possibility that it was an inside job.

In other news:

https://twitter.com/disclosetv/status/1445100931947892736?s=20

-- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV

Bill Woodcock

9:10 p.m.

New subject: facebook outage

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though. -Bill

Bill Woodcock

9:21 p.m.

New subject: facebook outage

...

On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net> wrote:

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

aaaand we’re back: WoodyNet-2:.ssh woody$ dig www.facebook.com @9.9.9.9 ; <<>> DiG 9.10.6 <<>> www.facebook.com @9.9.9.9 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32839 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;www.facebook.com. IN A ;; ANSWER SECTION: www.facebook.com. 3420 IN CNAME star-mini.c10r.facebook.com. star-mini.c10r.facebook.com. 6 IN A 157.240.19.35 ;; Query time: 13 msec ;; SERVER: 9.9.9.9#53(9.9.9.9) ;; WHEN: Mon Oct 04 23:20:41 CEST 2021 ;; MSG SIZE rcvd: 90 -Bill

Jeff Shultz

9:29 p.m.

New subject: facebook outage

Now they just need to get the site itself back up. On Mon, Oct 4, 2021 at 2:25 PM Bill Woodcock <woody@pch.net> wrote:

...

...
On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net> wrote:

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

aaaand we’re back:

WoodyNet-2:.ssh woody$ dig www.facebook.com @9.9.9.9

; <<>> DiG 9.10.6 <<>> www.facebook.com @9.9.9.9 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32839 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;www.facebook.com. IN A

;; ANSWER SECTION: www.facebook.com. 3420 IN CNAME star-mini.c10r.facebook.com. star-mini.c10r.facebook.com. 6 IN A 157.240.19.35

;; Query time: 13 msec ;; SERVER: 9.9.9.9#53(9.9.9.9) ;; WHEN: Mon Oct 04 23:20:41 CEST 2021 ;; MSG SIZE rcvd: 90

-Bill

chris

9:57 p.m.

New subject: facebook outage

Hopefully this will show people they can enjoy life and survive without it and that it will be just like myspace some day :) On Mon, Oct 4, 2021 at 5:31 PM Jeff Shultz <jeffshultz@sctcweb.com> wrote:

...

Now they just need to get the site itself back up.

On Mon, Oct 4, 2021 at 2:25 PM Bill Woodcock <woody@pch.net> wrote:

...
...
On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net> wrote:

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

aaaand we’re back:

WoodyNet-2:.ssh woody$ dig www.facebook.com @9.9.9.9

; <<>> DiG 9.10.6 <<>> www.facebook.com @9.9.9.9 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32839 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;www.facebook.com. IN A

;; ANSWER SECTION: www.facebook.com. 3420 IN CNAME star-mini.c10r.facebook.com. star-mini.c10r.facebook.com. 6 IN A 157.240.19.35

;; Query time: 13 msec ;; SERVER: 9.9.9.9#53(9.9.9.9) ;; WHEN: Mon Oct 04 23:20:41 CEST 2021 ;; MSG SIZE rcvd: 90

-Bill

-- Jeff Shultz

Like us on Social Media for News, Promotions, and other information!!

<https://www.facebook.com/SCTCWEB/> [image: https://www.instagram.com/sctc_sctc/] <https://www.instagram.com/sctc_sctc/> <https://www.yelp.com/biz/sctc-stayton-3> <https://www.youtube.com/c/sctcvideos>

**** This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ****

Bill Woodcock

9:30 p.m.

New subject: facebook outage

...

On Oct 4, 2021, at 11:21 PM, Bill Woodcock <woody@pch.net> wrote:

...
On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net> wrote:

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

aaaand we’re back:

WoodyNet-2:.ssh woody$ dig www.facebook.com @9.9.9.9

So that was, what… 15:50 UTC to 21:05 UTC, more or less… five hours and fifteen minutes. That’s a lot of hair burnt all the way to the scalp, and some third-degree burns beyond that. Maybe they’ll get one or two independent secondary authoritatives, so this doesn’t happen again. :-) -Bill

Mike Lyon

9:39 p.m.

New subject: facebook outage

Can't wait to see the RFO! -Mike On Mon, Oct 4, 2021 at 2:35 PM Bill Woodcock <woody@pch.net> wrote:

...

...
On Oct 4, 2021, at 11:21 PM, Bill Woodcock <woody@pch.net> wrote:

...
On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net> wrote:

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

aaaand we’re back:

WoodyNet-2:.ssh woody$ dig www.facebook.com @9.9.9.9

So that was, what… 15:50 UTC to 21:05 UTC, more or less… five hours and fifteen minutes.

That’s a lot of hair burnt all the way to the scalp, and some third-degree burns beyond that.

Maybe they’ll get one or two independent secondary authoritatives, so this doesn’t happen again. :-)

-Bill

-- Mike Lyon mike.lyon@gmail.com http://www.linkedin.com/in/mlyon

Baldur Norddahl

9:41 p.m.

New subject: facebook outage

man. 4. okt. 2021 23.33 skrev Bill Woodcock <woody@pch.net>:

...

...
On Oct 4, 2021, at 11:21 PM, Bill Woodcock <woody@pch.net> wrote:

...
On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net> wrote:

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

aaaand we’re back:

WoodyNet-2:.ssh woody$ dig www.facebook.com @9.9.9.9

So that was, what… 15:50 UTC to 21:05 UTC, more or less… five hours and fifteen minutes.

That’s a lot of hair burnt all the way to the scalp, and some third-degree burns beyond that.

Maybe they’ll get one or two independent secondary authoritatives, so this doesn’t happen again. :-)

We have had dns back for a while here but the site is still down. Not counting this as over yet.

Jeff Shultz

9:52 p.m.

New subject: facebook outage

On Mon, Oct 4, 2021 at 2:49 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:

...

man. 4. okt. 2021 23.33 skrev Bill Woodcock <woody@pch.net>:

...
...
On Oct 4, 2021, at 11:21 PM, Bill Woodcock <woody@pch.net> wrote:

...
On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net> wrote:

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

aaaand we’re back:

WoodyNet-2:.ssh woody$ dig www.facebook.com @9.9.9.9

So that was, what… 15:50 UTC to 21:05 UTC, more or less… five hours and fifteen minutes.

That’s a lot of hair burnt all the way to the scalp, and some third-degree burns beyond that.

Maybe they’ll get one or two independent secondary authoritatives, so this doesn’t happen again. :-)

We have had dns back for a while here but the site is still down. Not counting this as over yet.

I'm getting part of my news feed and notifications. Can't post yet, and clicking on something usually sends you back to the "Something is broken" message. So, no - it's not back up yet. But it's starting to twitch.... -- Jeff Shultz -- Like us on Social Media for News, Promotions, and other information!! <https://www.facebook.com/SCTCWEB/> <https://www.instagram.com/sctc_sctc/> <https://www.yelp.com/biz/sctc-stayton-3> <https://www.youtube.com/c/sctcvideos> _**** This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ****_

Michael Thomas

9:53 p.m.

New subject: facebook outage

On 10/4/21 2:41 PM, Baldur Norddahl wrote:

...

man. 4. okt. 2021 23.33 skrev Bill Woodcock <woody@pch.net <mailto:woody@pch.net>>:

> On Oct 4, 2021, at 11:21 PM, Bill Woodcock <woody@pch.net <mailto:woody@pch.net>> wrote: > > > >> On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net <mailto:woody@pch.net>> wrote: >> >> They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though. > > aaaand we’re back: > > WoodyNet-2:.ssh woody$ dig www.facebook.com <http://www.facebook.com> @9.9.9.9 <http://9.9.9.9>

So that was, what… 15:50 UTC to 21:05 UTC, more or less… five hours and fifteen minutes.

That’s a lot of hair burnt all the way to the scalp, and some third-degree burns beyond that.

Maybe they’ll get one or two independent secondary authoritatives, so this doesn’t happen again. :-)

We have had dns back for a while here but the site is still down. Not counting this as over yet.

I got a page to load. Probably trickling out. Mike

mike tancsa

9:58 p.m.

New subject: facebook outage (resolving)

Getting the odd message through, but DNS looks good via their Toronto, Canada pop % traceroute -q1 -I a.dns.facebook.com traceroute to star.c10r.facebook.com (31.13.80.8), 64 hops max, 48 byte packets 1 torix-core1-10G (67.43.129.248) 0.140 ms 2 facebook-a.ip4.torontointernetxchange.net (206.108.35.2) 0.880 ms 3 po103.psw02.yyz1.tfbnw.net (74.119.78.131) 0.365 ms 4 173.252.67.57 (173.252.67.57) 0.341 ms 5 edge-star-shv-01-yyz1.facebook.com (31.13.80.8) 0.263 ms % traceroute6 -q1 -I a.dns.facebook.com traceroute6 to star.c10r.facebook.com (2a03:2880:f00e:a:face:b00c:0:2) from 2607:f3e0:0:80::290, 64 hops max, 20 byte packets 1 toronto-torix-6 0.133 ms 2 facebook-a.ip6.torontointernetxchange.net 0.532 ms 3 po103.psw01.yyz1.tfbnw.net 0.322 ms 4 po1.msw1ah.01.yyz1.tfbnw.net 0.335 ms 5 edge-star6-shv-01-yyz1.facebook.com 0.267 ms % traceroute6 -q1 -I d.dns.facebook.com traceroute6 to star.c10r.facebook.com (2a03:2880:f00e:a:face:b00c:0:2) from 2607:f3e0:0:80::290, 64 hops max, 20 byte packets 1 toronto-torix-6 0.126 ms 2 facebook-a.ip6.torontointernetxchange.net 0.673 ms 3 po103.psw01.yyz1.tfbnw.net 0.349 ms 4 po1.msw1ah.01.yyz1.tfbnw.net 0.332 ms 5 edge-star6-shv-01-yyz1.facebook.com 0.264 ms % traceroute -q1 -I d.dns.facebook.com traceroute to star.c10r.facebook.com (31.13.80.8), 64 hops max, 48 byte packets 1 torix-core1-10G (67.43.129.248) 0.139 ms 2 facebook-a.ip4.torontointernetxchange.net (206.108.35.2) 41.394 ms 3 po103.psw02.yyz1.tfbnw.net (74.119.78.131) 0.285 ms 4 173.252.67.57 (173.252.67.57) 0.309 ms 5 edge-star-shv-01-yyz1.facebook.com (31.13.80.8) 0.211 ms % % host www.facebook.com a.ns.facebook.com Using domain server: Name: a.ns.facebook.com Address: 2a03:2880:f0fc:c:face:b00c:0:35#53 Aliases: www.facebook.com is an alias for star-mini.c10r.facebook.com. % host -4 www.facebook.com a.ns.facebook.com Using domain server: Name: a.ns.facebook.com Address: 129.134.30.12#53 Aliases: www.facebook.com is an alias for star-mini.c10r.facebook.com. % On 10/4/2021 5:41 PM, Baldur Norddahl wrote:

...

man. 4. okt. 2021 23.33 skrev Bill Woodcock <woody@pch.net <mailto:woody@pch.net>>:

> On Oct 4, 2021, at 11:21 PM, Bill Woodcock <woody@pch.net <mailto:woody@pch.net>> wrote: > > > >> On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net <mailto:woody@pch.net>> wrote: >> >> They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though. > > aaaand we’re back: > > WoodyNet-2:.ssh woody$ dig www.facebook.com <http://www.facebook.com> @9.9.9.9 <http://9.9.9.9>

So that was, what… 15:50 UTC to 21:05 UTC, more or less… five hours and fifteen minutes.

That’s a lot of hair burnt all the way to the scalp, and some third-degree burns beyond that.

Maybe they’ll get one or two independent secondary authoritatives, so this doesn’t happen again. :-)

We have had dns back for a while here but the site is still down. Not counting this as over yet.

Bill Woodcock

10:16 p.m.

New subject: facebook outage

...

On Oct 4, 2021, at 11:41 PM, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:

man. 4. okt. 2021 23.33 skrev Bill Woodcock <woody@pch.net>:

...
On Oct 4, 2021, at 11:21 PM, Bill Woodcock <woody@pch.net> wrote:

...
On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net> wrote:

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

aaaand we’re back:

WoodyNet-2:.ssh woody$ dig www.facebook.com @9.9.9.9

So that was, what… 15:50 UTC to 21:05 UTC, more or less… five hours and fifteen minutes.

That’s a lot of hair burnt all the way to the scalp, and some third-degree burns beyond that.

Maybe they’ll get one or two independent secondary authoritatives, so this doesn’t happen again. :-)

We have had dns back for a while here but the site is still down. Not counting this as over yet.

Yeah, fair enough. I went back and looked, and it looks like the BGP withdrawals were around 16:40 UTC? And as of 22:15 UTC, application-layer services still aren’t up. Which puts us at 6:35 thus far? -Bill

Bill Woodcock

10:28 p.m.

New subject: facebook outage

...

On Oct 5, 2021, at 12:16 AM, Bill Woodcock <woody@pch.net> wrote:

...
On Oct 4, 2021, at 11:41 PM, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:

man. 4. okt. 2021 23.33 skrev Bill Woodcock <woody@pch.net>:

...
On Oct 4, 2021, at 11:21 PM, Bill Woodcock <woody@pch.net> wrote:

...
On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net> wrote:

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

aaaand we’re back:

WoodyNet-2:.ssh woody$ dig www.facebook.com @9.9.9.9

So that was, what… 15:50 UTC to 21:05 UTC, more or less… five hours and fifteen minutes.

That’s a lot of hair burnt all the way to the scalp, and some third-degree burns beyond that.

Maybe they’ll get one or two independent secondary authoritatives, so this doesn’t happen again. :-)

We have had dns back for a while here but the site is still down. Not counting this as over yet.

Yeah, fair enough. I went back and looked, and it looks like the BGP withdrawals were around 16:40 UTC? And as of 22:15 UTC, application-layer services still aren’t up. Which puts us at 6:35 thus far?

Arrrr. It’s past midnight here, and my brain is failing to convert between three timezones accurately. My apologies. I’ll stop typing until I’ve had some sleep. Good night. -Bill

Ryan Brooks

9:50 p.m.

New subject: facebook outage

...

On Oct 4, 2021, at 4:30 PM, Bill Woodcock <woody@pch.net> wrote:

...
On Oct 4, 2021, at 11:21 PM, Bill Woodcock <woody@pch.net> wrote:

...
On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net> wrote:

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

aaaand we’re back:

WoodyNet-2:.ssh woody$ dig www.facebook.com @9.9.9.9

So that was, what… 15:50 UTC to 21:05 UTC, more or less… five hours and fifteen minutes.

That’s a lot of hair burnt all the way to the scalp, and some third-degree burns beyond that.

Maybe they’ll get one or two independent secondary authoritatives, so this doesn’t happen again. :-)

DNS was a victim in this outage, not the cause.

...

-Bill

Bill Woodcock

9:54 p.m.

New subject: facebook outage

...

On Oct 4, 2021, at 11:50 PM, Ryan Brooks <ryan@hack.net> wrote: DNS was a victim in this outage, not the cause.

You are absolutely correct. However, people who don’t have this problem avoid having this problem by not putting all their DNS eggs in one basket. And then forgetting where they put the basket. -Bill

Patrick W. Gilmore

9:52 p.m.

New subject: facebook outage

On Oct 4, 2021, at 5:30 PM, Bill Woodcock <woody@pch.net> wrote:

...

On Oct 4, 2021, at 11:21 PM, Bill Woodcock <woody@pch.net> wrote:

...
On Oct 4, 2021, at 11:10 PM, Bill Woodcock <woody@pch.net> wrote:

...
They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

aaaand we’re back:

WoodyNet-2:.ssh woody$ dig www.facebook.com @9.9.9.9

So that was, what… 15:50 UTC to 21:05 UTC, more or less… five hours and fifteen minutes.

That’s a lot of hair burnt all the way to the scalp, and some third-degree burns beyond that.

Maybe they’ll get one or two independent secondary authoritatives, so this doesn’t happen again. :-)

If by “independent” you mean “3rd party” (e.g. DynDNS), not sure what an external secondary would have done here. While their BGP was misbehaving, the app would not work even if you had a static DNS entry. And while using external / 3rd party secondaries is likely a good idea for many companies, almost none of the largest do this. These companies view it as a control issue. Giving someone outside your own employees the ability to change a DNS name is, frankly, giving another company the ability to take you down. Taking a sample of FB, cisco, Amazon, NF, Dell, Akamai, Google, MS, CF, only 2 use 3rd party resolvers. * NF uses only awsdns, so same problem, just moved to another company they do not control. * Amazon uses Ultra & Dyn. (Anyone else amused amazon.com has no authorities on Route 53? At least not from my vantage point.) That said, plenty of what people may call “big” companies do use 3rd parties, e.g. IBM, PayPal, Juniper. You want to use a 3rd party DNS, go for it. There are lots of reasons to do it. But it is not a panacea, and there are reasons not to. -- TTFN, patrick

John Lee

11:02 p.m.

New subject: facebook outage

I was seeing NXDOMAIN errors, so I wonder if they had a DNS outage of some sort?? On Mon, Oct 4, 2021 at 5:14 PM Bill Woodcock <woody@pch.net> wrote:

...

They’re starting to pick themselves back up off the floor in the last two or three minutes. A few answers getting out. I imagine it’ll take a while before things stabilize, though.

-Bill

Niels Bakker

5 Oct 5 Oct

9:17 a.m.

New subject: facebook outage

* jllee9753@gmail.com (John Lee) [Tue 05 Oct 2021, 01:06 CEST]:

...

I was seeing NXDOMAIN errors, so I wonder if they had a DNS outage of some sort??

Were you using host(1)? Please don't, and use dig(1) instead. There were as far as I know at no point NXDOMAINs being returned, but due to the SERVFAILs host(1) was silently appending your local search domain to your query which would lead to incorrect NXDOMAIN output. -- Niels.

Jay Hennigan

4 Oct 4 Oct

9:12 p.m.

Starting to see some prefixes return on our peering session with them in Los Angeles but DNS still not resolving. -- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV

Jean St-Laurent

8:47 p.m.

The glue records for the NS are set at 48 hours. dig @c.gtld-servers.net. facebook.com. NS ;; facebook.com. 172800 IN NS a.ns.facebook.com. facebook.com. 172800 IN NS b.ns.facebook.com. facebook.com. 172800 IN NS c.ns.facebook.com. facebook.com. 172800 IN NS d.ns.facebook.com. What happens if the NS aren’t back within 48 hours? Jean From: NANOG <nanog-bounces+jean=ddostest.me@nanog.org> On Behalf Of Eric Kuhnke Sent: October 4, 2021 4:33 PM To: Jay Hennigan <jay@west.net>; nanog@nanog.org list <nanog@nanog.org> Subject: Re: massive facebook outage presently I am starting to see reports that in ISPs with very large numbers of residential users, customers are starting to press the factory-reset buttons on their home routers/modems/whatever, in an attempt to make Facebook work. This is resulting in much heavier than normal first tier support volumes. The longer it stays down the worse this is going to get. On Mon, Oct 4, 2021 at 3:30 PM Jay Hennigan <jay@west.net <mailto:jay@west.net> > wrote: On 10/4/21 12:11, bzs@theworld.com <mailto:bzs@theworld.com> wrote:

...

Although I believe it's generally true that if a company appears prominently in the news it's liable to be attacked I assume because the miscreants sit around thinking "hmm, who shall we attack today oh look at that shiny headline!" I'd hate to ascribe any altruistic motivation w/o some evidence like even a credible twitter post (maybe they posted that on FB? :-)

I personally believe that the outage was caused by human error and not something malicious. Time will tell. However, if you missed the 60 Minutes piece, it was a former employee who spoke out with some rather powerful observations. I don't think that this type of worldwide outage was caused by an outside bad actor. It is certainly within the realm of possibility that it was an inside job. In other news: https://twitter.com/disclosetv/status/1445100931947892736?s=20 -- Jay Hennigan - jay@west.net <mailto:jay@west.net> Network Engineering - CCIE #7880 503 897-8550 - WB6RDV

Bryan Fields

9:16 p.m.

On 10/4/21 4:47 PM, Jean St-Laurent via NANOG wrote:

...

dig @c.gtld-servers.net. facebook.com. NS ;; facebook.com. 172800 IN NS a.ns.facebook.com. facebook.com. 172800 IN NS b.ns.facebook.com. facebook.com. 172800 IN NS c.ns.facebook.com. facebook.com. 172800 IN NS d.ns.facebook.com.

What happens if the NS aren’t back within 48 hours?

Facebook can just update it at the registrar. $ whois facebook.com Domain Name: FACEBOOK.COM Registry Domain ID: 2320948_DOMAIN_COM-VRSN Registrar WHOIS Server: whois.registrarsafe.com Registrar URL: http://www.registrarsafe.com Updated Date: 2021-09-22T19:33:41Z Creation Date: 1997-03-29T05:00:00Z Registry Expiry Date: 2030-03-30T04:00:00Z Registrar: RegistrarSafe, LLC $ dig @c.gtld-servers.net. registrarsafe.com. NS ;; registrarsafe.com. 172800 IN NS a.ns.facebook.com. registrarsafe.com. 172800 IN NS b.ns.facebook.com. registrarsafe.com. 172800 IN NS c.ns.facebook.com. registrarsafe.com. 172800 IN NS d.ns.facebook.com. crap.... Vertical integration is a hell of a thing. -- Bryan Fields 727-409-1194 - Voice http://bryanfields.net

Matt Hoppes

8:56 p.m.

Yes, We've seen that. On 10/4/21 4:33 PM, Eric Kuhnke wrote:

...

I am starting to see reports that in ISPs with very large numbers of residential users, customers are starting to press the factory-reset buttons on their home routers/modems/whatever, in an attempt to make Facebook work. This is resulting in much heavier than normal first tier support volumes. The longer it stays down the worse this is going to get.

On Mon, Oct 4, 2021 at 3:30 PM Jay Hennigan <jay@west.net <mailto:jay@west.net>> wrote:

On 10/4/21 12:11, bzs@theworld.com <mailto:bzs@theworld.com> wrote: > > Although I believe it's generally true that if a company appears > prominently in the news it's liable to be attacked I assume because > the miscreants sit around thinking "hmm, who shall we attack today oh > look at that shiny headline!" I'd hate to ascribe any altruistic > motivation w/o some evidence like even a credible twitter post (maybe > they posted that on FB? :-)

I personally believe that the outage was caused by human error and not something malicious. Time will tell.

However, if you missed the 60 Minutes piece, it was a former employee who spoke out with some rather powerful observations. I don't think that this type of worldwide outage was caused by an outside bad actor. It is certainly within the realm of possibility that it was an inside job.

In other news:

https://twitter.com/disclosetv/status/1445100931947892736?s=20

-- Jay Hennigan - jay@west.net <mailto:jay@west.net> Network Engineering - CCIE #7880 503 897-8550 - WB6RDV

Glenn Kelley

5 Oct 5 Oct

12:18 a.m.

This is why you should have Routers that are Firmware Defaulted to your own network. ALWAYS Be it Calix or even a Mikrotik which you have setup with Netboot - having these default to your own setup is REALLY a game changer. Without it - you are rolling trucks or at minimum taking heavy call volumes. Glenn Kelley Chief cook and Bottle Washing Watcher @ Connectivity.Engineer On 10/4/2021 3:56 PM, Matt Hoppes wrote:

...

Yes, We've seen that.

On 10/4/21 4:33 PM, Eric Kuhnke wrote:

...
I am starting to see reports that in ISPs with very large numbers of residential users, customers are starting to press the factory-reset buttons on their home routers/modems/whatever, in an attempt to make Facebook work. This is resulting in much heavier than normal first tier support volumes. The longer it stays down the worse this is going to get.

On Mon, Oct 4, 2021 at 3:30 PM Jay Hennigan <jay@west.net <mailto:jay@west.net>> wrote:

    On 10/4/21 12:11, bzs@theworld.com <mailto:bzs@theworld.com> wrote:      >      > Although I believe it's generally true that if a company appears      > prominently in the news it's liable to be attacked I assume because      > the miscreants sit around thinking "hmm, who shall we attack today oh      > look at that shiny headline!" I'd hate to ascribe any altruistic      > motivation w/o some evidence like even a credible twitter post (maybe      > they posted that on FB? :-)

    I personally believe that the outage was caused by human error and not     something malicious. Time will tell.

    However, if you missed the 60 Minutes piece, it was a former employee     who spoke out with some rather powerful observations. I don't think     that     this type of worldwide outage was caused by an outside bad actor. It is     certainly within the realm of possibility that it was an inside job.

    In other news:

https://twitter.com/disclosetv/status/1445100931947892736?s=20

    --     Jay Hennigan - jay@west.net <mailto:jay@west.net>     Network Engineering - CCIE #7880     503 897-8550 - WB6RDV

Jean St-Laurent

12:34 p.m.

I don't understand how this would have helped yesterday.

...

From what is public so far, they really paint themselves in a corner with no way out. A classic, but at epic scale.

They will learn and improve for sure, but I don't understand how "firmware default to your own network" would have help here. Can you elaborate a bit please? Jean -----Original Message----- From: NANOG <nanog-bounces+jean=ddostest.me@nanog.org> On Behalf Of Glenn Kelley Sent: October 4, 2021 8:18 PM To: nanog@nanog.org Subject: Re: massive facebook outage presently This is why you should have Routers that are Firmware Defaulted to your own network. ALWAYS Be it Calix or even a Mikrotik which you have setup with Netboot - having these default to your own setup is REALLY a game changer. Without it - you are rolling trucks or at minimum taking heavy call volumes. Glenn Kelley Chief cook and Bottle Washing Watcher @ Connectivity.Engineer On 10/4/2021 3:56 PM, Matt Hoppes wrote:

...

Yes, We've seen that.

On 10/4/21 4:33 PM, Eric Kuhnke wrote:

...
I am starting to see reports that in ISPs with very large numbers of residential users, customers are starting to press the factory-reset buttons on their home routers/modems/whatever, in an attempt to make Facebook work. This is resulting in much heavier than normal first tier support volumes. The longer it stays down the worse this is going to get.

On Mon, Oct 4, 2021 at 3:30 PM Jay Hennigan <jay@west.net <mailto:jay@west.net>> wrote:

On 10/4/21 12:11, bzs@theworld.com <mailto:bzs@theworld.com> wrote: > > Although I believe it's generally true that if a company appears > prominently in the news it's liable to be attacked I assume because > the miscreants sit around thinking "hmm, who shall we attack today oh > look at that shiny headline!" I'd hate to ascribe any altruistic > motivation w/o some evidence like even a credible twitter post (maybe > they posted that on FB? :-)

I personally believe that the outage was caused by human error and not something malicious. Time will tell.

However, if you missed the 60 Minutes piece, it was a former employee who spoke out with some rather powerful observations. I don't think that this type of worldwide outage was caused by an outside bad actor. It is certainly within the realm of possibility that it was an inside job.

In other news:

https://twitter.com/disclosetv/status/1445100931947892736?s=20

-- Jay Hennigan - jay@west.net <mailto:jay@west.net> Network Engineering - CCIE #7880 503 897-8550 - WB6RDV

Mark Tinka

4:29 a.m.

On 10/4/21 22:33, Eric Kuhnke wrote:

...

I am starting to see reports that in ISPs with very large numbers of residential users, customers are starting to press the factory-reset buttons on their home routers/modems/whatever, in an attempt to make Facebook work. This is resulting in much heavier than normal first tier support volumes. The longer it stays down the worse this is going to get.

A relative could not understand how she received an e-mail from me yesterday about a family matter (via GMail) when the Internet was down :-). Mark.

richey goldberg

4 Oct 4 Oct

8:41 p.m.

No evidence that it’s intentional but….. What’s going to be the big headline tonight? A Facebook whistleblower or a global outage that kept everyone from arguing all day long? -richey From: NANOG <nanog-bounces+richey.goldberg=gmail.com@nanog.org> on behalf of Jay Hennigan <jay@west.net> Date: Monday, October 4, 2021 at 3:30 PM To: nanog@nanog.org <nanog@nanog.org> Subject: Re: massive facebook outage presently On 10/4/21 12:11, bzs@theworld.com wrote:

...

Although I believe it's generally true that if a company appears prominently in the news it's liable to be attacked I assume because the miscreants sit around thinking "hmm, who shall we attack today oh look at that shiny headline!" I'd hate to ascribe any altruistic motivation w/o some evidence like even a credible twitter post (maybe they posted that on FB? :-)

richey goldberg

4:55 p.m.

In other news worker productivity is up 100% today. -richey From: NANOG <nanog-bounces+richey.goldberg=gmail.com@nanog.org> on behalf of Jason Kuehl <jason.w.kuehl@gmail.com> Date: Monday, October 4, 2021 at 12:45 PM To: Mel Beckman <mel@beckman.org> Cc: nanog@nanog.org list <nanog@nanog.org> Subject: Re: massive facebook outage presently Looks like they run there own nameservers and I see the soa records are even missing. On Mon, Oct 4, 2021, 12:23 PM Mel Beckman <mel@beckman.org<mailto:mel@beckman.org>> wrote: Here’s a screenshot: Error! Filename not specified. -mel beckman On Oct 4, 2021, at 9:06 AM, Eric Kuhnke <eric.kuhnke@gmail.com<mailto:eric.kuhnke@gmail.com>> wrote: https://downdetector.com/status/facebook/ Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls. Appears to be failure in DNS resolution.

Casey Russell

5:18 p.m.

...

...
In other news worker productivity is up 100% today.

For everyone except IT workers. Although, I suppose if you're just counting the number of tickets they can quickly clear by sending out a "It's the internet, not us". You could count that as increased productivity. Sincerely, Casey Russell Network Engineer <http://www.kanren.net> 785-856-9809 2029 Becker Drive, Suite 282 Lawrence, Kansas 66047 XSEDE Campus Champion Certified Software Carpentry Instructor need support? <support@kanren.net> On Mon, Oct 4, 2021 at 12:14 PM richey goldberg <richey.goldberg@gmail.com> wrote:

...

In other news worker productivity is up 100% today.

-richey

*From: *NANOG <nanog-bounces+richey.goldberg=gmail.com@nanog.org> on behalf of Jason Kuehl <jason.w.kuehl@gmail.com> *Date: *Monday, October 4, 2021 at 12:45 PM *To: *Mel Beckman <mel@beckman.org> *Cc: *nanog@nanog.org list <nanog@nanog.org> *Subject: *Re: massive facebook outage presently

Looks like they run there own nameservers and I see the soa records are even missing.

On Mon, Oct 4, 2021, 12:23 PM Mel Beckman <mel@beckman.org> wrote:

Here’s a screenshot:

*Error! Filename not specified.*

-mel beckman

On Oct 4, 2021, at 9:06 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

Billy Croan

6:30 p.m.

I know what this is..... They forgot to update the credit card on their godaddy account and the domain lapsed. I guess it will be facebook.info when they get it back online. The post mortem should be an interesting read. On Mon, Oct 4, 2021 at 11:46 AM Jason Kuehl <jason.w.kuehl@gmail.com> wrote:

...

Looks like they run there own nameservers and I see the soa records are even missing.

On Mon, Oct 4, 2021, 12:23 PM Mel Beckman <mel@beckman.org> wrote:

...
Here’s a screenshot:

-mel beckman

On Oct 4, 2021, at 9:06 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

Baldur Norddahl

6:41 p.m.

I got a mail that Facebook was leaving NLIX. Maybe someone botched the script so they took down all BGP sessions instead of just NLIX and now they can't access the equipment to put it back... :-) man. 4. okt. 2021 20.31 skrev Billy Croan <BCroan@unrealservers.net>:

...

I know what this is..... They forgot to update the credit card on their godaddy account and the domain lapsed. I guess it will be facebook.info when they get it back online. The post mortem should be an interesting read.

On Mon, Oct 4, 2021 at 11:46 AM Jason Kuehl <jason.w.kuehl@gmail.com> wrote:

...
Looks like they run there own nameservers and I see the soa records are even missing.

On Mon, Oct 4, 2021, 12:23 PM Mel Beckman <mel@beckman.org> wrote:

...
Here’s a screenshot:

-mel beckman

On Oct 4, 2021, at 9:06 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

Blake Dunlap

6:44 p.m.

You laugh but that kind of sounds like what happened so far as oops we isolated prod and are scrambling on DR. There was someone supposedly live tweeting from their incident response for a bit before their account panic deleted. On Mon, Oct 4, 2021, 13:42 Baldur Norddahl <baldur.norddahl@gmail.com> wrote:

...

I got a mail that Facebook was leaving NLIX. Maybe someone botched the script so they took down all BGP sessions instead of just NLIX and now they can't access the equipment to put it back... :-)

man. 4. okt. 2021 20.31 skrev Billy Croan <BCroan@unrealservers.net>:

...
I know what this is..... They forgot to update the credit card on their godaddy account and the domain lapsed. I guess it will be facebook.info when they get it back online. The post mortem should be an interesting read.

On Mon, Oct 4, 2021 at 11:46 AM Jason Kuehl <jason.w.kuehl@gmail.com> wrote:

...
Looks like they run there own nameservers and I see the soa records are even missing.

On Mon, Oct 4, 2021, 12:23 PM Mel Beckman <mel@beckman.org> wrote:

...
Here’s a screenshot:

-mel beckman

On Oct 4, 2021, at 9:06 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

Jason Kuehl

6:52 p.m.

I mean, you're an idiot if you post that public on the internet about your own place of work. What do you think would happen? Nothing? He should never of said anything, but now the Facebook hitman got him. Facebook will have to send out a Reason For Outage with all the services it's effecting, like login On Mon, Oct 4, 2021 at 2:46 PM Blake Dunlap <ikiris@gmail.com> wrote:

...

You laugh but that kind of sounds like what happened so far as oops we isolated prod and are scrambling on DR. There was someone supposedly live tweeting from their incident response for a bit before their account panic deleted.

On Mon, Oct 4, 2021, 13:42 Baldur Norddahl <baldur.norddahl@gmail.com> wrote:

...
I got a mail that Facebook was leaving NLIX. Maybe someone botched the script so they took down all BGP sessions instead of just NLIX and now they can't access the equipment to put it back... :-)

man. 4. okt. 2021 20.31 skrev Billy Croan <BCroan@unrealservers.net>:

...
I know what this is..... They forgot to update the credit card on their godaddy account and the domain lapsed. I guess it will be facebook.info when they get it back online. The post mortem should be an interesting read.

On Mon, Oct 4, 2021 at 11:46 AM Jason Kuehl <jason.w.kuehl@gmail.com> wrote:

...
Looks like they run there own nameservers and I see the soa records are even missing.

On Mon, Oct 4, 2021, 12:23 PM Mel Beckman <mel@beckman.org> wrote:

...
Here’s a screenshot:

-mel beckman

On Oct 4, 2021, at 9:06 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

-- Sincerely, Jason W Kuehl Cell 920-419-8983 jason.w.kuehl@gmail.com

Matthew Petach

8:26 p.m.

On Mon, Oct 4, 2021 at 11:59 AM Jason Kuehl <jason.w.kuehl@gmail.com> wrote:

...

I mean, you're an idiot if you post that public on the internet about your own place of work. What do you think would happen? Nothing? He should never of said anything, but now the Facebook hitman got him.

Some of us have done that, and survived[0]. But I would be the first to admit I've led a very charmed life in that regard. ^_^; Matt [0] https://www.computerworld.com/article/2529621/networking-glitch-knocks-yahoo...

Sabri Berisha

6:45 p.m.

----- On Oct 4, 2021, at 11:41 AM, Baldur Norddahl baldur.norddahl@gmail.com wrote: Hi,

...

I got a mail that Facebook was leaving NLIX. Maybe someone botched the script so they took down all BGP sessions instead of just NLIX and now they can't access the equipment to put it back... :-)

That's an interesting theory. Once upon a time I saw a billion dollar company suffer a significant outage after enabling EVPN on a remote site. Took down the entire backbone, including access to the site. Thanks, Sabri

Tony Wicks

6:56 p.m.

Didn't write that part of the automation script and that coder left...

...

I got a mail that Facebook was leaving NLIX. Maybe someone botched the script so they took down all BGP sessions instead of just NLIX and now they can't access the equipment to put it back... :-)

Jason Kuehl

7:05 p.m.

https://twitter.com/disclosetv/status/1445100931947892736?s=20 On Mon, Oct 4, 2021 at 3:01 PM Tony Wicks <tony@wicks.co.nz> wrote:

...

Didn't write that part of the automation script and that coder left...

...
I got a mail that Facebook was leaving NLIX. Maybe someone botched the script so they took down all BGP sessions instead of just NLIX and now they can't access the equipment to put it back... :-)

-- Sincerely, Jason W Kuehl Cell 920-419-8983 jason.w.kuehl@gmail.com

Hank Nussbacher

8:44 p.m.

On 04/10/2021 22:05, Jason Kuehl wrote: BGP related: https://twitter.com/SGgrc/status/1445116435731296256 as also related by FB CTO: https://twitter.com/atoonk/status/1445121351707070468 -Hank

...

https://twitter.com/disclosetv/status/1445100931947892736?s=20 <https://twitter.com/disclosetv/status/1445100931947892736?s=20>

On Mon, Oct 4, 2021 at 3:01 PM Tony Wicks <tony@wicks.co.nz <mailto:tony@wicks.co.nz>> wrote:

Didn't write that part of the automation script and that coder left...

> I got a mail that Facebook was leaving NLIX. Maybe someone botched the > script so they took down all BGP sessions instead of just NLIX and now > they can't access the equipment to put it back... :-)

-- Sincerely,

Jason W Kuehl Cell 920-419-8983 jason.w.kuehl@gmail.com <mailto:jason.w.kuehl@gmail.com>

David Andrzejewski

9:02 p.m.

I find it hilarious and ironic that their CTO had to use a competitor’s platform to confirm their outage. - dave

...

On Oct 4, 2021, at 16:45, Hank Nussbacher <hank@interall.co.il> wrote:

On 04/10/2021 22:05, Jason Kuehl wrote:

BGP related: https://twitter.com/SGgrc/status/1445116435731296256 as also related by FB CTO: https://twitter.com/atoonk/status/1445121351707070468

-Hank

...
https://twitter.com/disclosetv/status/1445100931947892736?s=20 <https://twitter.com/disclosetv/status/1445100931947892736?s=20> On Mon, Oct 4, 2021 at 3:01 PM Tony Wicks <tony@wicks.co.nz <mailto:tony@wicks.co.nz>> wrote: Didn't write that part of the automation script and that coder left... > I got a mail that Facebook was leaving NLIX. Maybe someone botched the > script so they took down all BGP sessions instead of just NLIX and now > they can't access the equipment to put it back... :-) -- Sincerely, Jason W Kuehl Cell 920-419-8983 jason.w.kuehl@gmail.com <mailto:jason.w.kuehl@gmail.com>

Luke Guillory

6:48 p.m.

From what I believe was a FB employee on Reddit, account now deleted it seems. As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC). There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified. Part of this is also due to lower staffing in data centers due to pandemic measures. I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally. https://twitter.com/jgrahamc/status/1445068309288951820 "About five minutes before Facebook's DNS stopped working we saw a large number of BGP changes (mostly route withdrawals) for Facebook's ASN." From: NANOG <nanog-bounces+lguillory=reservetele.com@nanog.org> On Behalf Of Baldur Norddahl Sent: Monday, October 4, 2021 1:41 PM To: NANOG <nanog@nanog.org> Subject: Re: massive facebook outage presently *External Email: Use Caution* I got a mail that Facebook was leaving NLIX. Maybe someone botched the script so they took down all BGP sessions instead of just NLIX and now they can't access the equipment to put it back... :-) man. 4. okt. 2021 20.31 skrev Billy Croan <BCroan@unrealservers.net<mailto:BCroan@unrealservers.net>>: I know what this is..... They forgot to update the credit card on their godaddy account and the domain lapsed. I guess it will be facebook.info<https://link.edgepilot.com/s/7bad5051/Di9CwLEB1E6iB_KlhyWtZA?u=http://facebook.info/> when they get it back online. The post mortem should be an interesting read. On Mon, Oct 4, 2021 at 11:46 AM Jason Kuehl <jason.w.kuehl@gmail.com<mailto:jason.w.kuehl@gmail.com>> wrote: Looks like they run there own nameservers and I see the soa records are even missing. On Mon, Oct 4, 2021, 12:23 PM Mel Beckman <mel@beckman.org<mailto:mel@beckman.org>> wrote: Here’s a screenshot: -mel beckman On Oct 4, 2021, at 9:06 AM, Eric Kuhnke <eric.kuhnke@gmail.com<mailto:eric.kuhnke@gmail.com>> wrote: https://link.edgepilot.com/s/3926b9ff/bTkszib6zUmYbE_rZxhltQ?u=https://downd... Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls. Appears to be failure in DNS resolution. Links contained in this email have been replaced. If you click on a link in the email above, the link will be analyzed for known threats. If a known threat is found, you will not be able to proceed to the destination. If suspicious content is detected, you will see a warning.

Michael Thomas

7:55 p.m.

On 10/4/21 11:48 AM, Luke Guillory wrote:

...

I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally.

Assuming that this is what actually happened, what should fb have done different (beyond the obvious of not screwing up the immediate issue)? This seems like it's a single point of failure. Should all of the BGP speakers have been dual homed or something like that? Or should they not have been mixing ops and production networks? Sorry if this sounds dumb. Mike

Baldur Norddahl

8:12 p.m.

On Mon, 4 Oct 2021 at 21:58, Michael Thomas <mike@mtcc.com> wrote:

...

On 10/4/21 11:48 AM, Luke Guillory wrote:

I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally.

Assuming that this is what actually happened, what should fb have done different (beyond the obvious of not screwing up the immediate issue)? This seems like it's a single point of failure. Should all of the BGP speakers have been dual homed or something like that? Or should they not have been mixing ops and production networks? Sorry if this sounds dumb.

Facebook is a huge network. It is doubtful that what is going on is this simple. So I will make no guesses to what Facebook is or should be doing. However the traditional way for us small timers is to have a backdoor using someone else's network. Nowadays this could be a simple 4/5G router with a VPN, to a terminal server that allows the operator to configure the equipment through the monitor port even when the config is completely destroyed. Regards, Baldur

PJ Capelli

8:16 p.m.

Seems unlikely that FB internal controls would allow such a backdoor ... "Never to get lost, is not living" - Rebecca Solnit Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Monday, October 4th, 2021 at 4:12 PM, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:

...

On Mon, 4 Oct 2021 at 21:58, Michael Thomas <mike@mtcc.com> wrote:

...

...
On 10/4/21 11:48 AM, Luke Guillory wrote:

...

...
...
I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally.

...

...
Assuming that this is what actually happened, what should fb have done different (beyond the obvious of not screwing up the immediate issue)? This seems like it's a single point of failure. Should all of the BGP speakers have been dual homed or something like that? Or should they not have been mixing ops and production networks? Sorry if this sounds dumb.

...

Facebook is a huge network. It is doubtful that what is going on is this simple. So I will make no guesses to what Facebook is or should be doing.

...

However the traditional way for us small timers is to have a backdoor using someone else's network. Nowadays this could be a simple 4/5G router with a VPN, to a terminal server that allows the operator to configure the equipment through the monitor port even when the config is completely destroyed.

...

Regards,

...

Baldur

Baldur Norddahl

8:23 p.m.

Not in such a primitive fashion no. But they could definitely have a secondary network that will continue to work even if something goes wrong with the primary. On Mon, 4 Oct 2021 at 22:16, PJ Capelli <pjcapelli@pm.me> wrote:

...

Seems unlikely that FB internal controls would allow such a backdoor ...

"Never to get lost, is not living" - Rebecca Solnit

Sent with ProtonMail <https://protonmail.com/> Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Monday, October 4th, 2021 at 4:12 PM, Baldur Norddahl < baldur.norddahl@gmail.com> wrote:

On Mon, 4 Oct 2021 at 21:58, Michael Thomas <mike@mtcc.com> wrote:

...
On 10/4/21 11:48 AM, Luke Guillory wrote:

I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally.

Assuming that this is what actually happened, what should fb have done different (beyond the obvious of not screwing up the immediate issue)? This seems like it's a single point of failure. Should all of the BGP speakers have been dual homed or something like that? Or should they not have been mixing ops and production networks? Sorry if this sounds dumb.

Facebook is a huge network. It is doubtful that what is going on is this simple. So I will make no guesses to what Facebook is or should be doing.

However the traditional way for us small timers is to have a backdoor using someone else's network. Nowadays this could be a simple 4/5G router with a VPN, to a terminal server that allows the operator to configure the equipment through the monitor port even when the config is completely destroyed.

Regards,

Baldur

Blake Dunlap

8:32 p.m.

If there isn't an undernetwork capable of being backdoored with the proper keys (I'd be shocked if there isn't - the big players have very good infra and DR people), I suspect there will be one soonish. It doesnt do much good to have DR plans and keys otherwise if you can't even get to the locks without getting on a plane. Regardless of how people may feel about the company, I just feel bad for their sres right now and am drinking one in their honor. I just want to know what an October meltdown gets called in the pm. On Mon, Oct 4, 2021, 15:24 Baldur Norddahl <baldur.norddahl@gmail.com> wrote:

...

Not in such a primitive fashion no. But they could definitely have a secondary network that will continue to work even if something goes wrong with the primary.

On Mon, 4 Oct 2021 at 22:16, PJ Capelli <pjcapelli@pm.me> wrote:

...
Seems unlikely that FB internal controls would allow such a backdoor ...

"Never to get lost, is not living" - Rebecca Solnit

Sent with ProtonMail <https://protonmail.com/> Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Monday, October 4th, 2021 at 4:12 PM, Baldur Norddahl < baldur.norddahl@gmail.com> wrote:

On Mon, 4 Oct 2021 at 21:58, Michael Thomas <mike@mtcc.com> wrote:

...
On 10/4/21 11:48 AM, Luke Guillory wrote:

I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally.

Assuming that this is what actually happened, what should fb have done different (beyond the obvious of not screwing up the immediate issue)? This seems like it's a single point of failure. Should all of the BGP speakers have been dual homed or something like that? Or should they not have been mixing ops and production networks? Sorry if this sounds dumb.

Facebook is a huge network. It is doubtful that what is going on is this simple. So I will make no guesses to what Facebook is or should be doing.

However the traditional way for us small timers is to have a backdoor using someone else's network. Nowadays this could be a simple 4/5G router with a VPN, to a terminal server that allows the operator to configure the equipment through the monitor port even when the config is completely destroyed.

Regards,

Baldur

Mel Beckman

8:46 p.m.

I’m not the only one who finds this timing suspicious, Starting with the publishers of 60 Minutes themselves :-) CBS: The outage comes the morning after "60 Minutes" aired an interview with a whistleblower who said Facebook is aware of how it amplifies hate, misinformation and unrest but claimed the company hides what it knows. https://www.cbsnews.com/news/facebook-instagram-whatsapp-down-2021-10-04/ https://abcnews.go.com/Technology/facebook-instagram-users-us/story?id=80397... https://www.cnbc.com/2021/10/04/facebook-shares-drop-5percent-after-site-out... https://www.insidenova.com/headlines/facebook-instagram-down-after-60-minute... https://adage.com/article/digital-marketing-ad-tech-news/what-facebook-telli... -mel beckman On Oct 4, 2021, at 1:36 PM, Blake Dunlap <ikiris@gmail.com> wrote: If there isn't an undernetwork capable of being backdoored with the proper keys (I'd be shocked if there isn't - the big players have very good infra and DR people), I suspect there will be one soonish. It doesnt do much good to have DR plans and keys otherwise if you can't even get to the locks without getting on a plane. Regardless of how people may feel about the company, I just feel bad for their sres right now and am drinking one in their honor. I just want to know what an October meltdown gets called in the pm. On Mon, Oct 4, 2021, 15:24 Baldur Norddahl <baldur.norddahl@gmail.com<mailto:baldur.norddahl@gmail.com>> wrote: Not in such a primitive fashion no. But they could definitely have a secondary network that will continue to work even if something goes wrong with the primary. On Mon, 4 Oct 2021 at 22:16, PJ Capelli <pjcapelli@pm.me<mailto:pjcapelli@pm.me>> wrote: Seems unlikely that FB internal controls would allow such a backdoor ... "Never to get lost, is not living" - Rebecca Solnit Sent with ProtonMail<https://protonmail.com/> Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Monday, October 4th, 2021 at 4:12 PM, Baldur Norddahl <baldur.norddahl@gmail.com<mailto:baldur.norddahl@gmail.com>> wrote: On Mon, 4 Oct 2021 at 21:58, Michael Thomas <mike@mtcc.com<mailto:mike@mtcc.com>> wrote: On 10/4/21 11:48 AM, Luke Guillory wrote: I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally. Assuming that this is what actually happened, what should fb have done different (beyond the obvious of not screwing up the immediate issue)? This seems like it's a single point of failure. Should all of the BGP speakers have been dual homed or something like that? Or should they not have been mixing ops and production networks? Sorry if this sounds dumb. Facebook is a huge network. It is doubtful that what is going on is this simple. So I will make no guesses to what Facebook is or should be doing. However the traditional way for us small timers is to have a backdoor using someone else's network. Nowadays this could be a simple 4/5G router with a VPN, to a terminal server that allows the operator to configure the equipment through the monitor port even when the config is completely destroyed. Regards, Baldur

Mark Tinka

5 Oct 5 Oct

4:26 a.m.

On 10/4/21 22:23, Baldur Norddahl wrote:

...

Not in such a primitive fashion no. But they could definitely have a secondary network that will continue to work even if something goes wrong with the primary.

On IPv6, no less :-). On a serious note, I can't even imagine what it takes to run a network of that scale, and plan for situations like this, if, indeed, inline access was lost. Mark.

bzs＠theworld.com

4 Oct 4 Oct

9:50 p.m.

One might think in over six hours they could point facebook.com's DNS somewhere else and put up a page with some info about the outage there, that this would be a practiced firedrill. Yeah yeah cache blah blah but it'd get around and at least would be coming from them. I'd imagine some mutual pact with google's or amazon's or microsoft's cloud server(s), for example, could handle it for a few hours, and vice-versa. I only mean a single, simple information page like the "sorry we're working on it" I saw just before they came back. Then again what else would one assume? 17:44 They seem to be back, more or less, slowly loading the usual crazy sauce. So out about 6 hours and some. -- -Barry Shein Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*

Niels Bakker

11:01 p.m.

* bzs@theworld.com (bzs@theworld.com) [Tue 05 Oct 2021, 00:01 CEST]:

...

I only mean a single, simple information page like the "sorry we're working on it" I saw just before they came back.

Which would subsequently receive the world's FB session cookies. -- Niels.

Doug McIntyre

11:09 p.m.

On Mon, Oct 04, 2021 at 05:50:07PM -0400, bzs@theworld.com wrote:

...

One might think in over six hours they could point facebook.com's DNS somewhere else and put up a page with some info about the outage there, that this would be a practiced firedrill.

Perhaps, if they didn't decide to be their own registrar as well and run it all on the same network as it seems. Maybe company divisions running different parts of infrastructure should be self-hosted 100% on their own, different AS, different networking points, etc. Or don't try to be everything top down all in the same company.

Łukasz Bromirski

8:27 p.m.

Dual homing won’t help you if your automation template will do „no router bgp X” and at this point session will terminate as suddenly advertisement will be withdrawn… It won’t you either if the change triggers some obscure bug in your BGP stack. I bet FB tested the change on smaller scale and everything was fine, and only then started to roll this over wider network and at that point „something” broke. Or some bug needed a moment to start cascading issues around the infra. -- ./

...

On 4 Oct 2021, at 22:00, Michael Thomas <mike@mtcc.com> wrote:

On 10/4/21 11:48 AM, Luke Guillory wrote:

...
I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally.

Assuming that this is what actually happened, what should fb have done different (beyond the obvious of not screwing up the immediate issue)? This seems like it's a single point of failure. Should all of the BGP speakers have been dual homed or something like that? Or should they not have been mixing ops and production networks? Sorry if this sounds dumb.

Mike

Mark Tinka

5 Oct 5 Oct

4:27 a.m.

On 10/4/21 22:27, Łukasz Bromirski wrote:

...

I bet FB tested the change on smaller scale and everything was fine, and only then started to roll this over wider network and at that point „something” broke. Or some bug needed a moment to start cascading issues around the infra.

This is the annoyance that is our world. In the lab, it's always good. Mark.

bzs＠theworld.com

4 Oct 4 Oct

9:39 p.m.

17:35EDT: I'm suddenly getting: Sorry, something went wrong. We're working on it and we'll get it fixed as soon as we can. Go Back Facebook © 2020 · Help Center and: % host facebook.com facebook.com has address 157.240.241.35 facebook.com has IPv6 address 2a03:2880:f112:182:face:b00c:0:25de facebook.com mail is handled by 10 smtpin.vvv.facebook.com. -- -Barry Shein Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*

Tony Wicks

9:47 p.m.

Back and working by the looks.

Mark Tinka

5 Oct 5 Oct

4:31 a.m.

On 10/4/21 20:48, Luke Guillory wrote:

...

I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally.

Q: What is automation? A: Breaking the network at scale. I suppose we shall know more in the coming days, but to me, it smells like a wide-scale automatic update roll-out that didn't go to plan. Isn't the first time a major content provider has suffered this; likely won't be the last. In the end, the best thing for all of us is that it is a teaching moment, and we move the needle forward. Mark.

Karl Auer

4:52 a.m.

On Tue, 2021-10-05 at 06:31 +0200, Mark Tinka wrote:

...

Q: What is automation? A: Breaking the network at scale.

P J Plauger (I think) once defined a computer as a mechanism allowing the deletion of vast quantities of irreplaceable data using simple mnemonic commands. He defined a network as a mechanism allowing the deletion of vast quantities of *other people's* irreplaceable data using simple mnemonic commands... Regards, K. -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Karl Auer (kauer@biplane.com.au) http://www.biplane.com.au/kauer GPG fingerprint: 61A0 99A9 8823 3A75 871E 5D90 BADB B237 260C 9C58 Old fingerprint: 2561 E9EC D868 E73C 8AF1 49CF EE50 4B1D CCA1 5170

Niels Bakker

4 Oct 4 Oct

5:27 p.m.

* mel@beckman.org (Mel Beckman) [Mon 04 Oct 2021, 18:23 CEST]:

...

Here’s a screenshot:

[cid:3E071EF9-BBC5-44BF-865D-2EDC36E05C71-L0-001]

Please don't do this on the NANOG list. -- Niels.

Dmitry Sherman

4:09 p.m.

same problem in Israel From: NANOG [mailto:nanog-bounces+dmitry=interhost.net@nanog.org] On Behalf Of Eric Kuhnke Sent: Monday, 4 October 2021 19:03 To: nanog@nanog.org list <nanog@nanog.org> Subject: massive facebook outage presently https://downdetector.com/status/facebook/ Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls. Appears to be failure in DNS resolution.

Tom Beecher

4:27 p.m.

Curious if there is a malicious angle after the 60 Minutes story last night. On Mon, Oct 4, 2021 at 12:26 PM Dmitry Sherman <dmitry@interhost.net> wrote:

...

same problem in Israel

*From:* NANOG [mailto:nanog-bounces+dmitry=interhost.net@nanog.org] *On Behalf Of *Eric Kuhnke *Sent:* Monday, 4 October 2021 19:03 *To:* nanog@nanog.org list <nanog@nanog.org> *Subject:* massive facebook outage presently

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

Jay Hennigan

4:58 p.m.

On 10/4/21 09:27, Tom Beecher wrote:

...

Curious if there is a malicious angle after the 60 Minutes story last night.

Hanlon's razor. "Never attribute to malice that which is adequately explained by stupidity" Corollary - "But don't rule out malice." -- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV

Miles Fidelman

8:43 p.m.

Jay Hennigan wrote:

...

On 10/4/21 09:27, Tom Beecher wrote:

...
Curious if there is a malicious angle after the 60 Minutes story last night.

Hanlon's razor.

"Never attribute to malice that which is adequately explained by stupidity"

Corollary - "But don't rule out malice."

"Once is happenstance. Twice is coincidence. The third*time it's*enemy action'.” -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra Theory is when you know everything but nothing works. Practice is when everything works but no one knows why. In our lab, theory and practice are combined: nothing works and no one knows why. ... unknown

michael＠spears.io

4:28 p.m.

Looks bigger than DNS. Some people are saying they’ve disappeared off the DFZ. From: NANOG <nanog-bounces+michael=spears.io@nanog.org> On Behalf Of Dmitry Sherman Sent: Monday, October 4, 2021 12:10 PM To: Eric Kuhnke <eric.kuhnke@gmail.com>; nanog@nanog.org list <nanog@nanog.org> Subject: RE: massive facebook outage presently same problem in Israel From: NANOG [mailto:nanog-bounces+dmitry=interhost.net@nanog.org] On Behalf Of Eric Kuhnke Sent: Monday, 4 October 2021 19:03 To: nanog@nanog.org <mailto:nanog@nanog.org> list <nanog@nanog.org <mailto:nanog@nanog.org> > Subject: massive facebook outage presently https://downdetector.com/status/facebook/ Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls. Appears to be failure in DNS resolution.

Tom Beecher

5:01 p.m.

They haven't completely dropped off, but the big subnets certainly have. I'm only seeing 20-odd /24s from them via the DFZ, but everything larger still directly. On Mon, Oct 4, 2021 at 12:55 PM <michael@spears.io> wrote:

...

Looks bigger than DNS. Some people are saying they’ve disappeared off the DFZ.

*From:* NANOG <nanog-bounces+michael=spears.io@nanog.org> *On Behalf Of *Dmitry Sherman *Sent:* Monday, October 4, 2021 12:10 PM *To:* Eric Kuhnke <eric.kuhnke@gmail.com>; nanog@nanog.org list < nanog@nanog.org> *Subject:* RE: massive facebook outage presently

same problem in Israel

*From:* NANOG [mailto:nanog-bounces+dmitry=interhost.net@nanog.org <nanog-bounces+dmitry=interhost.net@nanog.org>] *On Behalf Of *Eric Kuhnke *Sent:* Monday, 4 October 2021 19:03 *To:* nanog@nanog.org list <nanog@nanog.org> *Subject:* massive facebook outage presently

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

Jason Kuehl

4:29 p.m.

Yeah it looks like there dns servers are just dead. I can't get an response from them. On Mon, Oct 4, 2021, 12:26 PM Dmitry Sherman <dmitry@interhost.net> wrote:

...

same problem in Israel

*From:* NANOG [mailto:nanog-bounces+dmitry=interhost.net@nanog.org] *On Behalf Of *Eric Kuhnke *Sent:* Monday, 4 October 2021 19:03 *To:* nanog@nanog.org list <nanog@nanog.org> *Subject:* massive facebook outage presently

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

George Herbert

4:16 p.m.

And WhatsApp and Instagram. Twitter users nationwide agree anecdotally. What I’m getting is DNS failure. -George Sent from my iPhone

...

On Oct 4, 2021, at 9:07 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

Eric Kuhnke

4:22 p.m.

Considering the massive impact of this it would be interesting to see some traffic graphs from ISPs that have PNIs with Facebook, or high volume peering sessions across an IX, showing traffic to FB falling off a cliff. On Mon, Oct 4, 2021 at 12:16 PM George Herbert <george.herbert@gmail.com> wrote:

...

And WhatsApp and Instagram. Twitter users nationwide agree anecdotally.

What I’m getting is DNS failure.

-George

Sent from my iPhone

On Oct 4, 2021, at 9:07 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

aaron1＠gvtc.com

5:42 p.m.

Yes, embedded ISP CDN’s show a huge drop -Aaron From: NANOG <nanog-bounces+aaron1=gvtc.com@nanog.org> On Behalf Of Eric Kuhnke Sent: Monday, October 4, 2021 11:22 AM To: George Herbert <george.herbert@gmail.com>; nanog@nanog.org list <nanog@nanog.org> Subject: Re: massive facebook outage presently Considering the massive impact of this it would be interesting to see some traffic graphs from ISPs that have PNIs with Facebook, or high volume peering sessions across an IX, showing traffic to FB falling off a cliff. On Mon, Oct 4, 2021 at 12:16 PM George Herbert <george.herbert@gmail.com <mailto:george.herbert@gmail.com> > wrote: And WhatsApp and Instagram. Twitter users nationwide agree anecdotally. What I’m getting is DNS failure. -George Sent from my iPhone On Oct 4, 2021, at 9:07 AM, Eric Kuhnke <eric.kuhnke@gmail.com <mailto:eric.kuhnke@gmail.com> > wrote: https://downdetector.com/status/facebook/ Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls. Appears to be failure in DNS resolution.

Mark Tinka

8:53 p.m.

On 10/4/21 18:22, Eric Kuhnke wrote:

...

Considering the massive impact of this it would be interesting to see some traffic graphs from ISPs that have PNIs with Facebook, or high volume peering sessions across an IX, showing traffic to FB falling off a cliff.

Our PNI's and FNA's in Africa have dipped. But PNI's in Europe have nearly doubled, even though we aren't delivering any Facebook services of note to our eyeballs. Strangest thing... Mark.

Matthew Petach

4:24 p.m.

Not so much a DNS failure as an "oops, where did the routes to our auth servers go?" moment. or, in this case, several moments... ^_^;; Matt On Mon, Oct 4, 2021 at 9:07 AM Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

...

https://downdetector.com/status/facebook/

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

Adrian Minta

4:34 p.m.

https://www.youtube.com/watch?v=gmk673M5BoA On 10/4/21 7:03 PM, Eric Kuhnke wrote:

...

https://downdetector.com/status/facebook/ <https://downdetector.com/status/facebook/>

Normally not worth mentioning random $service having an outage here, but this will undoubtedly generate a large volume of customer service calls.

Appears to be failure in DNS resolution.

-- Best regards, Adrian Minta

Anne P. Mitchell, Esq.

5:07 p.m.

...

On Oct 4, 2021, at 10:34 AM, Adrian Minta <adrian.minta@gmail.com> wrote:

https://www.youtube.com/watch?v=gmk673M5BoA

Because it's helpful to know what a link is about before clicking on it, Adrian's link goes to "The Onion Movie - Internet Down" On a related note, what do you think the scene is like in FB HQ right now? (shaking head) Anne -- Outsource your email deliverability headaches to us, and get to the inbox! www.GetToTheInbox.com Anne P. Mitchell, Esq. CEO Get to the Inbox - We get you into the inbox! Author: The Email Deliverability Handbook Author: Section 6 of the Federal Email Marketing Law (CAN-SPAM) Email Marketing Deliverability and Best Practices Expert Board of Directors, Denver Internet Exchange Chair Emeritus, Asilomar Microcomputer Workshop Former Counsel: MAPS Anti-Spam Blacklist Location: Boulder, Colorado

Sabri Berisha

5:46 p.m.

----- On Oct 4, 2021, at 10:07 AM, Anne P. Mitchell, Esq. amitchell@isipp.com wrote: Hi Anne,

...

On a related note, what do you think the scene is like in FB HQ right now? (shaking head)

Very quiet, as their offices are still closed for all but essentials :) But, from experience I can tell you how that works. I assume Facebook works in a similar manner as some of my previous employers. This assumption comes from the fact that quite a number of my previous colleagues now work at Facebook in similar roles. First there is the question of detecting the outage. Obviously, Facebook will have a monitoring/SRE team that continuously monitors 1000s of metrics. They observe a number of metrics go down, and start to investigate. Most likely they will have some sort of overall technical lead (let's call this the Technical Duty Officer), that is responsible for the whole thing. Once the SRE team figured out where the problem lies, they will alert the TDO. TDO will then hit that big red button and send out alerts to the appropriate teams to jump on a bridge (let's call that the Technical Crisis Bridge), to fix the issue. If done right, whomever was on call for that team will take the lead and interface with adjoining teams, and other team members who are available to help out. Looking at how long this outage lasts, there must be either something very broken, or they're having trouble rolling back a change which was expected to not have impact. Once the issue is fixed, the TDO will write a report and submit it to the Problem Management group. This group will now contact the teams deemed responsible for the outage. This team will no have an opportunity to explain themselves during a post- mortem. Depending on the scale of the outage, the post-mortem can be a 10 minute call on a bridge with a Problem Management manager, or in the hot seat during a 60 minute meeting with a bunch of execs. I've been in that hot seat a few times. Not the most pleasurable experience. Perhaps it's time for a new career :) Thanks, Sabri

Sabri Berisha

5:49 p.m.

Hi, Oops, this was not supposed to go to the list, apologies for the clutter. Thanks, Sabri ----- On Oct 4, 2021, at 10:46 AM, Sabri Berisha sabri@cluecentral.net wrote:

...

----- On Oct 4, 2021, at 10:07 AM, Anne P. Mitchell, Esq. amitchell@isipp.com

1524

Age (days ago)

1526

Last active (days ago)

List overview

Download

104 comments

60 participants

participants (60)

Aaron C. de Bruyn
aaron1＠gvtc.com
Adrian Minta
Anne P. Mitchell, Esq.
Baldur Norddahl
Bill Woodcock
Billy Croan
Blake Dunlap
Bryan Fields
bzs＠theworld.com
Casey Russell
chris
Darwin Costa
David Andrzejewski
Dmitry Sherman
Doug McIntyre
Eric Kuhnke
George Herbert
George Metz
Glenn Kelley
Hank Nussbacher
Hugo Slabbert
Jamie Dahl
Jared Mauch
Jason Kuehl
Jay Hennigan
Jean St-Laurent
Jeff Shultz
jim deleskie
John Lee
Jonathan Kalbfeld
Jorge Amodio
Karl Auer
Luke Guillory
Mark Tinka
Martin List-Petersen
Matt Hoppes
Matthew Petach
Mel Beckman
Michael Spears
Michael Thomas
michael＠spears.io
Mike Lyon
mike tancsa
Miles Fidelman
Niels Bakker
None None
Patrick W. Gilmore
PJ Capelli
richey goldberg
Ryan Brooks
Sabri Berisha
Sean Donelan
Tom Beecher
tomocha
Tony Wicks
Töma Gavrichenkov
Warren Kumari
Wolfgang Tremmel
Łukasz Bromirski