Re: Facebook post-mortems...

5 Oct 2021

      On Tue, Oct 5, 2021 at 8:57 AM Kain, Becki (.) <bkain1@ford.com> wrote:
...
Why ever would have a card reader on your external facing network, if that
was really the case why they couldn't get in to fix it?
Let's hypothesize for a moment.

Let's suppose you've decided that certificate-based
authentication is the cat's meow, and so you've got
dot1x authentication on every network port in your
corporate environment, all your users are authenticated
via certificates, all properly signed all the way up the
chain to the root trust anchor.

Life is good.

But then you have a bad network day.  Suddenly,
you can't talk to upstream registries/registrars,
you can't reach the trust anchor for your certificates,
and you discover that all the laptops plugged into
your network switches are failing to validate their
authenticity; sure, you're on the network, but you're
in a guest vlan, with no access.  Your user credentials
aren't able to be validated, so you're stuck with the
base level of access, which doesn't let you into the
OOB network.

Turns out your card readers were all counting on
dot1x authentication to get them into the right vlan
as well, and with the network buggered up, the
switches can't validate *their* certificates either,
so the door badge card readers just flash their
LEDs impotently when you wave your badge at
them.

Remember, one attribute of certificates is that they are
designated as valid for a particular domain, or set of
subdomains with a wildcard; that is, an authenticator needs
to know where the certificate is being presented to know if
it is valid within that scope or not.   You can do that scope
validation through several different mechanisms,
such as through a chain of trust to a certificate authority,
or through DNSSEC with DANE--but fundamentally,
all certificates have a scope within which they are valid,
and a means to identify in which scope they are being
used.  And wether your certificate chain of trust is
being determined by certificate authorities or DANE,
they all require that trust to be validated by something
other than the client and server alone--which generally
makes them dependent on some level of external
network connectivity being present in order to properly
function.   [yes, yes, we can have a side discussion about
having every authentication server self-sign certificates
as its own CA, and thus eliminate external network
connectivity dependencies--but that's an administrative
nightmare that I don't think any large organization would
sign up for.]

So, all of the client certificates and authorization servers
we're talking about exist on your internal network, but they
all counted on reachability to your infrastructure
servers in order to properly authenticate and grant
access to devices and people.  If your BGP update
made your infrastructure servers, such as DNS servers,
become unreachable, then suddenly you might well
find yourself locked out both physically and logically
from your own network.

Again, this is purely hypothetical, but it's one scenario
in which a routing-level "oooooops" could end up causing
physical-entry denial, as well as logical network access
level denial, without actually having those authentication
systems on external facing networks.

Certificate-based authentication is scalable and cool, but
it's really important to think about even generally "that'll
never happen" failure scenarios when deploying it into
critical systems.  It's always good to have the "break glass
in case of emergency" network that doesn't rely on dot1x,
that works without DNS, without NTP, without RADIUS,
or any other external system, with a binder with printouts
of the IP addresses of all your really critical servers and
routers in it which gets updated a few times a year, so that
when the SHTF, a person sitting at a laptop plugged into
that network with the binder next to them can get into the
emergency-only local account on each router to fix things.

And yes, you want every command that local emergency-only
user types into a router to be logged, because someone
wanting to create mischief in your network is going to aim
for that account access if they can get it; so watch it like a
hawk, and the only time it had better be accessed and used
is when the big red panic button has already been hit, and
the executives are huddled around speakerphones wanting
to know just how fast you can get things working again.  ^_^;

I know nothing of the incident in question.  But sitting at home,
hypothesizing about ways in which things could go wrong, this
is one of the reasons why I still configure static emergency
accounts on network devices, even with centrally administered
account systems, and why there's always a set of "no dot1x"
ports that work to get into the OOB/management network even
when everything else has gone toes-up.   :)

So--that's one way in which an outage like this could have
locked people out of buildings.   ^_^;

Thanks!

Matt
[ready for the deluge of people pointing out I've overly simplified the
validation chain for certificates in order to keep the post short and
high-level.   ^_^; ]

Re: Facebook post-mortems...

Matthew Petach