Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

9 Jul 2021

      On Thu, Jul 8, 2021 at 4:03 PM William Herrin <bill@herrin.us> wrote:
...
On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti <saku@ytti.fi> wrote:
...
Network experiences gray failures all the time, and I almost never
care, unless a customer does.
I would suggest that your customer does care, but as there is no
simple test to demonstrate gray failures, your customer rarely makes
it past first tier support to bring the issue to your attention and
gives up trying. Indeed, name the networks with the worst reputations
around here and many of them have those reputations because of a
routine, uncorrected state of gray failure.
Networks originating/receiving the traffic tend to have more
incentives to resolve these issues, which might be not so rare

If you have connection/application level health metrics (e.g. TLS
handshake failures, TCP retransmits), identifying a problem exists is
not too difficult. Having health metrics associated with network paths
can greatly simplify repro. Then it's mostly troubleshooting datapath
issues on your favorite platform.

It takes quite some effort to figure out/collect relevant metrics and
present them in a usable way. Something like connections from PoP A to
destination ASN/prefix (via interface X) had TLS handshake failure
rate increased from 0.02% to 1% is a good starting point for
troubleshooting (may or may not be a network issue, the
origin/receiver probably wants to fix it regardless).

Things can get more complicated when traffic crosses network
boundaries with things you don't have visibility into (IX fabric,
remote peering, another networks' optical systems, complicated setups
like stateful firewall / MC-LAG)

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Yang Yu