On Thu, Jul 8, 2021 at 4:03 PM William Herrin <bill@herrin.us> wrote:
On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti <saku@ytti.fi> wrote:
Network experiences gray failures all the time, and I almost never care, unless a customer does.
I would suggest that your customer does care, but as there is no simple test to demonstrate gray failures, your customer rarely makes it past first tier support to bring the issue to your attention and gives up trying. Indeed, name the networks with the worst reputations around here and many of them have those reputations because of a routine, uncorrected state of gray failure.
Networks originating/receiving the traffic tend to have more incentives to resolve these issues, which might be not so rare If you have connection/application level health metrics (e.g. TLS handshake failures, TCP retransmits), identifying a problem exists is not too difficult. Having health metrics associated with network paths can greatly simplify repro. Then it's mostly troubleshooting datapath issues on your favorite platform. It takes quite some effort to figure out/collect relevant metrics and present them in a usable way. Something like connections from PoP A to destination ASN/prefix (via interface X) had TLS handshake failure rate increased from 0.02% to 1% is a good starting point for troubleshooting (may or may not be a network issue, the origin/receiver probably wants to fix it regardless). Things can get more complicated when traffic crosses network boundaries with things you don't have visibility into (IX fabric, remote peering, another networks' optical systems, complicated setups like stateful firewall / MC-LAG)