On Thu, Jul 8, 2021 at 8:32 AM Saku Ytti <saku@ytti.fi> wrote:
On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent <lvanbever@ethz.ch> wrote:
Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do anything about it?”
Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing.
I think that some of it depends on the type of failure -- for example, some devices hash packets across an internal switch fabric, and so the failure manifests as persistent issues to a specific 5-tuple (or between a pair of 5-tuples). If this affects one in a thousand flows it is likely more annoying than one in a thousand random packets being dropped. But, yes, all networks drop some set of packets some percentage of the time (cue the "SEU caused by cosmic rays" response :-)) W
Fixing these can take months of working with vendors and attempts to remedy will usually cause planned or unplanned outages. So it rarely makes sense to try to fix as they usually impact a trivial amount of traffic.
Networks also routinely mangle packets in-memory which are not visible to FCS check.
-- ++ytti
-- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra