On Fri, 9 Jul 2021 at 00:01, William Herrin <bill@herrin.us> wrote:
I would suggest that your customer does care, but as there is no
Most don't. Somewhat recently we were dropping a non-trivial amount of packets from a well-known book store due to DMAC failure. This was unexpected, considering it was an L3 to L3 connection. This was a LACP bundle with a large number of interfaces and this issue affected just one interface in the bundle. After we informed the customer about the problem, while it was still occurring, they could not observe it, they looked at their stats and whatever it was dropping was being drowned in the noise, it was not an actionable signal to them. Customer wasn't willing to remove the broken interface from the bundle, as they could not observe the problem. We did migrate that port to a working port and after 3 months we agreed with the vendor to stop troubleshooting it, vendor can see that they had misprogrammed their hardware, but they were not able to figure out why and therefore it is not fixed. Very large amount of cycles were spent at the vendor and operator, and a small amount of work (checking TCP resends etc) at customers trying to solve it. The reason we contacted the customer is because there were quite a large number of packets we were dropping, I can easily find 100 real smaller problems we have in the network immediately. Customer was /not/ wrong, the customer did the exact right thing. There are a lot of problems, and you can go deep into the rabbit hole trying to fix problems which are real but don't affect a sufficient amount of packets to have a meaningful impact on the product quality. -- ++ytti