Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Vanbever Laurent

8 Jul 2021 8 Jul '21

11:57 a.m.

Dear NANOG, Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do anything about it?” Please help us out to find out by answering our short (<10 minutes) anonymous survey. Survey URL: https://forms.gle/v99mBNEPrLjcFCEu8 ## Context: When we think about network failures, we often think about a link or a network device going down. These failures are "obvious" in that *all* the traffic crossing the corresponding resource is dropped. But network failures can also be more subtle and only affect a *subset* of the traffic (e.g. 0.01% of the packets crossing a link/router). These failures are commonly referred to as "gray" failures. Because they don't drop *all* the traffic, gray failures are much harder to detect. Many studies revealed that cloud and datacenter networks routinely suffer from gray failures and, as such, many techniques exist to track them down in these environments (see e.g. this study from Microsoft Azure https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1....). What is less known though is how much gray failures affect *other* types of networks such as Internet Service Providers (ISPs), Wide Area Networks (WAN), or Enterprise networks. While the bug reports submitted to popular routing vendors (Cisco, Juniper, etc.) suggest that gray failures are pervasive and hard to catch for all networks, we would love to know more about first-hand experiences. ## About the survey: The questionnaire is intended for network operators. It has a total of 15 questions and should take at most 10 minutes to complete. The survey and the collected data are totally anonymous (so please do not include information that may help to identify you or your organization). All questions are optional, so if you don't like a question or don't know the answer, just skip it. Thank you so much in advance, and we look forward to read your responses! Laurent Vanbever, ETH Zurich PS: Of course, we would be extremely grateful if you could forward this email to any operator you might know who may not read NANOG ( assuming those even exist? :-) )!

Attachments:

attachment.html (text/html — 3.4 KB)

Show replies by date

Saku Ytti

8 Jul 8 Jul

12:29 p.m.

On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent <lvanbever@ethz.ch> wrote:

...

Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do anything about it?”

Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing. Fixing these can take months of working with vendors and attempts to remedy will usually cause planned or unplanned outages. So it rarely makes sense to try to fix as they usually impact a trivial amount of traffic. Networks also routinely mangle packets in-memory which are not visible to FCS check. -- ++ytti

Mark Tinka

12:56 p.m.

On 7/8/21 14:29, Saku Ytti wrote:

...

Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing.

Fixing these can take months of working with vendors and attempts to remedy will usually cause planned or unplanned outages. So it rarely makes sense to try to fix as they usually impact a trivial amount of traffic.

Networks also routinely mangle packets in-memory which are not visible to FCS check.

Mark Tinka

12:59 p.m.

On 7/8/21 14:29, Saku Ytti wrote:

...

Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing.

Fixing these can take months of working with vendors and attempts to remedy will usually cause planned or unplanned outages. So it rarely makes sense to try to fix as they usually impact a trivial amount of traffic.

Networks also routinely mangle packets in-memory which are not visible to FCS check.

I was going to say the exact same thing. +1. It's all par for the course, which is why we get up everyday :-). I'm currently dealing with an issue that will forward a customer's traffic to/from one /24, but not the rest of their IPv4 space, including the larger allocation from which the /24 is born. It was a gray issue while the customer partially activated, and then caused us to care when they tried to fully swing over. We've had an issue that has lasted over a year but only manifested recently, where someone wrote a static route pointing to an indirect next-hop, mistakenly. The router ended up resolving it and forwarding traffic, but in the process, was spiking CPU in a manner that was not immediately evident from the NMS. Fixing the next-hop resolved the issue, as would improving service provisioning and troubleshooting manuals :-). Like Saku says, there's always something, and attention to it will be granted depending on how much visible pain it causes. Mark.

Vanbever Laurent

1:22 p.m.

...

On 8 Jul 2021, at 14:59, Mark Tinka <mark@tinka.africa> wrote:

On 7/8/21 14:29, Saku Ytti wrote:

...
Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing.

Fixing these can take months of working with vendors and attempts to remedy will usually cause planned or unplanned outages. So it rarely makes sense to try to fix as they usually impact a trivial amount of traffic.

Networks also routinely mangle packets in-memory which are not visible to FCS check.

I was going to say the exact same thing.

+1.

It's all par for the course, which is why we get up everyday :-).

:-)

...

I'm currently dealing with an issue that will forward a customer's traffic to/from one /24, but not the rest of their IPv4 space, including the larger allocation from which the /24 is born. It was a gray issue while the customer partially activated, and then caused us to care when they tried to fully swing over.

Did you folks manage to understand what was causing the gray issue in the first place?

...

We've had an issue that has lasted over a year but only manifested recently, where someone wrote a static route pointing to an indirect next-hop, mistakenly. The router ended up resolving it and forwarding traffic, but in the process, was spiking CPU in a manner that was not immediately evident from the NMS. Fixing the next-hop resolved the issue, as would improving service provisioning and troubleshooting manuals :-).

Interesting. I can see how hard this one is to debug as even a relatively small of traffic pointing at the static route would be enough to make the CPU spikes.

...

Like Saku says, there's always something, and attention to it will be granted depending on how much visible pain it causes.

Yep. Makes absolute sense. Best, Laurent

colin johnston

1:24 p.m.

Uucp using tcp does work to overcome packet size problems but limited usage but did work in the past Col

Mark Tinka

1:28 p.m.

On 7/8/21 15:22, Vanbever Laurent wrote:

...

Did you folks manage to understand what was causing the gray issue in the first place?

Nope, still chasing it. We suspect a FIB issue on a transit device, but currently building a test to confirm. Mark.

Jörg Kost

1:58 p.m.

We have a similar gray issue, where switches in a virtual chassis configuration with layer3-configuration seem to lose transit ICMP messages like echo or echo-reply randomly. Once we estimated it around 0.00012% ( let alone variances, or errors in measuring ). We noticed this when we replaced Nagios with some more bursting, trigger-happy monitoring software a few years back. Since then, it's reporting false positives from time to time, and this can become annoying. Besides spending a lot of time debugging this, we never had a breakthrough in finding the root cause, just looking to replace things in the next year. On 8 Jul 2021, at 15:28, Mark Tinka wrote:

...

On 7/8/21 15:22, Vanbever Laurent wrote:

...
Did you folks manage to understand what was causing the gray issue in the first place?

Nope, still chasing it. We suspect a FIB issue on a transit device, but currently building a test to confirm.

Mark.

Vanbever Laurent

2:43 p.m.

Hi Jörg, Thanks for sharing your gray failure! With a few years of lifespan, it might well be the oldest gray failure ever monitored continuously :-) I'm pretty sure you guys exhausted all options already but... did you check for micro-bursts that may cause sudden buffer overflow? Or perhaps is your probing traffic already high priority? Best, Laurent

...

On 8 Jul 2021, at 15:58, Jörg Kost <jk@ip-clear.de> wrote:

We have a similar gray issue, where switches in a virtual chassis configuration with layer3-configuration seem to lose transit ICMP messages like echo or echo-reply randomly. Once we estimated it around 0.00012% ( let alone variances, or errors in measuring ).

We noticed this when we replaced Nagios with some more bursting, trigger-happy monitoring software a few years back. Since then, it's reporting false positives from time to time, and this can become annoying.

Besides spending a lot of time debugging this, we never had a breakthrough in finding the root cause, just looking to replace things in the next year.

On 8 Jul 2021, at 15:28, Mark Tinka wrote:

...
On 7/8/21 15:22, Vanbever Laurent wrote:

...
Did you folks manage to understand what was causing the gray issue in the first place?

Nope, still chasing it. We suspect a FIB issue on a transit device, but currently building a test to confirm.

Mark.

Baldur Norddahl

8:10 p.m.

We had a line card that would drop any IPv6 packet with bit #65 in the destination address set to 1. Turns out that only a few hosts have this bit set to 1 in the address, so nobody noticed until some debian mirrors started to become unreachable. Also webbrowser are very good at switching to IPv4 in case of IPv6 timeouts, so nobody would notice web hosts with the problem. And then we had to live with the problem for a while because the device was out of warranty and marked to be replaced, but you do not just replace a router in a hurry unless you absolutely need to. You do not expect this kind of issue and a lot of time was spent trying to find an alternate explanation for the problem. Regards, Baldur

Chriztoffer Hansen

9 Jul 9 Jul

7:52 a.m.

On Thu, 8 Jul 2021 at 22:10, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:

...

We had a line card that would drop any IPv6 packet with bit #65 in the destination address set to 1. Turns out that only a few hosts have this bit set to 1 in the address, so nobody noticed until some Debian mirrors started to become unreachable. Also, web browsers are very good at switching to IPv4 in case of IPv6 timeouts, so nobody would notice web hosts with the problem. And then we had to live with the problem for a while because the device was out of warranty and marked to be replaced, but you do not just replace a router in a hurry unless you absolutely need to.

Grey failures, ugh. Heard of a colleague at prior employment who did troubleshooting of an issue with an extended line. Where packets to select IPv4 dest addresses would be dropped by the extended line card. Took time plus inserting middle-boxes from Vendor Y (packet capture for evidence) to confirm and convince the vendor of their code had problems. 😫

Vanbever Laurent

8 Jul 8 Jul

1:13 p.m.

...

On 8 Jul 2021, at 14:29, Saku Ytti <saku@ytti.fi> wrote:

On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent <lvanbever@ethz.ch> wrote:

...
Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do anything about it?”

Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing.

Fixing these can take months of working with vendors and attempts to remedy will usually cause planned or unplanned outages. So it rarely makes sense to try to fix as they usually impact a trivial amount of traffic.

Thanks for chiming in. That's also my feeling: a *lot* of gray failures routinely happen, a small percentage of which end up being really damaging (the ones hitting customer traffic, as you pointed out). For this small percentage though, I can imagine being able to detect / locate them rapidly (i.e. before the customer submit a ticket) would be interesting? Even if fixing the root cause might take up months (since it is up to the vendors), one could still hope to remediate to the situation transiently by rerouting traffic combined with the traditional rebooting of the affected resources?

...

Networks also routinely mangle packets in-memory which are not visible to FCS check.

Added to the list... Thanks! Best, Laurent

Saku Ytti

1:53 p.m.

On Thu, 8 Jul 2021 at 16:13, Vanbever Laurent <lvanbever@ethz.ch> wrote:

...

Thanks for chiming in. That's also my feeling: a *lot* of gray failures routinely happen, a small percentage of which end up being really damaging (the ones hitting customer traffic, as you pointed out). For this small percentage though, I can imagine being able to detect / locate them rapidly (i.e. before the customer submit a ticket) would be interesting? Even if fixing the root cause might take up months (since it is up to the vendors), one could still hope to remediate to the situation transiently by rerouting traffic combined with the traditional rebooting of the affected resources?

One method is collecting lookup exceptions. We scrape these: npu_triton_trapstats.py: command = "start shell sh command \"for fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\"" ptx1k_trapstats.py: command = "start shell sh command \"for fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\"" asr9k_npu_counters.py: command = "show controllers np counters all" junos_trio_exceptions.py: command = "show pfe statistics exceptions" No need for ML or AI, as trivial algorithms like 'what counter is incrementing which isn't incrementing elsewhere' or 'what counter is not incrementing is incrementing elsewhere' shows a lot of real problems, and capturing those exceptions and reviewing confirms them. We do not use these to proactively find problems, as it would yield to poorer overall availability. But we regularly use them to expedite time to resolution. Very recently we had Tomahawk (EZchip) reset the whole linecard and looking at counters identifying counter which is incrementing but likely should not yielded the problem. Customer was sending us IP packets, where ethernet header and IP header until total length was missing on the wire, this accidentally fuzzed the NPU ucode periodically triggering NPU bug, which causes total LC reload when it happens often enough.

...

...
Networks also routinely mangle packets in-memory which are not visible to FCS check.

Added to the list... Thanks!

The only way I know how to try to find these memory corruptions is to look at egress PE device backbone facing interface and see if there are IP checksum errors. -- ++ytti

Vanbever Laurent

2:59 p.m.

...

One method is collecting lookup exceptions. We scrape these:

npu_triton_trapstats.py: command = "start shell sh command \"for fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\"" ptx1k_trapstats.py: command = "start shell sh command \"for fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\"" asr9k_npu_counters.py: command = "show controllers np counters all" junos_trio_exceptions.py: command = "show pfe statistics exceptions"

No need for ML or AI, as trivial algorithms like 'what counter is incrementing which isn't incrementing elsewhere' or 'what counter is not incrementing is incrementing elsewhere' shows a lot of real problems, and capturing those exceptions and reviewing confirms them.

We do not use these to proactively find problems, as it would yield to poorer overall availability. But we regularly use them to expedite time to resolution.

Thanks for sharing! I guess this process working means the counters are "standard" / close enough across vendors to allow for comparisons?

...

Very recently we had Tomahawk (EZchip) reset the whole linecard and looking at counters identifying counter which is incrementing but likely should not yielded the problem. Customer was sending us IP packets, where ethernet header and IP header until total length was missing on the wire, this accidentally fuzzed the NPU ucode periodically triggering NPU bug, which causes total LC reload when it happens often enough.

...

...
...
Networks also routinely mangle packets in-memory which are not visible to FCS check.

Added to the list... Thanks!

The only way I know how to try to find these memory corruptions is to look at egress PE device backbone facing interface and see if there are IP checksum errors.

Saku Ytti

3:02 p.m.

On Thu, 8 Jul 2021 at 17:59, Vanbever Laurent <lvanbever@ethz.ch> wrote:

...

Thanks for sharing! I guess this process working means the counters are "standard" / close enough across vendors to allow for comparisons?

Not at all I'm afraid, and not intended for user consumption so generally not available via SNMP or streaming. -- ++ytti

Tom Beecher

2:22 p.m.

...

If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing.

This. Full stop. I believe there are very few, if any, production networks in existence in which have a 0% rate of drops or 'weird shit'. Monitoring for said drops and weird shit is important, and knowing your traffic profiles is also important so that when there is an intersection of 'stuff' and 'stuff that noticeably impacts traffic' , you can get to the bottom of it quickly and figure out what to do. On Thu, Jul 8, 2021 at 8:31 AM Saku Ytti <saku@ytti.fi> wrote:

...

On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent <lvanbever@ethz.ch> wrote:

...
Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do anything about it?”

Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing.

Fixing these can take months of working with vendors and attempts to remedy will usually cause planned or unplanned outages. So it rarely makes sense to try to fix as they usually impact a trivial amount of traffic.

Networks also routinely mangle packets in-memory which are not visible to FCS check.

-- ++ytti

Warren Kumari

5:19 p.m.

On Thu, Jul 8, 2021 at 8:32 AM Saku Ytti <saku@ytti.fi> wrote:

...

On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent <lvanbever@ethz.ch> wrote:

...
Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do anything about it?”

Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing.

I think that some of it depends on the type of failure -- for example, some devices hash packets across an internal switch fabric, and so the failure manifests as persistent issues to a specific 5-tuple (or between a pair of 5-tuples). If this affects one in a thousand flows it is likely more annoying than one in a thousand random packets being dropped. But, yes, all networks drop some set of packets some percentage of the time (cue the "SEU caused by cosmic rays" response :-)) W

...

Fixing these can take months of working with vendors and attempts to remedy will usually cause planned or unplanned outages. So it rarely makes sense to try to fix as they usually impact a trivial amount of traffic.

Networks also routinely mangle packets in-memory which are not visible to FCS check.

-- ++ytti

-- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra

William Herrin

9:01 p.m.

On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti <saku@ytti.fi> wrote:

...

Network experiences gray failures all the time, and I almost never care, unless a customer does.

Greetings, I would suggest that your customer does care, but as there is no simple test to demonstrate gray failures, your customer rarely makes it past first tier support to bring the issue to your attention and gives up trying. Indeed, name the networks with the worst reputations around here and many of them have those reputations because of a routine, uncorrected state of gray failure. To answer Laurent 's question: Yes, gray failures are a regular problem. Yes, most of us care. And for the most part we don't have particularly good ways to detect and isolate the problems, let alone fix them. When it's not a clean failure we really are driven by: customer says blank is broken, often followed by grueling manual effort just to duplicate the problem within our view. Can network researchers do anything about it? Maybe. Because of the end to end principle, only the endpoints understand the state of the connection and they don't know the difference capacity and error. They mostly process that information locally sharing only limited information with the other endpoint. Which means there's not much passing over the wire for the middle to examine and learn that there's a problem... and when there is it often takes correlating multiple packets to understand that a problem exists which, in the stateless middle with asymmetric routing, is not usable. The middle can only look at its immediate link stats which, when there's a bug, are misleading. What would you change to dig us out of this hole? Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Saku Ytti

9 Jul 9 Jul

5:45 a.m.

On Fri, 9 Jul 2021 at 00:01, William Herrin <bill@herrin.us> wrote:

...

I would suggest that your customer does care, but as there is no

Most don't. Somewhat recently we were dropping a non-trivial amount of packets from a well-known book store due to DMAC failure. This was unexpected, considering it was an L3 to L3 connection. This was a LACP bundle with a large number of interfaces and this issue affected just one interface in the bundle. After we informed the customer about the problem, while it was still occurring, they could not observe it, they looked at their stats and whatever it was dropping was being drowned in the noise, it was not an actionable signal to them. Customer wasn't willing to remove the broken interface from the bundle, as they could not observe the problem. We did migrate that port to a working port and after 3 months we agreed with the vendor to stop troubleshooting it, vendor can see that they had misprogrammed their hardware, but they were not able to figure out why and therefore it is not fixed. Very large amount of cycles were spent at the vendor and operator, and a small amount of work (checking TCP resends etc) at customers trying to solve it. The reason we contacted the customer is because there were quite a large number of packets we were dropping, I can easily find 100 real smaller problems we have in the network immediately. Customer was /not/ wrong, the customer did the exact right thing. There are a lot of problems, and you can go deep into the rabbit hole trying to fix problems which are real but don't affect a sufficient amount of packets to have a meaningful impact on the product quality. -- ++ytti

Warren Kumari

12:51 p.m.

On Thu, Jul 8, 2021 at 5:04 PM William Herrin <bill@herrin.us> wrote:

...

On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti <saku@ytti.fi> wrote:

...
Network experiences gray failures all the time, and I almost never care, unless a customer does.

Greetings,

I would suggest that your customer does care, but as there is no simple test to demonstrate gray failures, your customer rarely makes it past first tier support to bring the issue to your attention and gives up trying. Indeed, name the networks with the worst reputations around here and many of them have those reputations because of a routine, uncorrected state of gray failure.

To answer Laurent 's question:

Yes, gray failures are a regular problem. Yes, most of us care. And for the most part we don't have particularly good ways to detect and isolate the problems, let alone fix them.

Depending on the actual failure mode, and the architecture of the device itself, one technique is to run test traffic through the box/path/whatever while twiddling the source and destination ports, and sometimes the source IP as well. This sometimes helps find the issue if there is a bad interface in a LAG, or in a device which sprays packets/cells across an internal fabric, etc. If you are really lucky you can convince the vendor to share how they spray/hash (or, at least demonstrate deterministic failure and hopefully they can hash and tell which of the N fabric cards is broken) Hopefully you noticed the number of weasel words in there... W

...

When it's not a clean failure we really are driven by: customer says blank is broken, often followed by grueling manual effort just to duplicate the problem within our view.

Can network researchers do anything about it? Maybe. Because of the end to end principle, only the endpoints understand the state of the connection and they don't know the difference capacity and error. They mostly process that information locally sharing only limited information with the other endpoint. Which means there's not much passing over the wire for the middle to examine and learn that there's a problem... and when there is it often takes correlating multiple packets to understand that a problem exists which, in the stateless middle with asymmetric routing, is not usable. The middle can only look at its immediate link stats which, when there's a bug, are misleading.

What would you change to dig us out of this hole?

Regards, Bill Herrin

-- William Herrin bill@herrin.us https://bill.herrin.us/

-- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra

Yang Yu

9:16 p.m.

On Thu, Jul 8, 2021 at 4:03 PM William Herrin <bill@herrin.us> wrote:

...

On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti <saku@ytti.fi> wrote:

...
Network experiences gray failures all the time, and I almost never care, unless a customer does.

I would suggest that your customer does care, but as there is no simple test to demonstrate gray failures, your customer rarely makes it past first tier support to bring the issue to your attention and gives up trying. Indeed, name the networks with the worst reputations around here and many of them have those reputations because of a routine, uncorrected state of gray failure.

Networks originating/receiving the traffic tend to have more incentives to resolve these issues, which might be not so rare If you have connection/application level health metrics (e.g. TLS handshake failures, TCP retransmits), identifying a problem exists is not too difficult. Having health metrics associated with network paths can greatly simplify repro. Then it's mostly troubleshooting datapath issues on your favorite platform. It takes quite some effort to figure out/collect relevant metrics and present them in a usable way. Something like connections from PoP A to destination ASN/prefix (via interface X) had TLS handshake failure rate increased from 0.02% to 1% is a good starting point for troubleshooting (may or may not be a network issue, the origin/receiver probably wants to fix it regardless). Things can get more complicated when traffic crosses network boundaries with things you don't have visibility into (IX fabric, remote peering, another networks' optical systems, complicated setups like stateful firewall / MC-LAG)

Lukas Tribus

8 Jul 8 Jul

4:21 p.m.

Hello, there is a large eyeball ASN in Southern Europe, single homed to a Tier1 running under the same corporate umbrella, which for about a decade suffered from periodic blackholing of specific src/dst tuples. The issue occurred every 6 - 18 months, completely breaking specific production traffic *for multiple days* (think dead, mission-critical IPsec VPNs for example). It was never acknowledged on the record, some say this was about stalled 100G cards. I believe at this point the HW was faced out, but this was one of the rather infuriating experiences ... More generally speaking, single link overloads causing PL or even full blackholing affecting single links (and therefore in a load-balanced environment: specific tuples) is something that is very frustrating to troubleshoot and it happens quite a lot in the DFZ. It doesn't show on monitoring systems, and it is difficult to get past the first level support in bigger networks because load-balancing decisions and hashing are difficult concepts for the uninitiated and they will generally refuse to escalate issues they are unable to reproduce from their specific system (WORKSFORME). At some point I had a router with an entire /24 configured on a loopback, just to ping destinations from the same device with different source IP's, to establish whether there is a load-balancing induced issue with packet-loss, latency, or full blackholing towards a particular destination. Tooling (for troubleshooting), monitoring and education is lacking in this regard unfortunately. - lukas

Saku Ytti

4:30 p.m.

On Thu, 8 Jul 2021 at 19:25, Lukas Tribus <lukas@ltri.eu> wrote:

...

More generally speaking, single link overloads causing PL or even full blackholing affecting single links (and therefore in a load-balanced environment: specific tuples) is something that is very frustrating to troubleshoot and it happens quite a lot in the DFZ. It

Ask your vendor to implement RFC5837, so that in addition to the bundle interface having the L3 address, traceroute also returns the actual physical interface that received the packet. This would expedite troubleshooting issues where elephant flows congest specific links. Juniper and Nokia support adaptive load balancing, dynamically adjusting hash=>interface mapping table, to deal with elephant flows without congesting one link. -- ++ytti

1454

Age (days ago)

1455

Last active (days ago)

List overview

Download

22 comments

12 participants

participants (12)

Baldur Norddahl
Chriztoffer Hansen
colin johnston
Jörg Kost
Lukas Tribus
Mark Tinka
Saku Ytti
Tom Beecher
Vanbever Laurent
Warren Kumari
William Herrin
Yang Yu