RE: BGP and The zero window edge

21 Apr 2021

      I'd like to get some data on what actually happened
in the real cases and analyze it.

If it's a Cisco router at fault, then we have a bug to fix.
Even if it's not a Cisco, there may be ways we can help
to avoid the situation.
However, before we start on solutions, I'd like to get
a good understanding of what actually happened.

TCP zero window is possible, but many other things could
cause it too.

Anyone?

Regards,
Jakob.

-----Original Message-----
From: Job Snijders <job@fastly.com> 
Sent: Wednesday, April 21, 2021 2:11 PM
To: Jakob Heitz (jheitz) <jheitz@cisco.com>
Cc: nanog@nanog.org
Subject: Re: BGP and The zero window edge

Dear Jakob, group,

On Wed, Apr 21, 2021 at 08:59:06PM +0000, Jakob Heitz (jheitz) via NANOG wrote:
...
Ben's blog details an experiment in which he advertises routes and then
withdraws them, but some of them remain stuck for days.
I'd like to get to the bottom of this problem.
I think there are *two* problems:

1) some BGP implementations (or multi-node BGP configurations) sometimes
   end up getting stuck in one way or another.

2) other BGP nodes are not able to disconnect/reconnect to systems
   suffering from instantiations of problem #1.

While on the one hand it is important to follow-up on each and every
instantiation of problem #1, I personally think it also is worthwhile
exploring whether the BGP FSM itself can be redefined in a way that
encourages BGP protocol implementations to be more robust and rely less
on the remote peer behaving correctly.

Once Problem #2 is addressed, finding and isolating instances of Problem
#1 will become much easier.
...
Has anyone else seen this before or can provide data to analyze?
On or off list.
...
From the BGP Default-Free Zone perspective it is hard to differentiate
between an entire (multi-vendor) Autonomous System being stuck, or just
one router.
To test individual router implementations this tool is useful
https://github.com/benjojo/bgp-zerowindow-test - but please keep in mind
that "TCP Recv Wind == 0" trick is just one way to easily get a BGP peer
to manifest the problematic behavior.
...
From a BGP protocol perspective BGP nodes shouldn't inspect the TCP
receive window, but rather focus on whether all locally available
signals indicate that the remote peer is still progressing data.
Kind regards,

Job