Hi Mark,
I know it's annoying that I won't mention specifics.
Unfortunately, the last time I mentioned $vendor-specific
information on NANOG, it was picked up by the press, and
turned into a multimillion dollar kerfuffle with me at the
center of the cross-hairs:
After that, I've learned it's best to not name specific
very-big-name vendors on NANOG posts.
What I *can* say is that this was one of the primary
vendors in the Internet backbone space, running mainstream
code.
The only reason it didn't affect more networks was a
function of the particular cluster of signalling communities
being applied to all inbound prefixes, and how they
interacted with the vendor's hash algorithm.
Corner cases, while valid, do not speak to the
majority. If this was a major issue, there would have been
more noise about it by now.
I prefer to look at it the other way; the reason you
didn't hear more noise about it, is that we stubbed our toes
on it early, and had relatively fast, direct access to the
development engineers to get it fixed within two days. It's
precisely *bcause* people trip over corner cases and get
them fixed that they don't end up causing more widespread
pain across the rest of the Internet.
There has been quite some noise about lengthy AS_PATH
updates that bring some routers down, which has usually
been fixed with improved BGP code. But even those are not
too common, if one considers a 365-day period.
Oh, absolutely. Bugs in implementations that either
crash the router or reset the BGP session are much more
immediately visible than "that's odd, it's taking my routers
longer to converge than it should".
How many networks actually track their convergence time
in a time series database, and look at unusual trends, and
then diagnose why the convergence time is increasing, versus
how many networks just note an increasing number of "hey,
your network seems to be slowing down" and throw more
hardware at the problem, while grumbling about why their big
expensive routers seem to be less powerful than a *nix box
running gated?
I suspect there's more of these type of "corner cases"
out there than you recognize.
It's just that most networks don't dig into routing
performance issues unless it actually breaks the router, or
kills BGP adjacencies.
If you *are* one of the few networks that tracks your
router's convergence time over time, and identifies and
resolves unexpected increases in convergence time, then yes,
you absolutely have standing to tell me to pipe down and go
back into my corner again. ;D