At a previous company we had a large number of Foundry Networks layer-3 switches. They participated in our OSPF network and had a *really* annoying bug. Every now and then one of them would get somewhat confused and would corrupt its OSPF database (there seemed to be some pointer that would end up off by one).

It would then cleverly realize that its LSDB was different to everyone else's and so would flood this corrupt database to all other OSPF speakers. Some vendors would do a better job of sanity checking the LSAs and would ignore the bad LSAs, other vendors would install them... and now you have different link state databases on different devices and OSPF becomes unhappy.

Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.9.32.5 
Mask 10.160.8.0 from 10.178.255.252 
NOTE: This route will not be installed in the routing table.
Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3
Mask 10.2.153.0 from 10.178.255.252 
NOTE: This route will not be installed in the routing table.
Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3
Mask 10.2.153.0 from 10.178.255.252 
NOTE: This route will not be installed in the routing table.


 If you look at the output, you can see that there is some garbage in the LSID field and the bit that should be there is now in the Mask section. I also saw some more extreme version of the same bug, in my favorite example the mask was 115.104.111.119 and further down there was 105.110.116.114 -- if you take these as decimal number and look up their ASCII values we get "show" and "inte" -- I wrote a tool to scrape bits from these errors and ended up with a large amount of the CLI help text. 




Many years ago I worked for a small Mom-and-Pop type ISP in New York state (I was the only network / technical person there) -- it was a very free wheeling place and I built the network by doing whatever made sense at the time.

One of my "favorite" customers (Joe somebody) was somehow related to the owner of the ISP and was a gamer. This was back in the day when the gaming magazines would give you useful tips like "Type 'tracert $gameserver' and make sure that there are less than N hops".  Joe would call up tech support, me, the owner, etc and complain that there was N+3 hops and most of them were in our network. I spent much time explaining things about packet-loss, latency, etc but couldn't shake his belief that hop count was the only metric that mattered.

Finally, one night he called me at home well after midnight (no, I didn't give him my home phone number, he looked me up in the phonebook!) to complain that his gaming was suffering because it was "too many hops to get out of your network". I finally snapped and built a static GRE tunnel from the RAS box that he connected to all over the network -- it was a thing of beauty, it went through almost every device that we owned and took the most convoluted path I could come up with. "Yay!", I figured, "now I can demonstrate that latency is more important than hop count" and I went to bed.

The next morning I get a call from him. He is ecstatic and wildly impressed by how well the network is working for him now and how great his gaming performance is. "Oh well", I think, "at least he is happy and will leave me alone now". I don't document the purpose of this GRE anywhere and after some time forget about it.

A few months later I am doing some routine cleanup work and stumble across a weird looking tunnel -- its bizarre, it goes all over the place and is all kinds of crufty -- there are static routes and policy routing and bizarre things being done on the RADIUS server to make sure some user always gets a certain IP... I look in my pile of notes and old configs and then decide to just yank it out.

That night I get an enraged call (at home again) from Joe *screaming* that the network is all broken again because it is now way too many hops to get out of the network and that people keep shooting him...

What I learnt from this:

1: Make sure you document everything (and no, the network isn't documentation)
2: Gamers are weird.
3: Making changes to your network in anger provides short term pleasure but long term pain.



On Fri, Feb 19, 2021 at 1:10 PM Andrew Gallo <akg1330@gmail.com> wrote:


On 2/16/2021 2:37 PM, John Kristoff wrote:
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?


I don't believe I've seen this in any of the replies, but the AT&T
cascading switch crashes of 1990 is a good one.  This link even has some
pseudocode
https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse



--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra