Famous operational issues
Friends, I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen. Which examples would make up your top three? To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more. I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who. Thanks in advance for your suggestions, John
On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote:
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
This was a fantastic outage, one could really feel the tremors into the far corners of the BGP default-free zone: https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime... The experiment triggered a bug in some Cisco router models: affected Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **. Any peers of such Ciscos receiving this BGP update, would (according to then current RFCs) consider the BGP UPDATE corrupted, and would subsequently tear down the BGP sessions with the Ciscos. Because the corruption was not detected by the Ciscos themselves, whenever the sessions would come back online again they'd reannounce the corrupted update, causing a session tear down. Bounce ... Bounce ... Bounce ... at global scale in both IBGP and EBGP! :-) Luckily the industry took these, and many other lessons to heart: in 2015 the IETF published RFC 7606 ("Revised Error Handling for BGP UPDATE Messages") which specifices far more robust behaviour for BGP speakers. Kind regards, Job
There was the outage in 2014 when we got to 512K routes. http://www.bgpmon.net/what-caused-todays-internet-hiccup/ On 2/16/21, 1:04 PM, "NANOG on behalf of Job Snijders via NANOG" <nanog-bounces+rich.compton=charter.com@nanog.org on behalf of nanog@nanog.org> wrote: CAUTION: The e-mail below is from an external source. Please exercise caution before opening attachments, clicking links, or following guidance. On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote: > I'd like to start a thread about the most famous and widespread Internet > operational issues, outages or implementation incompatibilities you > have seen. > > Which examples would make up your top three? This was a fantastic outage, one could really feel the tremors into the far corners of the BGP default-free zone: https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime... The experiment triggered a bug in some Cisco router models: affected Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **. Any peers of such Ciscos receiving this BGP update, would (according to then current RFCs) consider the BGP UPDATE corrupted, and would subsequently tear down the BGP sessions with the Ciscos. Because the corruption was not detected by the Ciscos themselves, whenever the sessions would come back online again they'd reannounce the corrupted update, causing a session tear down. Bounce ... Bounce ... Bounce ... at global scale in both IBGP and EBGP! :-) Luckily the industry took these, and many other lessons to heart: in 2015 the IETF published RFC 7606 ("Revised Error Handling for BGP UPDATE Messages") which specifices far more robust behaviour for BGP speakers. Kind regards, Job E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.
On 16/02/2021 22:51, Compton, Rich A wrote:
There was the outage in 2014 when we got to 512K routes. http://www.bgpmon.net/what-caused-todays-internet-hiccup/
There was a similar issue in 1998/9 or so when we got to 64K routes, which broke the routing table index (which defaulted to a uint16_t) on any FreeBSD box doing BGP. Fortunately a quick kernel recompile with the type changed to uint32_t fixed that. Ray
Le mar. 16 févr. 2021 à 21:03, Job Snijders via NANOG <nanog@nanog.org> a écrit :
https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime...
The experiment triggered a bug in some Cisco router models: affected Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **. Any peers of such Ciscos receiving this BGP update, would (according to then current RFCs) consider the BGP UPDATE corrupted, and would subsequently tear down the BGP sessions with the Ciscos. Because the corruption was not detected by the Ciscos themselves, whenever the sessions would come back online again they'd reannounce the corrupted update, causing a session tear down. Bounce ... Bounce ... Bounce ... at global scale in both IBGP and EBGP! :-)
In a similar fashion, a network I know had a massive outage when a failing linecard corrupted is-is lsps, triggering a flood of purges and taking down the whole backbone. This was pre-rfc6232, so you can guess that resolving the issue was a real PITA. This kind of outages fuels my netops nightmares.
On Tue, 16 Feb 2021, John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
https://blogs.oracle.com/internetintelligence/longer-is-not-always-better -- Mikael Abrahamsson email: swmike@swm.pp.se
On Tue, 16 Feb 2021, John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
When Boston University joined the internet proper ca 1984 I was in charge of that group. We accidentally* submitted an initial HOSTS.TXT file which included some internally used one-character host names (A, B, C) and one which began with a digit (3B, an AT&T 3B5), both illegal for HOSTS.TXT back then. This put the BSD Unix program which converted from HOSTS.TXT to Unix' /etc/hosts format into an infinite loop filling /tmp which in those days crashed Unix and it often couldn't reboot successfully without manual intervention. On many, many hosts across the internet. I hesitate to guess a number since scale has changed so much but some of the more heated email claimed it brought down at least half the internet by some count. It was worsened by the fact that many hosts pulled and processed a new HOSTS.TXT file via cron (time-based job scheduler) at midnight so no one was around to fix and reboot systems. The thread on the TCP-IP mailing list was: BU JOINS THE INTERNET! It was a little embarrassing. Today it probably would have landed me in Gitmo. * There were two versions, the one we used internally, and the one to be submitted which removed those host names. The wrong one got submitted. -- -Barry Shein Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*
Hi, I don't want to classify and rate it, but would name 9/11. You can read about the impacts on the list archives and there is also a presentation from NANOG '23 online. Regards Jörg On 16 Feb 2021, at 20:37, John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
On Tue Feb 16, 2021 at 09:33:20PM +0100, J?rg Kost wrote:
I don't want to classify and rate it, but would name 9/11.
You can read about the impacts on the list archives and there is also a presentation from NANOG '23 online.
For an operational perspective, I was part of the team trying to keep the BBC website up and running through 9/11... http://www.slimey.org/bbc_ticket_10083.txt Simon
actually, the 129/8 incident was as damaging as 7007, but folk tend not to remember it; maybe because it was a bit embarrassing and the baltimore tunnel is a gift that gave a few times and the quake/mudslides off taiwan the tohoku quake was also fun, in some sense of the word but the list of really damaging wet glass cuts is long
https://en.wikipedia.org/wiki/SQL_Slammer was interesting in that it was an application-layer issue that affected the network layer. Damian On Tue, Feb 16, 2021 at 11:37 AM John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
There are all the hilarious leaks and blocks. Pakistan blocks youtube and the announcement leaks internet-wide. Turk telecom (AS9121 IIRC) leaks a full table out one of their providers. So many routing level incidents they're probably not even interesting any more, I suppose. The huge power outages in the US northeast in 2003 ( https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.183.998&rep=rep1&type=pdf) were pretty decent. On Tue, Feb 16, 2021 at 4:02 PM Damian Menscher via NANOG <nanog@nanog.org> wrote:
https://en.wikipedia.org/wiki/SQL_Slammer was interesting in that it was an application-layer issue that affected the network layer.
Damian
On Tue, Feb 16, 2021 at 11:37 AM John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
Since you said operational issues, instead of just outage... How about MCI Worldcom's 10-day operational disaster in 1999. http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/ How not to handle a network outage [...] MCI WorldCom issued an alert to its sales force, which was given the option to deliver a notice to customers by e-mail, hand delivery or telephone – or not at all. After a deafening silence from company executives on the 10-day network outage, MCI WorldCom CEO Bernie Ebbers finally took the podium to discuss the situation. How did he explain the failure, and reassure customers that the network would not suffer such a failure in the future? He didn't. Instead, he blamed Lucent. [...]
Oh well, MCI in 1999 was all about… https://www.youtube.com/watch?v=7iM5nFNUG4U On 16 Feb 2021, at 22:28, Sean Donelan wrote:
Since you said operational issues, instead of just outage...
How about MCI Worldcom's 10-day operational disaster in 1999.
http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/ How not to handle a network outage
[...] MCI WorldCom issued an alert to its sales force, which was given the option to deliver a notice to customers by e-mail, hand delivery or telephone – or not at all. After a deafening silence from company executives on the 10-day network outage, MCI WorldCom CEO Bernie Ebbers finally took the podium to discuss the situation. How did he explain the failure, and reassure customers that the network would not suffer such a failure in the future? He didn't. Instead, he blamed Lucent. [...]
That was the one with the most severe imact for my company. Seven Frame Circuits (UUNET) and we all saw what an updtae can do On 2/16/21 3:28 PM, Sean Donelan wrote:
Since you said operational issues, instead of just outage...
How about MCI Worldcom's 10-day operational disaster in 1999.
http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/ How not to handle a network outage
[...] MCI WorldCom issued an alert to its sales force, which was given the option to deliver a notice to customers by e-mail, hand delivery or telephone – or not at all. After a deafening silence from company executives on the 10-day network outage, MCI WorldCom CEO Bernie Ebbers finally took the podium to discuss the situation. How did he explain the failure, and reassure customers that the network would not suffer such a failure in the future? He didn't. Instead, he blamed Lucent. [...]
Would this also extend to intentional actions that may have had unintended consequences, such as provider A intentionally de-peering provider B, or the monopoly telco for $country cutting itself off from the rest of the global Internet for various reasons (technical, political, or otherwise)? That said, I'd still have to stick with AS7007, the Baltimore tunnel fire, and 9/11 as the most prominent examples of widespread issues/outages and how those issues were addressed. Honorable mention: $vendor BGP bugs, either due to $vendor ignoring the relevant RFCs, implementing them incorrectly, or an outage exposed a design flaw that the RFCs didn't catch. Too many of those to list here :) jms On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen. Sent from my TI-99/4a
On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
----- On Feb 16, 2021, at 2:08 PM, Jared Mauch jared@puck.nether.net wrote: Hi,
I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
Wait... what? I would love to listen to that call between you and your manager. But, here is one for you then. I was once called to a POP where one of our main routers was down. Due to political reasons, my access had been revoked. My manager told me to do whatever I needed to do to fix the problem, he would cover my behind. I did, and I "gently" removed the door. My manager held word. Another interesting one: entering a pop to find it flooded. Luckily there were raised floors with only fiber underneath the floor panels. The NOC ignored the warnings because "it was impossible for water to enter the building as it was not raining". Yeah, but water pipes do burst from time to time. But my favorite was pressing an undocumented combination of keys on a fire alarm system which set off the Inergen protection without warning, immediately. The noise and pressure of all that air entering the datacenter space with me still in it is something I will never forget. Similar to the response of my manager who, instead of asking me if I was ok, decided to try and light a piece of paper. "Oh wow, it does work, I can't set anything on fire". All if this was, obviously, in the late 1990s and early 2000s. These days, things are -slightly- more professional. Thanks, Sabri
On Tue, 16 Feb 2021, Sabri Berisha wrote:
----- On Feb 16, 2021, at 2:08 PM, Jared Mauch jared@puck.nether.net wrote:
Hi,
I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
Wait... what? I would love to listen to that call between you and your manager.
But, here is one for you then. I was once called to a POP where one of our main routers was down. Due to political reasons, my access had been revoked. My manager told me to do whatever I needed to do to fix the problem, he would cover my behind. I did, and I "gently" removed the door. My manager held word.
This reminds me of one of the Sprint CO's we were colo'd in. Access to the CLEC colo area was via a back door through the Men's room! One weekend, I had to make the drive to that site to deal with an access server issue, and I found they'd locked the back door to the Men's room from the colo floor side, so no access. Using supplies I found inside the CO, I managed open the locked door and get to our gear. That route, being our only access route was probably some kind of violation. Not all of our techs were guys. While we never had a router stolen, we did have a flash card stolen from one of our routers in a WCOM colo facility (most customers in open relay racks). It was right after they'd upgraded the doors to the colo area from simplex locks to card access. I was pissed for quite some time that WCOM knew who was in there (due to the card access system), but refused to tell us. I figured it was probably one of their own people. ---------------------------------------------------------------------- Jon Lewis, MCP :) | I route StackPath, Sr. Neteng | therefore you are _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
jlewis> This reminds me of one of the Sprint CO's we were colo'd in. Ah, Sprint. Nothing like using your railroad to run phone lines... Our routers in San Jose colo were black from the soot of the trains. Fondly remember a major Sprint outage in the early 90s. All our data circuits in the southeast went down at once and there were major voice outages in the entire southeast. Turns out a storm caused a mudslide which in turn derailed a train carrying toxic waste, resulting in a wave of 6-10' of toxic mud taking out the Spring voice pop for the whole southeast, because it was conveniently located right on said railroad tracks. We were a big enough customer that PLSC in Atlanta gave us the real story when we asked for an ETA on repair. They couldn't give us one immediately until the HAZMAT crew let them in. Turned out to be a total loss of all gear. They yanked every tech east of the Misssissippi and a 7ESS was Fedex overnighted (stolen from some customer in the middle east?) and they had to rebuild everything. Was down less than 10 days. Good times.
Ahh, war stories. I like the one where I got a wake up call that our IRC server was on fire, together with the rest of the DC. Not that widespread, but we reached Slashdot. :) November 2002, University of Twente, The Netherlands. Some idiot wanted to be a hero. He deflated peoples tires, to help inflate them. One morning he thought it would be a good idea to start a small fire and then extinguish it, so he would be the hero that stopped a fire. He failed and the building burned down. He got caught a few days later when he tried the same thing in a different building. Almost all of the IT was in that building, including core network, uplinks to SURFNet (Dutch Educational Network) and to the 2000 students living on the campus. Ironically a new DC was already being built, so that was ready for use a few weeks later. As we had quite a network for 2002 we hosted for instance security.debian.org. The students all had 100Mbit in their room, so some of them also hosted some popular websites. One I can remember was an image sharing site. Some students immediately created a backup network; dhcp server, dns server with a catch all, website explaining what was going on, IRC server, etc.. A local ISP offered to sponsor 50Mbit for the residents, which was connected via a microwave relay and a temporary fiber was run through a ditch to connect two parts of the campus residencies. At the end of the day all 2000 students had their internet connection back, although all behind a single 50Mbit link. Syslog message from the local SURFNet router: lo0.ar5.enschede1.surf.net 3613: Nov 20 07:20:50.927 UTC: %ENV_MON-2-TEMP: Hotpoint temp sensor(slot 18) temperature has reached WARNING level at 61(C) (Disclaimer: Where I say we, I mean we as University. I wasn't working for the university, but was part of the students working on the backup network. There are probably some other people on list with some more details and I've probably missed some details, but this is the summary.) On 16-02-2021 23:08, Jared Mauch wrote:
I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
Sent from my TI-99/4a
On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
Stolen isn’t nearly as exciting as what happens when your (used) 6509 arrives and gets installed and operational before anyone realizes that the conductive packing peanuts that it was packed in have managed to work their way into various midplane connectors. Several hours later someone notices that the box is quite literally smoldering in the colo and the resulting combination of panic, fire drill, and management antics that ensue. Owen
On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net> wrote:
I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
Sent from my TI-99/4a
On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
On that note, I'd be very interested in hearing stories of actual incidents that are the cause of why cardboard boxes are banned in many facilities, due to loose particulate matter getting into the air and setting off very sensitive fire detection systems. Or maybe it's more mundane and 99% of the reason is people unpack stuff and don't always clean up properly after themselves. On Wed, Feb 17, 2021, 6:21 PM Owen DeLong <owen@delong.com> wrote:
Stolen isn’t nearly as exciting as what happens when your (used) 6509 arrives and gets installed and operational before anyone realizes that the conductive packing peanuts that it was packed in have managed to work their way into various midplane connectors. Several hours later someone notices that the box is quite literally smoldering in the colo and the resulting combination of panic, fire drill, and management antics that ensue.
Owen
On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net> wrote:
I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
Sent from my TI-99/4a
On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
On that note, I'd be very interested in hearing stories of actual incidents that are the cause of why cardboard boxes are banned in many facilities, due to loose particulate matter getting into the air and setting off very sensitive fire detection systems.
Or maybe it's more mundane and 99% of the reason is people unpack stuff and don't always clean up properly after themselves.
We had a plastic bag sucked into the intake of a router in a datacenter once that caused it to overheat and take the site down. We had cameras in our cage and I remember seeing the photo from the site of the colo (I'll protect their name just because) taken as the tech was on the phone and pulled the bag out of the router. The time from the thermal warning syslog that it's getting warm to overheat and shutdown is short enough you can't really get a tech to the cage in time to prevent it. I assume also the latter above, which is people have varying definitons of clean. - Jared -- Jared Mauch | pgp key available via finger from jared@puck.nether.net clue++; | http://puck.nether.net/~jared/ My statements are only mine.
On Thu, Feb 18, 2021 at 8:31 AM Jared Mauch <jared@puck.nether.net> wrote:
On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
On that note, I'd be very interested in hearing stories of actual incidents that are the cause of why cardboard boxes are banned in many facilities, due to loose particulate matter getting into the air and setting off very sensitive fire detection systems.
Or maybe it's more mundane and 99% of the reason is people unpack stuff and don't always clean up properly after themselves.
We had a plastic bag sucked into the intake of a router in a datacenter once that caused it to overheat and take the site down. We had cameras in our cage and I remember seeing the photo from the site of the colo (I'll protect their name just because) taken as the tech was on the phone and pulled the bag out of the router.
The time from the thermal warning syslog that it's getting warm to overheat and shutdown is short enough you can't really get a tech to the cage in time to prevent it.
1: A previous employer was a large customer of a (now defunct) L3 switch vendor. The AC power inputs were along the bottom of the power supply, and the big aluminium heatsinks in the power supplies were just above the AC socket. Anyway, the subcontractor who made the power supplies for the vendor realized that they could save a few cents by not installing the little metal clip that held the heatsink to the MOSFET, and instead relying on the thermal adhesive to hold it... This worked fine, until a certain number of hours had passed, at which point the goop would dry out and the heatsink would fall down, directly across the AC socket.... This would A: trip the circuit that this was on, but, more excitingly, set the aluminum on fire, which would then ignite the other heatsinks in the PSU, leading to much fire... 2: A somewhat similar thing would happen with the Ascend TNT Max, which had side-to-side airflow. These were dial termination boxes, and so people would install racks and racks of them. The first one would draw in cool air on the left, heat it up and ship it out the right. The next one over would draw in warm air on the left, heat it up further, and ship it out the right... Somewhere there is a fairly famous photo of a rack of TNT Maxes, with the final one literally on fire, and still passing packets. There is a related (and probably apocryphal) regarding the launch of the TNT. It was being shipped for a major trade-show, but got stuck in customs. After many bizarre calls with the customs folk, someone goes to the customs office to try and sort it out, and get greeted by custom agents with guns. They all walk into the warehouse, and discover that there is a large empty area around the crate, which is a wooden cube, with "TNT" stencilled in big red letters... 3: I used to work for a small ISP in Yonkers, NY. We had a customer in Florida, and on a Friday morning their site goes down. We (of course) have not paid for Cisco 4 hour support (or, honestly, any support) and they have a strict SLA, so we are a little stuck. We end up driving to JFK, and lugging a fully loaded Cisco 7507 to the check in counter. It was just before the last flight of the day, so we shrugged and said it was my checked bag. The excess baggage charges were eye-watering, but it rode the conveyor belt with the rest of the luggage onto the plane. It arrived with just a bent ejector handle, and the rest was fine. 4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork... 5: Another one. In the early 2000s I was working for a dot-com boom company. We are building out our first datacenter, and I'm installing a pair of Cisco 7206s in 811 10th Ave. These will run basically the entire company, we have some transit, we have some peering to configure, we have an AS, etc. I'm going to be configuring all of this; clearly I'm a router-god... Anyway, while I'm getting things configured, this janitor comes past, wheeling a garbage bin. He stops outside the cage and says "Whatcha doin'?". I go into this long explanation of how these "routers" <point> will connect to "the Internet" <wave hands in a big circle> to allow my "servers" <gesture at big black boxes with blinking lights> to talk to other "computers" <typing motion> on "the Internet" <again with the waving of the hands>. He pauses for a second, and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't intended to be a condescending ass, but I think of that every time I realize I might be assuming something about someone based on thier attire/job/etc. W [0]: Well, technically pre-TSA, but I cannot remember what we used to call airport security pre-TSA...
I assume also the latter above, which is people have varying definitons of clean.
- Jared
-- Jared Mauch | pgp key available via finger from jared@puck.nether.net clue++; | http://puck.nether.net/~jared/ My statements are only mine.
-- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra
On Thu, 2021-02-18 at 17:37 -0500, Warren Kumari wrote:
Anyway, the subcontractor who made the power supplies for the vendor realized that they could save a few cents by not installing the little metal clip that held the heatsink to the MOSFET
I think it was Macchiavelli who said that one should not ascribe to malice anything adequately explained by incompetence...
3: I used to work for a small ISP in Yonkers, NY.
There is actually a place called "Yonkers"?!? I always thought it was a joke placename. We don't really need joke placenames in Oz, since we have real ones like Woolloomooloo, Burpengary and Humpty Doo. My favourite is Numbugga (closely followed by Wonglepong).
I cannot remember what we used to call airport security pre-TSA...
"Useful"? Regards, K. -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Karl Auer (kauer@biplane.com.au) http://www.biplane.com.au/kauer GPG fingerprint: 2561 E9EC D868 E73C 8AF1 49CF EE50 4B1D CCA1 5170 Old fingerprint: 8D08 9CAA 649A AFEF E862 062A 2E97 42D4 A2A0 616D
On Feb 18, 2021, at 6:10 PM, Karl Auer <kauer@biplane.com.au> wrote:
I think it was Macchiavelli who said that one should not ascribe to malice anything adequately explained by incompetence…
https://en.wikipedia.org/wiki/Hanlon%27s_razor Never attribute to malice that which is adequately explained by stupidity. I personally prefer this version from Robert A. Heinlein: Never underestimate the power of human stupidity. And to put it on topic, cover your EPOs In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now. I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said “just a second”. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover. Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country? Me: Maybe you should get a cover for that? Her: Good idea. Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says “just a second”, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall. Me: Did you order that EPO cover? Her: Nope. -- TTFN, patrick
Northridge quake. I was #2 and on call at CRL. That One Guy on dialup in Atlanta playing MUDs 23x7 pages that things are down. I wander out to my computer to dial in and see what’s up, turned on TV walking past it, sat down and turned computer on, as it was booting on comes a live helicopter shot over Northridge showing the 1.5 remaining floors of the 3-story Cable and Wireless building our east coast connector went through. Took a second to listen and make sure I understood what was happening, changed channels to verify it wasn’t a stunt, logged on and pinged our router there to confirm nothing there, call & wake up Jim: “East coast’s down because earthquake in Northridge and the C&W center fell down.” “....oh.” And then there was the Sidekick outage... -George Sent from my iPhone
On Feb 18, 2021, at 4:37 PM, Patrick W. Gilmore <patrick@ianai.net> wrote:
On Feb 18, 2021, at 6:10 PM, Karl Auer <kauer@biplane.com.au> wrote:
I think it was Macchiavelli who said that one should not ascribe to malice anything adequately explained by incompetence…
https://en.wikipedia.org/wiki/Hanlon%27s_razor Never attribute to malice that which is adequately explained by stupidity.
I personally prefer this version from Robert A. Heinlein: Never underestimate the power of human stupidity.
And to put it on topic, cover your EPOs
In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now.
I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said “just a second”. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover.
Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country?
Me: Maybe you should get a cover for that? Her: Good idea.
Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says “just a second”, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall.
Me: Did you order that EPO cover? Her: Nope.
-- TTFN, patrick
On Thu, Feb 18, 2021 at 07:34:39AM -0500, Patrick W. Gilmore wrote:
In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now.
I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said âjust a secondâ. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover.
Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country?
Me: Maybe you should get a cover for that? Her: Good idea.
Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says âjust a secondâ, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall.
Me: Did you order that EPO cover? Her: Nope.
some of the ibm 4300 series mini-mainframes came with a console terminal that had a very large, raised (completely not flush), alternate power button on the upper panel of the keyboard, facing the operator. in later versions, the button was inset in a little open box with high sides. in earlier versions, there was just a pair of raised ribs on either side of the button. in the earliest version, if that panel needed to be replaced, the replacement part didn't even have those protective ribs, this huge button was just sitting there. on our 4341, someone had dropped the keyboard during installation and the damaged panel was replaced with the no-protection-whatsoever part. i had an operator who, working a double shift into the overnight run, fell asleep and managed to bang his head square on the button. the overnight jobs running were left in various states of ruin. third party manufacturers had an easy sell for lucite power/EPO button covers. -- Henry Yen Aegis Information Systems, Inc. Senior Systems Programmer Hicksville, New York
Well... During my younger days, that button was used a few time by the operator of a VM/370 to regain control from someone with a "curious mind" *cought* *cought*... ----- Alain Hebert ahebert@pubnix.net PubNIX Inc. 50 boul. St-Charles P.O. Box 26770 Beaconsfield, Quebec H9W 6G7 Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443 On 2/20/21 4:07 AM, Henry Yen wrote:
In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now.
I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said “just a secondâ€. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover.
Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country?
Me: Maybe you should get a cover for that? Her: Good idea.
Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says “just a secondâ€, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall.
Me: Did you order that EPO cover? Her: Nope. some of the ibm 4300 series mini-mainframes came with a console terminal
On Thu, Feb 18, 2021 at 07:34:39AM -0500, Patrick W. Gilmore wrote: that had a very large, raised (completely not flush), alternate power button on the upper panel of the keyboard, facing the operator. in later versions, the button was inset in a little open box with high sides. in earlier versions, there was just a pair of raised ribs on either side of the button. in the earliest version, if that panel needed to be replaced, the replacement part didn't even have those protective ribs, this huge button was just sitting there. on our 4341, someone had dropped the keyboard during installation and the damaged panel was replaced with the no-protection-whatsoever part.
i had an operator who, working a double shift into the overnight run, fell asleep and managed to bang his head square on the button. the overnight jobs running were left in various states of ruin.
third party manufacturers had an easy sell for lucite power/EPO button covers.
-- Henry Yen Aegis Information Systems, Inc. Senior Systems Programmer Hicksville, New York
On 2/22/21 9:14 AM, Alain Hebert wrote:
*[External Email]*
Well...
During my younger days, that button was used a few time by the operator of a VM/370 to regain control from someone with a "curious mind" *cought* *cought*...
Two horror stories I remember from long ago when I was a console jockey for a federal space agency that will remain nameless :P 1. A coworker brought her daughter to work with her on a Saturday overtime shift because she couldn't get a babysitter. She parked the kid with a coloring book and a pile of crayons at the only table in the console room with some space, right next to the master console for our 3081. I asked her to make sure sh was well away from the console, and as she reached over to scoot the girl and her coloring books further away she slipped, and reached out to steady herself. Yep, planted her finger right down on the IML button (plexi covers? We don' need no STEENKIN' plexi covers!). MVS and VM vanished, two dozen tape drives rewound and several hours' worth of data merge jobs went blooey. 2. The 3081 was water cooled via a heat exchanger. The building chilled water feed had a very old, very clogged filter that was bypassed until it could be replaced. One day a new maintenance foreman came through the building doing his "clipboard and harried expression" thing, and spotted the filter in bypass (NO, I don't know WHY it hadn't been red-tagged. Someone clearly dropped that ball.) He thought, "Well that's not right" and reset all the valves to put it back inline, which of course, pretty much killed the chilled water flow through the heat exchanger. First thing we knew about it in Operations was when the 3081 started throwing thermal alarms and MVS crashed hard. IBM had to replace several modules in the CPUs. -- -------------------------------------------- Bruce H. McIntosh Network Engineer II University of Florida Information Technology bhm@ufl.edu 352-273-1066
Many years ago I experienced a very similar thing. The DC/Integrator I worked for outsourced the co-location and operation of mainframe services for several banks and government organisations. One of these banks had a significant investment in AS/400's and they decided that it was so much hassle and expense using our datacentres that they would start putting those nice small AS/400's in computer rooms in their office buildings instead. One particular computer room contained large line printers that the developers would use to print out whatever it is such people print out. One Saturday morning I received a frantic call from the customer to say that all their primary production as/400's had gone offline. After a short investigation I realised that all the offline devices wire in this particular computer room. It turn's out that one of the developers had bought his six year old son to work that Saturday and upon retrieval of a printout said son had dutifully followed dad in to the computer room and was unable to resist the big red button sitting exposed on the wall by the door. Shortly thereafter the embarrassed customer decided that perhaps it was worth relocating their as/400's to our expensive datacentres.
During my younger days, that button was used a few time by the operator of a VM/370 to regain control from someone with a "curious mind" *cought* *cought*...
Two horror stories I remember from long ago when I was a console jockey for a federal space agency that will remain nameless :P 1. A coworker brought her daughter to work with her on a Saturday overtime shift because she couldn't get a babysitter. She parked the kid with a coloring book and a pile of crayons at the only table in the console room with some space, right next to the master console for our 3081. I asked her to make sure sh was well away from the console, and as she reached over to scoot the girl and her coloring books further away she slipped, and reached out to steady herself. Yep, planted her finger right down on the IML button (plexi covers? We don' need no STEENKIN' plexi covers!). MVS and VM vanished, two dozen tape drives rewound and several hours' worth of data merge jobs went blooey.
At Boston Univ we discovered the hard way that a security guard's walkie-talkie could cause a $5,000 (or $10K for the big machine room) Halon dump. Took a couple of times before we figured out the connection tho once someone made it to the hold button before it actually dumped. Speaking of halon one very hot day I'm goofing off drinking coffee at a nearby sub shop when the owner tells me someone from the computing center was on the phone, that never happened before. Some poor operator was holding the halon shot, it's a deadman's switch (well, button) and the building was doing its 110db thing could I come help? The building is being evac'd. So my boss who wasn't the sharpest knife in the drawer follows me down as I enter and I'm sweating like a pig with a floor panel sucker trying to figure out which zone tripped. And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily. I answered: well, maybe THERE'S A FIRE!!! At which point I notice the back of my shoulder is really bothering me, which I say to him, and he says hmmm there's a big bee on your back maybe he's stinging you? Fun day. -- -Barry Shein Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*
Let me tell you about my personal favorite. It’s 2002 and I am working as an engineer for an electronic stock trading platform (ECN), this platform happened to be the biggest platform for trading stocks electronically, on some days bigger than NASDAQ itself. This platform also happened to be run on DOS, FoxPro and a Novell file share, on a cluster of roughly 1,000 computers, two of which were the “engine” that matched all of the trades. Well FoxPro has this “feature” where the ESC key halts the running program. We had the ability to remote control these DOS/FoxPro machines via some program we had written. Someone asked me to check the status of the process running on the primary matching engine, and when I was done, out of habit, I hit ESC. Trade processing grinds to a halt (phone calls have to be made to the SEC). I immediately called the NOC and told them it was me. Next thing I know, someone from the NOC is at my desk with a screwdriver putting the ESC key from my keyboard. I remained ESC keyless for the next several years until I left the company. I was hazed pretty good over it, but was essentially given a one time pass.
On Feb 22, 2021, at 7:30 PM, bzs@theworld.com wrote:
At Boston Univ we discovered the hard way that a security guard's walkie-talkie could cause a $5,000 (or $10K for the big machine room) Halon dump.
Took a couple of times before we figured out the connection tho once someone made it to the hold button before it actually dumped.
Speaking of halon one very hot day I'm goofing off drinking coffee at a nearby sub shop when the owner tells me someone from the computing center was on the phone, that never happened before.
Some poor operator was holding the halon shot, it's a deadman's switch (well, button) and the building was doing its 110db thing could I come help? The building is being evac'd.
So my boss who wasn't the sharpest knife in the drawer follows me down as I enter and I'm sweating like a pig with a floor panel sucker trying to figure out which zone tripped.
And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.
I answered: well, maybe THERE'S A FIRE!!!
At which point I notice the back of my shoulder is really bothering me, which I say to him, and he says hmmm there's a big bee on your back maybe he's stinging you?
Fun day.
-- -Barry Shein
Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*
On Mon, Feb 22, 2021 at 7:31 PM <bzs@theworld.com> wrote:
At Boston Univ we discovered the hard way that a security guard's walkie-talkie could cause a $5,000 (or $10K for the big machine room) Halon dump.
At one of the AOL datacenters there was some convoluted fire marshal reason why a specific door could not be locked "during business hours" (?!), and so there was a guard permanently stationed outside. The door was all the way around the back of the building, and so basically never used - and so the guard would fall asleep outside it with a piece of cardboard saying "Please wake me before entering". He was a nice guy (and it was less faff than the main entrance), and so we'd either sneak in and just not tell anyone, or talk loudly while going round the corner so he could pretend to have been awake the whole time... W
Took a couple of times before we figured out the connection tho once someone made it to the hold button before it actually dumped.
Speaking of halon one very hot day I'm goofing off drinking coffee at a nearby sub shop when the owner tells me someone from the computing center was on the phone, that never happened before.
Some poor operator was holding the halon shot, it's a deadman's switch (well, button) and the building was doing its 110db thing could I come help? The building is being evac'd.
So my boss who wasn't the sharpest knife in the drawer follows me down as I enter and I'm sweating like a pig with a floor panel sucker trying to figure out which zone tripped.
And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.
I answered: well, maybe THERE'S A FIRE!!!
At which point I notice the back of my shoulder is really bothering me, which I say to him, and he says hmmm there's a big bee on your back maybe he's stinging you?
Fun day.
-- -Barry Shein
Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*
-- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra
Patrick W. Gilmore <patrick@ianai.net> wrote:
Me: Did you order that EPO cover? Her: Nope.
There are apparently two kinds of EPO cover: - the kind that stops you from pressing the button by mistake; - and the kind that doesn't, and instead locks the button down to make sure it isn't un-pressed until everything is safe. We had a series of incidents similar to yours, so an EPO cover was belatedly installed. We learned about the second kind of EPO cover when a colleague proudly demonstrated that the EPO button should no longer be pressed by accident, or so he thought. Tony. -- f.anthony.n.finch <dot@dotat.at> http://dotat.at/ the quest for freedom and justice can never end
Long ago, in a galaxy far away I worked for a gov't contractor on site at a gov't site... We had our own cute little datacenter, and our 4 building complex had a central power distribution setup from utility -> buildings. It was really quite nice :) (the job, the buildings, the power and cute little datacenter) One fine Tues afternoon ~2pm local time, the building engineers decided they would make a copy of the key used to turn the main / utility power off... Of course they also needed to make sure their copy worked, so... they put the key in and turned it. Shockingly, the key worked! and no power was provided to the buildings :( It was very suddenly very dark and very quiet... (then the yelling started) Ok, fast forward 7 days... rerun the movie... Yes, the same building engineers made a new copy, and .. tested that new copy in the same manner. For neither of these events did someone tell the rest of us (and our customers): "Hey, we MAY interrupt power to the buildings... FYI, BTW, make sure your backups are current..." I recall we got the name of the engineer the 1st time around, but not the second. On Mon, Feb 22, 2021 at 12:26 PM Tony Finch <dot@dotat.at> wrote:
Patrick W. Gilmore <patrick@ianai.net> wrote:
Me: Did you order that EPO cover? Her: Nope.
There are apparently two kinds of EPO cover:
- the kind that stops you from pressing the button by mistake;
- and the kind that doesn't, and instead locks the button down to make sure it isn't un-pressed until everything is safe.
We had a series of incidents similar to yours, so an EPO cover was belatedly installed. We learned about the second kind of EPO cover when a colleague proudly demonstrated that the EPO button should no longer be pressed by accident, or so he thought.
Tony. -- f.anthony.n.finch <dot@dotat.at> http://dotat.at/ the quest for freedom and justice can never end
warren> 2: A somewhat similar thing would happen with the Ascend TNT warren> Max, which had side-to-side airflow. These were dial termination warren> boxes, and so people would install racks and racks of them. The warren> first one would draw in cool air on the left, heat it up and warren> ship it out the right. The next one over would draw in warm air warren> on the left, heat it up further, and ship it out the warren> right... Somewhere there is a fairly famous photo of a rack of warren> TNT Maxes, with the final one literally on fire, and still warren> passing packets. The Ascend MAX (TNT was the T3 version, max took 2 T1s) was originally an ISDN device. We got the first v.34 rockwell modem version for testing. An individual card had 4 daughter boards. They were burned in for 24 hours at Ascend, then shipped to us. We were doing stress testing in Fairfax VA. Turns out that the boards started to overheat at about 30 hours and caught fire a few hours after that... Completely melted the daughterboards. They did fix that issue and upped the burnin test period to 48 hours. And yeah, they vented side to side. They were designed for enclosed racks where are flow was forced up. We were colocating at telco POPs so we had to use center mount open relay racks. The air flow was as you describe. Good time. Had by all... Both we (UUNET, for MSN and Earthlink) and AOL were using these for dialup access. 80k ports before we switched to the TNTs, 3+ million ports on TNTs by the time I stopped paying attention.
when employer had shipped 2xJ to london, had the circuits up, ... the local office sat on their hands. for weeks. i finally was pissed enough to throw my toolbag over my shoulder, get on a plane, and fly over. i walked into the fancy office and said "hi, i am randy, vp eng, here to help you turn up the routers." they managed to turn them up pretty quickly.
On Feb 18, 2021, at 4:37 PM, Warren Kumari <warren@kumari.net> wrote:
4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork…
Well, in his defense, he wasn’t wrong… :-) ---- Andy Ringsmuth 5609 Harding Drive Lincoln, NE 68521-5831 (402) 304-0083 andy@andyring.com “Better even die free, than to live slaves.” - Frederick Douglas, 1863
On Fri, 19 Feb 2021, Andy Ringsmuth wrote:
I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork…
Well, in his defense, he wasn’t wrong… :-)
This is wjy, in the UK, we tend to pronounce "router" as "router", and "router" as "router", so there's no confusion. You're welcome. Jethro. . . . . . . . . . . . . . . . . . . . . . . . . . Jethro R Binks, Network Manager, Information Services Directorate, University Of Strathclyde, Glasgow, UK The University of Strathclyde is a charitable body, registered in Scotland, number SC015263.
On 2/19/21 00:37, Warren Kumari wrote:
5: Another one. In the early 2000s I was working for a dot-com boom company. We are building out our first datacenter, and I'm installing a pair of Cisco 7206s in 811 10th Ave. These will run basically the entire company, we have some transit, we have some peering to configure, we have an AS, etc. I'm going to be configuring all of this; clearly I'm a router-god... Anyway, while I'm getting things configured, this janitor comes past, wheeling a garbage bin. He stops outside the cage and says "Whatcha doin'?". I go into this long explanation of how these "routers" <point> will connect to "the Internet" <wave hands in a big circle> to allow my "servers" <gesture at big black boxes with blinking lights> to talk to other "computers" <typing motion> on "the Internet" <again with the waving of the hands>. He pauses for a second, and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't intended to be a condescending ass, but I think of that every time I realize I might be assuming something about someone based on thier attire/job/etc.
:-), cute. Mark.
Did you at least hire the janitor? From: NANOG <nanog-bounces+ops.lists=gmail.com@nanog.org> on behalf of Mark Tinka <mark@tinka.africa> Date: Friday, 19 February 2021 at 10:20 AM To: nanog@nanog.org <nanog@nanog.org> Subject: Re: Famous operational issues On 2/19/21 00:37, Warren Kumari wrote: 5: Another one. In the early 2000s I was working for a dot-com boom company. We are building out our first datacenter, and I'm installing a pair of Cisco 7206s in 811 10th Ave. These will run basically the entire company, we have some transit, we have some peering to configure, we have an AS, etc. I'm going to be configuring all of this; clearly I'm a router-god... Anyway, while I'm getting things configured, this janitor comes past, wheeling a garbage bin. He stops outside the cage and says "Whatcha doin'?". I go into this long explanation of how these "routers" <point> will connect to "the Internet" <wave hands in a big circle> to allow my "servers" <gesture at big black boxes with blinking lights> to talk to other "computers" <typing motion> on "the Internet" <again with the waving of the hands>. He pauses for a second, and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't intended to be a condescending ass, but I think of that every time I realize I might be assuming something about someone based on thier attire/job/etc. :-), cute. Mark.
On Feb 18, 2021, at 11:51 PM, Suresh Ramasubramanian <ops.lists@gmail.com> wrote:
On 2/19/21 00:37, Warren Kumari wrote:
and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't intended to be a condescending ass, but I think of that every time I realize I might be assuming something about someone based on thier attire/job/etc.
Did you at least hire the janitor?
Well, it's funny that you mention that because I worked at a place where the company ended up hiring a young lady who worked in the cafeteria. When she graduated she was offered a job in HR, and turned out to be absolutely awesome. At some point in my life, I was carrying 50lbs bags of potato starch. Now I have two graduate degrees and am working on a third. That janitor may be awesome, too! Thanks, Sabri
He is. He asked a perfectly relevant question based on what he saw of the physical setup in front of him. And he kept his cool when being talked down to. I’d hire him the next minute, personally speaking. From: Sabri Berisha <sabri@cluecentral.net> Date: Friday, 19 February 2021 at 2:02 PM To: Suresh Ramasubramanian <ops.lists@gmail.com> Cc: nanog <nanog@nanog.org> Subject: Re: Famous operational issues On Feb 18, 2021, at 11:51 PM, Suresh Ramasubramanian <ops.lists@gmail.com> wrote:
On 2/19/21 00:37, Warren Kumari wrote:
and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't intended to be a condescending ass, but I think of that every time I realize I might be assuming something about someone based on thier attire/job/etc.
Did you at least hire the janitor?
Well, it's funny that you mention that because I worked at a place where the company ended up hiring a young lady who worked in the cafeteria. When she graduated she was offered a job in HR, and turned out to be absolutely awesome. At some point in my life, I was carrying 50lbs bags of potato starch. Now I have two graduate degrees and am working on a third. That janitor may be awesome, too! Thanks, Sabri
On 2/19/21 10:40, Suresh Ramasubramanian wrote:
He is. He asked a perfectly relevant question based on what he saw of the physical setup in front of him.
And he kept his cool when being talked down to.
I’d hire him the next minute, personally speaking.
In the early 2000's, with that level of deduction, I'd have been surprised if he wasn't snatched up quickly. Unless, of course, it ultimately wasn't his passion. Mark.
On Fri, Feb 19, 2021 at 9:40 AM Warren Kumari <warren@kumari.net> wrote:
4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork...
OK, Warren, achievement unlocked. You've just made a network engineer to google 'router'.... P.S. I guess I'm obliged to tell a story if I respond to this thread...so... "Servers and the ice cream factory". Late spring/early summer in Moscow. The temperature above 30C (86°F). I worked for a local content provided. Aircons in our server room died, the technician ETA was 2 days ( I guess we were not the only ones with aircon problems). So we drove to the nearby ice cream factory and got *a lot* of dry ice. Then we have a roaster: every few hours one person took a deep breath, grabbed a box of dry ice, ran into the server room and emptied the box on top of the racks. The backup person was watching through the glass door - just in case, you know, ready to start the rescue operation. We (and the servers) survived till the technician arrived. And we had a lot of dry ice to cool the beer.. -- SY, Jen Linkova aka Furry
Jen Linkova писал 2021-02-19 00:04:
OK, Warren, achievement unlocked. You've just made a network engineer to google 'router'....
He meant that we call "frezer" machine... (in our language ;) I heard a similar story from my colleague who was working at that time for Huawei as DWDM engineer and had to fly frequently with testing devices. One time he tried to explain at airport security control what DWDM spectrum analyser is for, the officer called another for help and he said something like this: "DWDM spectrum analyser? Pass it, usual thing..." -- Kind regards, Andrey Kostin
On Feb 18, 2021, at 9:04 PM, Jen Linkova <furry13@gmail.com> wrote:
On Fri, Feb 19, 2021 at 9:40 AM Warren Kumari <warren@kumari.net> wrote:
4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork...
OK, Warren, achievement unlocked. You've just made a network engineer to google 'router'....
P.S. I guess I'm obliged to tell a story if I respond to this thread...so... "Servers and the ice cream factory". Late spring/early summer in Moscow. The temperature above 30C (86°F). I worked for a local content provided. Aircons in our server room died, the technician ETA was 2 days ( I guess we were not the only ones with aircon problems). So we drove to the nearby ice cream factory and got *a lot* of dry ice. Then we have a roaster: every few hours one person took a deep breath, grabbed a box of dry ice, ran into the server room and emptied the box on top of the racks. The backup person was watching through the glass door - just in case, you know, ready to start the rescue operation. We (and the servers) survived till the technician arrived. And we had a lot of dry ice to cool the beer..
-- SY, Jen Linkova aka Furry
During a wood-working project for the Southern California Linux Expo (the tech team that (among other things) runs the network for the show was building new equipment carts), I came up with the following meme: [I don’t know if NANOG will pass the image despite its small size, so textual description: A bandaged hand with the index finger amputated at the second knuckle with overlaid red text stating “Carless Routing May Lead to Urgent Test of Self Healing Network”] Fortunately, we didn’t have any such issues with the router, though we did have one person suffer a crushed toe from a cabinet tip-over. Fortunately, the person made a full recovery. Owen
On Thursday, 18 February, 2021 22:37, "Warren Kumari" <warren@kumari.net> said:
4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork...
Here in the UK we avoid that issue by pronouncing the packet-shifter as "rooter", and only the wood-working tool as "rowter" :) Of course, it raises a different set of problems when talking to the Australians... Cheers, Tim.
On Mon, Feb 22, 2021 at 7:09 AM tim@pelican.org <tim@pelican.org> wrote:
On Thursday, 18 February, 2021 22:37, "Warren Kumari" <warren@kumari.net> said:
4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork...
Here in the UK we avoid that issue by pronouncing the packet-shifter as "rooter", and only the wood-working tool as "rowter" :)
Of course, it raises a different set of problems when talking to the Australians...
Yes. I discovered this while walking around Sydney wearing my "I have root @ Google" t-shirt.... got some odd looks/snickers... W
Cheers, Tim.
-- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra
On Feb 22, 2021, at 7:02 AM, tim@pelican.org wrote:
On Thursday, 18 February, 2021 22:37, "Warren Kumari" <warren@kumari.net> said:
4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork...
Here in the UK we avoid that issue by pronouncing the packet-shifter as "rooter", and only the wood-working tool as "rowter" :)
So wrong. A “root” server is part of the DNS. A “route” server is part of BGP.
Of course, it raises a different set of problems when talking to the Australians…
Everything is weird down down. But I still like them. :-) -- TTFN, patrick
On Thu, Feb 18, 2021 at 5:38 PM Warren Kumari <warren@kumari.net> wrote:
2: A somewhat similar thing would happen with the Ascend TNT Max, which had side-to-side airflow. These were dial termination boxes, and so people would install racks and racks of them. The first one would draw in cool air on the left, heat it up and ship it out the right. The next one over would draw in warm air on the left, heat it up further, and ship it out the right... Somewhere there is a fairly famous photo of a rack of TNT Maxes, with the final one literally on fire, and still passing packets.
We had several racks of TNTs at the peak of our dial POP phase, and I believe we ended up designing baffles for the sides of those racks to pull in cool air from the front of the rack to the left side of the chassis and exhaust it out the back from the right side. It wasn't perfect, but it did the job. The TNTs with channelized T3 interfaces were a great way to terminate lots of modems in a reasonable amount of rack space with minimal cabling. Thank you jms
On 2/18/21 1:07 AM, Eric Kuhnke wrote:
On that note, I'd be very interested in hearing stories of actual incidents that are the cause of why cardboard boxes are banned in many facilities, due to loose particulate matter getting into the air and setting off very sensitive fire detection systems.
I had a customer that tried to stack their servers - no rails except the bottom most one - using 2x4's between each server. Up until then I hadn't imagined anyone would want to fill their cabinet with wood, so I made a rule to ban wood and anything tangentially related (cardboard, paper, plastic, etc.). Easier to just ban all things. Fire reasons too but mainly I thought a cabinet full of wood was too stupid to allow. The "no wood" rule has become a fun story to tell everyone who asks how that ended up being a rule. The wood customer turned out to be a complete a-hole anyway, wood was just the tip of the iceberg.
Worked a cronic support call where their internet would bounce at noon every workday. The Cisco 1601 or 1700 Router that had there T1 in, ended up being on top a microwave. Weeks of troubleshooting and shipping new routers on this one. Also had another one where the router was plugged in to an outlet that was controlled by a light switch, discovered this after shipping them two new routers. Customer had there building remodeled and the techs counldn't find the T1 Smartjack for the building. The contract who did the remodel job, decided it would be a good idea to cut out the section of wall where the telco equipment was and mounted it to the ceiling. It's new location was in the ladys bathroom, above the drop ceiling mounted to the building's rafters 10' in the air. Customer needed a new router, because the first one died. It was a machine shop and they mounted the router to the wall next to a lathe or drill press that used oil to cool the bit while it was cutting. It looked like some dumped the router in a bucket of oil when we got it back. Arriving at another large colo for a buildout. Only to find that our ASR9K that arrived 2 weeks ago was stored outside on the load dock which has no roof or locked gate. I guess that why Cisco put the plastic bag over the chassis when there shipped. Colo techs at another larger colo decided to unpack our router which was a fully loaded 1/2 rack chassis. Since they couldn't lift it, they tipped the router on the side and walked it back by shifting the weight from one corner of the chassis to another. Bending the chassis. I could see the scrap marks in the floor from it. We had colo space in top floor of an ATT CO where we put a Cisco 7513 to terminate about a dozen CHDS3's. The roof was leaking and instead of fixing the roof. The fix was to put a sheet of plastic over our cabinet. It was more like a tent over the cabinet. A pool of water formed in a diviot at the top and it was 120+ degrees under the plastic tarp. Our office was in a work loft off an older building and they had the AC unit mounted to the ceiling with a drip pan underneath them. Well, AC on the 2nd floor had the pump for the drip pan died. Who every installed the drip pan didn't secure it or center it under the AC unit. It filled up with water and since it was not secured and was off centered. The drip pan came crashing down with a few gallons of water. The water worked it's way over to the wall and traveled down one story in the building. The floor below had all the telco equipment mounted to that same wall and the water flowed down right through a couple of ATT's Ciena mounted to the wall shorting them out. I was at the Chicago Nanog Hackathon on Sunday and was called out to work that one 😕 Was working in the back of a cabinet that had -48 VDC power for a Cisco Router, a screw fell and shorted out the power. My co worker who was standing in front of the rack wasn't happy because the ADC PowerWorx Fuse panel was about 6" from his face where he was working. It had those little black alarm fuses, that had the spring-loaded arm. When it tripped a nice shower of sparks had flew right at his face Luckly he wore glasses. I was 18 at my first IT job and it was a brand-new building. I was plugging in a 208VAC 30A APC UPS in the server room the electrican had just energized and check the circuit. I plugged in the APC UPS and gave it a good turn for the twist lock plug to catch and KA BAMB!!! Sparks came shooting out of the outlet at me. I think I pooped myself that day. Turns out the electricians deiced that a single Gange electrical box was good enough for a 208 VAC 30A outlet, that barely fit in the box. Didn't put any tape around the wire terminals. When they energized the circuit there was enough of an air gap that the hot screw didn't ground out. When I gave it that good old twist while plugging in the APC, I grounded the hot screw to the side of the electrical box. ________________________________ From: NANOG <nanog-bounces+esundberg=nitelusa.com@nanog.org> on behalf of Seth Mattinen <sethm@rollernet.us> Sent: Thursday, February 18, 2021 10:23 AM To: nanog@nanog.org <nanog@nanog.org> Subject: Re: Famous operational issues On 2/18/21 1:07 AM, Eric Kuhnke wrote:
On that note, I'd be very interested in hearing stories of actual incidents that are the cause of why cardboard boxes are banned in many facilities, due to loose particulate matter getting into the air and setting off very sensitive fire detection systems.
I had a customer that tried to stack their servers - no rails except the bottom most one - using 2x4's between each server. Up until then I hadn't imagined anyone would want to fill their cabinet with wood, so I made a rule to ban wood and anything tangentially related (cardboard, paper, plastic, etc.). Easier to just ban all things. Fire reasons too but mainly I thought a cabinet full of wood was too stupid to allow. The "no wood" rule has become a fun story to tell everyone who asks how that ended up being a rule. The wood customer turned out to be a complete a-hole anyway, wood was just the tip of the iceberg. ________________________________ CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or previous e-mail messages attached to it may contain confidential information that is legally privileged. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this transmission is STRICTLY PROHIBITED. If you have received this transmission in error please notify the sender immediately by replying to this e-mail. You must destroy the original transmission and its attachments without reading or saving in any manner. Thank you.
On Thursday, 18 February, 2021 16:23, "Seth Mattinen" <sethm@rollernet.us> said:
I had a customer that tried to stack their servers - no rails except the bottom most one - using 2x4's between each server. Up until then I hadn't imagined anyone would want to fill their cabinet with wood, so I made a rule to ban wood and anything tangentially related (cardboard, paper, plastic, etc.). Easier to just ban all things. Fire reasons too but mainly I thought a cabinet full of wood was too stupid to allow.
On the "stupid racking" front, I give you most of a rack dedicated to a single server. Not all that high a server, maybe 2U or so, but *way* too deep for the rack, so it had been installed vertically. By looping some fairly hefty chain through the handles on either side of the front of the chassis, and then bolting the four chain ends to the four rack posts. I wish I'd kept pictures of that one. Not flammable, but a serious WTF moment. Cheers, Tim.
A few I remember: . Some monitoring server SCSI drive failed (we're talking State/Province level govt)... Got a return back stating it will take 6 month delay to get a replacement... Ended up choosing to use my own drive instead of leaving something that could be have been deadly, unmonitored. . Metro interruption during rush hour (for a pop of 4M) due to overload power bar in a MMR (Meet Me Room) during a unplanned deployment; . Cherry red and very angry looking 520-600V bus bar =D; . Fire fighters hitting the building generator emergency STOP button because some neighbor reported smoke on top of the building during a black out... ( not their fault, local gov failure as usual ) . Some idiots poured gasoline into a large pipe under a bridge... ended up demonstrating the lack of diversity to the DCs on that urban island; . Underground transformer blow up downtown Mtl and took out the entire fiber bundle, demonstrating to those customers that their diversity was actually real =D. (took them a year to get that fixed) and . Obviously: Any rack cabling I do... ----- Alain Hebert ahebert@pubnix.net PubNIX Inc. 50 boul. St-Charles P.O. Box 26770 Beaconsfield, Quebec H9W 6G7 Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443 On 2/18/21 2:37 PM, tim@pelican.org wrote:
On Thursday, 18 February, 2021 16:23, "Seth Mattinen" <sethm@rollernet.us> said:
I had a customer that tried to stack their servers - no rails except the bottom most one - using 2x4's between each server. Up until then I hadn't imagined anyone would want to fill their cabinet with wood, so I made a rule to ban wood and anything tangentially related (cardboard, paper, plastic, etc.). Easier to just ban all things. Fire reasons too but mainly I thought a cabinet full of wood was too stupid to allow. On the "stupid racking" front, I give you most of a rack dedicated to a single server. Not all that high a server, maybe 2U or so, but *way* too deep for the rack, so it had been installed vertically. By looping some fairly hefty chain through the handles on either side of the front of the chassis, and then bolting the four chain ends to the four rack posts. I wish I'd kept pictures of that one. Not flammable, but a serious WTF moment.
Cheers, Tim.
From a datacenter ROI and economics, cooling, HVAC perspective that might just be the best colo customer ever. As long as they're paying full price for the cabinet and nothing is *dangerous* about how they've hung the 2U server vertically, using up all that space for just one thing has to be a lot better than a customer that makes full and efficient use of space and all the amperage allotted to them.
On Thu, Feb 18, 2021 at 11:38 AM tim@pelican.org <tim@pelican.org> wrote:
On Thursday, 18 February, 2021 16:23, "Seth Mattinen" <sethm@rollernet.us> said:
I had a customer that tried to stack their servers - no rails except the bottom most one - using 2x4's between each server. Up until then I hadn't imagined anyone would want to fill their cabinet with wood, so I made a rule to ban wood and anything tangentially related (cardboard, paper, plastic, etc.). Easier to just ban all things. Fire reasons too but mainly I thought a cabinet full of wood was too stupid to allow.
On the "stupid racking" front, I give you most of a rack dedicated to a single server. Not all that high a server, maybe 2U or so, but *way* too deep for the rack, so it had been installed vertically. By looping some fairly hefty chain through the handles on either side of the front of the chassis, and then bolting the four chain ends to the four rack posts. I wish I'd kept pictures of that one. Not flammable, but a serious WTF moment.
Cheers, Tim.
Oh, I actually wanted to keep this for my memoirs, but if we can name danger datacenter operational issues …. somehow 2000s: Somebody ran its own datacenter, - once had an active ant colony living under the raised floor and in the climate system, - for a while had several electric grounding defects, leading to the work instruction of “don’t touch any metallic or conducting materials”, - for a minute, had a “look what we have bought on Ebay” - UPS system, until started to roast after turned on, - from time to time had climate issues, leading to temperatures around peaks with 68 centigrade room temperature, and yes, some equipment survived and even continued to work. Decided not to go back there, after “look what we have bought on Ebay, an argon fire distinguisher, we just need to mount it”. On 20 Feb 2021, at 10:15, Eric Kuhnke wrote:
From a datacenter ROI and economics, cooling, HVAC perspective that might just be the best colo customer ever. As long as they're paying full price for the cabinet and nothing is *dangerous* about how they've hung the 2U server vertically, using up all that space for just one thing has to be a lot better than a customer that makes full and efficient use of space and all the amperage allotted to them.
On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
On that note, I'd be very interested in hearing stories of actual incidents that are the cause of why cardboard boxes are banned in many facilities,
the datacenter manager's daughter's cat. -- Henry Yen Aegis Information Systems, Inc. Senior Systems Programmer Hicksville, New York
One day I got called into the office supplies area because there was a smell of something burning. Uh-oh. To make a long story short there was a stainless steel bowl which was focusing the sun from a window such that it was igniting a cardboard box. Talk about SMH and random bad luck which could have been a lot worse, nothing really happened other than some smoke and char. On February 18, 2021 at 01:07 eric.kuhnke@gmail.com (Eric Kuhnke) wrote:
On that note, I'd be very interested in hearing stories of actual incidents that are the cause of why cardboard boxes are banned in many facilities, due to loose particulate matter getting into the air and setting off very sensitive fire detection systems.
Or maybe it's more mundane and 99% of the reason is people unpack stuff and don't always clean up properly after themselves.
On Wed, Feb 17, 2021, 6:21 PM Owen DeLong <owen@delong.com> wrote:
Stolen isn’t nearly as exciting as what happens when your (used) 6509 arrives and gets installed and operational before anyone realizes that the conductive packing peanuts that it was packed in have managed to work their way into various midplane connectors. Several hours later someone notices that the box is quite literally smoldering in the colo and the resulting combination of panic, fire drill, and management antics that ensue.
Owen
> On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net> wrote: > > I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen. > > Sent from my TI-99/4a > >> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote: >> >> Friends, >> >> I'd like to start a thread about the most famous and widespread Internet >> operational issues, outages or implementation incompatibilities you >> have seen. >> >> Which examples would make up your top three? >> >> To get things started, I'd suggest the AS 7007 event is perhaps the >> most notorious and likely to top many lists including mine. So if >> that is one for you I'm asking for just two more. >> >> I'm particularly interested in this as the first step in developing a >> future NANOG session. I'd be particularly interested in any issues >> that also identify key individuals that might still be around and >> interested in participating in a retrospective. I already have someone >> that is willing to talk about AS 7007, which shouldn't be hard to guess >> who. >> >> Thanks in advance for your suggestions, >> >> John
-- -Barry Shein Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*
In the case of Exodus when I was working there, it was literally dictated to us by the fire marshal of the city of Santa Clara (and enough other cities where we had datacenters to make a universal policy the only sensible choice). Owen
On Feb 18, 2021, at 1:07 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:
On that note, I'd be very interested in hearing stories of actual incidents that are the cause of why cardboard boxes are banned in many facilities, due to loose particulate matter getting into the air and setting off very sensitive fire detection systems.
Or maybe it's more mundane and 99% of the reason is people unpack stuff and don't always clean up properly after themselves.
On Wed, Feb 17, 2021, 6:21 PM Owen DeLong <owen@delong.com <mailto:owen@delong.com>> wrote: Stolen isn’t nearly as exciting as what happens when your (used) 6509 arrives and gets installed and operational before anyone realizes that the conductive packing peanuts that it was packed in have managed to work their way into various midplane connectors. Several hours later someone notices that the box is quite literally smoldering in the colo and the resulting combination of panic, fire drill, and management antics that ensue.
Owen
On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net <mailto:jared@puck.nether.net>> wrote:
I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
Sent from my TI-99/4a
On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org <mailto:jtk@dataplane.org>> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
On 16/02/2021 22:08, Jared Mauch wrote:
I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
Enough time has (probably) elapsed since my escapades in a small data centre in Manchester. The RFO was ten pages long, and I don't want to spoil the ending, but ... I later discovered that Cumulus' then VP of Engineering had elevated me to a veritable 'Hall of Infamy' for the support ticket attached to that particular tale. One day I'll be able to buy the guy that handled it a *lot* of whisky. He deserved it. -- Tom
Biggest internet operational SUCCESS 1. Secure Shell (SSH) replaced TELNET. Nearly eliminated an entire class of security problems on the Internet. But then HTTP took over everything, so a good news/bad news. 2. Internet worms massively reduced by changed default configurations and default firewalls (Windows XP proved defaults could be changed). Still need to work on DDOS amplification. 3. Head of Line blocking in IX switches (although I miss Stephen Stuart saying "I'm Sorry" at every NANOG for a decade). Was a huge problem, which is a non-problem now. 4. Classless Inter-Domain Routing and BGP4 changed how Internet routing worked across the entire backbone, and it worked! Vince Fuller et al rebuilt the aircraft in flight, without crashing. 5. Y2K was a huge suggess because a lot of people fixed things ahead time, and almost nothing crashed (other than the National Security Agency's internal systems :-). I'll be retired before Y2038, so that's someone else's problem.
On 17 Feb 2021, at 09:51, Sean Donelan <sean@donelan.com> wrote:
Biggest internet operational SUCCESS
1. Secure Shell (SSH) replaced TELNET. Nearly eliminated an entire class of security problems on the Internet. But then HTTP took over everything, so a good news/bad news.
2. Internet worms massively reduced by changed default configurations and default firewalls (Windows XP proved defaults could be changed). Still need to work on DDOS amplification.
3. Head of Line blocking in IX switches (although I miss Stephen Stuart saying "I'm Sorry" at every NANOG for a decade). Was a huge problem, which is a non-problem now.
4. Classless Inter-Domain Routing and BGP4 changed how Internet routing worked across the entire backbone, and it worked! Vince Fuller et al rebuilt the aircraft in flight, without crashing.
5. Y2K was a huge suggess because a lot of people fixed things ahead time, and almost nothing crashed (other than the National Security Agency's internal systems :-). I'll be retired before Y2038, so that's someone else's problem.
Lets hope you aren’t depending on a piece of medical equipment with a Y2038 issue to keep you alive. Y2038 is everybody's problem! Mark -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org
On 2/16/2021 9:37 AM, John Kristoff wrote:
I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. --------------------------------------------------------
AS7007 is how I found NANOG. We (Digital Island; first job out of college) were in 10-20 countries around the planet at the time. All of them wentdown while we were in cisco training. I kept interrupting the class andtelling my manager "everything's down! We need to stop the training and get on it!" We didn't because I was new and no onebelieved that much could go down all at once. They assumed it was a monitoring glitch.So, the training continued for a while until very senior engineers got involved. One of the senior guys said something to the effect of "yeah, it's all over NANOG." I said what is NANOG? I signed upfor the list and many of you have had to listen to me ever since... ;) scott
If were just talking about outages historically, I recall the 1996 AOL Email debacle, not really anything to do with network mishaps but more so DNS configuration.. As well, I believe the North East 2003 blackout was a great DR test that no one was expecting. Of course we also have the big non-events too such as Y2K.... Regards -Joe B. On Tue, Feb 16, 2021 at 1:38 PM John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote:
Which examples would make up your top three?
Morris worm, November 1988. Much confusion and eventually the realization the John Brunner had called it from 13 years out ("The Shockwave Rider", 1975). But sloppy coding meant it could be defeated with one line of /bin/sh. ---rsk
John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Well... pre-Internet, but the great Northeast fiber cut comes to mind (backhoe vs. fiber, backhoe won). Miles Fidelman -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra Theory is when you know everything but nothing works. Practice is when everything works but no one knows why. In our lab, theory and practice are combined: nothing works and no one knows why. ... unknown
I remember when the big carriers de-peered with Cogent in the early 2000s. The underestimated the amount of web-sites being hosted by people using cogent exclusively. Justin Wilson j2sw@j2sw.com — https://j2sw.com - All things jsw (AS209109) https://blog.j2sw.com - Podcast and Blog
On Feb 17, 2021, at 10:29 AM, Miles Fidelman <mfidelman@meetinghouse.net> wrote:
John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Well... pre-Internet, but the great Northeast fiber cut comes to mind (backhoe vs. fiber, backhoe won).
Miles Fidelman
-- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra
Theory is when you know everything but nothing works. Practice is when everything works but no one knows why. In our lab, theory and practice are combined: nothing works and no one knows why. ... unknown
Cogentco still did not peer with Google and HE over IPv6 I guess. ________________________________ From: NANOG <nanog-bounces+david=xtom.com@nanog.org> on behalf of Justin Wilson (Lists) <lists@mtin.net> Sent: Thursday, February 18, 2021 00:53 To: Miles Fidelman Cc: nanog@nanog.org Subject: Re: Famous operational issues I remember when the big carriers de-peered with Cogent in the early 2000s. The underestimated the amount of web-sites being hosted by people using cogent exclusively. Justin Wilson j2sw@j2sw.com — https://j2sw.com - All things jsw (AS209109) https://blog.j2sw.com - Podcast and Blog
On Feb 17, 2021, at 10:29 AM, Miles Fidelman <mfidelman@meetinghouse.net> wrote:
John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Well... pre-Internet, but the great Northeast fiber cut comes to mind (backhoe vs. fiber, backhoe won).
Miles Fidelman
-- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra
Theory is when you know everything but nothing works. Practice is when everything works but no one knows why. In our lab, theory and practice are combined: nothing works and no one knows why. ... unknown
The he.net side is interesting as you can see who their v4 transits are but they suppress their routes via v6, but (last I knew) lacked community support for their customers to do similar route suppression. I’m not a fan of it, but it makes the commercial discussions much easier each time those networks come by to shop services to me in a personal or professional capacity. “No, I need all the internet”. - Jared
On Feb 17, 2021, at 12:07 PM, David Guo via NANOG <nanog@nanog.org> wrote:
Cogentco still did not peer with Google and HE over IPv6 I guess.
From: NANOG <nanog-bounces+david=xtom.com@nanog.org> on behalf of Justin Wilson (Lists) <lists@mtin.net> Sent: Thursday, February 18, 2021 00:53 To: Miles Fidelman Cc: nanog@nanog.org Subject: Re: Famous operational issues
I remember when the big carriers de-peered with Cogent in the early 2000s. The underestimated the amount of web-sites being hosted by people using cogent exclusively.
Justin Wilson j2sw@j2sw.com
— https://j2sw.com - All things jsw (AS209109) https://blog.j2sw.com - Podcast and Blog
On Feb 17, 2021, at 10:29 AM, Miles Fidelman <mfidelman@meetinghouse.net> wrote:
John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Well... pre-Internet, but the great Northeast fiber cut comes to mind (backhoe vs. fiber, backhoe won).
Miles Fidelman
-- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra
Theory is when you know everything but nothing works. Practice is when everything works but no one knows why. In our lab, theory and practice are combined: nothing works and no one knows why. ... unknown
(resent - to list this time) On 16 Feb 2021, at 2:37 PM, John Kristoff <jtk@dataplane.org <mailto:jtk@dataplane.org>> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
John - I have no idea what outages were most memorable for others, but the Stanford transfer switch explosion in October 1996 resulted in a much of the Internet in the Bay Area simply not being reachable for several days. At the time there were three main power grids feeding Stanford – two from PG&E and one from Stanford’s own CoGen plant – and somehow a rat crawling into one of the two 12KVA transfer switches resulted in an the switch disppearing in an epic explosion that even took out a portion of the exterior wall of the building. The ensuing restoration involved lots of industry folks, GE power-on-wheel generating stations, anaconda-sized power cables, and all in all was quite the adventure. FYI, /John
Normally I reference this as an example of terrible government bureaucracy, but in this case it's also how said bureaucracy can delay operational changes. I was a contractor for one of the many branches of the DoD in charge of the network at a moderate-sized site. I'd been there about 4 months, and it was my first job with FedGov. I was sent a pair of Cisco 6509-E routers, with all supervisors and blades needed, along with a small mountain of SFPs, to replace the non-E 6509s we had installed that were still using GBICs for their downlinks. These were the distro switches for approximately half the site. Problem was, we needed 84 new SC-LC fiber jumpers to replace the SC-SC we had in place for the existing switch - GBICs to SFPs remember. We hadn't received any with the shipment. So I reached out to the project manager to ask about getting the fiber jumpers. "Oh, that should be coming from the server farm folks, since it's being installed in a server farm." Okay, that seems stupid to me, but $FedGov, who knows. I tell him we're stalled out until we get those cables - we have the routers configured and ready to go, just need the jumpers, can he get them from the server farm folks? He'll do that. It took FIFTEEN MONTHS to hash out who was going to pay for and order the fiber jumpers. Any number of times as the months dragged on, I seriously considered ordering them on Amazon Prime using my corporate card. We had them installed a week and a half after we got them. Why that long? Because we had to completely reconfigure them, and after 15 months, the urgency just wasn't there. By the way, the project ended up buying them, not the server farm team. On Tue, Feb 16, 2021 at 2:38 PM John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
Do you remember the Cisco HDCI connectors? https://en.wikipedia.org/wiki/HDCI I once shipped a Cisco 4500 plus some cables to a remote data center and asked the local guys to cable them for me. With Cisco you could check the cable type and if they were properly attached. They were not. I asked for a check and the local guy confirmed me three times that the cables were properly plugged. At the end I gave up, and took the 3 hour drive to the datacenter to check myself. Problem was that, while the casing of the connector is asymmetrical, the pins inside are symmetrical. And the local guy was quite strong. Yes, he managed to plug in the cables 180° flipped, bending the case, but he got them in. He was quite embarrassed when I fixed the cabling problem in 10 seconds. That must have been 1995 or so.... Wolfgang
On 16. Feb 2021, at 20:37, John Kristoff <jtk@dataplane.org> wrote:
Which examples would make up your top three?
-- Wolfgang Tremmel Phone +49 69 1730902 0 | wolfgang.tremmel@de-cix.net Executive Directors: Harald A. Summa and Sebastian Seifert | Trade Registry: AG Cologne, HRB 51135 DE-CIX Management GmbH | Lindleystrasse 12 | 60314 Frankfurt am Main | Germany | www.de-cix.net
I’m embarrassed to say, I’ve done this. Ms. Lady Benjamin PD Cannon, ASCE 6x7 Networks & 6x7 Telecom, LLC CEO ben@6by7.net "The only fully end-to-end encrypted global telecommunications company in the world.” FCC License KJ6FJJ Sent from my iPhone via RFC1149.
On Feb 19, 2021, at 12:55 AM, Wolfgang Tremmel <wolfgang.tremmel@de-cix.net> wrote:
Do you remember the Cisco HDCI connectors? https://en.wikipedia.org/wiki/HDCI
I once shipped a Cisco 4500 plus some cables to a remote data center and asked the local guys to cable them for me. With Cisco you could check the cable type and if they were properly attached. They were not.
I asked for a check and the local guy confirmed me three times that the cables were properly plugged. At the end I gave up, and took the 3 hour drive to the datacenter to check myself.
Problem was that, while the casing of the connector is asymmetrical, the pins inside are symmetrical. And the local guy was quite strong.
Yes, he managed to plug in the cables 180° flipped, bending the case, but he got them in. He was quite embarrassed when I fixed the cabling problem in 10 seconds.
That must have been 1995 or so....
Wolfgang
On 16. Feb 2021, at 20:37, John Kristoff <jtk@dataplane.org> wrote:
Which examples would make up your top three?
-- Wolfgang Tremmel
Phone +49 69 1730902 0 | wolfgang.tremmel@de-cix.net Executive Directors: Harald A. Summa and Sebastian Seifert | Trade Registry: AG Cologne, HRB 51135 DE-CIX Management GmbH | Lindleystrasse 12 | 60314 Frankfurt am Main | Germany | www.de-cix.net
On 16 Feb 2021, at 20:37, John Kristoff wrote:
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
My absolute top one happened 1995. Traffic engineering was not a widely used term then. A bright colleague who will remain un-named decided that he could make AS paths longer by repeating the same AS number more than once. Unfortunately the prevalent software on CISCO routers was not resilient to such trickery and reacted with a reboot. This caused an avalanche of jo-jo-ing routers. Think it through! It took some time before that offending path could be purged from the whole Internet; yes we all roughly knew the topology and the players of the BGP speaking parts of it at that time. Luckily this happened during the set-up for the Danvers IETF and co-ordination between major operators was quick because most of their routing geeks happened to be in the same room, the ‘terminal room’; remember those? Since at the time I personally had no responsibility for operations any more I went back to pulling cables and crimping RJ45s. Lessons: HW/SW mono-cultures are dangerous. Input testing is good practice at all levels software. Operational co-ordination is key in times of crisis. Daniel
----- On Feb 19, 2021, at 3:07 AM, Daniel Karrenberg dfk@ripe.net wrote: Hi,
Lessons: HW/SW mono-cultures are dangerous. Input testing is good practice at all levels software. Operational co-ordination is key in times of crisis.
Well... Here is a very similar, fairly recent one. Albeit in this case, the opposite is true: running one software train would have prevented an outage. Some members on this list (hi, Brian!) will recognize the story. Group XX within $company decided to deploy EVPN. All of backbone was running single $vendor, but different software trains. Turns out that between an early draft, implemented in version X, and the RFC, implemented in version Y, a change was made in NLRI formats which were not backwards compatible. Version X was in use on virtually all DC egress boxes, version Y was in use on route reflectors. The moment the first EVPN NLRI was advertised, the entire backbone melted down. Dept-wide alert issued (at night), people trying to log on to the VPN. Oh wait, the VPN requires yubikey, which requires the corp network to access the interwebs, which is not accessible due to said issue. And, despite me complaining since the day of hire, no out of band network. I didn't stay much longer after that. Thanks, Sabri
All these stories remind me of two of my own from back in the late 90s. I worked for a regional ISP doing some network stuff (under the real engineer), and some software development. Like a lot of ISPs in the 90s, this one started out in a rental house. Over the months and years rooms were slowly converted to host more and more equipment as we expanded our customer base and presence in the region. If we needed a "rack", someone would go to the store and buy a 4-post metal shelf [1] or...in some cases the dump to see what they had. We had one that looked like an oversized filing cabinet with some sort of rails on the sides. I don't recall how the equipment was mounted, but I think it was by drilling holes into the front lip and tapping the screws in. This was the big super-important rack. It had the main router that connected lines between 5 POPs around the region, and also several connections to Portland Oregon about 60 miles away. Since we were making tons of money, we decided we should update our image and install real racks in the "bedroom server room". It was decided we were going to do it with no downtime. I was on the 2-man team that stood behind and in front of the rack with 2x4s dead-lifting them as equipment was unscrewed and lowered onto the boards. I was on the back side of the rack. After all the equipment was unscrewed, someone came in with a sawzall and cut the filing cabinet thing apart. The top half was removed and taken away, then we lifted up on the boards and the bottom half was slid out of the way. The new rack was brought in, bolted to the floor, and then one by one equipment was taken off the pile we were holding up with 2x4s, brought through the back of the new rack, and then mounted. I was pleasantly surprised and very relieved when we finished moving the big router, several switches, a few servers, and a UPS unit over to the new rack with zero downtime. The entire team cheered and cracked beers. I stepped out from behind the rack... ...and snagged the power cable to the main router with my foot. I don't recall the Cisco model number after all this time...but I do remember the excruciating 6-8 minutes it took for the damn thing to reboot, and the sight of the 7 PRI cards in our phone system almost immediately jumping from 5 channels in-use to being 100% full. It's been 20 years, but I swear my arms are still sore from holding all that equipment up for ~20 minutes, and I always pick my feet up very slowly when I'm near a rack. ;) The second story is a short one from the same time period. Our POPs consisted of the afore-mentioned 4-post metal shelves stacked with piles of US Robotics 56k modems [2] stacked on top of each other. They were wired back to some sort of serial box that was in-turn connected to an ISA card stuck in a Windows NT 4 server that used RADIUS to authenticate sessions with an NT4 server back at the main office that had user accounts for all our customers. Every single modem had a wall-wart power brick for power, an RJ11 phone line, and a big old serial cable. It was an absolute rats nest of cables. The small POP (which I think was a TuffShed in someone's yard about 50 feet from the telco building) was always 100 degrees--even in the dead of winter. One year we made the decision to switch to 3Com Total Control Chassis with PRI cards. The cut-over was pretty seamless and immediately made shelves stacked full of hundreds of modems completely useless. As we started disconnecting modems with the intent of selling them for a few bucks to existing customers who wanted to upgrade or giving them to new customers to get them signed up, we found a bunch of the stacks of modems had actually melted together due to the temps. That explained the handful of numbers in the hunt group that would just ring and ring with no answer. In the end we went from a completely packed 10x20 shed to two small 3Com TCH boxes packed with PRI cards and a handful of PRI cables with much more normal temperatures. I thoroughly enjoyed the "wild west" days of the internet. If Eric and Dan are reading this, thanks for everything you taught me about networking, business, hard work, and generally being a good person. -A [1] - https://www.amazon.com/dp/B01D54TICS/ref=redir_mobile_desktop?_encoding=UTF8&aaxitk=Pe4xuew1D1PkrRA9cq8Cdg&hsa_cr_id=5048111780901&pd_rd_plhdr=t&pd_rd_r=4d9e3b6b-3360-41e8-9901-d079ac063f03&pd_rd_w=uRxXq&pd_rd_wg=CDibq&ref_=sbx_be_s_sparkle_td_asin_0_img [2] - https://www.usr.com/products/56k-dialup-modem/usr5686g/ On Tue, Feb 16, 2021 at 11:39 AM John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
On 2/16/2021 2:37 PM, John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
I don't believe I've seen this in any of the replies, but the AT&T cascading switch crashes of 1990 is a good one. This link even has some pseudocode https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse
At a previous company we had a large number of Foundry Networks layer-3 switches. They participated in our OSPF network and had a *really* annoying bug. Every now and then one of them would get somewhat confused and would corrupt its OSPF database (there seemed to be some pointer that would end up off by one). It would then cleverly realize that its LSDB was different to everyone else's and so would flood this corrupt database to all other OSPF speakers. Some vendors would do a better job of sanity checking the LSAs and would ignore the bad LSAs, other vendors would install them... and now you have different link state databases on different devices and OSPF becomes unhappy. Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.9.32.5 Mask 10.160.8.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3 Mask 10.2.153.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3 Mask 10.2.153.0 from 10.178.255.252 NOTE: This route will not be installed in the routing table. If you look at the output, you can see that there is some garbage in the LSID field and the bit that should be there is now in the Mask section. I also saw some more extreme version of the same bug, in my favorite example the mask was 115.104.111.119 and further down there was 105.110.116.114 -- if you take these as decimal number and look up their ASCII values we get "show" and "inte" -- I wrote a tool to scrape bits from these errors and ended up with a large amount of the CLI help text. Many years ago I worked for a small Mom-and-Pop type ISP in New York state (I was the only network / technical person there) -- it was a very free wheeling place and I built the network by doing whatever made sense at the time. One of my "favorite" customers (Joe somebody) was somehow related to the owner of the ISP and was a gamer. This was back in the day when the gaming magazines would give you useful tips like "Type 'tracert $gameserver' and make sure that there are less than N hops". Joe would call up tech support, me, the owner, etc and complain that there was N+3 hops and most of them were in our network. I spent much time explaining things about packet-loss, latency, etc but couldn't shake his belief that hop count was the only metric that mattered. Finally, one night he called me at home well after midnight (no, I didn't give him my home phone number, he looked me up in the phonebook!) to complain that his gaming was suffering because it was "too many hops to get out of your network". I finally snapped and built a static GRE tunnel from the RAS box that he connected to all over the network -- it was a thing of beauty, it went through almost every device that we owned and took the most convoluted path I could come up with. "Yay!", I figured, "now I can demonstrate that latency is more important than hop count" and I went to bed. The next morning I get a call from him. He is ecstatic and wildly impressed by how well the network is working for him now and how great his gaming performance is. "Oh well", I think, "at least he is happy and will leave me alone now". I don't document the purpose of this GRE anywhere and after some time forget about it. A few months later I am doing some routine cleanup work and stumble across a weird looking tunnel -- its bizarre, it goes all over the place and is all kinds of crufty -- there are static routes and policy routing and bizarre things being done on the RADIUS server to make sure some user always gets a certain IP... I look in my pile of notes and old configs and then decide to just yank it out. That night I get an enraged call (at home again) from Joe *screaming* that the network is all broken again because it is now way too many hops to get out of the network and that people keep shooting him... *What I learnt from this:* 1: Make sure you document everything (and no, the network isn't documentation) 2: Gamers are weird. 3: Making changes to your network in anger provides short term pleasure but long term pain. On Fri, Feb 19, 2021 at 1:10 PM Andrew Gallo <akg1330@gmail.com> wrote:
On 2/16/2021 2:37 PM, John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
I don't believe I've seen this in any of the replies, but the AT&T cascading switch crashes of 1990 is a good one. This link even has some pseudocode https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse
-- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra
Not a famous operational issue, but in 2000, we had a major outage of our dialup modem pool. The owner of the building was re-skinning the outside using Styrofoam and stucco. A bunch of the Styrofoam had blocked the roof drains on the podium section of the building, immediately above our equipment room. A flash rainstorm filled the entire flat roof, and water came back in over the flashings, and poured directly in to our dialup modem pool through the hole in the concrete roof deck where the drain pipe protruded through. In retrospect, it was a monumentally stupid place to put our main modem pool, but we didn't realize what was above the drop ceiling - and that it was roof, not the other 11 floors of the building. 1 bay of 6 shelves of USR TC 1000 HiperDSPs were now very wet and blinking funny patterns on their LEDs. Fortunately, our vendor in Toronto (4 hour drive away) had stock of equipment that another customer kept delaying shipment on. They got their staff in, started un-boxing and, slotting cards. We spent a few hours tearing out the old gear and getting ready for replacements. We left Windsor, Ontario at around 12:00am - same time they left Toronto, heading towards us. We coordinated a meet at one of the rural exits along Highway 401 at a closed gas station at around 2am. Everything was going so well until a cop pulled up, and asked us what we were doing, as we were slinging modem chassis between the back of the vendor's SUV and our van... We calmly explained what happened. He looked between us a couple of times, shook his head and said "well, good luck with that", got back in his car and drove away. We had everything back online within 14 hours of the initial outage. At 02:37 PM 16/02/2021, John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
-- Clayton Zekelman Managed Network Systems Inc. (MNSi) 3363 Tecumseh Rd. E Windsor, Ontario N8W 1H4 tel. 519-985-8410 fax. 519-985-8409
Beyond the widespread outages, I have so many personal war stories that it's hard to pick a favorite. My first job out of college in the mid-late 90s was at an ISP in Pittsburgh that I joined pretty early in its existence, and everyone did a bit of everything. I was hired to do sysadmin stuff, networking, pretty much whatever was needed. About a year after I started, we brought up a new mail system with an external RAID enclosure for the mail store itself. One day, we saw indications that one of the disks in the RAID enclosure was starting to fail, so I scheduled a maintenance window to replace the disk and let the controller rebuild the data and integrate it back into the RAID set. No big worries, right? It's Tuesday at about 2 AM. Well, the kernel on the RAID controller itself decided that when I pulled the failing drive would be a fine time to panic, and more or less turn itself into a bit-blender, and take all the mailstore down with it. After a few hours of watching fsck make no progress on anything, in terms of trying to un-fsck the mailstore, we made the decision in consultation with the CEO to pull the plug on trying to bring the old RAID enclosure back to life, and focus on finding suitable replacement hardware and rebuild from scratch. We also discovered that the most recent backups of the mailstore were over a month old :( I think our CEO ended up driving several hours to procure a suitable enclosure. By the time we got the enclosure installed, filesystems built, and got whatever tape backups we had restored, and tested the integrity of the system, it was now Thursday around 8 AM. Coincidentally, that was the same day the company hosted a big VIP gathering (the mayor was there, along with lots of investors and other bigwigs), so I had to come back and put on a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in about the previous 3 days. I still don't know how I got home that night without wrapping my vehicle around a utility pole (due to being over-tired, not due to alcohol). Many painful lessons learned over that stretch of days, as often the case as a company grows from startup mode and builds more robust technology and business processes as a consequence of growth. jms On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
That brings back memories....I had a similar experience. First month on the job, large Sun raid array storing ~ 5k of mailboxes dies in the middle of the afternoon. So, I start troubleshooting and determine it's most likely a bad disk. The CEO walked into the server room right about the time I had 20 disks laid out on a table. He had a fit and called the desktop support guy to come and 'show me how to fix a pc'. Never mind the fact that we had a 90% ready to go replacement box sitting at another site, and just needed to either go get it, or bring the disks to it..... So we sat there until the desktop who was 30 minutes away guy got there. He took one look at it and said 'never touched that thing before, looks like he knows what he's doing' and pointed to me. 4 hours later we were driving the new server to the data center strapped down in the back of a pickup. Fun times. -----Original Message----- From: "Justin Streiner" <streinerj@gmail.com> Sent: Tuesday, February 23, 2021 5:11pm To: "John Kristoff" <jtk@dataplane.org> Cc: "NANOG" <nanog@nanog.org> Subject: Re: Famous operational issues Beyond the widespread outages, I have so many personal war stories that it's hard to pick a favorite. My first job out of college in the mid-late 90s was at an ISP in Pittsburgh that I joined pretty early in its existence, and everyone did a bit of everything. I was hired to do sysadmin stuff, networking, pretty much whatever was needed. About a year after I started, we brought up a new mail system with an external RAID enclosure for the mail store itself. One day, we saw indications that one of the disks in the RAID enclosure was starting to fail, so I scheduled a maintenance window to replace the disk and let the controller rebuild the data and integrate it back into the RAID set. No big worries, right? It's Tuesday at about 2 AM. Well, the kernel on the RAID controller itself decided that when I pulled the failing drive would be a fine time to panic, and more or less turn itself into a bit-blender, and take all the mailstore down with it. After a few hours of watching fsck make no progress on anything, in terms of trying to un-fsck the mailstore, we made the decision in consultation with the CEO to pull the plug on trying to bring the old RAID enclosure back to life, and focus on finding suitable replacement hardware and rebuild from scratch. We also discovered that the most recent backups of the mailstore were over a month old :( I think our CEO ended up driving several hours to procure a suitable enclosure. By the time we got the enclosure installed, filesystems built, and got whatever tape backups we had restored, and tested the integrity of the system, it was now Thursday around 8 AM. Coincidentally, that was the same day the company hosted a big VIP gathering (the mayor was there, along with lots of investors and other bigwigs), so I had to come back and put on a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in about the previous 3 days. I still don't know how I got home that night without wrapping my vehicle around a utility pole (due to being over-tired, not due to alcohol). Many painful lessons learned over that stretch of days, as often the case as a company grows from startup mode and builds more robust technology and business processes as a consequence of growth. jms On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <[ jtk@dataplane.org ]( mailto:jtk@dataplane.org )> wrote:Friends, I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen. Which examples would make up your top three? To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more. I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who. Thanks in advance for your suggestions, John
On Tue, Feb 23, 2021 at 5:14 PM Justin Streiner <streinerj@gmail.com> wrote:
Beyond the widespread outages, I have so many personal war stories that it's hard to pick a favorite.
My first job out of college in the mid-late 90s was at an ISP in Pittsburgh that I joined pretty early in its existence, and everyone did a bit of everything. I was hired to do sysadmin stuff, networking, pretty much whatever was needed. About a year after I started, we brought up a new mail system with an external RAID enclosure for the mail store itself. One day, we saw indications that one of the disks in the RAID enclosure was starting to fail, so I scheduled a maintenance window to replace the disk and let the controller rebuild the data and integrate it back into the RAID set. No big worries, right?
It's Tuesday at about 2 AM.
Well, the kernel on the RAID controller itself decided that when I pulled the failing drive would be a fine time to panic, and more or less turn itself into a bit-blender, and take all the mailstore down with it. After a few hours of watching fsck make no progress on anything, in terms of trying to un-fsck the mailstore, we made the decision in consultation with the CEO to pull the plug on trying to bring the old RAID enclosure back to life, and focus on finding suitable replacement hardware and rebuild from scratch. We also discovered that the most recent backups of the mailstore were over a month old :(
I think our CEO ended up driving several hours to procure a suitable enclosure. By the time we got the enclosure installed, filesystems built, and got whatever tape backups we had restored, and tested the integrity of the system, it was now Thursday around 8 AM. Coincidentally, that was the same day the company hosted a big VIP gathering (the mayor was there, along with lots of investors and other bigwigs), so I had to come back and put on a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in about the previous 3 days. I still don't know how I got home that night without wrapping my vehicle around a utility pole (due to being over-tired, not due to alcohol).
Many painful lessons learned over that stretch of days, as often the case as a company grows from startup mode and builds more robust technology and business processes as a consequence of growth.
Oh, dear. RAID.... that triggered 2 stories. 1: I worked at a small ISP in Westchester, NY. One day I'm doing stuff, and want to kill process 1742, so I type 'kill -9 1' ... and then, before pressing enter, I get distracted by our "Cisco AGS+ monitor" (a separate story). After I get back to my desk I unlock my terminal, and call over a friend to show just how close I'd gotten to making something go Boom. He says "Nah, BSD is cleverer than that. I'm sure the kill command has some check in to stop you killing init.". I disagree. He disagrees. I disagree again. He calls me stupid. I bet him a soda. He proves his point by typing 'su; kill -9 1' in the window he's logged into -- and our primary NFS server (with all of the user sites) obediently kills off init, and all of the child processes.... we run over to the front of the box and hit the power switch, while desperately looking for a monitor and keyboard to watch it boot. It does the BIOS checks, and then stops on the RAID controller, complaining about the fact that there are *2* dead drives, and that the array is now sad..... This makes no sense. I can understand one drive not recovering from a power outage, but 2 seems a bit unlikely, especially because the machine hadn't been beeping or anything like that.... we try turning it off and on again a few times, no change... We pull the machine out of the rack and rip the cover off. Sure enough, there is a RAID card - but the piezo-buzzer on it is, for some reason, wrapped in a bunch of napkins, held in place with electrical tape. I pull that off, and there is also some paper towel jammed into the hole in the buzzer, and bits of a broken pencil.... After replacing the drives, starting an rsync restore from a backup server we investigate more.... ... it turns out that a few months ago(!) the machine had started beeping. The night crew naturally found this annoying, and so they'd gone investigating and discovered that it was this machine, and lifted the lid while still in the rack. They traced the annoying noise to this small black thingie, and made poked it until it stopped, thus solving the problem once and for all.... yay! 2: I used to work at a company which was in one of the buildings next to the twin-towers. For various clever reasons, they had their "datacenter" in a corner of the office space... anyway, the planes hit, power goes out and the building is evacuated - luckily no one is injured, but the entire company/site is down. After a few weeks, my friend Joe is able to arrange with a fire marshal to get access to the building so he can go and grab the disks with all the data. The fire marshal and Joe trudge up the 15 flights of stairs.... When they reach the suite, Joe discovers that the windows where his desk was are blown in, there is debris everywhere, etc. He's somewhat shaken by all this, but goes over to the datacenter area, pulls the drives out of the Sun storage arrays, and puts them in his backpack. They then trudge down the 15 flights of stairs, and Joe takes them home. We've managed to scrounge up 3 identical (empty) arrays, and some servers, and the plan is to temporarily run the service from his basement... Anyway, I get a panic'ed call from Joe. He's got the empty RAID arrays. He's got the servers. He's got a pile of 42 drives (3 enclosures, 14 drives per enclosure). Unfortunately he completely didn't think to mark the order of the drives, and now we have *no* idea which drives goes in which array, nor in which slot in the array.... We spent some time trying to figure out how many ways you can arrange 42 things into 3 piles, and how long it would take to try all combinations.... I cannot remember the actual number, but it approached the lifetime of the universe.... After much time and poking, we eventually worked out that the RAID controller wrote a slot number at sector 0 on each physical drive, and it became a solvable problem, but... W
jms
On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
-- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra
maybe late '60s or so, we had a few 2314 dasd monsters[0]. think maybe 4m x 2m with 9 drives with removable disk packs. a grave shift operator gets errors on a drive and wonders if maybe they swap it into another spindle. no luck, so swapped those two drives with two others. one more iteration, and they had wiped out the entire array. at that point they called me; so i missed the really creative part. [0] https://www.ibm.com/ibm/history/exhibits/storage/storage_2314.html randy --- randy@psg.com `gpg --locate-external-keys --auto-key-locate wkd randy@psg.com` signatures are back, thanks to dmarc header mangling
On Tue, 23 Feb 2021 20:46:38 -0800, Randy Bush said:
maybe late '60s or so, we had a few 2314 dasd monsters[0]. think maybe 4m x 2m with 9 drives with removable disk packs.
a grave shift operator gets errors on a drive and wonders if maybe they swap it into another spindle. no luck, so swapped those two drives with two others. one more iteration, and they had wiped out the entire array. at that point they called me; so i missed the really creative part.
I suspect every S/360 site that had 2314's had an operator who did that, as I was witness to the same thing. For at least a decade after that debacle, the Manager of Operations was awarding Gold, Silver, and Bronze Danny awards for operational screw-ups. (The 2314 event was the sole Platinum Danny :) And yes, IBM 4341 consoles were all too easy to hit the EPO button on the keyboard, we got guards for the consoles after one of our operators nailed the button a second time in a month. And to tie the S/360 and 4341 together - we were one of the last sites that was still running an S/360 Mod 65J. And plans came through for a new server room on the top floor of a new building. Architect comes through, measures the S/360 and all the peripherals for floorspace and power/cooling - and the CPU, plus *4* meg of memory, and 3 strings of 2314 drives chewed a lot of both. Construction starts. Meanwhile, IBM announces the 4341, and offers us a real sweetheart deal because even at the high maintenance charges we were paying, IBM was losing money. Something insane like the system and peripherals and first 3 years of maintenance, for less than the old system per-year maintenance. Oh, and the power requirements are like 10% of the 360s. So we take delivery of the new system and it's looking pitiful, just one box and 2 small strings of disk in 10K square feet. Lots of empty space. Do all the migrations to the new system over the summer, and life is good. Until fall and winter arrive, and we discover there is zero heat in the room, and the ceiling is uninsulated, and it's below zero outside because this is way upstate NY. And if there was a 360 in the room, it would *still* be needing cooling rather than heating. But it's a 4341 that's shedding only 10% of the heat... Finally, one February morning, the 4341 throws a thermal check. Air was too cold at the intakes. Our IBM CE did a double-take because he'd been doing IBM mainframes for 3 decades and had never seen a thermal check for too cold before. Lots of legal action threatened against the architect, who simply said "If you had *told* me that the system was being replaced, I'd have put heat in the room". A settlement was reached, revised plans were drawn up, there was a whole mess of construction to get ductwork and insulation and other stuff into place, and life was good for the decade or so before I left for a better gig....
anyone else have the privilege of running 2321 data cells? had a bunch. unreliable as hell. there was a job running continuously recovering transactions off of log tapes. one night at 3am, head of apps program (i was systems) got a call that a tran tape was unmounted with a console message that recovery was complete. ops did not know what it meant or what to do. was the first time in over five years the data were stable. wife of same head of apps grew more and more tired of 2am calls. finally she answered one "david? he said he was going in to work." ops never called in the night again. randy --- randy@psg.com `gpg --locate-external-keys --auto-key-locate wkd randy@psg.com` signatures are back, thanks to dmarc header mangling
Hardly famous and not service-affecting in the end, but figured I'd share an incident from our side that occurred back in 2018. While commissioning a new node in our Metro-E network, an IPv6 point-to-point address was mis-typed. Instead of ending in /126, it ended in /12. This happened in Johannesburg. We actually came across this by chance while examining the IGP table of another router located in Slough, and found an entry for 2c00::/12 floating around. That definitely looked out of place, as we never carry parent blocks in our IGP. Running the trace from Slough led us back to this one Metro-E device in Jo'burg. It took everyone nearly an hour to figure out the typo, because for all the laser focus we had on the supposed link of the supposed box that was creating this problem, we all overlooked the fact that the /12 configured on the point-to-point link was actually supposed to have been a /126. The reason this never caused a service problem was because we do not redistribute our IGP into BGP (not that anyone should). And even if we did, there are a ton of filters and BGP communities on all devices to ensure a route such as that would have never made it out of our AS. Also, the IGP contains the most specific paths to every node in our network, so the presence of the 2c00::/12 was mostly cosmetic. It would have never been used for routing decisions. Mark.
I only just now found this thread, so I'm sorry I'm late to the party, but here, I put it on Medium. https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa
On Mar 12, 2021, at 10:07 PM, Mark Tinka <mark@tinka.africa> wrote:
Hardly famous and not service-affecting in the end, but figured I'd share an incident from our side that occurred back in 2018.
While commissioning a new node in our Metro-E network, an IPv6 point-to-point address was mis-typed. Instead of ending in /126, it ended in /12. This happened in Johannesburg.
We actually came across this by chance while examining the IGP table of another router located in Slough, and found an entry for 2c00::/12 floating around. That definitely looked out of place, as we never carry parent blocks in our IGP.
Running the trace from Slough led us back to this one Metro-E device in Jo'burg.
It took everyone nearly an hour to figure out the typo, because for all the laser focus we had on the supposed link of the supposed box that was creating this problem, we all overlooked the fact that the /12 configured on the point-to-point link was actually supposed to have been a /126.
The reason this never caused a service problem was because we do not redistribute our IGP into BGP (not that anyone should). And even if we did, there are a ton of filters and BGP communities on all devices to ensure a route such as that would have never made it out of our AS.
Also, the IGP contains the most specific paths to every node in our network, so the presence of the 2c00::/12 was mostly cosmetic. It would have never been used for routing decisions.
Mark.
What a day.. hope you are better now :) On 6/12/2021 2:42 AM, Dan Mahoney wrote:
I only just now found this thread, so I'm sorry I'm late to the party, but here, I put it on Medium.
https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa <https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa>
On Mar 12, 2021, at 10:07 PM, Mark Tinka <mark@tinka.africa> wrote:
Hardly famous and not service-affecting in the end, but figured I'd share an incident from our side that occurred back in 2018.
While commissioning a new node in our Metro-E network, an IPv6 point-to-point address was mis-typed. Instead of ending in /126, it ended in /12. This happened in Johannesburg.
We actually came across this by chance while examining the IGP table of another router located in Slough, and found an entry for 2c00::/12 floating around. That definitely looked out of place, as we never carry parent blocks in our IGP.
Running the trace from Slough led us back to this one Metro-E device in Jo'burg.
It took everyone nearly an hour to figure out the typo, because for all the laser focus we had on the supposed link of the supposed box that was creating this problem, we all overlooked the fact that the /12 configured on the point-to-point link was actually supposed to have been a /126.
The reason this never caused a service problem was because we do not redistribute our IGP into BGP (not that anyone should). And even if we did, there are a ton of filters and BGP communities on all devices to ensure a route such as that would have never made it out of our AS.
Also, the IGP contains the most specific paths to every node in our network, so the presence of the 2c00::/12 was mostly cosmetic. It would have never been used for routing decisions.
Mark.
opening the link currently gives me a HTTP 500 error, very fitting :) Am 12.06.2021 um 04:42 schrieb Dan Mahoney:
I only just now found this thread, so I'm sorry I'm late to the party, but here, I put it on Medium.
https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa
On Mar 12, 2021, at 10:07 PM, Mark Tinka <mark@tinka.africa> wrote:
Hardly famous and not service-affecting in the end, but figured I'd share an incident from our side that occurred back in 2018.
While commissioning a new node in our Metro-E network, an IPv6 point-to-point address was mis-typed. Instead of ending in /126, it ended in /12. This happened in Johannesburg.
We actually came across this by chance while examining the IGP table of another router located in Slough, and found an entry for 2c00::/12 floating around. That definitely looked out of place, as we never carry parent blocks in our IGP.
Running the trace from Slough led us back to this one Metro-E device in Jo'burg.
It took everyone nearly an hour to figure out the typo, because for all the laser focus we had on the supposed link of the supposed box that was creating this problem, we all overlooked the fact that the /12 configured on the point-to-point link was actually supposed to have been a /126.
The reason this never caused a service problem was because we do not redistribute our IGP into BGP (not that anyone should). And even if we did, there are a ton of filters and BGP communities on all devices to ensure a route such as that would have never made it out of our AS.
Also, the IGP contains the most specific paths to every node in our network, so the presence of the 2c00::/12 was mostly cosmetic. It would have never been used for routing decisions.
Mark.
Anyone remember when DEC delivered a new VMS version (V5 I think) whose backups didn't work, couldn't be restored? BU did, the hard way, when the engineering dept's faculty and student disk failed. DEC actually paid thousands of dollars for typist services to come and re-enter whatever was on paper and could be re-entered. I think that was the day I won the Unix vs VMS wars at BU anyhow. -- -Barry Shein Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD The World: Since 1989 | A Public Information Utility | *oo*
An interesting sub-thread to this could be: Have you ever unintentionally crashed a device by running a perfectly innocuous command? 1. Crashed a 6500/Sup2 by typing "show ip dhcp binding". 2. "clear interface XXX" on a Nexus 7K triggered a cascading/undocument Sev1 bug that caused two linecards to crash and reload, and take down about two dozen buildings on campus at the .edu where I used to work. 3. For those that ever had the misfortune of using early versions of the "bcc" command shell* on Bay Networks routers, which was intended to make the CLI make look and feel more like a Cisco router, you have my condolences. One would reasonably expect "delete ?" to respond with a list of valid arguments for that command. Instead, it deleted, well... everything, and prompted an on-site restore/reboot. BCC originally stood for "Bay Command Console", but we joked that it really stood for "Blatant Cisco Clone". On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
I would be more interested in seeing someone who HASN'T crashed a Cisco 6500/7600, particularly one with a long uptime, by typing in a supposedly harmless 'show' command. On Tue, Feb 23, 2021 at 2:26 PM Justin Streiner <streinerj@gmail.com> wrote:
An interesting sub-thread to this could be:
Have you ever unintentionally crashed a device by running a perfectly innocuous command? 1. Crashed a 6500/Sup2 by typing "show ip dhcp binding". 2. "clear interface XXX" on a Nexus 7K triggered a cascading/undocument Sev1 bug that caused two linecards to crash and reload, and take down about two dozen buildings on campus at the .edu where I used to work. 3. For those that ever had the misfortune of using early versions of the "bcc" command shell* on Bay Networks routers, which was intended to make the CLI make look and feel more like a Cisco router, you have my condolences. One would reasonably expect "delete ?" to respond with a list of valid arguments for that command. Instead, it deleted, well... everything, and prompted an on-site restore/reboot.
BCC originally stood for "Bay Command Console", but we joked that it really stood for "Blatant Cisco Clone".
On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
On 2/23/2021 12:22 PM, Justin Streiner wrote:
An interesting sub-thread to this could be: Have you ever unintentionally crashed a device by running a perfectly innocuous command? ---------------------------------------------------------------
There was that time in the later 1990s where I took most of a global network down several times by typing "show ip bgp regexp <regex here>" on most all of the core routers. It turned out to be a cisco bug. I looked for a reference, but cannot find one. Ahh, the earlier days of the commercial internet...gotta love'em. scott
I personally did "disable vlan Xyz" instead of "delete vlan Xyz" on Extreme Network... which proceeded to disable all the ports where the VLAN was present... Good thing it was a (local) remote pop and not on the core. ----- Alain Hebert ahebert@pubnix.net PubNIX Inc. 50 boul. St-Charles P.O. Box 26770 Beaconsfield, Quebec H9W 6G7 Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443 On 2/23/21 5:22 PM, Justin Streiner wrote:
An interesting sub-thread to this could be:
Have you ever unintentionally crashed a device by running a perfectly innocuous command? 1. Crashed a 6500/Sup2 by typing "show ip dhcp binding". 2. "clear interface XXX" on a Nexus 7K triggered a cascading/undocument Sev1 bug that caused two linecards to crash and reload, and take down about two dozen buildings on campus at the .edu where I used to work. 3. For those that ever had the misfortune of using early versions of the "bcc" command shell* on Bay Networks routers, which was intended to make the CLI make look and feel more like a Cisco router, you have my condolences. One would reasonably expect "delete ?" to respond with a list of valid arguments for that command. Instead, it deleted, well... everything, and prompted an on-site restore/reboot.
BCC originally stood for "Bay Command Console", but we joked that it really stood for "Blatant Cisco Clone".
On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org <mailto:jtk@dataplane.org>> wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
My war story. At one of our major POPs in DC we had a row of 7513's, and one of them had intermittent problems. I had replaced every piece of removable card/part in it over time, and it kept failing. Even the vendor flew in a team to the site to try to figure out what was wrong. It was finally decided to replace the whole router (about 200lbs?). Being the local field tech, that was my Job. On the night of the maintenance at 3am, the work started. I switched off the rack power, which included a 2511 terminal server that was connected to half the routers in the row and started to remove the router. A few minutes later I got a text, "You're taking out the wrong router!" You can imagine the "Damn it, what have I done?" feeling that runs through your mind and the way your heart stops for a moment. Okay, I wasn't taking out the wrong router. But unknown at the time, terminal servers when turned off, had a nasty habit of sending a break to all the routers it was connected to, and all those routers effectively stopped. The remote engineer that was in charge saw the whole POP go red and assumed I was the cause. I was, but not because of anything I could have known about. I had to power cycle the downed routers to bring them back on-line, and then continue with the maintenance. A disaster to all involved, but the router got replaced. I gave a very detailed account of my actions in the postmortem. It was clear they knew I had turned off the wrong rack/router, and wasn't being honest about it. I was adamant I had done exactly what I said, and even swore I would fess up if I had error-ed, and always would, even if it cost me the job. I rarely made mistakes, if any, so it was an easy thing for me to say. For the next two weeks everyone that aware of the work gave me the side eye. About a week after that, the same thing happened to another field tech in another state. That helped my case. They used my account to figure out it was the TS that caused the problem. A few of them that had questioned me harshly admitted to me my account helped them figure out the cause. And the worst part of this story? That router, completely replaced, still had the same intermittent problem as before. It was a DC powered POP, so they were all wired with the same clean DC power. In the end they chalked it up to cosmic rays and gave up on it. I believe this break issue was unique to the DC powered 2511's, and that we were the first to use them, but I might be wrong on that. On 2/16/21 2:37 PM, John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
While we're talking about raid types... A few acquisitions ago, between 2006-2010, I worked at a Wireless ISP in Northern Indiana. Our CEO decided to sell Internet service to school systems because the e-rate funding was too much to resist. He had the idea to install towers on the schools and sell service off that while paying the school for roof rights. About two years into the endeavor, I wake up one morning and walk to my car. Two FBI agents get out of an unmarked towncar. About an hour later, they let me go to the office where I found an entire barrage of FBI agents. It was a full raid and not the kind you want to see. Hard drives were involved and being made redundant, but the redundant copies were labeled and placed into boxes that were carried out to SUVs that were as dark as the morning coffee these guys drank. There were a lot of drives, all of our servers were in our server room at the office. There were roughly five or six racks of varying amounts of equipment in each. After some questioning and assisting them in their cataloging adventure, the agents left us with a ton of questions and just enough equipment to keep the customers connected. CEO became extremely paranoid at this point. He told us to prepare to move servers to a different building. He went into a tailspin trying to figure out where he could hide the servers to keep things going without the bank or FBI seizing the assets. He was extremely worried the bank would close the office down. We started moving all network routing around to avoid using the office as our primary DIA. One morning I get into the office and we hear the words we've been dreading: "We're moving the servers". The plan was to move them to a tower site that had a decent-sized shack on site. Connectivity was decent, we had a licensed 11GHz microwave backhaul capable of about 155mbps. The site was part of the old MCI microwave long-distance network in the 80s and 90s. It had redundant air conditioners, a large propane tank, and a generator capable of keeping the site alive for about three days. We were told not to notify any customers, which became problematic because two customers had servers colocated in our building. We consolidated the servers into three racks and managed to get things prepared with a decent UPS in each rack. CEO decided to move the servers at nightfall to "avoid suspicion". Our office was in an unsavory part of town, moving anything at night was suspicious. So, under the cover of half-ass darkness, we loaded the racks onto a flatbed truck and drove them 20 minutes to the tower. While we unloaded the racks, an electrician we knew was wiring up the L5-20 outlets for the UPS in each rack. We got the racks plugged in, servers powered up, and then the two customers came that had colocated equipment. They got their equipment powered up and all seemed ok. Back at the office the next day we were told to gather our workstations and start working from home. I've been working from home ever since and quite enjoy it, but that's beside the point. Summer starts and I tell the CEO we need to repair the AC units because they are failing. He ignores it, claiming he doesn't want to lose money the bank could take at any minute. About a month later, a nice hot summer day rolls in and the AC units both die. I stumble upon an old portable AC unit and put that at the site. Temperatures rise to 140F ambient. Server overheat alarms start going off, things start failing. Our colocation customers are extremely upset. They pull their servers and drop service. The heat subsides, CEO finally pays to repair one of the AC units. Eventually, the company declares bankruptcy and goes into liquidation. Luckily another WISP catches wind of it, buys the customers and assets, and hires me. My happiest day that year was moving all the servers into a better-suited home, a real data center. I don't know what happened to the CEO, but I know that I'll never trust anything he has his hands in ever again. Adam Kennedy Systems Engineer adamkennedy@watchcomm.net | 800-589-3837 x120 <800-589-3837;120> Watch Communications | www.watchcomm.net <https://www.watchcomm.net?utm_source=signature&utm_medium=email&utm_campaign=general_signature> 3225 W Elm St, Suite A Lima, OH 45805 <https://twitter.com/watchcommnet> <https://www.facebook.com/watchcommunications> <http://www.linkedin.com/company/watch-communications> On Tue, Feb 23, 2021 at 8:55 PM brutal8z via NANOG <nanog@nanog.org> wrote:
My war story.
At one of our major POPs in DC we had a row of 7513's, and one of them had intermittent problems. I had replaced every piece of removable card/part in it over time, and it kept failing. Even the vendor flew in a team to the site to try to figure out what was wrong. It was finally decided to replace the whole router (about 200lbs?). Being the local field tech, that was my Job. On the night of the maintenance at 3am, the work started. I switched off the rack power, which included a 2511 terminal server that was connected to half the routers in the row and started to remove the router. A few minutes later I got a text, "You're taking out the wrong router!" You can imagine the "Damn it, what have I done?" feeling that runs through your mind and the way your heart stops for a moment.
Okay, I wasn't taking out the wrong router. But unknown at the time, terminal servers when turned off, had a nasty habit of sending a break to all the routers it was connected to, and all those routers effectively stopped. The remote engineer that was in charge saw the whole POP go red and assumed I was the cause. I was, but not because of anything I could have known about. I had to power cycle the downed routers to bring them back on-line, and then continue with the maintenance. A disaster to all involved, but the router got replaced.
I gave a very detailed account of my actions in the postmortem. It was clear they knew I had turned off the wrong rack/router, and wasn't being honest about it. I was adamant I had done exactly what I said, and even swore I would fess up if I had error-ed, and always would, even if it cost me the job. I rarely made mistakes, if any, so it was an easy thing for me to say. For the next two weeks everyone that aware of the work gave me the side eye.
About a week after that, the same thing happened to another field tech in another state. That helped my case. They used my account to figure out it was the TS that caused the problem. A few of them that had questioned me harshly admitted to me my account helped them figure out the cause.
And the worst part of this story? That router, completely replaced, still had the same intermittent problem as before. It was a DC powered POP, so they were all wired with the same clean DC power. In the end they chalked it up to cosmic rays and gave up on it. I believe this break issue was unique to the DC powered 2511's, and that we were the first to use them, but I might be wrong on that.
On 2/16/21 2:37 PM, John Kristoff wrote:
Friends,
I'd like to start a thread about the most famous and widespread Internet operational issues, outages or implementation incompatibilities you have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the most notorious and likely to top many lists including mine. So if that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a future NANOG session. I'd be particularly interested in any issues that also identify key individuals that might still be around and interested in participating in a retrospective. I already have someone that is willing to talk about AS 7007, which shouldn't be hard to guess who.
Thanks in advance for your suggestions,
John
participants (65)
-
Aaron C. de Bruyn
-
Adam Kennedy
-
Alain Hebert
-
Andrew Gallo
-
Andrey Kostin
-
Andy Ringsmuth
-
Ben Cannon
-
Bruce H McIntosh
-
brutal8z
-
bzs@theworld.com
-
Christopher Morrow
-
Clayton Zekelman
-
Compton, Rich A
-
Damian Menscher
-
Dan Mahoney
-
Daniel Karrenberg
-
David Guo
-
Eric Kuhnke
-
Erik Sundberg
-
George Herbert
-
George Metz
-
Giuseppe De Luca
-
Henry Yen
-
Jared Mauch
-
Jen Linkova
-
Jethro R Binks
-
Job Snijders
-
Joe
-
John Curran
-
John Kristoff
-
Jon Lewis
-
Justin Streiner
-
Justin Wilson (Lists)
-
Jörg Kost
-
Karl Auer
-
Mark Andrews
-
Mark Tinka
-
Mikael Abrahamsson
-
Miles Fidelman
-
Owen DeLong
-
Patrick Schultz
-
Patrick W. Gilmore
-
Paul Ebersman
-
Pierre Emeriaud
-
Randy Bush
-
Ray Bellis
-
Rich Kulawiec
-
Richard Golodner
-
Rogier van Eeten
-
Sabri Berisha
-
scott
-
Sean Donelan
-
Seth Mattinen
-
Shawn L
-
Simon Lockhart
-
sronan@ronan-online.com
-
Suresh Ramasubramanian
-
tim@pelican.org
-
Todd Underwood
-
Tom Hill
-
Tony Finch
-
Tony Wicks
-
Valdis Klētnieks
-
Warren Kumari
-
Wolfgang Tremmel