On Feb 28, 2024, at 1:30 AM, Daniel Marks via NANOG <nanog@nanog.org> wrote:
We’re getting rocked by storms here in Michigan, could be related.
[ brief version of what happened from what I can tell reconstructing things] I was alerted ~4am US/E yesterday about the issue. This machine has been generously hosted by my previous employer for quite some time, funnily enough it was 7 years ago almost to the day since I started my current employment. The IPMI was not responsive and the machine was located in 350 Cermak, on a floor that was not impacted with the heat/cold event. I have been meaning to move things off and on, but never quite had the motivation to tackle the task. Yesterday forced my hand. Once I confirmed that we could get the machine out of the colocation facility (thank you again NTT) I drove from Michigan to Chicago, got lunch and picked up the machine and headed back to the colocation that I have in Michigan at the 123Net/DetroitIX site. Once I had a console on it, I determined that this old machine had a few things that had been gradually updated and upgraded over time, not all the filesystem options were set correctly and after some tune2fs options were set and fstab updated to ensure everything is migrated fully from ext2 -> ext4 the system was able to be booted without issues. Afterwards I’ve determined that there is still a hardware related problem, so I am now going to move it to new hardware later today schedule permitting as I want to go onsite and make sure that the I/O is performant. Feb 28 22:09:05 kernel: Memory: 32816872K/33544380K available (20480K kernel code, 3276K rwdata, 14748K rodata, 4588K init, 4892K bss, 727248K reserved, 0K cma-reserved) Feb 29 00:20:07 kernel: Memory: 16326408K/16767164K available (20480K kernel code, 3276K rwdata, 14748K rodata, 4588K init, 4892K bss, 440496K reserved, 0K cma-reserved) Not quite a great thing when nobody is onsite and the machine requires being power cycled and the amount of memory changes. If you are seeing any other issues, do let me know, I did move the IPv4 space but have renumbered for v6, so if you use my free secondary dns service, and your own vanity name, you will need to update your AAAA records. If you are seeing any reachability issues let me know, there should be ROA and other objects in place for things. Sorry everyone got this email, feel bad it’s like when warren asked the list some personal details :-) - Jared (Even more details: changing disk images from qcow -> qcow2 and other things like ext2 -> ext3/4 over all the years as the machine has gone from Linux -> FreeBSD -> Linux again and other things is always a fun way to keep bringing your legacy around with you, it’s good overall)
On 29/02/2024 17:21, Jared Mauch wrote: On behalf of cisco-nsp and outages - we salute you. -Hank
On Feb 28, 2024, at 1:30 AM, Daniel Marks via NANOG <nanog@nanog.org> wrote:
We’re getting rocked by storms here in Michigan, could be related.
[ brief version of what happened from what I can tell reconstructing things]
I was alerted ~4am US/E yesterday about the issue. This machine has been generously hosted by my previous employer for quite some time, funnily enough it was 7 years ago almost to the day since I started my current employment.
The IPMI was not responsive and the machine was located in 350 Cermak, on a floor that was not impacted with the heat/cold event.
I have been meaning to move things off and on, but never quite had the motivation to tackle the task. Yesterday forced my hand.
Once I confirmed that we could get the machine out of the colocation facility (thank you again NTT) I drove from Michigan to Chicago, got lunch and picked up the machine and headed back to the colocation that I have in Michigan at the 123Net/DetroitIX site.
Once I had a console on it, I determined that this old machine had a few things that had been gradually updated and upgraded over time, not all the filesystem options were set correctly and after some tune2fs options were set and fstab updated to ensure everything is migrated fully from ext2 -> ext4 the system was able to be booted without issues.
Afterwards I’ve determined that there is still a hardware related problem, so I am now going to move it to new hardware later today schedule permitting as I want to go onsite and make sure that the I/O is performant.
Feb 28 22:09:05 kernel: Memory: 32816872K/33544380K available (20480K kernel code, 3276K rwdata, 14748K rodata, 4588K init, 4892K bss, 727248K reserved, 0K cma-reserved) Feb 29 00:20:07 kernel: Memory: 16326408K/16767164K available (20480K kernel code, 3276K rwdata, 14748K rodata, 4588K init, 4892K bss, 440496K reserved, 0K cma-reserved)
Not quite a great thing when nobody is onsite and the machine requires being power cycled and the amount of memory changes.
If you are seeing any other issues, do let me know, I did move the IPv4 space but have renumbered for v6, so if you use my free secondary dns service, and your own vanity name, you will need to update your AAAA records.
If you are seeing any reachability issues let me know, there should be ROA and other objects in place for things.
Sorry everyone got this email, feel bad it’s like when warren asked the list some personal details :-)
- Jared
(Even more details: changing disk images from qcow -> qcow2 and other things like ext2 -> ext3/4 over all the years as the machine has gone from Linux -> FreeBSD -> Linux again and other things is always a fun way to keep bringing your legacy around with you, it’s good overall)
On Feb 29, 2024, at 10:56 AM, Jay Acuna <mysidia@gmail.com> wrote:
On Thu, Feb 29, 2024 at 9:22 AM Jared Mauch <jared@puck.nether.net> wrote:
Apparently some of the most important email lists, Outages, etc, are being kept online by 1 person's Unix/Linux server.
There’s other people who have access etc, but when it comes to hardware that is quite old, last substantive refresh was in 2011, it’s served its purpose well. Obligatory xkcd https://xkcd.com/2347/
Yeah, thats cool. It reminds me good old internet from 90's and early 2000. Anyway, if that list is so importand, maybe its time to run it with redundancy of 1+N (master-slave topology)? Its all MTAs so its pretty easy, all you need to sync data from master to slaves via push (best, because its nearly instant). Slave down? Nothing really happened. Master down? next Slave takes over and bring Master online or nominate any of those slaves as new Master. ---------- Original message ---------- From: Jared Mauch <jared@puck.nether.net> To: Jay Acuna <mysidia@gmail.com> Cc: nanog@nanog.org Subject: Re: puck not responding Date: Thu, 29 Feb 2024 10:58:17 -0500
On Feb 29, 2024, at 10:56˙˙AM, Jay Acuna <mysidia@gmail.com> wrote:
On Thu, Feb 29, 2024 at 9:22˙˙AM Jared Mauch <jared@puck.nether.net> wrote:
Apparently some of the most important email lists, Outages, etc, are being kept online by 1 person's Unix/Linux server.
There˙˙s other people who have access etc, but when it comes to hardware that is quite old, last substantive refresh was in 2011, it˙˙s served its purpose well. Obligatory xkcd https://xkcd.com/2347/
If it wasn’t for how clunky they are with email sites, I’d suggest moving to a cloud somewhere. But … -George Sent from my iPhone
On Feb 29, 2024, at 8:01 AM, Jared Mauch <jared@puck.nether.net> wrote:
On Feb 29, 2024, at 10:56 AM, Jay Acuna <mysidia@gmail.com> wrote:
On Thu, Feb 29, 2024 at 9:22 AM Jared Mauch <jared@puck.nether.net> wrote:
Apparently some of the most important email lists, Outages, etc, are being kept online by 1 person's Unix/Linux server.
There’s other people who have access etc, but when it comes to hardware that is quite old, last substantive refresh was in 2011, it’s served its purpose well.
Obligatory xkcd https://xkcd.com/2347/
George Herbert <george.herbert@gmail.com> writes:
If it wasn’t for how clunky they are with email sites, I’d suggest moving to a cloud somewhere. But …
I believe statistics point in favour of the single puck.nether.net host.... BTW, for anyone else taking advantage of the excellent secondary service provided by puck: You might want to update your AXFR ACLs. It seems the IPv6 address has changed. I must admit that such transfer failures go unnoticed due to the large volume of unwanted requests. So I appreciate the extra effort sending an email warning when a zone i disabled. Thanks for running all these high quality services! Bjørn
On Sat, Mar 02, 2024 at 11:55:45AM +0100, Bjørn Mork wrote:
George Herbert <george.herbert@gmail.com> writes:
If it wasn’t for how clunky they are with email sites, I’d suggest moving to a cloud somewhere. But …
I believe statistics point in favour of the single puck.nether.net host....
BTW, for anyone else taking advantage of the excellent secondary service provided by puck: You might want to update your AXFR ACLs. It seems the IPv6 address has changed.
I must admit that such transfer failures go unnoticed due to the large volume of unwanted requests. So I appreciate the extra effort sending an email warning when a zone i disabled.
Yes, I'm notifying people now and have updated the FAQ/docs page. I also said there that I would notify people if the geography of the machine changed and it has. I still need to get my upstreams to notify all their upstreams to permit packets as there's one provider that does uRPF in the mix, so I have blocked their routes for now.
Thanks for running all these high quality services!
it's the sustained community efforts that have allowed technology to improve to the point where auto-updates and many other things are without trouble, sadly i had to do a bit of physical moving of things, but the machine should now have a ~10g uplink and if I can find the right 100g device that I'm happy with I'm in a better position to update/upgrade it now compared to a week ago. - Jared -- Jared Mauch | pgp key available via finger from jared@puck.nether.net clue++; | http://puck.nether.net/~jared/ My statements are only mine.
participants (7)
-
Bjørn Mork
-
borg@uu3.net
-
Daniel Marks
-
George Herbert
-
Hank Nussbacher
-
Jared Mauch
-
Jay Acuna