Arista filesystem rewinding back 3 years

Hello, It must be my lucky day. I've run into quite a strange situation where what shows up in flash: isn't the same as what is actually in /mnt/flash on an Arista switch. I found out that this was an issue when I reloaded the switch and the filesystem looks like it rewound itself to 2022 in Aboot. I'm guessing that the SSD stopped taking writes sometime in 2022 and EOS never found out about it. Anyway, if anyone has any information on this feel free to send it to me offlist. I don't want to create a spamwave or anything. Thanks and have a nice day. -Drew

On Tue, Feb 18, 2025 at 12:16 PM Drew Weaver <drew.weaver@thenap.com> wrote:
I found out that this was an issue when I reloaded the switch and the filesystem looks like it rewound itself to 2022 in Aboot.
I've seen this before with MicroSD cards in a Raspberry Pi. The card stops accepting writes but continues to report write success to the OS. On the Pi, this eventually shows up as seeming filesystem corruption when blocks are flushed and then reloaded to the disk cache. Upon reboot, the Pi reverts to the state it was in when the writes actually stopped happening. I'm not really sure what the theory behind designing cards this way is. It does mean that the OS will boot even if the boot process must write to succeed, but it also means that the OS has no idea that the flash drive has failed and experiences odd random faults instead. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

On Tue, Feb 18, 2025 at 4:05 PM, William Herrin <bill@herrin.us> wrote:
On Tue, Feb 18, 2025 at 12:16 PM Drew Weaver <drew.weaver@thenap.com> wrote:
I found out that this was an issue when I reloaded the switch and the filesystem looks like it rewound itself to 2022 in Aboot.
I've seen this before with MicroSD cards in a Raspberry Pi. The card stops accepting writes but continues to report write success to the OS.
My favorite was a "Sony 480GB Flash Drive" which I purchased at an electronics market in Beijing in 2010, for around $5 USD. I knew that it couldn't be real, but I figured it would be a entertaining… It reported itself to the OS as having 480GB of capacity, but actually only had a 16Mb flash chip. Anything that you wrote past the and of the storage would wrap around to the start. It actually turned out to be remarkable useful - I mounted it on /var/log/syslog on a server, and magically had circular buffer of logs which would never fill up / run out of space…. W On the Pi, this eventually shows up as seeming filesystem corruption when
blocks are flushed and then reloaded to the disk cache. Upon reboot, the Pi reverts to the state it was in when the writes actually stopped happening.
I'm not really sure what the theory behind designing cards this way is. It does mean that the OS will boot even if the boot process must write to succeed, but it also means that the OS has no idea that the flash drive has failed and experiences odd random faults instead.
Regards, Bill Herrin
-- William Herrin bill@herrin.us https://bill.herrin.us/

William Herrin wrote:
I've seen this before with MicroSD cards in a Raspberry Pi. The card stops accepting writes but continues to report write success to the OS. On the Pi, this eventually shows up as seeming filesystem corruption when blocks are flushed and >>then reloaded to the disk cache. Upon reboot, the Pi reverts to the state it was in when the writes actually stopped happening.
Well, I am glad it didn't boot up using the startup-config from 2022 that would've been an actual catastrophe. Chatting with TAC about it. Yikes. -Drew

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On Tuesday, February 18th, 2025 at 21:16, Drew Weaver <drew.weaver@thenap.com> wrote:
...I’ve run into quite a strange situation where what shows up in flash: isn’t the same as what is actually in /mnt/flash on an Arista switch.
I found out that this was an issue when I reloaded the switch and the filesystem looks like it rewound itself to 2022 in Aboot.
I’m guessing that the SSD stopped taking writes sometime in 2022 and EOS never found out about it.
Anyway, if anyone has any information on this feel free to send it to me offlist. I don’t want to create a spamwave or anything.
We've had several SSD failures in Arista devices, I can only assume a bad batch of SSDs because they were all in a batch of routers ordered and delivered together. For us the SSDs dropped into RO mode. When this happens there are syslog messages to let you know, and if you drop into a BASH shell you can see the issue, but none of the EOS show commands show the problem: show file systems -> shows all FS as rw show system health storage -> shows "OK" Cheers, James. -----BEGIN PGP SIGNATURE----- Version: ProtonMail wsG5BAEBCgBtBYJntgL+CZCoEx+igX+A+0UUAAAAAAAcACBzYWx0QG5vdGF0 aW9ucy5vcGVucGdwanMub3Jn/+FiQqIFBm5d6bkRrvAo/fbzvXcjm5yRpqBG E0ZmOloWIQQ+k2NZBObfK8Tl7sKoEx+igX+A+wAA4E0QAIqAJKlFQs1N9GVy yHEmKoT6LD1pweO52oYhgNwV9h9xa83wLdFR182X00YgSGXASvjGIZ25CsXH W/74jDcy6JxS9xxeKQ6bVMJIPwuHVx2l3COB9gYUbgR+wnPAdMJIH+C04I1s M3VIe+lf/06hD5BvFwbPLTH36vZ3GlFeZA1oY7pW1v1p/1buZ6V7+dQZ5Kf/ 6ScYbSGRCHmsTdPFBVyugcfFhCVDxtbI/lVwbMU7FUbs2CzrTK/cyRABJ7Lo /QeUE5vTLpAvtHtIbDM2/zPG4oMW/B0xM75aPLH2G/loscP6A9ZKaTyev/CD PG0ooa15haTKQ7l4atwhDN4kSNVyQsHg7fDSCVuMTGqV7TMF3IgyCjBFoB1Q fXJu8IBhHqbVDQwshlqPUi90q62cFZib6tWIkQjR/yOj1VSmj8lWYZsEdJHO Ck60kSmdjNhqjENNNfugMv8cytBGeY/n/P6VrR10e3ouBhqbrzG7WNRhnkA6 pQGAE/1wh8ncwqYadRBIfy4QT8ejSox1ggBjBC9d8DNWw+Fc8y4AJpfvWLey rUE0aIivQ9q1MgfmGizq4nxLbVcVAj8V/l0LZRJWXwWgsTv5EP4boXi7G0ry hTso/uvynzkSxnHs8fbLxOIDRIdqWnO0C6wmuOgWP8S71/v2i63P4nVAc4tO LvdW1Ti/ =IiR8 -----END PGP SIGNATURE-----

James wrote:
We've had several SSD failures in Arista devices, I can only assume a bad batch of SSDs because they were all in a batch of routers ordered and delivered together. For us the SSDs dropped into RO mode. When this happens there are syslog messages to let you know, and if you drop into a BASH shell you can see the issue, but none of the EOS show commands show the problem:
I just went through 6 years of a syslog file and there wasn't anything mentioned about /dev/sda, /mnt/flash, or anything else that indicated that something was wrong with the disk or the filesystem. RANCID was even showing that there were EOS image files there that weren't actually there when it was backing up the configs and the startup-config date showed 2/28/2025 until the system was reloaded and then it began saying startup-config was actually last written on 03/11/2022. RANCID shows that there have been 57 configuration updates since 3/11/2022 and every time 'write memory' was run it said: : “Copy completed successfully.” Anyway thanks for replying. -Drew
participants (4)
-
Drew Weaver
-
James Bensley
-
Warren Kumari
-
William Herrin