
On Sun, Apr 11, 2010 at 7:07 AM, Robert E. Seastrom <rs@seastrom.com> wrote:
We've seen great increases in CPU and memory speeds as well as disk densities since the last maximum (March 2000). Speccing ECC memory is a reasonable start, but this sort of thing has been a problem in the past (anyone remember the Sun UltraSPARC CPUs that had problems last time around?) and will no doubt bite us again.
From the UltraSPARC III's they fixed this problem by sticking with Parity in
Sun's problem had an easy solution - and it's exactly the one you've mentioned - ECC. The issue with the UltraSPARC II's was that they had enough redundancy to detect a problem (Parity), but not enough to correct the problem (ECC). They also (initially) had a very abrupt handling of such errors - they would basically panic and restart. the L1 cache (write-through, so if you get a parity error you can just dump the cache and re-read from memory or a higher cache), but using ECC on the L2 and higher (write-back) caches. The memory and all datapaths were already protected with ECC in everything but the low-end systems. It does raise a very interesting question though - how many systems are you running that don't use ECC _everywhere_? (CPU, memory and datapath) Unlike many years ago, today Parity memory is basically non-existent, which means if you're not using ECC then you're probably suffering relatively regular single-bit errors without knowing it. In network devices that's less of an issue as you can normally rely on higher-level protocols to detect/correct the errors, but if you're not using ECC in your servers then you're asking for (silent) trouble... Scott.