Re: Solar Flux (was: Re: China prefix hijack)

11 Apr 2010

      On Sun, Apr 11, 2010 at 7:07 AM, Robert E. Seastrom <rs@seastrom.com> wrote:
...
We've seen great increases in CPU and memory speeds as well as disk
densities since the last maximum (March 2000).  Speccing ECC memory is
a reasonable start, but this sort of thing has been a problem in the
past (anyone remember the Sun UltraSPARC CPUs that had problems last
time around?) and will no doubt bite us again.
...
From the UltraSPARC III's they fixed this problem by sticking with Parity in
Sun's problem had an easy solution - and it's exactly the one you've
mentioned - ECC.

The issue with the UltraSPARC II's was that they had enough redundancy to
detect a problem (Parity), but not enough to correct the problem (ECC). They
also (initially) had a very abrupt handling of such errors - they would
basically panic and restart.

the L1 cache (write-through, so if you get a parity error you can just dump
the cache and re-read from memory or a higher cache), but using ECC on the
L2 and higher (write-back) caches.  The memory and all datapaths were
already protected with ECC in everything but the low-end systems.

It does raise a very interesting question though - how many systems are you
running that don't use ECC _everywhere_? (CPU, memory and datapath)

Unlike many years ago, today Parity memory is basically non-existent, which
means if you're not using ECC then you're probably suffering relatively
regular single-bit errors without knowing it.  In network devices that's
less of an issue as you can normally rely on higher-level protocols to
detect/correct the errors, but if you're not using ECC in your servers then
you're asking for (silent) trouble...

  Scott.

Re: Solar Flux (was: Re: China prefix hijack)

Scott Howard