On Sun, Apr 15, 2012 at 10:52:51AM -0500, Jimmy Hess wrote:
Consider that the probability 16GB of SDRAM experiences at least one single bit error at sea level, in a given 6 hour period exceeds 66% = 1 - (1 - 1.3e-12 * 6)^(16 * 2^30 * 8). In any given 24 hour period, the probability of at least one single bit error exceeds 98%. Assuming the memory is good and functioning correctly;
It's expected to see on average approximately 3 to 4 1-bit errors per day. More are frequently seen.
Now if most of this 16GB of memory is unused, you will never notice that over 30 days, 120 or so bits have been flipped from their proper value..
I think that is an overestimate, at least if single-bit (corrected) ecc errors are as common as flipped bits on non-ecc ram. Now, First, count me in the "ECC is a must, full stop." crowd. I insist on ecc for even my customer's dedicated servers, even though most of the customers don't care that much. "It's not for you, it's for me." With ECC? if you have EDAC/bluesmoke setup correctly on a supported motherboard, you get console spew whenever you have a single-bit error. This means I can do a very simple grep on the box conserver logs to and I can find all the failing ram modules I am responsible for. Without ecc, I have no real way of telling the difference between broken software and broken ram. That said, I still think the 120 bits a month estimate is large; I believe that ECC ram should report correctable errors (assuming a correctly configured EDAC/bluesmoke module and supported chipset) about as often as non-ecc ram would get a bit flip. In a past role, I did spend the time grepping through such a properly configured cluster, with tens of thousands of nodes, looking for failing hardware. I should have done a proper paper with statistics, but I did not. The vast majority of servers had zero correctable ecc errors, while a few had a lot, which is consistent with the theory that ECC errors are more often caused by bad ram. (Of course, all these servers were in proper cases in a proper data center, which probably gives you a fair bit of shielding.) On my current fleet (well under 100 servers) single bit errors are so rare that if I get one, I schedule that machine for removal from production.