If a CPU dies, it's unlikely to come back up without removing the bad CPU, especially if the CPU has become unreliable rather than dying completely. Even if CPU 0 is good and the BIOS has no problems booting the OS, the SMP aware OS will quite probably hit problems with the bad CPU.
Not necessarily. There have been a number of innovations in recent years in the area of integrated fault tolerance, including bios level controls over component monitoring / management. Some of the more upscale Compaq G3 servers for instance, can remove a processor from operation if it exceeds a threshold of critical errors, (this is also true for memory). Alphas can boot even if the bootstrap processor fails at system start, and simply selects the next available processor.. they also have hot swap processor capabilities, (again for the time being -upscale..). Add onto this features like hot swap 'raid memory' and pci, redundant pwr, fans, and drives, and systems can be made to withstand many common component failures, with little or no interruption in service. With the advent of technologies like hyperthreading, manufacturers are being driven by market demands to create more reliable SMP drivers, and I think it is likely that simultaneous multi-threading will eventually become the standard.
a duallie will keep the system up when a faulty process hogs 100% CPU, because the second one is still available. That also increases availability ratio.
Well it depends.. The real differentiation is if the system is truly 'symetric', that is; dual processor, I/O and memory bus. If both processors share the same resources, competition between processors for regions of memory and acquiring locks on the pci bus, severely constrain the available resources for each processor. So that if a process runs amock on a single bus architecture, the second processor will not have the resources it needs to run effectively..
application is not going to take down the machine on any modern OS[2] and anyway can be dealt with with resource limits, SMP or not, presuming your OS supports resource limits.
The real problem with SMP is kernel complexity. Drivers that are rock solid in single-processor can have bugs that are only triggered under SMP. Threaded applications can also become unreliable on SMP systems.
The extra power of an SMP system might be a bonus, but trying to argue their benefits on the basis of reliability is misguided.
Michel.
1. Now, they may still be very reliable, and more than reliable enough for your needs, but they are still not as reliable as the exact same machine with terminators in all CPU sockets/slots bar one ;) The fault-tolerant systems are outrageously expensive.
2. Unless you're running MacOS 9 or Windows 3.11 on your server.. - dont think either supports SMP though ;).
regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: A Linux machine! because a 486 is a terrible thing to waste! (By jjs@wintermute.ucr.edu, Joe Sloan)