On Sun, 1 Aug 2004, John Underhill wrote:
Not necessarily. There have been a number of innovations in recent years in the area of integrated fault tolerance, including bios level controls over component monitoring / management. Some of the more upscale Compaq G3 servers for instance, can remove a processor from operation if it exceeds a threshold of critical errors, (this is also true for memory).
Interesting to know. Those usually are due to ECC errors in CPU caches often due to overheating. The CPU is still functional to a degree though, marginal failure as opposed to catastrophic. But what of electrical failures? Even P4 class machines still share a host bus amongst CPUs no? Anyway, CPUs (if kept sufficiently cool) tend to one of the more reliable components in a system, if they are good to begin with.
Alphas can boot even if the bootstrap processor fails at system start, and simply selects the next available processor..
Alphas are quite nice, they have support for lockstep operation too. Tandem were supposed to have been moving to Alpha for their Himalaya F-T servers when DEC bought them. Also the 21164 and up (not sure about 21064) AXPs used a point-to-point bus for SMP[1], they were all electrically isolated from each other - at least, a failure of one CPU couldnt affect the other CPUs.
So that if a process runs amock on a single bus architecture, the second processor will not have the resources it needs to run effectively..
Processes running amok still only have access to those resources granted it. Processes generally do not have access to bare IO. What the OS giveth, it can take away (or constrain). 1. Still alive and well in a sense, but now developed into a general purpose PtP local CPU/IO interconnect: AMDs' HyperTransport as used in K8. regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: Don't get stuck in a closet -- wear yourself out.