RE: Quick question.

Michel Py

1 Aug 2004 1 Aug '04

4:44 p.m.

...
Michel Py wrote: Terminators are a thing of the past; as a matter of fact, in California and especially in Sacramento they're called governators now.

...

Erik Bais wrote: You mean, they'll be back?

:-D Only once, and this model is obsolete and can't be upgraded to presidentor.

...

Paul Jakma wrote: Intel dont do fault-tolerant SMP. Running SMP will lower your MTBF just by fact that you now have 2 CPUs.

True; this would be like raid-0 arrays, the more disks the greater the chance of failure. However, MTBF is not the name of the game here, availability ratio is. Which is tied to failure rate. In other words, I don't really care if the second processor reduces the MTBF from 200k hours to 60k hours, but I do care if the second processor reduces the time to restore service from 24 hours to 20 minutes (7.5 minutes for SNMP to fail the query twice, 1.5 minute for the tech to find out that either it's frozen or there's a BSOD, 6 minutes to have someone go there and reset, 5 minutes to reboot). The dead processor still has to be replaced, but this is scheduled maintenance, not outage. A little extra ammo when you have to hunt five or six nines.

...

Then there's fact that you're far more likely to hit bugs in OS with SMP than uniproc.

Unsignificant in my experience, and does not balance what Alexei mentioned yesterday: a duallie will keep the system up when a faulty process hogs 100% CPU, because the second one is still available. That also increases availability ratio. Michel.

Show replies by date

Colm MacCarthaigh

1 Aug 1 Aug

5:05 p.m.

New subject: Quick question.

On Sun, Aug 01, 2004 at 09:44:13AM -0700, Michel Py wrote:

...

In other words, I don't really care if the second processor reduces the MTBF from 200k hours to 60k hours, but I do care if the second processor reduces the time to restore service from 24 hours to 20 minutes (7.5 minutes for SNMP to fail the query twice, 1.5 minute for the tech to find out that either it's frozen or there's a BSOD, 6 minutes to have someone go there and reset, 5 minutes to reboot).

With the right form factor (nice easy-to-open rackmount unit) it will take just as little time to swap in an on-site cold-spare. That way you get the nice MTBF and the short restore time. Also, if you have multiple similar machines, you drastically reduce your spares inventory.

...

Unsignificant in my experience, and does not balance what Alexei mentioned yesterday: a duallie will keep the system up when a faulty process hogs 100% CPU, because the second one is still available. That also increases availability ratio.

These days you can achieve the same using hyper-threading for example, and keep the long MTBF :) -- Colm MacCárthaigh Public Key: colm+pgp@stdlib.net

Alexei Roudnev

4 Aug 4 Aug

5:54 a.m.

New subject: Quick question.

No need. Remove disk. Insert isk to spare. Start spare server. Allow techs to analyze broken server next day. 1 minute. But in reality, 2 CPU servers are redundant to most COPU failures (had a few cases). Anyway, CPU faiolure is not major reason for server failures (and never was).

...

On Sun, Aug 01, 2004 at 09:44:13AM -0700, Michel Py wrote:

...
In other words, I don't really care if the second processor reduces the MTBF from 200k hours to 60k hours, but I do care if the second processor reduces the time to restore service from 24 hours to 20 minutes (7.5 minutes for SNMP to fail the query twice, 1.5 minute for the tech to find out that either it's frozen or there's a BSOD, 6 minutes to have someone go there and reset, 5 minutes to reboot).

With the right form factor (nice easy-to-open rackmount unit) it will take just as little time to swap in an on-site cold-spare. That way you get the nice MTBF and the short restore time. Also, if you have multiple similar machines, you drastically reduce your spares inventory.

...
Unsignificant in my experience, and does not balance what Alexei mentioned yesterday: a duallie will keep the system up when a faulty process hogs 100% CPU, because the second one is still available. That also increases availability ratio.

These days you can achieve the same using hyper-threading for example, and keep the long MTBF :)

-- Colm MacCárthaigh Public Key: colm+pgp@stdlib.net

Paul Jakma

1 Aug 1 Aug

5:06 p.m.

New subject: Quick question.

On Sun, 1 Aug 2004, Michel Py wrote:

...

True; this would be like raid-0 arrays, the more disks the greater the chance of failure.

This holds true for most RAID-x levels.

...

In other words, I don't really care if the second processor reduces the MTBF from 200k hours to 60k hours, but I do care if the second processor reduces the time to restore service from 24 hours to 20 minutes (7.5 minutes for SNMP to fail the query twice, 1.5 minute for the tech to find out that either it's frozen or there's a BSOD, 6 minutes to have someone go there and reset, 5 minutes to reboot).

If a CPU dies, it's unlikely to come back up without removing the bad CPU, especially if the CPU has become unreliable rather than dying completely. Even if CPU 0 is good and the BIOS has no problems booting the OS, the SMP aware OS will quite probably hit problems with the bad CPU. If you really want to guard against CPU failures, you need a machine designed for fault-tolerance, not a "cheap" SMP box, those are just *less* reliable.[1]

...

The dead processor still has to be replaced, but this is scheduled maintenance, not outage. A little extra ammo when you have to hunt five or six nines.

Just tape a spare CPU to the inside of the box if time-to-repair is important. Even better, just have a second system on standby.

...

Unsignificant in my experience, and does not balance what Alexei mentioned yesterday:

Alexei is talking about something else.

...

a duallie will keep the system up when a faulty process hogs 100% CPU, because the second one is still available. That also increases availability ratio.

This is a resource problem, not an availibility problem. A spinning application is not going to take down the machine on any modern OS[2] and anyway can be dealt with with resource limits, SMP or not, presuming your OS supports resource limits. The real problem with SMP is kernel complexity. Drivers that are rock solid in single-processor can have bugs that are only triggered under SMP. Threaded applications can also become unreliable on SMP systems. The extra power of an SMP system might be a bonus, but trying to argue their benefits on the basis of reliability is misguided.

...

Michel.

1. Now, they may still be very reliable, and more than reliable enough for your needs, but they are still not as reliable as the exact same machine with terminators in all CPU sockets/slots bar one ;) The fault-tolerant systems are outrageously expensive. 2. Unless you're running MacOS 9 or Windows 3.11 on your server.. - dont think either supports SMP though ;). regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: A Linux machine! because a 486 is a terrible thing to waste! (By jjs@wintermute.ucr.edu, Joe Sloan)

John Underhill

7:07 p.m.

New subject: Quick question.

...

If a CPU dies, it's unlikely to come back up without removing the bad CPU, especially if the CPU has become unreliable rather than dying completely. Even if CPU 0 is good and the BIOS has no problems booting the OS, the SMP aware OS will quite probably hit problems with the bad CPU.

Not necessarily. There have been a number of innovations in recent years in the area of integrated fault tolerance, including bios level controls over component monitoring / management. Some of the more upscale Compaq G3 servers for instance, can remove a processor from operation if it exceeds a threshold of critical errors, (this is also true for memory). Alphas can boot even if the bootstrap processor fails at system start, and simply selects the next available processor.. they also have hot swap processor capabilities, (again for the time being -upscale..). Add onto this features like hot swap 'raid memory' and pci, redundant pwr, fans, and drives, and systems can be made to withstand many common component failures, with little or no interruption in service. With the advent of technologies like hyperthreading, manufacturers are being driven by market demands to create more reliable SMP drivers, and I think it is likely that simultaneous multi-threading will eventually become the standard.

...

...
a duallie will keep the system up when a faulty process hogs 100% CPU, because the second one is still available. That also increases availability ratio.

Well it depends.. The real differentiation is if the system is truly 'symetric', that is; dual processor, I/O and memory bus. If both processors share the same resources, competition between processors for regions of memory and acquiring locks on the pci bus, severely constrain the available resources for each processor. So that if a process runs amock on a single bus architecture, the second processor will not have the resources it needs to run effectively..

...

application is not going to take down the machine on any modern OS[2] and anyway can be dealt with with resource limits, SMP or not, presuming your OS supports resource limits.

The real problem with SMP is kernel complexity. Drivers that are rock solid in single-processor can have bugs that are only triggered under SMP. Threaded applications can also become unreliable on SMP systems.

The extra power of an SMP system might be a bonus, but trying to argue their benefits on the basis of reliability is misguided.

...
Michel.

1. Now, they may still be very reliable, and more than reliable enough for your needs, but they are still not as reliable as the exact same machine with terminators in all CPU sockets/slots bar one ;) The fault-tolerant systems are outrageously expensive.

2. Unless you're running MacOS 9 or Windows 3.11 on your server.. - dont think either supports SMP though ;).

regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: A Linux machine! because a 486 is a terrible thing to waste! (By jjs@wintermute.ucr.edu, Joe Sloan)

Paul Jakma

10:45 p.m.

New subject: Quick question.

On Sun, 1 Aug 2004, John Underhill wrote:

...

Not necessarily. There have been a number of innovations in recent years in the area of integrated fault tolerance, including bios level controls over component monitoring / management. Some of the more upscale Compaq G3 servers for instance, can remove a processor from operation if it exceeds a threshold of critical errors, (this is also true for memory).

Interesting to know. Those usually are due to ECC errors in CPU caches often due to overheating. The CPU is still functional to a degree though, marginal failure as opposed to catastrophic. But what of electrical failures? Even P4 class machines still share a host bus amongst CPUs no? Anyway, CPUs (if kept sufficiently cool) tend to one of the more reliable components in a system, if they are good to begin with.

...

Alphas can boot even if the bootstrap processor fails at system start, and simply selects the next available processor..

Alphas are quite nice, they have support for lockstep operation too. Tandem were supposed to have been moving to Alpha for their Himalaya F-T servers when DEC bought them. Also the 21164 and up (not sure about 21064) AXPs used a point-to-point bus for SMP[1], they were all electrically isolated from each other - at least, a failure of one CPU couldnt affect the other CPUs.

...

So that if a process runs amock on a single bus architecture, the second processor will not have the resources it needs to run effectively..

Processes running amok still only have access to those resources granted it. Processes generally do not have access to bare IO. What the OS giveth, it can take away (or constrain). 1. Still alive and well in a sense, but now developed into a general purpose PtP local CPU/IO interconnect: AMDs' HyperTransport as used in K8. regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: Don't get stuck in a closet -- wear yourself out.

Alexei Roudnev

4 Aug 4 Aug

5:59 a.m.

New subject: Quick question.

...

Alexei is talking about something else.

...
a duallie will keep the system up when a faulty process hogs 100% CPU, because the second one is still available. That also increases availability ratio.

This is a resource problem, not an availibility problem. A spinning application is not going to take down the machine on any modern OS[2] and anyway can be dealt with with resource limits, SMP or not, presuming your OS supports resource limits.

In theory, yes. On pracrtice, 2 CPU improve behavior dramatically. 4 CPU makes system too complex (as you wrote beloow). New P-IV with multi threading may be a good selection - behave as 2 CPU system but is not so complicated as SMP.

...

The real problem with SMP is kernel complexity. Drivers that are rock

s/is/was/ (5 years ago). Now most kernels are SMP. I agree that SMP kernels are much more complicated, but we _already_ paid this price. In reality, applications are less reliable on 2 CPU systems (if they have some kinds of bugs, which make sense on SMP only), so I agree with you in some cases.

Paul Jakma

6:48 a.m.

New subject: Quick question.

On Tue, 3 Aug 2004, Alexei Roudnev wrote:

...

In theory, yes. On pracrtice, 2 CPU improve behavior dramatically.

That is not about reliability. That's to do with software performance. I was purely picking a, admittedly pedantic, nit with the notion that SMP == more reliable. I'm not trying to argue that SMP does not have other benefits (eg performance).

...

4 CPU makes system too complex (as you wrote beloow).

Nah, the big jump in complexity appears to be from no-concurrency to concurrency. After that initial hurdle, 2 to 4 to 8 CPUs isnt as big a deal (making it scale is though).

...

New P-IV with multi threading may be a good selection - behave as 2 CPU system but is not so complicated as SMP.

...

From the OS POV, the complication is the same. And yes, even single-processors are today capable of presenting multiple execution contexts to software, and it seems to be a trend we'll see more and more of.

...

In reality, applications are less reliable on 2 CPU systems (if they have some kinds of bugs, which make sense on SMP only), so I agree with you in some cases.

Right.. regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: Rubber bands have snappy endings!

Robert E. Seastrom

1 Aug 1 Aug

10:37 p.m.

New subject: Quick question.

"Michel Py" <michel@arneill-py.sacramento.ca.us> writes:

...

The dead processor still has to be replaced, but this is scheduled maintenance, not outage. A little extra ammo when you have to hunt five or six nines.

MTTR on a single box is irrelevant when you are off playing Ponce de Leon, hunting the Fountain of Five or Six Nines. Even when your architecture doesn't depend on any one particular machine (or even whole big sets of machines) being available, you don't get to "five or six nines"... just ask Google, Akamai, or Microsoft - there are other things beyond your control that spoil the picnic first. As has been observed time and time again, the tried and true way to make five or six nines of reliability in a system of more than trivial complexity is to take a lesson from the telcos (the progenitors of the "five nines" lie) and build a framework and evaluation methodology that excludes broad classes of unavailability-causing events or prorates them in such a way as to make them non-reportable. Add to that list incrementally, until the remaining time listed shows your target number of nines of reliability. Presto, five nines. ---Rob

7761

Age (days ago)

7764

Last active (days ago)

List overview

Download

8 comments

6 participants

participants (6)

Alexei Roudnev
Colm MacCarthaigh
John Underhill
Michel Py
Paul Jakma
Robert E. Seastrom