
Alexei Roudnev wrote: We had 6509 which failed, because backplain failed (it can not happen -:) but it happen) - iof course, no any 'dual CPU dual power' could prevent it... Image broken line card - it can crash whole box no matter how much 'dual' things you have. The same with software error (I crashed one of 6509 just running 'snmpwalk' on it).
I lost a 7507 dual power dual RSP earlier this year: one of the cards died, something in the power circuitry. It put the entire router in short-circuit, both power supplies decided to go south and would not power back up again until the faulty card was physically removed. After the card was removed it worked fine again. It does not happen often, but it does happen. Redundancy is not a slam dunk with IOS though; same as dCEF, don't expect RPR-compatible images to run every config you'll bump into, YMMV. There is an annoying number of things that are not working on RPR images of fall back to route cache instead of distributed cache.
So, I always prefer to have 2 boxes and application level reliaility instead of playing with 'dual everything' solutions (last example - 2 days ago one of our dual-power Intel servers failed because of 1 power supply failure - it did not broke, but it did something wrong''' and system crashed...).
Actually, what I try to do for routers is having a "dual everything" for production and an "el-cheapo eBay special" sitting in the same rack for backup. The reason I still do dual power and dual CPU is that over the last 20 years I have seen very little failures of redundant systems (although I have seen some) however a dual-something saved my bottom several times. That part of my body is priceless :-D For PCs I install dual Xeons on every production machine for example, even though the CPU power needed for some is a 486; Intel processors do die like anything else; a processor dying will typically lead to a system crash, but it does reboot in single-processor mode when the graveyard dude pushes the reset button. I also try do have RAID-10 arrays span over two raid cards; same as CPUs, a RAID card that dies will likely crash the system but it will reboot in degraded mode. Michel.

--On Saturday, July 31, 2004 20:51 -0700 Michel Py <michel@arneill-py.sacramento.ca.us> wrote:
For PCs I install dual Xeons on every production machine for example, even though the CPU power needed for some is a 486; Intel processors do die like anything else; a processor dying will typically lead to a system crash, but it does reboot in single-processor mode when the graveyard dude pushes the reset button. I also try do have RAID-10 arrays span over two raid cards; same as CPUs, a RAID card that dies will likely crash the system but it will reboot in degraded mode.
Eh really? Whenever I've lost a second CPU (primary or secondary) the machine was a brick until the secondary CPU was gutted and for Piii slotted systems a terminator board was installed in the secondary slot. What motherboard(s) you using that are holding up to failures like this? My experience has shown PSU and motherboard failures are faaaaar more common than CPUs. -- Undocumented Features quote of the moment... "It's not the one bullet with your name on it that you have to worry about; it's the twenty thousand-odd rounds labeled `occupant.'" --Murphy's Laws of Combat

2 CPU are not for redundancy, but they protects system from crazy process spending 100% of one CPU (and system still have 50%of capacity).
--On Saturday, July 31, 2004 20:51 -0700 Michel Py <michel@arneill-py.sacramento.ca.us> wrote:
For PCs I install dual Xeons on every production machine for example, even though the CPU power needed for some is a 486; Intel processors do die like anything else; a processor dying will typically lead to a system crash, but it does reboot in single-processor mode when the graveyard dude pushes the reset button. I also try do have RAID-10 arrays span over two raid cards; same as CPUs, a RAID card that dies will likely crash the system but it will reboot in degraded mode.
Eh really? Whenever I've lost a second CPU (primary or secondary) the machine was a brick until the secondary CPU was gutted and for Piii
slotted
systems a terminator board was installed in the secondary slot.
What motherboard(s) you using that are holding up to failures like this?
My experience has shown PSU and motherboard failures are faaaaar more common than CPUs.
-- Undocumented Features quote of the moment... "It's not the one bullet with your name on it that you have to worry about; it's the twenty thousand-odd rounds labeled `occupant.'" --Murphy's Laws of Combat

We are doing the same, including spare-staging hardware from e-bay and 2 CPU servers for everything (but we still like old and cheap 2x1Ggz servers, able to do 99% of all tasks). PS. I like E-bay; last example - our collegues spend a long negotiations and settle price for Cisco switches with a good discount; 10 seconds e-bay search revealed exactly the same systems in original boxes (unopened) 10% cheaper -:)
Alexei Roudnev wrote: We had 6509 which failed, because backplain failed (it can not happen -:) but it happen) - iof course, no any 'dual CPU dual power' could prevent it... Image broken line card - it can crash whole box no matter how much 'dual' things you have. The same with software error (I crashed one of 6509 just running 'snmpwalk' on it).
I lost a 7507 dual power dual RSP earlier this year: one of the cards died, something in the power circuitry. It put the entire router in short-circuit, both power supplies decided to go south and would not power back up again until the faulty card was physically removed. After the card was removed it worked fine again. It does not happen often, but it does happen. Redundancy is not a slam dunk with IOS though; same as dCEF, don't expect RPR-compatible images to run every config you'll bump into, YMMV. There is an annoying number of things that are not working on RPR images of fall back to route cache instead of distributed cache.
So, I always prefer to have 2 boxes and application level reliaility instead of playing with 'dual everything' solutions (last example - 2 days ago one of our dual-power Intel servers failed because of 1 power supply failure - it did not broke, but it did something wrong''' and system crashed...).
Actually, what I try to do for routers is having a "dual everything" for production and an "el-cheapo eBay special" sitting in the same rack for backup. The reason I still do dual power and dual CPU is that over the last 20 years I have seen very little failures of redundant systems (although I have seen some) however a dual-something saved my bottom several times. That part of my body is priceless :-D For PCs I install dual Xeons on every production machine for example, even though the CPU power needed for some is a 486; Intel processors do die like anything else; a processor dying will typically lead to a system crash, but it does reboot in single-processor mode when the graveyard dude pushes the reset button. I also try do have RAID-10 arrays span over two raid cards; same as CPUs, a RAID card that dies will likely crash the system but it will reboot in degraded mode. Michel.

On Sat, 31 Jul 2004, Michel Py wrote:
For PCs I install dual Xeons on every production machine for example, even though the CPU power needed for some is a 486;
What a mad idea. Intel dont do fault-tolerant SMP. Running SMP will lower your MTBF just by fact that you now have 2 CPUs. Then there's fact that you're far more likely to hit bugs in OS with SMP than uniproc. SMP for reliability? Unless it's a multi-million zSeries, What a mad idea...
Michel.
regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: Q: What's the difference between USL and the Titanic? A: The Titanic had a band.

It is not mad idea - 2 CPU servers are not sugnificantly more expansive as 1CPU (and notice, we count P-IV MMultiThread as 2 CPU) but increases system redundancy to the run-away processes. Of course, it is not hardware redundancy, but it REALLY works.
On Sat, 31 Jul 2004, Michel Py wrote:
For PCs I install dual Xeons on every production machine for example, even though the CPU power needed for some is a 486;
What a mad idea.
Intel dont do fault-tolerant SMP.
Running SMP will lower your MTBF just by fact that you now have 2 CPUs. Then there's fact that you're far more likely to hit bugs in OS with SMP than uniproc.
SMP for reliability? Unless it's a multi-million zSeries, What a mad idea...
Michel.
regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: Q: What's the difference between USL and the Titanic? A: The Titanic had a band.

On Tue, 3 Aug 2004, Alexei Roudnev wrote:
It is not mad idea - 2 CPU servers are not sugnificantly more expansive as 1CPU (and notice, we count P-IV MMultiThread as 2 CPU)
Well, you have to compare like for like, so system with multiple CPUs versus exact system without. No diffference in cost, other than for the CPUs. And if you want reliability, you're not going to be buying your machines from the nearest Lidl (unless your application is engineered to take advantage of dozens of cheap throwaway PCs).
but increases system redundancy to the run-away processes. Of course, it is not hardware redundancy, but it REALLY works.
Not really.. this is a resource exhaustion problem, and you can not cure this, given buggy apps, by throwing more CPUs at it. Let's say you have some multi-process or multi-threaded application which regularly spawns/forks new processes/threads, but it is buggy and prone to having individual processes/threads spin. So one spins, but you still have plenty of CPU time left cause you have two CPUs. Another spins, and the machine starts to crawl. So you solve this problem by upgrading to a quad-SMP machine. And guess what happens? :) Sure, there are some application bugs you can mask a wee bit with SMP, but it's not much cop, its not a solution, and you need an infinite-SMP machine to guarantee that a bad application can never hog all CPU time. What you really want is a good OS with: - a good scheduler (to prevent spinning tasks from starving other tasks) - ability to set resource limits, ie per-task and/or per-user (if your apps run under dedicated user accounts) limits on cpu time, resident memory, etc.. Both of these will allow you to constrain the impact bad tasks can have on the system, whether your machine is 1, 2, ... or n CPUs. The real solution though is to fix the buggy application. regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: The life which is unexamined is not worth living. -- Plato

----- Original Message ----- From: "Paul Jakma" <paul@clubi.ie> To: "Alexei Roudnev" <alex@relcom.net> Cc: "Michel Py" <michel@arneill-py.sacramento.ca.us>; "Nanog" <nanog@nanog.org> Sent: Wednesday, August 04, 2004 2:39 AM Subject: Re: Quick question. --- snip ---
Not really.. this is a resource exhaustion problem, and you can not cure this, given buggy apps, by throwing more CPUs at it.
Let's say you have some multi-process or multi-threaded application which regularly spawns/forks new processes/threads, but it is buggy and prone to having individual processes/threads spin.
So one spins, but you still have plenty of CPU time left cause you have two CPUs. Another spins, and the machine starts to crawl. So you solve this problem by upgrading to a quad-SMP machine. And guess what happens? :)
the second cpu buys you time - it is unlikely you're going to be able to react in time on a busy single cpu box with a runaway process (it launches into a death sprial almost immediately), but you would usually have 10-15 mins on a dual cpu box at a minimum or maybe infinity if you enforce cpu affinity for apps that tend to misbehave. paul

On Wed, 4 Aug 2004, Paul G wrote:
the second cpu buys you time - it is unlikely you're going to be able to react in time on a busy single cpu box with a runaway process (it launches into a death sprial almost immediately), but you would usually have 10-15 mins on a dual cpu box at a minimum or maybe infinity if you enforce cpu affinity for apps that tend to misbehave.
Why do you have 10-15 mins? If the application is multi-threaded and has a reasonable workload, there are plenty of types of bugs that will result in one spinning thread after the other, you need far more than just 2 CPUs! Or maybe your application vendor has "at least 10minutes between hitting bugs!" on it's feature list? ;) Really, what you to need do is (in the face of such buggy apps) is to set per-task CPU time resource limits appropriate to how much cpu-time a task needs and how much you can afford - be it a 1, 2 or n CPU system.
paul
regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: I came to MIT to get an education for myself and a diploma for my mother.

----- Original Message ----- Cc: <nanog@merit.edu>From: "Paul Jakma" <paul@clubi.ie> To: "Paul G" <paul@rusko.us> Sent: Wednesday, August 04, 2004 3:09 AM Subject: Re: Quick question.
On Wed, 4 Aug 2004, Paul G wrote:
the second cpu buys you time - it is unlikely you're going to be able to react in time on a busy single cpu box with a runaway process (it launches into a death sprial almost immediately), but you would usually have 10-15 mins on a dual cpu box at a minimum or maybe infinity if you enforce cpu affinity for apps that tend to misbehave.
Why do you have 10-15 mins? If the application is multi-threaded and has a reasonable workload, there are plenty of types of bugs that will result in one spinning thread after the other, you need far more than just 2 CPUs! Or maybe your application vendor has "at least 10minutes between hitting bugs!" on it's feature list? ;)
these are observations, pertaining to software products we use a lot - apache, mysql, apache/suexec, various mtas etc. your point is well taken in general, but at least When Done Here(tm), dual cpu helps significantly empirically speaking.
Really, what you to need do is (in the face of such buggy apps) is to set per-task CPU time resource limits appropriate to how much cpu-time a task needs and how much you can afford - be it a 1, 2 or n CPU system.
agreed. however, this degrades performance in certain situations, is not practical in others and introduces additional complexity (always a bad thing). the tradeoff is significantly in favor of reactive measures (be they automatic or human intervantion), at least in most of our installations. paul

I am sorry, but I do not make a theory - I just repors practical results. 2 CPU systems are much more stable than 1 CPU system, in my experience. You are free to find an explanatiion, if you want -:).
On Wed, 4 Aug 2004, Paul G wrote:
the second cpu buys you time - it is unlikely you're going to be able to react in time on a busy single cpu box with a runaway process (it launches into a death sprial almost immediately), but you would usually have 10-15 mins on a dual cpu box at a minimum or maybe infinity if you enforce cpu affinity for apps that tend to misbehave.
Why do you have 10-15 mins? If the application is multi-threaded and has a reasonable workload, there are plenty of types of bugs that will result in one spinning thread after the other, you need far more than just 2 CPUs! Or maybe your application vendor has "at least 10minutes between hitting bugs!" on it's feature list? ;)
Really, what you to need do is (in the face of such buggy apps) is to set per-task CPU time resource limits appropriate to how much cpu-time a task needs and how much you can afford - be it a 1, 2 or n CPU system.
paul
regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: I came to MIT to get an education for myself and a diploma for my mother.

On Wed, 4 Aug 2004, Alexei Roudnev wrote:
I am sorry, but I do not make a theory - I just repors practical results. 2 CPU systems are much more stable than 1 CPU system, in my experience. You are free to find an explanatiion, if you want -:).
The theory suggests your experience is unusual, or that you're overemphasising one positive contributor towards system reliability of complexity against the negative impacts of complexity. Again, I'm not arguing that the more complex system (eg SMP) must always be more unreliable, a well-engineered complex system will be more reliable than a simple but badly-engineered system. I know of an SMP PC server that hit at least 4 years uptime (never rebooted while i was in the employ of that company anyway ;) ), however it would have been just as reliable with just one CPU. And for a large sample of those machines, identical other than single and dual CPU, the set of single CPU machines will be statistically more reliable. Further, for a diverse sample of hardware of varying quality, you will see far more problems with SMP systems - primarily due to software (eg drivers with subtle locking bugs). Nor am I arguing that the tradeoff of reliability for better performance is unwise, particularly since in this case (SMP systems), CPU failures tend to be rare (unless secondary due to some other failure, eg cooling). anyway, I'm repeating myself, so i'll stop before susan larts me, and let the list get back to its favoured topic of discussing analogies. ;) regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: You're working under a slight handicap. You happen to be human.

On Aug 4, 2004, at 10:53 PM, Paul Jakma wrote:
On Wed, 4 Aug 2004, Alexei Roudnev wrote:
I am sorry, but I do not make a theory - I just repors practical results. 2 CPU systems are much more stable than 1 CPU system, in my experience. You are free to find an explanatiion, if you want -:).
The theory suggests your experience is unusual,
Practice suggests that there may well be good reason for this. Mainboards that are set up for 2 CPUs are likely to be engineered to a much higher standard than your normal chop-shop cheapie special. An interesting experiment would be to run a 1 CPU system based on the exact same 2 CPU mainboard and if it the level of reliability would be significantly different. Tony

I said - it WORKS. 1 spin - warning - someone opens system and kills a run away process... Never saw 2 spins (because first one was killed before second one). Btw, such systems (2 CPU) are even more stable in case of run away device drivers. I saw: - run away tomcat server - run away CA agent (!@#$) - run away ssh daemon - run away sandmail All regular, at some periods of time. And all processed without any system degradation because of a few CPU's. The same run-aways on 1 CPU systems caused visible degradation. It is all mattter of trade-off - if I must select 1 threaded or 2 threaded P-IV, I'll select 2 threaded; if I must select from $900 1 CPU and $1100 2 CPU server, I select 2 CPU one. ----- Original Message ----- From: "Paul Jakma" <paul@clubi.ie> To: "Alexei Roudnev" <alex@relcom.net> Cc: "Michel Py" <michel@arneill-py.sacramento.ca.us>; "Nanog" <nanog@nanog.org> Sent: Tuesday, August 03, 2004 11:39 PM Subject: Re: Quick question.
On Tue, 3 Aug 2004, Alexei Roudnev wrote:
It is not mad idea - 2 CPU servers are not sugnificantly more expansive as 1CPU (and notice, we count P-IV MMultiThread as 2 CPU)
Well, you have to compare like for like, so system with multiple CPUs versus exact system without. No diffference in cost, other than for the CPUs.
And if you want reliability, you're not going to be buying your machines from the nearest Lidl (unless your application is engineered to take advantage of dozens of cheap throwaway PCs).
but increases system redundancy to the run-away processes. Of course, it is not hardware redundancy, but it REALLY works.
Not really.. this is a resource exhaustion problem, and you can not cure this, given buggy apps, by throwing more CPUs at it.
Let's say you have some multi-process or multi-threaded application which regularly spawns/forks new processes/threads, but it is buggy and prone to having individual processes/threads spin.
So one spins, but you still have plenty of CPU time left cause you have two CPUs. Another spins, and the machine starts to crawl. So you solve this problem by upgrading to a quad-SMP machine. And guess what happens? :)
Sure, there are some application bugs you can mask a wee bit with SMP, but it's not much cop, its not a solution, and you need an infinite-SMP machine to guarantee that a bad application can never hog all CPU time.
What you really want is a good OS with:
- a good scheduler (to prevent spinning tasks from starving other tasks)
- ability to set resource limits, ie per-task and/or per-user (if your apps run under dedicated user accounts) limits on cpu time, resident memory, etc..
Both of these will allow you to constrain the impact bad tasks can have on the system, whether your machine is 1, 2, ... or n CPUs.
The real solution though is to fix the buggy application.
regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: The life which is unexamined is not worth living. -- Plato

On Wed, Aug 04, 2004, Alexei Roudnev wrote:
I said - it WORKS. 1 spin - warning - someone opens system and kills a run away process... Never saw 2 spins (because first one was killed before second one). Btw, such systems (2 CPU) are even more stable in case of run away device drivers.
I call crapola. Modern _modern_ systems may have _some_ of the device drivers running on seperate CPUs but they're still running in kernel mode. A runaway device driver means you're toast. Now, a very very busy device, thats a seperate story. Having one CPU handle all of your disk/network IO and the second CPU handle all of your processes may alleviate some of the pain. May. There's more to it than just offloading stuff. If your processes are all _depending_ on IO to occur then you may end up with random crappy starvation situations. This has nothing to do with NANOG. Lets talk about DCEF bugs or something. Adrian -- Adrian Chadd I'm only a fanboy if <adrian@creative.net.au> I emailed Wesley Crusher.

Just again. I do not try to explain, I report observations -:).
On Wed, Aug 04, 2004, Alexei Roudnev wrote:
I said - it WORKS. 1 spin - warning - someone opens system and kills a
run
away process... Never saw 2 spins (because first one was killed before second one). Btw, such systems (2 CPU) are even more stable in case of run away device drivers.
I call crapola. Modern _modern_ systems may have _some_ of the device drivers running on seperate CPUs but they're still running in kernel mode.
A runaway device driver means you're toast.
Now, a very very busy device, thats a seperate story. Having one CPU handle all of your disk/network IO and the second CPU handle all of your processes may alleviate some of the pain. May. There's more to it than just offloading stuff. If your processes are all _depending_ on IO to occur then you may end up with random crappy starvation situations.
This has nothing to do with NANOG. Lets talk about DCEF bugs or something.
Adrian
-- Adrian Chadd I'm only a fanboy if <adrian@creative.net.au> I emailed Wesley Crusher.
participants (7)
-
Adrian Chadd
-
Alexei Roudnev
-
Michael Loftis
-
Michel Py
-
Paul G
-
Paul Jakma
-
Tony Li