Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP's are Sprint and Qwest. thanks, Audie Onibala 703-292-5316
Audie Onibala wrote:
Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP's are Sprint and Qwest.
Around that time there was quite a bit sunspot activity and the moon had an unusual position too. The NOC contacts of your ISP's probably may be of more specific help. But make sure to ask them for their networks SPF (sunspot protection factor). That's an important metric to qualify their network reliability. -- Andre
Andre Oppermann wrote:
Audie Onibala wrote:
Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP's are Sprint and Qwest.
Around that time there was quite a bit sunspot activity and the moon had an unusual position too. The NOC contacts of your ISP's probably may be of more specific help. But make sure to ask them for their networks SPF (sunspot protection factor). That's an important metric to qualify their network reliability.
Are you sure it was sunspots? My NOC contacts were seeing substantial memory corruption due to cosmic rays. -- Jay Hennigan - CCIE #7880 - Network Engineering - jay@impulse.net Impulse Internet Service - http://www.impulse.net/ Your local telephone and internet company - 805 884-6323 - WB6RDV
Somebody form a certain large network vendor actually blamed problems with their kit on cosmic rays causing memory corruption... -- Leigh Porter Jay Hennigan wrote:
Andre Oppermann wrote:
Audie Onibala wrote:
Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP's are Sprint and Qwest.
Around that time there was quite a bit sunspot activity and the moon had an unusual position too. The NOC contacts of your ISP's probably may be of more specific help. But make sure to ask them for their networks SPF (sunspot protection factor). That's an important metric to qualify their network reliability.
Are you sure it was sunspots? My NOC contacts were seeing substantial memory corruption due to cosmic rays.
-- Jay Hennigan - CCIE #7880 - Network Engineering - jay@impulse.net Impulse Internet Service - http://www.impulse.net/ Your local telephone and internet company - 805 884-6323 - WB6RDV
On 19/04/07, Leigh Porter <leigh.porter@ukbroadband.com> wrote:
Somebody form a certain large network vendor actually blamed problems with their kit on cosmic rays causing memory corruption...
Remember that cosmic rays are very selective, they always seem to pick boxes from this specific vendor. /Tony
With certain susceptible Sun CPUs which were popular during the last sunspot maxima, this was actually demonstrably true (and acknowledged by Sun), so don't laugh too hard. ---rob Leigh Porter <leigh.porter@ukbroadband.com> writes:
Somebody form a certain large network vendor actually blamed problems with their kit on cosmic rays causing memory corruption...
-- Leigh Porter
Jay Hennigan wrote:
Andre Oppermann wrote:
Audie Onibala wrote:
Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP's are Sprint and Qwest.
Around that time there was quite a bit sunspot activity and the moon had an unusual position too. The NOC contacts of your ISP's probably may be of more specific help. But make sure to ask them for their networks SPF (sunspot protection factor). That's an important metric to qualify their network reliability.
Are you sure it was sunspots? My NOC contacts were seeing substantial memory corruption due to cosmic rays.
-- Jay Hennigan - CCIE #7880 - Network Engineering - jay@impulse.net Impulse Internet Service - http://www.impulse.net/ Your local telephone and internet company - 805 884-6323 - WB6RDV
Shields Up? http://www.theregister.co.uk/2007/04/19/star_trek_shield/ Robert E. Seastrom wrote:
With certain susceptible Sun CPUs which were popular during the last sunspot maxima, this was actually demonstrably true (and acknowledged by Sun), so don't laugh too hard.
---rob
Leigh Porter <leigh.porter@ukbroadband.com> writes:
Somebody form a certain large network vendor actually blamed problems with their kit on cosmic rays causing memory corruption...
-- Leigh Porter
Jay Hennigan wrote:
Andre Oppermann wrote:
Audie Onibala wrote:
Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP's are Sprint and Qwest.
Around that time there was quite a bit sunspot activity and the moon had an unusual position too. The NOC contacts of your ISP's probably may be of more specific help. But make sure to ask them for their networks SPF (sunspot protection factor). That's an important metric to qualify their network reliability.
Are you sure it was sunspots? My NOC contacts were seeing substantial memory corruption due to cosmic rays.
-- Jay Hennigan - CCIE #7880 - Network Engineering - jay@impulse.net Impulse Internet Service - http://www.impulse.net/ Your local telephone and internet company - 805 884-6323 - WB6RDV
On Apr 19, 2007, at 10:17 AM, Robert E. Seastrom wrote:
With certain susceptible Sun CPUs which were popular during the last sunspot maxima, this was actually demonstrably true (and acknowledged by Sun), so don't laugh too hard.
Yup, Sandia National Labs made a radiation hardened Pentium and, as far as I remember, was working on a hardened SPARC -- there was also some work done (AFAIR on PPC) whereby 3 processors would run the same instructions and vote on the output...
---rob
Leigh Porter <leigh.porter@ukbroadband.com> writes:
Somebody form a certain large network vendor actually blamed problems with their kit on cosmic rays causing memory corruption...
Oh, not just "somebody" -- a certain large vendor has many, many references to it -- and I have received it as a explanation for random reloads -- believe me, trying to tell an irate customer / PHB that the reason that his "mission critical" circuit bounced was because of cosmic rays is No Fun(tm). Hmmm.. Isn't this the same vendor that now has a router sitting on a satellite ?! ;-) There was also an issue where one of the large manufacturers of (binary) CAMs received a batch of polyimide that was contaminated with an alpa-emitter (for some reason thorium oxide springs to mind) and their quality control didn't catch it... As far as I know the problem was identified before any products with the CAMs were shipped, but I had an order held up while the vendor tried to source alternate parts...
-- Leigh Porter
Jay Hennigan wrote:
Andre Oppermann wrote:
Audie Onibala wrote:
Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP's are Sprint and Qwest.
Around that time there was quite a bit sunspot activity and the moon had an unusual position too. The NOC contacts of your ISP's probably may be of more specific help. But make sure to ask them for their networks SPF (sunspot protection factor). That's an important metric to qualify their network reliability.
Are you sure it was sunspots? My NOC contacts were seeing substantial memory corruption due to cosmic rays.
-- Jay Hennigan - CCIE #7880 - Network Engineering - jay@impulse.net Impulse Internet Service - http://www.impulse.net/ Your local telephone and internet company - 805 884-6323 - WB6RDV
-- After you'd known Christine for any length of time, you found yourself fighting a desire to look into her ear to see if you could spot daylight coming the other way. -- (Terry Pratchett, Maskerade)t
-----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Warren Kumari Sent: Thursday, April 19, 2007 12:01 PM To: Robert E. Seastrom Cc: Leigh Porter; Jay Hennigan; Andre Oppermann; nanog@merit.edu Subject: Re: BGP Problem on 04/16/2007
On Apr 19, 2007, at 10:17 AM, Robert E. Seastrom wrote:
With certain susceptible Sun CPUs which were popular during
the last
sunspot maxima, this was actually demonstrably true (and acknowledged by Sun), so don't laugh too hard.
Yup, Sandia National Labs made a radiation hardened Pentium and, as far as I remember, was working on a hardened SPARC -- there was also some work done (AFAIR on PPC) whereby 3 processors would run the same instructions and vote on the output...
Thinking of perhaps Resilience? http://www.resilience.com/ God, those things were horrid before they realized that the business model of assuming "The app will always be OK, the issue will be the hardware" was completely misguided. I forget what the product was named at the time, but I'll never forget what a piece of crap it was.
On Apr 19, 2007, at 12:52 PM, David Temkin wrote:
-----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Warren Kumari Sent: Thursday, April 19, 2007 12:01 PM To: Robert E. Seastrom Cc: Leigh Porter; Jay Hennigan; Andre Oppermann; nanog@merit.edu Subject: Re: BGP Problem on 04/16/2007
On Apr 19, 2007, at 10:17 AM, Robert E. Seastrom wrote:
With certain susceptible Sun CPUs which were popular during
the last
sunspot maxima, this was actually demonstrably true (and acknowledged by Sun), so don't laugh too hard.
Yup, Sandia National Labs made a radiation hardened Pentium and, as far as I remember, was working on a hardened SPARC -- there was also some work done (AFAIR on PPC) whereby 3 processors would run the same instructions and vote on the output...
There is a radiation hardened Pwwer PC - http://www.klabs.org/DEI/Processor/PowerPC/index.htm You need this for space flight qualified hardware. Up there, cosmic ray bit flips and stuck bits are a common occurrence. Regards Marshall
Thinking of perhaps Resilience? http://www.resilience.com/
God, those things were horrid before they realized that the business model of assuming "The app will always be OK, the issue will be the hardware" was completely misguided. I forget what the product was named at the time, but I'll never forget what a piece of crap it was.
"David Temkin" <dave@rightmedia.com> writes:
From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Warren Kumari Yup, Sandia National Labs made a radiation hardened Pentium and, as far as I remember, was working on a hardened SPARC -- there was also some work done (AFAIR on PPC) whereby 3 processors would run the same instructions and vote on the output...
Thinking of perhaps Resilience? http://www.resilience.com/
God, those things were horrid before they realized that the business model of assuming "The app will always be OK, the issue will be the hardware" was completely misguided. I forget what the product was named at the time, but I'll never forget what a piece of crap it was.
Eh, they're not the only folks to have had voting-muti-cpu-lockstep-execution hardware platforms. Stratus did it for years; the Tandem Integrity S2 (to which I ported Emacs 18.55 many moons ago) was similar. ---Rob
On Apr 19, 2007, at 10:03 AM, Robert E. Seastrom wrote:
"David Temkin" <dave@rightmedia.com> writes:
From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Warren Kumari Yup, Sandia National Labs made a radiation hardened Pentium and, as far as I remember, was working on a hardened SPARC -- there was also some work done (AFAIR on PPC) whereby 3 processors would run the same instructions and vote on the output...
Thinking of perhaps Resilience? http://www.resilience.com/
God, those things were horrid before they realized that the business model of assuming "The app will always be OK, the issue will be the hardware" was completely misguided. I forget what the product was named at the time, but I'll never forget what a piece of crap it was.
Eh, they're not the only folks to have had voting-muti-cpu-lockstep- execution hardware platforms. Stratus did it for years; the Tandem Integrity S2 (to which I ported Emacs 18.55 many moons ago) was similar.
I helped develop a digital communication system for the Navy at Huges back in the early 80s. We could only use fusable ROMS and rad-hard 8080s. (No break points.) Crystals where nudged into lock for three- way synchronous voting on defective systems/hardware. Mechanical inputs were also redundant, and of course a bear to resync. This lead to a snafu during war games with an aircraft carrier, where the air controller panel's gray-code rotor switches were erroneously flagged as defective during peak use. Luckily everyone lived. -Doug
On Apr 19, 2007, at 12:52 PM, David Temkin wrote:
-----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Warren Kumari Sent: Thursday, April 19, 2007 12:01 PM To: Robert E. Seastrom Cc: Leigh Porter; Jay Hennigan; Andre Oppermann; nanog@merit.edu Subject: Re: BGP Problem on 04/16/2007
On Apr 19, 2007, at 10:17 AM, Robert E. Seastrom wrote:
With certain susceptible Sun CPUs which were popular during
the last
sunspot maxima, this was actually demonstrably true (and acknowledged by Sun), so don't laugh too hard.
Yup, Sandia National Labs made a radiation hardened Pentium and, as far as I remember, was working on a hardened SPARC -- there was also some work done (AFAIR on PPC) whereby 3 processors would run the same instructions and vote on the output...
Thinking of perhaps Resilience? http://www.resilience.com/
God, those things were horrid before they realized that the business model of assuming "The app will always be OK, the issue will be the hardware" was completely misguided. I forget what the product was named at the time, but I'll never forget what a piece of crap it was.
Nah, I wasn't thinking of them -- post-traumatic memory loss allowed me to forget them... There was someone else who's name I have managed to forget who tried to do the same thing through 4 parallel SCSI connectors and fancy OS software -- it was horrendous.. There were 2 motherboards in a case (driven by the same, non-redundant, non- swappable PSU!) and each motherboard had 2 dual channel SCSI cards with cables stretched between the cards. Fancy drivers exposed each board's RAM to the other machine -- there was also a 10Base-2 cable (I'm dating myself here) between the mother-boards for coordination and communication. Every now-and-then your application was supposed to make a system call that would cause the machines grind to a halt and compare their memory -- if there was a difference, the syscall would return non-zero and leave you to figure out what to do about it -- unfortunately because there were only 2 machines voting there was no way to know who was right and who was wrong -- the vendors suggestion was to a: reboot or b: "just choose one and hope you guessed right". Wildly broken system... I cannot find any of my docs on the system that I was originally talking about, but it was 3 PPC cores in a single package -- there was built in hardware to keep them synchronized and voting. AFAIR, it was a drop-in replacement for the "normal" version of the same device, modulo the power-draw. Maxwell Technologies makes a triple modular redundant cPCI board with SOI processor and rad tolerant FPGAs that is really nice -- somewhere I think I still have a stash of them... NB: The above mentions 10BASE-2 and cPCI (which will fit in certain vendors hardware) which *just* managed to keep this on-topic -- hopefully :-) W -- If the bad guys have copies of your MD5 passwords, then you have way bigger problems than the bad guys having copies of your MD5 passwords. -- Richard A Steenbergen
On Thu, 19 Apr 2007 12:00:53 -0400 Warren Kumari <warren@kumari.net> wrote:
There was also an issue where one of the large manufacturers of (binary) CAMs received a batch of polyimide that was contaminated with an alpa-emitter (for some reason thorium oxide springs to mind) and their quality control didn't catch it... As far as I know the problem was identified before any products with the CAMs were shipped, but I had an order held up while the vendor tried to source alternate parts...
Contamination by alpha emitters was a major problem some years ago. Manufacturers had to change their formulations to avoid the problem. --Steve Bellovin, http://www.cs.columbia.edu/~smb
I dont have the reference to hand but with Cisco the crash reason hinted at something very odd which was either a hardware failure or cosmic ray - i think it was a parity error or something similar. I remember this because I had such a reload and it was during a period of heavy cosmic activity.. as the hardware had always been reliable and was reliable after this was beleived to be the cause Steve On Thu, Apr 19, 2007 at 10:17:49AM -0400, Robert E. Seastrom wrote:
With certain susceptible Sun CPUs which were popular during the last sunspot maxima, this was actually demonstrably true (and acknowledged by Sun), so don't laugh too hard.
---rob
Leigh Porter <leigh.porter@ukbroadband.com> writes:
Somebody form a certain large network vendor actually blamed problems with their kit on cosmic rays causing memory corruption...
-- Leigh Porter
Jay Hennigan wrote:
Andre Oppermann wrote:
Audie Onibala wrote:
Yesterday on 04/16/07 between 3:00 - 3:45 PM we had sporadic Internet problem. Our ISP's are Sprint and Qwest.
Around that time there was quite a bit sunspot activity and the moon had an unusual position too. The NOC contacts of your ISP's probably may be of more specific help. But make sure to ask them for their networks SPF (sunspot protection factor). That's an important metric to qualify their network reliability.
Are you sure it was sunspots? My NOC contacts were seeing substantial memory corruption due to cosmic rays.
-- Jay Hennigan - CCIE #7880 - Network Engineering - jay@impulse.net Impulse Internet Service - http://www.impulse.net/ Your local telephone and internet company - 805 884-6323 - WB6RDV
Hi Steve, steve@telecomplete.co.uk (Stephen Wilcox) wrote:
I remember this because I had such a reload and it was during a period of heavy cosmic activity.. as the hardware had always been reliable and was reliable after this was beleived to be the cause
We have also started to use this as the standard excuse. Up to now, people believe us... Cheers, Elmi. -- "Hinken ist kein Mangel eines Vergleichs, sondern sollte als wesentliche Eigenschaft von Vergleichen angesehen werden." (Marius Fränzel in desd) --------------------------------------------------------------[ ELMI-RIPE ]---
I remember this because I had such a reload and it was during a period of heavy cosmic activity.. as the hardware had always been reliable and was reliable after this was beleived to be the cause
We have also started to use this as the standard excuse. Up to now, people believe us...
Well, there is some documentation on Cisco containing references to cosmic rays and parity errors: http://www.cisco.com/en/US/products/hw/routers/ps341/products_tech_note09186... Cisco 7200 Parity Error Fault Tree "As with all computer and networking devices, the NPE is susceptible to the rare occurrence of parity errors in processor memory. Parity errors may cause the system to reset and can be a transient Single Event Upset (SEU or soft error) or can occur multiple times (often referred to as hard errors) due to damaged hardware. SEUs or soft errors are caused by "noise" most frequently due to high-energy neutrons generated in the atmosphere by cosmic rays. For more information on SEUs, refer to the Increasing Network Availability page. [...] Even if systems use Error Code Correction (ECC), it is still possible to see an occasional parity error when more than a single error has occurred in the 64 bits of data due to cosmic rays affecting more than one memory cell, or a hard error in the cache." Regards, Daniele.
On Fri, Apr 20, 2007 at 04:52:04PM +0200, Daniele Arena wrote:
I remember this because I had such a reload and it was during a period of heavy cosmic activity.. as the hardware had always been reliable and was reliable after this was beleived to be the cause
We have also started to use this as the standard excuse. Up to now, people believe us...
Well, there is some documentation on Cisco containing references to cosmic rays and parity errors:
http://www.cisco.com/en/US/products/hw/routers/ps341/products_tech_note09186...
Cisco 7200 Parity Error Fault Tree
"As with all computer and networking devices, the NPE is susceptible to the rare occurrence of parity errors in processor memory. Parity errors may cause the system to reset and can be a transient Single Event Upset (SEU or soft error) or can occur multiple times (often referred to as hard errors) due to damaged hardware. SEUs or soft errors are caused by "noise" most frequently due to high-energy neutrons generated in the atmosphere by cosmic rays. For more information on SEUs, refer to the Increasing Network Availability page.
yup, thats the reference i was referring to.. we indeed had a single event upset on an NPE :) Steve
On Thu, 2007-04-19 at 10:58 +0100, Leigh Porter wrote:
Somebody form a certain large network vendor actually blamed problems with their kit on cosmic rays causing memory corruption...
Right. I get that answer quite often. We've made a little spinner that has "Upgrade software", "Random radiation", and "We've never supported that feature". It's proven to be fairly accurate when opening cases with this vendor's tech-support organization. -- Daniel J McDonald, CCIE # 2495, CISSP # 78281, CNX Austin Energy http://www.austinenergy.com
On Thu, 19 Apr 2007, Leigh Porter wrote:
Somebody form a certain large network vendor actually blamed problems with their kit on cosmic rays causing memory corruption...
in point of fact it seems like it's the fall through for their technical assistance center's answer tree if all else fails. Quite funny.
participants (16)
-
Andre Oppermann
-
Audie Onibala
-
Chris L. Morrow
-
Daniel J McDonald
-
Daniele Arena
-
David Temkin
-
Douglas Otis
-
Elmar K. Bins
-
Jay Hennigan
-
Leigh Porter
-
Marshall Eubanks
-
Robert E. Seastrom
-
Stephen Wilcox
-
Steven M. Bellovin
-
tony sarendal
-
Warren Kumari