On Sep 11, 2012, at 4:53 PM, Rubens Kuhl <rubensk@gmail.com> wrote:
That doesn't mean that their description of the internal error fits what happened
Anytime I've seen a real RFO, it takes more than 24 hours to collect data. Sometimes you actually don't know what happened. There's a reason for this comic: http://www.dilbert.com/strips/comic/1999-08-04/ (the reboot cleared the problem). I've seen many odd behaviors of devices that nobody could explain, including the vendors.. sometimes it takes a few years to understand what happened. I recall a case where 2-3 years after a major outage someone made some minor comment about their architecture and a light came on. I welcome more information about mistakes/errors that we can all learn from. Sharing that information can be hard or uncomfortable at times, but can help others learn and not make the same mistakes again. I took the recommendation of others and have started to read "Normal Accidents". amazon link: http://tinyurl.com/9dc6x98 The whole multiple-failures problem really makes me concerned about cascading system failures when things go wrong. - Jared