Re: Teaching/developing troubleshooting skills
It's also important that one avoid:
* The faulty assumption there is but one problem * Incorrectly-formed causal relationships (NANOG-L has some examples of these) * Making too many changes in one iteration * Attempting to tackle a system with more unknowns than are absolutely necessary.
These words should be hanging on a wall in every IT department. You wouldn't believe how many times I've had to gently correct someone because of these mistakes, particularly the first two. John --
It's also important that one avoid:
* The faulty assumption there is but one problem
Here's an interesting example that I came across several years ago. It was in an office with lots of PCs plugged into RJ45 10baseT ports near each desk. One PC had lost connectivity. I came and checked that the software was installed and running. Probably did something like ping 127.0.0.1 to satisfy myself that it wasn't a problem on the PC itself. Then I unplugged the cable from the RJ45 port in the wall and tried another port. It still did not work. I swapped in a new cable and it worked fine. Most people would stop right there, but I followed up and tested the existing cable in the lab. It worked just fine. Why did it not work before? There must be some problem with the switch or the wall wiring and somehow two RJ45 ports did not work. After a bit of poking and discussions with the employee at that desk, it turned out that the cable lay in a bad spot and often got caught on her foot as she rushed off somewhere. It turns out that the little metal pins inside the RJ45 socket had been bent. It was just sheer luck that swapping the cable caused contact to be made again. And the second socket was also bent. When that one ceased to work the employee had swapped cables themselves. The real solution was to replace both sockets and install a longer patch cable that could be placed where feet would not get caught up in it. Troubleshooting is made easier by methodically doing the work and following through. If I had not had the lab handy I probably would have swapped the "bad " cable back in to verify that "trouble" accompanied the cable. But it is also easier to troubleshoot when you have a stock of interesting war stories in your memory to encourage you to "think outside the box". It's the blend of creativity and methodical work practices that makes a good troubleshooter, technical or otherwise. --Michael Dillon
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Michael.Dillon@radianz.com wrote: |>>It's also important that one avoid: |>> |>>* The faulty assumption there is but one problem | | | Here's an interesting example that I came across | several years ago. It was in an office with lots | of PCs plugged into RJ45 10baseT ports near each desk. | One PC had lost connectivity. | | I came and checked that the software was | installed and running. Probably did something | like ping 127.0.0.1 to satisfy myself that it | wasn't a problem on the PC itself. Then I unplugged | the cable from the RJ45 port in the wall and tried | another port. It still did not work. I swapped | in a new cable and it worked fine. | | Most people would stop right there, but I | followed up and tested the existing cable | in the lab. It worked just fine. Why did | it not work before? There must be some problem | with the switch or the wall wiring and somehow | two RJ45 ports did not work. After a bit of | poking and discussions with the employee at | that desk, it turned out that the cable lay | in a bad spot and often got caught on her foot | as she rushed off somewhere. It turns out that | the little metal pins inside the RJ45 socket | had been bent. It was just sheer luck that | swapping the cable caused contact to be made again. | And the second socket was also bent. When that | one ceased to work the employee had swapped | cables themselves. | | The real solution was to replace both sockets | and install a longer patch cable that could be | placed where feet would not get caught up in it. | | Troubleshooting is made easier by methodically | doing the work and following through. If I had | not had the lab handy I probably would have | swapped the "bad " cable back in to verify that | "trouble" accompanied the cable. But it is also | easier to troubleshoot when you have a stock of | interesting war stories in your memory to encourage | you to "think outside the box". It's the blend of | creativity and methodical work practices that makes | a good troubleshooter, technical or otherwise. | You've described Closed Loop Corrective Action to the tee. It's not enough to know what the problem is, but how to correct it, and what to do to prevent it in the future. - -- ========= bep -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (MingW32) iD8DBQFA4c0KE1XcgMgrtyYRArh6AJ9yOTkxGOv7iloTegO/DtUENYXmygCgiNnO m6XSOg2EPejbV4ZqOHvmPO0= =AwT9 -----END PGP SIGNATURE-----
* The faulty assumption there is but one problem * Incorrectly-formed causal relationships
Mythology. Some may recall the adventures of the CTO who ran a sweep of an net 10.* in a rather modest machine room somewhere in Maine, resulting in memory exhaustion (arp table) in the router, which resulted in 1918 leakage into public address space. The operational mythology of the ever-so-security-minded-yolks was that the initial and very poorly understood presenting problem was an external act of malice, rather than self-inflicted DoS by the security-yolk itself. I've seen many people struggle to fit what little they know into predefined mythos of what could be happening, rather than starting like Sgt. Schultz, who "knew nothing", at least until he really _knew_ it. Eric
participants (4)
-
Bruce Pinsky
-
Eric Brunner-Williams
-
John Neiberger
-
Michael.Dillon@radianz.com