Re: Teaching/developing troubleshooting skills
Pete Kruckenberg <pete@kruckenberg.com> 6/24/04 5:09:19 PM >>> It's been so long since I learned network troubleshooting techniques I can't remember how I learned them or even how I used to do it (so poorly).
Does anyone have experience with developing a skills-improvement program on this topic?
I find that it's helpful to teach troubleshooting in two stages: 1) Define the problem. 2) Isolate the problem For stage one, teach them the basic skillset needed to define the issue in a general way based on available information. Is a circuit obviously down? Are certain destinations unreachable? Are *all* destinations unreachable? Is network access slow? You get the picture. Once the nature of the problem is determined, I find that a layered approach to troubleshooting is helpful and that is what I teach to others. The exact order of steps might changed based on information learned in step one, but generally I work my way up the OSI model. If the problem could possibly be caused by a physical layer issue, try to determine such. Check the circuits for errors, bouncing links, indications of mismatched clocking configurations, faulty CSU/DSUs, faulty router interfaces, or bad cabling. If all of that appears to be okay then I consider the datalink layer. Could the problem defined in step one be caused by a datalink layer issue? Was the encapsulation changed on a router interface? If frame relay, is the router seeing LMI from the frame relay switch? Is there evidence of dropped frames completely within the cloud (granted, that's not necessarily datalink layer, but it is a separate 'administrative' layer if it's out of your control.) I'm sure you can think of a number of other examples. Could the problem defined in step one be the results of a network layer issue? Is there evidence of a routing loop? Do the devices involved have routing tables that appear to be correct? What do traceroutes and pings show? Teach them to go hop-by-hop and verify that everything appears as it should, starting with the device closest to the problem if it's possible to narrow it down that far. If routing is determined to be correct, could this be a transport layer issue? Is it possible that an access list or firewall somewhere is blocking only certain types of traffic? Does the problem only involve HTTP? SMTP? Is there policy routing involved that might be redirecting certain types of traffic to the wrong destination? Where there *any* recent configuration changes? If so, what were they? Find out, because they might be the cause of the problem. This is the general framework I use for troubleshooting and that's how I've taught the people that work with me. It's constantly evolving and, of course, the specific steps taken depend on the nature of the issue, but I find that it helps to have a good foundational troubleshooting framework. John --
participants (1)
-
John Neiberger