Here is a summary of a recent query I posted to the list. Reportable Metrics: Original Query: to identify a list of network metrics to compile and report to management on a monthly basis. (Emphasis is on metrics and not the tools used to gather them.) My original list: 1. Uptime per WAN or Internet circuit 2. # and average length of outages 3. Bandwidth utilization per WAN/Internet circuit and "important" VLANs 4. Overall Network Latency, RTT measured from various parts of network (cisco IPM)to various other parts 5. Top talkers per WAN circuit 6. Top destinations per WAN circuit 7. Top 10 most utilized WAN circuits (% burst above CIR, etc) 7. Protocol distribution per WAN circuit 8. Syslog/Sniffer alarms by severity 9. Application Response time for key Apps (eg, SAP, HTTP) 10. Security Incidents 11. TACACs reports on number of logins, changes, etc 12. Bandwidth/Latency trending Vince Mulhollan's additions: 1. acceptable use policy violations 2. number/severity of externally reported abuse complaints 3. IOS deployments: upgrades, schedules, risk/rewards of IOSs in use 4. Installations completed: hw and circuit changes. Time to implement each. 5. RFO: reasons for outages and remedies employed 6. Employees % of time spent on: upgrades, installs, security incidents, etc. Headcount in line with workload? 7. Any hardware related trends:, ie particular devices burning out frequently, etc. Establish loose figure of likelihood of failure per type of device Joe St Sauver's recommendations: 1. Don't overwhelm management with large quantity of data 2. Implement "management by exception" by tracking/reporting "material statistical deviations from expected values wherever possible. " 3. "The other key concept is to give management gauges that will help them drive the plane, rather than historical data that will tell them when/where/how badly they crashed (last month). E.G., make the data timely and operationally relevant." a. -- What's broken? b. -- Where am I vulnerable? c. -- Where am I running out of capacity? d. -- Where do I have performance problems? e. -- What are we doing really well? f. Where can I increase my return on already deployed assets? (e.g., where do I have underutilized capacity?)Look longitudinally (over time), geographically (spatially), and at snapshots (cross sectionally). 4. Focus on downtime as opposed to uptime and only report those that exceed some acceptable threshold. Focus on cause of outages and responses to those outages and whether there are ongoing problems in solving the issues. 5. Tie all measurements to realities of the business: stats that bear out billing expenses, those that might help marketing, or those that help in planning, etc 6. Dial-ins 7. A list of URLs for more ideas a. Compare and contrast: http://hydra.uits.iu.edu/~abilene/traffic/ http://monon.uits.iupui.edu/abilene/dnvr.html http://monon.uits.iupui.edu/abilene/dnvr/index.html http://monon.uits.iupui.edu/abilene/dnvr/uoreg-bits.html http://www.itec.oar.net/abilene-netflow/ b. Latency, packet loss, route changes: i. http://amp.nlanr.net/active/amp-uoregon/HPC/body.html ii. http://www.advanced.org/surveyor/ iii. http://www.caida.org/cgi-bin/skitter_summary/main.pl iv. http://www.ncne.nlanr.net/nimi/ c. Possible breakdown on top WAN talkers: i. Single flow? Aggregate traffic? By protocol? By port? Per dotted quad? Per network block? Per ASN? Measured by octets? Flow count? Flow duration? From flow data? Passive monitoring with OCxMON type tools? Privacy issues? ii. Sample report: http://www.canet3.net/stats/reports.html 8. :-) "Keep it brief." Iljitsch van Beijnum: 1. Interface stats a. CRC errors, (helps identify lower layer problems) b. collisions, if you use any non-switched ethernet 2. router CPU load Joe Provo: 1. Errors: CRCs, queue drops/depth (eg, RED vs tail-drop queues) 2. Routing protocol transitions where relevant, eg, BGP route table size 3. "You should look at SNIPS [formerly NOCOL] for examples of good stuff to monitor [&*therefore trend]. David Newman: 1. For delay-sensitive, include: a. jitter (latency variation) b. histograms (latency distribution) Thanks to everyone who responded! If you have more suggestions for this list, please email me directly. -BM
participants (1)
-
Murphy, Brennan