I promised to summarize responses to my query
So who actually measures their network performance and how?
As most responses were private, I have removed attribution. Thanks to all constructive respondees. I have proposed a survey panel for the next NANOG if we do not exhaust the subject beforehand. randy - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - We do SNMP polling ever 15 minutes at SESQUINET on every line over which we have administrative control and over every peering point. We produce a daily reports on errors and usage. We are getting ready to switch to Vulture or NetScarf (or some combo) to give us more interactive information. - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - We perform measurement of certain basic network parameters, such as usage (bandwidth used / total bandwidth) and line error rates on all of our non-customer links. We perform CPU usage, memory usage, and eviron- mental monitoring of all our routers. We also perform the line usage and error rate on all customer lines. We monitor all of our customers' routers unless they say otherwise, and notify them of any problems. Finally, we monitor select points throughout the Internet (root name servers, etc.) on a 4 times an hour basis using pings. We accomplish this monitoring using the following items an in-house built package that uses SNMP, traceroute, and ping to provide graphs and tabular statistical information. We use cabletron's Spectrum for a quick network overview. - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - We do. SNMP MIB-II stuff, plus the cflowd stuff and something we call 'mxd' (measures round trip times, packet loss and potential reason, etc. from a whole bunch of different points in our network to a bunch of other points in our network... we use it to create delay matrices, packet loss reports and other reports). There are some other things, but these are the biggies. I was hoping to get our mxd developer to present at NANOG, but she was unable to attend and is sort of the shy type (too bad, she's one of our better people). Maybe I can throw together some information on what we measure for a bullet or two at ISMA, and why. If there's any interest, that is. The mxd thing was originally just sort of a toy for neat reports, but in the last year it's become a critical tool for measuring delay variance for one of our VPDN customers that does real-time video stuff (and is to some extent helping us figure out where we've got delay jitter and why; on the other hand it's also raising more questions :-)). - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - Since most of my professional career has been in the enterprise world, I can offer you what we used to measure availability to our mail servers, web servers, DNS servers, etc., at one of my previous employers. We employed several application tests, along with network performance tests. Our primary link was via UUnet, a burstable T1. We purchased an ISDN account from another local provider, who wasn't directly connected to UUnet. Probably a good example of a joe-average-user out there. Every 5 minutes, we measured round-trip response times to each of the servers and gateway router (via ping) and recorded it. We also had application tests, such as DNS lookups on our servers, timing sendmail test mails to a /dev/null account, and time to retrieve the whole home page. We trended the results into graphs and used it This wasn't meant to be a really great performance monitoring system; it was actually meant to 1) check how our availability looked from a "joe user" perspective on the net (granted, reachability/availability wasn't perfect because it was only one point in the net) and 2) look at response time trends / application trends to see if our hardware/software was cutting it. - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - We use a traffic flow monitoring system from Kaspia Systems. (www.kaspia.com) The Kaspia product collects all sorts of data from router ports and RMON probes, stores the data and performs various trend analysis. We collect traffic flow, router CPU usage and router memory information plus various errors. There is a data reduction process which runs once a day, and a very nifty web interface. The product isn't cheap, but the system definitely fills a void here. Maybe I should organize a talk on what we're doing with it for an upcoming NANOG? As an old instrumentation engineer, I think the basis of our use of the tool is pretty solid. Plus, I actually developed a means for calibration of the accuracy of the flow data. Haven't had time yet to work out a validation for the trends, but I'll get to it one of these decades. Also, the Kaspia people will give you a thirty day trial on their product at no charge. - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - It's brain-dead simple, and probably not of real interest to you, but we keep a few basic stats, going back about two years... For non-intrusive stuff, we keep a log of all interface status changes on our routers, and we pull five-minute byte-counts inbound and outbound on each interface, which we graph against port speed. Watching the graphs for any sort of clipping of peaks gives a pretty good indication of problems, and watching for shifts of traffic between ports on parallel paths likewise. As far as intrusive testing, we do a three-packet min-length ping to the LAN-side port of each of our customers' routers once each five minutes, and follow that up with additional attempts if those three are lost. We log latency, and if we have to follow up with a burst, we log loss rate from the burst. Pinging through to the LAN port obviously lets us know when CPE routers konk out, as occasionally we see hung routers that still have operational WAN ports talking to us, likewise, simply watching VC-state isn't a reliable enough indicator of the status of the remote router. Plus it tells you if the customer has kicked the Ethernet transceiver off their equipment, for instance. Wouldn't matter to you, probably, but our demarc is all the way out at their WAN port, since we own and operate our customers' CPE. I think a bit about what more we could be doing; flows-analysis and whatnot... It's nice to think about, and eventually we'll get around to it, but programmer-time is relatively precious, and other things have higher priority, since the current system works and tends to tell us most of what we seem to need to know to provide decent service. - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - we do. in fact, we place quite a bit of emphasis on network stats. currently we have about 3 years of stats online, and are working on converting our inhouse engine to an rdbms so we can more easily perform trend analysis. besides kaspia, other commercial packages include trendsnmp (www.desktalk.com) and concord's packages (www.concord.com). our inhouse stuff is located at http://netop.cc.buffalo.edu/ if you are curious about what we do. - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - We're neanderthals right now - we use a hacked rcisco to feed data to nocol. We watch bandwidth (separately as well) on key links - and also watch input errors and interface transitions (for nocol) - all done with perl and expect-like routines, parsing 'sho int's every few minutes. Emergency stuff goes through nocol; bandwidth summaries are mailed to interested parties overnight. - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - We have running here now the MRTG package that generate some fancy graphics, but in my opinion these graphics are useless and looking in detail to some of the reports they are not accurate, several of our clients request the raw data but this package only mantain few raw data just to generate the graphs, mean also useless. In the past we use to have also a kind of ascii reports (Vikas wrote some of the scripts and programs) generated from information obtained using the old snmp tool set developed by nysernet but I guess that nobody mantained the config files and I believe that the snmp library routines used aren't working fine. So, I need to invest some time to provide a fast solution to this, I'll apreciate your help to identify some useful package or directions about how to generate some good looking and consistent reports. - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - We have been using the MRTG package which is basically a special SNMP agent that queries the routers for stats and then does some nice graphing of the data on the web. - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - SNMP queries with a heavily-modified version of MRTG from the nice guy in Germany. Works very nicely. We have recently installed NetScarf 2.0, and are contemplating merging NetScarf 3.0 with the MRTG front end. - - - - - - - - - - - - - - c u t h e r e - - - - - - - - - - - - - - I'm researching whether I can rewrite Steve Corbato's fastpoll program using the fastsnmp library from the NetScarf people. I think this will allow fastpoll to scale better. I've successfully written a quick C program that uses the library to collect the required data for a router -- now I've just got to make it so we can manage it easily (i.e. auto-generated config files from our databases). My goal is to be able to collect 1-2 minute period data on all links that are greater than 10 Mbps -- 15 minutes data for everything else. The 2 minute collection period will allow to scale up to 280Mbps before experiencing two counter roll-overs within a polling interval. Hopefully that will hold us over the interface counters are available as Counter64 objects via SNMPv2 (if that ever happens). BTW -- what fastpoll collects now is ifInOctets, ifOutOctets, ifInUcastPkts, ifOutUcastPkts, ifInErrors and ifOutDiscards. Rather than storing the raw counters, it calculates the rate by taking the delta and dividing by the period. Getting the accurate period is actually the hard part -- I'm having SNMP send me the uptime of the router in each query and using that to calculate the interval between polls and to detect counter resets due to reboots. The other trick to handle is the fact that, while IOS updates the SNMP counters for process-switched packets as they are routed, it looks like the counter for SSE switched packets on C70X0 routers only get updated once every 10 seconds. I'll let you know how my tests come out. -30-
participants (1)
-
randy@psg.com