Re: measurement

9 Mar 1997

      I promised to summarize responses to my query
...
So who actually measures their network performance and how?
As most responses were private, I have removed attribution.  Thanks to all
constructive respondees.

I have proposed a survey panel for the next NANOG if we do not exhaust the
subject beforehand.

randy

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

We do SNMP polling ever 15 minutes at SESQUINET on every line over which we 
have administrative control and over every peering point. We produce a daily
reports on errors and usage. 

We are getting ready to switch to Vulture or NetScarf (or some combo) to
give us more interactive information.

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

We perform measurement of certain basic network parameters, such as
usage (bandwidth used / total bandwidth) and line error rates on all of our
non-customer links.  We perform CPU usage, memory usage, and eviron-
mental monitoring of all our routers.  We also perform the line usage and
error rate on all customer lines.  We monitor all of our customers' routers
unless they say otherwise, and notify them of any problems.  Finally, we
monitor select points throughout the Internet (root name servers, etc.) on
a 4 times an hour basis using pings.

We accomplish this monitoring using the following items an in-house built
package that uses SNMP, traceroute, and ping to provide graphs and tabular
statistical information.  We use cabletron's Spectrum for a quick network
overview.

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

We do.  SNMP MIB-II stuff, plus the cflowd stuff and something we call
'mxd' (measures round trip times, packet loss and potential reason,
etc. from a whole bunch of different points in our network to a bunch of
other points in our network... we use it to create delay matrices,
packet loss reports and other reports).  There are some other things,
but these are the biggies.

I was hoping to get our mxd developer to present at NANOG, but she was
unable to attend and is sort of the shy type (too bad, she's one of our
better people).  Maybe I can throw together some information on what we
measure for a bullet or two at ISMA, and why.  If there's any interest,
that is.  The mxd thing was originally just sort of a toy for neat
reports, but in the last year it's become a critical tool for measuring
delay variance for one of our VPDN customers that does real-time video
stuff (and is to some extent helping us figure out where we've got delay
jitter and why; on the other hand it's also raising more questions :-)).

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

Since most of my professional career has been in the enterprise world, I
can offer you what we used to measure availability to our mail servers,
web servers, DNS servers, etc., at one of my previous employers.

We employed several application tests, along with network performance
tests.  Our primary link was via UUnet, a burstable T1.  We purchased an
ISDN account from another local provider, who wasn't directly connected to
UUnet.  Probably a good example of a joe-average-user out there.

Every 5 minutes, we measured round-trip response times to each of the
servers and gateway router (via ping) and recorded it.  We also had
application tests, such as DNS lookups on our servers, timing sendmail
test mails to a /dev/null account, and time to retrieve the whole home
page.  We trended the results into graphs and used it 

This wasn't meant to be a really great performance monitoring system; it
was actually meant to 1) check how our availability looked from a "joe
user" perspective on the net (granted, reachability/availability wasn't
perfect because it was only one point in the net) and 2) look at response
time trends / application trends to see if our hardware/software was
cutting it.

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

We use a traffic flow monitoring system from Kaspia Systems.
(www.kaspia.com) The Kaspia product collects all sorts of data from router
ports and RMON probes, stores the data and performs various trend analysis.
We collect traffic flow, router CPU usage and router memory information plus
various errors.  There is a data reduction process which runs once a day,
and a very nifty web interface.  The product isn't cheap, but the system
definitely fills a void here.

Maybe I should organize a talk on what we're doing with it for an upcoming
NANOG?  As an old instrumentation engineer, I think the basis of our use of
the tool is pretty solid.  Plus, I actually developed a means for
calibration of the accuracy of the flow data.  Haven't had time yet to work
out a validation for the trends, but I'll get to it one of these decades.

Also, the Kaspia people will give you a thirty day trial on their product at
no charge.

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

It's brain-dead simple, and probably not of real interest to you, but we
keep a few basic stats, going back about two years...

For non-intrusive stuff, we keep a log of all interface status changes on
our routers, and we pull five-minute byte-counts inbound and outbound on
each interface, which we graph against port speed.  Watching the graphs for
any sort of clipping of peaks gives a pretty good indication of problems,
and watching for shifts of traffic between ports on parallel paths likewise.

As far as intrusive testing, we do a three-packet min-length ping to the
LAN-side port of each of our customers' routers once each five minutes, and
follow that up with additional attempts if those three are lost.  We log
latency, and if we have to follow up with a burst, we log loss rate from the
burst. Pinging through to the LAN port obviously lets us know when CPE
routers konk out, as occasionally we see hung routers that still have
operational WAN ports talking to us, likewise, simply watching VC-state
isn't a reliable enough indicator of the status of the remote router.  Plus
it tells you if the customer has kicked the Ethernet transceiver off their
equipment, for instance.  Wouldn't matter to you, probably, but our demarc
is all the way out at their WAN port, since we own and operate our
customers' CPE.

I think a bit about what more we could be doing; flows-analysis and
whatnot...  It's nice to think about, and eventually we'll get around to it,
but programmer-time is relatively precious, and other things have higher
priority, since the current system works and tends to tell us most of what
we seem to need to know to provide decent service.

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

we do. in fact, we place quite a bit of emphasis on network stats. currently
we have about 3 years of stats online, and are working on converting our
inhouse engine to an rdbms so we can more easily perform trend
analysis. besides kaspia, other commercial packages include trendsnmp
(www.desktalk.com) and concord's packages (www.concord.com). our inhouse
stuff is located at http://netop.cc.buffalo.edu/ if you are curious about
what we do.

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

We're neanderthals right now - we use a hacked rcisco to feed data to nocol.
We watch bandwidth (separately as well) on key links - and also watch input
errors and interface transitions (for nocol) - all done with perl and
expect-like routines, parsing 'sho int's every few minutes.

Emergency stuff goes through nocol; bandwidth summaries are mailed to
interested parties overnight.

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

We have running here now the MRTG package that generate some fancy graphics,
but in my opinion these graphics are useless and looking in detail to some
of the reports they are not accurate, several of our clients request the raw
data but this package only mantain few raw data just to generate the graphs,
mean also useless.

In the past we use to have also a kind of ascii reports (Vikas wrote some of
the scripts and programs) generated from information obtained using the old
snmp tool set developed by nysernet but I guess that nobody mantained the
config files and I believe that the snmp library routines used aren't
working fine.

So, I need to invest some time to provide a fast solution to this, I'll
apreciate your help to identify some useful package or directions about how
to generate some good looking and consistent reports.

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

We have been using the MRTG package which is basically a special SNMP agent
that queries the routers for stats and then does some nice graphing of the
data on the web.

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

SNMP queries with a heavily-modified version of MRTG from the nice guy in
Germany.  Works very nicely.  We have recently installed NetScarf 2.0, and
are contemplating merging NetScarf 3.0 with the MRTG front end.

- - - - - - - - - - - - - -   c u t   h e r e   - - - - - - - - - - - - - -

I'm researching whether I can rewrite Steve Corbato's fastpoll program
using the fastsnmp library from the NetScarf people.  I think this will
allow fastpoll to scale better.  I've successfully written a quick C
program that uses the library to collect the required data for a router --
now I've just got to make it so we can manage it easily (i.e.
auto-generated config files from our databases). 

My goal is to be able to collect 1-2 minute period data on all links that
are greater than 10 Mbps -- 15 minutes data for everything else.  The 2
minute collection period will allow to scale up to 280Mbps before
experiencing two counter roll-overs within a polling interval.  Hopefully
that will hold us over the interface counters are available as Counter64
objects via SNMPv2 (if that ever happens).

BTW -- what fastpoll collects now is ifInOctets, ifOutOctets, ifInUcastPkts,
ifOutUcastPkts, ifInErrors and ifOutDiscards.  Rather than storing the raw
counters, it calculates the rate by taking the delta and dividing by the
period.  Getting the accurate period is actually the hard part -- I'm having
SNMP send me the uptime of the router in each query and using that to
calculate the interval between polls and to detect counter resets due to
reboots.  The other trick to handle is the fact that, while IOS updates the
SNMP counters for process-switched packets as they are routed, it looks like
the counter for SSE switched packets on C70X0 routers only get updated once
every 10 seconds.

I'll let you know how my tests come out.

-30-

randy＠psg.com

tags

participants (1)