Hi all - Since the deployment of the RSes at the interconnection points the RA team has been collecting network statistics including packet loss and latency across the media to our peers, BGP message counts, RS memory utilization, various interface counts on the RS, and so forth. We have installed a sample web page containing data collected at the MAE-East facility. Over the next month we will finish up the automated portion of the data collection and report generation for the rest of the interconnection points. Our MAE-East data is located at http://www.ra.net/~ra/statistics. Feedback and suggestions welcome. We will try and incorporate your ideas into the layout for next month. Thanks to Dun Liu and Susan Horvath for putting all of this together. Bill ------------------------------------------------------------------------- William B. Norton Merit Network Inc. e-mail: wbn@merit.edu phone: (313) 936-2656 WWW: http://home.merit.edu/~wbn
In message <Pine.SUN.3.91.951017160511.7418G-100000@home.merit.edu>, "William B . Norton" writes:
Hi all -
Since the deployment of the RSes at the interconnection points the RA team has been collecting network statistics including packet loss and latency across the media to our peers, BGP message counts, RS memory utilization, various interface counts on the RS, and so forth.
We have installed a sample web page containing data collected at the MAE-East facility. Over the next month we will finish up the automated portion of the data collection and report generation for the rest of the interconnection points.
Our MAE-East data is located at http://www.ra.net/~ra/statistics. Feedback and suggestions welcome. We will try and incorporate your ideas into the layout for next month. Thanks to Dun Liu and Susan Horvath for putting all of this together.
Bill
------------------------------------------------------------------------- William B. Norton Merit Network Inc. e-mail: wbn@merit.edu phone: (313) 936-2656 WWW: http://home.merit.edu/~wbn
Bill, You have a lot of statistics that are simply listed as "ANS" with some incredibly high loss and delay numbers. You don't mention that these are all for the ANS interface 192.41.177.253 en-0.ENSS136.t3.ANS.NET which was the ethernet on ENSS136 and fail to mention that for most or all of this time period there was also a FDDI on ENSS136 and we later cut over to the FDDI on ENSS147. If you fail to mention these little facts, people might think that ANS has a really shoddy conmnection to Mae-East, not knowing that the problem was due to media saturation at the Mae East ethernet and technical problems with the ethernet bridging technology used by MFS, and that ANS also had the FDDI and preferred the FDDI except for those only reachable by ethernet. Would you please clear that up on your web pages and maybe add a readme on the ftp server. If you like, I can get the E136 FDDI installation dates and the cutover date for E147, which now only has a FDDI and doesn't have an ethernet at all. I think we have now shut down the College Park ethernet as well. Curtis
Thanks for the feedback Curtis. You are correct that one might get the wrong impression. ( Our intent here was really to show the layout of the web page and get some feedback, and this ANS router just happened to be first.) Dun just uploaded all the rest of the MAE-East Peer delay/packet loss graphs. Thanks for the suggestion. Web Pointer to MAE-East graphs: http://www.ra.net/~ra/statistics Bill & Dun ------------------------------------------------------------------------- William B. Norton Merit Network Inc. e-mail: wbn@merit.edu phone: (313) 936-2656 WWW: http://home.merit.edu/~wbn On Tue, 17 Oct 1995, Curtis Villamizar wrote:
Bill,
You have a lot of statistics that are simply listed as "ANS" with some incredibly high loss and delay numbers. You don't mention that these are all for the ANS interface 192.41.177.253 en-0.ENSS136.t3.ANS.NET which was the ethernet on ENSS136 and fail to mention that for most or all of this time period there was also a FDDI on ENSS136 and we later cut over to the FDDI on ENSS147. If you fail to mention these little facts, people might think that ANS has a really shoddy conmnection to Mae-East, not knowing that the problem was due to media saturation at the Mae East ethernet and technical problems with the ethernet bridging technology used by MFS, and that ANS also had the FDDI and preferred the FDDI except for those only reachable by ethernet.
Would you please clear that up on your web pages and maybe add a readme on the ftp server. If you like, I can get the E136 FDDI installation dates and the cutover date for E147, which now only has a FDDI and doesn't have an ethernet at all. I think we have now shut down the College Park ethernet as well.
Curtis
Curtis, Yes, we would like to have the dates of the changes at MAE-East to help us better annotate the graphs. Thanks for your response. --Elise
William B. Norton writes:
Thanks for the feedback Curtis. You are correct that one might get the wrong impression. ( Our intent here was really to show the layout of the web page and get some feedback, and this ANS router just happened to be first.) Dun just uploaded all the rest of the MAE-East Peer delay/packet loss graphs. Thanks for the suggestion.
Web Pointer to MAE-East graphs: http://www.ra.net/~ra/statistics
Bill & Dun ------------------------------------------------------------------------- William B. Norton Merit Network Inc. e-mail: wbn@merit.edu phone: (313) 936-2656 WWW: http://home.merit.edu/~wbn
On Tue, 17 Oct 1995, Curtis Villamizar wrote:
Bill,
You have a lot of statistics that are simply listed as "ANS" with some incredibly high loss and delay numbers. You don't mention that these are all for the ANS interface 192.41.177.253 en-0.ENSS136.t3.ANS.NET which was the ethernet on ENSS136 and fail to mention that for most or all of this time period there was also a FDDI on ENSS136 and we later cut over to the FDDI on ENSS147. If you fail to mention these little facts, people might think that ANS has a really shoddy conmnection to Mae-East, not knowing that the problem was due to media saturation at the Mae East ethernet and technical problems with the ethernet bridging technology used by MFS, and that ANS also had the FDDI and preferred the FDDI except for those only reachable by ethernet.
Would you please clear that up on your web pages and maybe add a readme on the ftp server. If you like, I can get the E136 FDDI installation dates and the cutover date for E147, which now only has a FDDI and doesn't have an ethernet at all. I think we have now shut down the College Park ethernet as well.
Curtis
In message <Pine.SUN.3.91.951020142813.25539B-100000@home.merit.edu>, "William B. Norton" writes:
Thanks for the feedback Curtis. You are correct that one might get the wrong impression. ( Our intent here was really to show the layout of the web page and get some feedback, and this ANS router just happened to be first.) Dun just uploaded all the rest of the MAE-East Peer delay/packet loss graphs. Thanks for the suggestion.
Web Pointer to MAE-East graphs: http://www.ra.net/~ra/statistics
Bill & Dun
Bill, Could you tell us a little more about your packet loss sampling? Like how many ping packets are you using per collection point? If those packet loss statistics at the Mae are correct, we have some serious trouble there. I just fired up a ping to .181 on the ring (MCI) and in 100 packets lost 2 (close together - I saw the sequence numbers that were missing). I tried again with 1000 and got 0 loss. Maybe it's just an off time. Still I can't see how you could be approaching anything near 10-20% on all the major providers. You've got some nasty peaks there. My first inclination was to wonder if you overflowed the space for UDP packets by kicking off two much data collection on the RS at once. We used to lose SNMP replies for that reason when we kicked off two may GETs at 15 minute intervals. We'll be looking at this too to try to confirm the loss you are reporting. Curtis
remember that using pings to sample connectivity to a very busy cisco router is not a very reliable probe for several reasons. Returning pings is a low-priority task in the first place, and they are rate-limited, so if the processor is busy processing lots of BGP updates and several folks are fribbling with it using ping or SNMP, it is less than clear what they will see. -mo
On Sat, 21 Oct 1995, Mike O'Dell wrote:
remember that using pings to sample connectivity to a very busy cisco router is not a very reliable probe for several reasons. Returning pings is a low-priority task in the first place, and they are rate-limited, so if the processor is busy processing lots of BGP updates and several folks are fribbling with it using ping or SNMP, it is less than clear what they will see.
-mo
OK! Agreed. So then, what would you use? ---------------------------------------------------------------------------- Michael A. Nasto Customer Support Manager NYSERNet, Inc. Phone: 315-453-2912 x 256 200 Elwood Davis Road Fax: 315-453-3052 Suite 103 Email: mnasto@franklin.nysernet.org Liverpool, NY 13088-6147 mnasto@transit.nyser.net +++++++++++++++++++++++++++ Carpe Diem +++++++++++++++++++++++++++++++++++++
On Sat, 21 Oct 1995, Mike O'Dell wrote:
remember that using pings to sample connectivity to a very busy cisco router is not a very reliable probe for several reasons. Returning pings is a low-priority task in the first place, and they are rate-limited, so if the processor is busy processing lots of BGP updates and several folks are fribbling with it using ping or SNMP, it is less than clear what they will see.
-mo
Michael A. Nasto quips:
OK! Agreed. So then, what would you use?
Have you ever been in a classroom and had a student raise his hand, answer every question, ask intelligent questions, etc. just to prove to the class how smart he or she is. This is the premise of the 'Two Mike's Interchange' above. One says, HEY! I know ping packets are a lower priority than everything else in a *CISCO* router LOOK AT ME (wave wave). Then another kid in the class quips... if PING is not what you would use, give us a better utility. In fact... EVERYONE ( okay 99.73 percent :-) uses PING. After all router LOAD is router LOAD.... and if a few ICMP packets can't get back in a subjectly reasonable time.. then DUH.... "da network is busy......" BGP updates take bandwidth just like any other packet. Of course the 0.27 percent, zen routing gods of the universe just feel the load and the harmonic BGP update patterns and PING between the BGP updates.... for a better answer. Sorry, I could not resist.... and apologize for the satire. PING!! PING!! PING!! Tim -- +--------------------------------------------------------------------------+ | Tim Bass | #include<campfire.h> | | Principal Network Systems Engineer | for(beer=100;beer>1;beer++){ | | The Silk Road Group, Ltd. | take_one_down(); | | | pass_it_around(); | | http://www.silkroad.com/ | } | | | back_to_work(); /*never reached */ | +--------------------------------------------------------------------------+
On Sun, 22 Oct 1995, Tim Bass wrote:
On Sat, 21 Oct 1995, Mike O'Dell wrote:
remember that using pings to sample connectivity to a very busy cisco router is not a very reliable probe for several reasons. Returning pings is a low-priority task in the first place, and they are rate-limited, so if the processor is busy processing lots of BGP updates and several folks are fribbling with it using ping or SNMP, it is less than clear what they will see.
-mo
Michael A. Nasto quips:
OK! Agreed. So then, what would you use?
Have you ever been in a classroom and had a student raise his hand, answer every question, ask intelligent questions, etc. just to prove to the class how smart he or she is. This is the premise of the 'Two Mike's Interchange' above. One says, HEY! I know ping packets are a lower priority than everything else in a *CISCO* router LOOK AT ME (wave wave). Then another kid in the class quips... if PING is not what you would use, give us a better utility.
In fact... EVERYONE ( okay 99.73 percent :-) uses PING. After all router LOAD is router LOAD.... and if a few ICMP packets can't get back in a subjectly reasonable time.. then DUH.... "da network is busy......" BGP updates take bandwidth just like any other packet.
Of course the 0.27 percent, zen routing gods of the universe just feel the load and the harmonic BGP update patterns and PING between the BGP updates.... for a better answer.
Sorry, I could not resist.... and apologize for the satire. PING!! PING!! PING!!
Tim
Yes, Tim. I always hated the "geeky" kid who always raised his hand or had an answer for everything, too. But, I was serious with this question. I thought it was rather obvious, too, that the router may be busy so, what is the point? If you don't use 'ping' or 'traceroute', what else do you use? I thought he had an answer. Obviously not. ---------------------------------------------------------------------------- Michael A. Nasto Customer Support Manager NYSERNet, Inc. Phone: 315-453-2912 x 256 200 Elwood Davis Road Fax: 315-453-3052 Suite 103 Email: mnasto@franklin.nysernet.org Liverpool, NY 13088-6147 mnasto@transit.nyser.net +++++++++++++++++++++++++++ Carpe Diem +++++++++++++++++++++++++++++++++++++
sorry if this seems so trivially obvious, but depending on what you are trying to measure, sources of significant signal noise *is* one thing to consider when doing an experimental design. using Pings to cisco routers produces a signal with a number of different components, some real signal, some noise, depending on what you you are trying to measure. so exactly what ARE we trying to measure??? I might have a suggestion if the problem was well-formed. for instance, if you want to measure "router busy-ness" then CPU load and SSE misses are probably good candidates. if you are trying to find out WHY the routers are busy, then you have to take more data (variables). If you are trying to establish how effectively (successfully?) the routers are forwarding packets across MAE-EAST, then I believe you need to do somethine more sophisticated than just ping routers. you could inject "dye" packets which get forwarded and then you monitor how many get across what paths and when. is this a lot me complex than just pinging routers? yes; knowning facts is often a lot more harder than handwaving. but the issue here *is* scientific experimental design. You have to define WHAT you want to measure before you can design an experiment to measure it. then you have to analyze the experiment to verify whether it actually measures what you want. then you run the experiment several times and analyze the output. and most importantly, if you do a good enough job of defining WHAT you want to measure, maybe some other enterprising folks will devise a different method to measure the same phenomenology and run independent experiments. this way you learn whether you are really measuring what you thought and can compare results. and if you do a good enough job of all this, you get to call it "science." otherwise, you are just collecting data and beating on graphing programs. And while I very much appreciate the intuitive bent which gets us all through the day, I haven't heard anything that looks like a definition of what we are trying to measure. -mo
Once last set of comments. Mike asks what is the measurement for? When BGP sessions break it is sometimes because of partial or complete packet loss. I think some goals here include: 1) we want to be able to correlate BGP session failures with packet loss, and 2) we would like to understand how the NAP performance contributes to routing stability. It is interesting to note that some interesting correlations result as a side effect. Specifically, the strong correlation between packet loss and gigaswitch utilization, and packet delay and gigaswitch utilization. Not perfect correlation, but somewhat strong. PS - Dun & I did some further data analysis and calculated that 1% of the packet loss can be attributed to the last of 5 packets (20%) transmitted having a response time of greater than a second (which occurs 5%-10% of the time). That is, the last packet response doesn't come back in the 1 second timeout period and stats show this as a loss. The first four packets can have delays > 1 second, but the last can not. We can attribute approx. 1%-2% of the packet loss to this phenomenon. Bill On Sun, 22 Oct 1995, Mike O'Dell wrote:
sorry if this seems so trivially obvious, but depending on what you are trying to measure, sources of significant signal noise *is* one thing to consider when doing an experimental design. using Pings to cisco routers produces a signal with a number of different components, some real signal, some noise, depending on what you you are trying to measure.
------------------------------------------------------------------------- William B. Norton Merit Network Inc. e-mail: wbn@merit.edu phone: (313) 936-2656 WWW: http://home.merit.edu/~wbn
by all means, ping all you like. please do realize that routers have other things to do besides answer pings when they're busy. just because a router doesn't respond to 15% of its pings doesn't mean that it's dropping 15% of its packets or that there is a 15% loss on a line. router load is not router load, process switched packets are not the same as card-level switched or cache switched. there aren't any easy answers. perhaps we should all just implement "ping" servers in pops ( a 386 running linux or bsd might suffice). Jeff Young young@mci.net
On Sat, 21 Oct 1995, Mike O'Dell wrote:
remember that using pings to sample connectivity to a very busy cisco router is not a very reliable probe for several reasons. Returning pings is a low-priority task in the first place, and they are rate-limited, so if the processor is busy processing lots of BGP updates and several folks are fribbling with it using ping or SNMP, it is less than clear what they will see.
-mo
Michael A. Nasto quips:
OK! Agreed. So then, what would you use?
Have you ever been in a classroom and had a student raise his hand, answer every question, ask intelligent questions, etc. just to prove to the class how smart he or she is. This is the premise of the 'Two Mike's Interchange' above. One says, HEY! I know ping packets are a lower priority than everything else in a *CISCO* router LOOK AT ME (wave wave). Then another kid in the class quips... if PING is not what you would use, give us a better utility.
In fact... EVERYONE ( okay 99.73 percent :-) uses PING. After all router LOAD is router LOAD.... and if a few ICMP packets can't get back in a subjectly reasonable time.. then DUH.... "da network is busy......" BGP updates take bandwidth just like any other packet.
Of course the 0.27 percent, zen routing gods of the universe just feel the load and the harmonic BGP update patterns and PING between the BGP updates.... for a better answer.
Sorry, I could not resist.... and apologize for the satire. PING!! PING!! PING!!
Tim
-- +--------------------------------------------------------------------------+ | Tim Bass | #include<campfire.h> | | Principal Network Systems Engineer | for(beer=100;beer>1;beer++){ | | The Silk Road Group, Ltd. | take_one_down(); | | | pass_it_around(); | | http://www.silkroad.com/ | } | | | back_to_work(); /*never reached */ | +--------------------------------------------------------------------------+
Thanks for all the spam..... these posting are like kindergarden of IP users with little answers like: " please keep in mind that pings add traffic to the net " Blah, blah. blah. I'm sure I'm not the only one on the net that considers these little ' oh, tie your shoe before walking ... so called, help messages ' and the ' make sure to put all the hair in your pants before zipping your zipper..... notes ... SPAMtastic Honestly, how some people who just learned how to PING or do BGP now believe that everyone else on the net needs their advice on the net is a mystery. Everybody PINGs, everybody will always ping to check out connectivity problems, etc. Based on the NOISE on the list now, we need an AUP entry on *how many packets you can PING a day* or other foolish KID talk like ;-) " lets all charge .001 cent per ICMP packet, that 'll control them....." (typical com-priv spam...) And please, lets not get into a weeks SON-OF-SPAM spawns SON-OF-SPAM discussion on the little zine ad that happened to cyberplop this weekend. My hard disk still has traces of magnetic SON-OF-SPAM electrons from the last zine ad that ravaged our mailboxes. Surely, there is some intellectually stimulating neurons out there about to spark some *new interesting* ideas :-) Tim -- +--------------------------------------------------------------------------+ | Tim Bass | #include<campfire.h> | | Principal Network Systems Engineer | for(beer=100;beer>1;beer++){ | | The Silk Road Group, Ltd. | take_one_down(); | | | pass_it_around(); | | http://www.silkroad.com/ | } | | | back_to_work(); /*never reached */ | +--------------------------------------------------------------------------+
On Sun, 22 Oct 1995, Jeff Young wrote:
there aren't any easy answers. perhaps we should all just implement "ping" servers in pops ( a 386 running linux or bsd might suffice).
So then perhaps since multiple (last I recall) hosts already exist on MAE-East; we could use them for an experiment of sorts. One could take two hosts and setup a program akin to a loopback on each, operating on TCP ports. A standard data set could then be transmitted back and forth at certain time intervals and compared. The result could then in theory say that there was such heavy load on the MAE, that packets were dropped -- even with retransmission. In any case, it would be more substantiative than pinging Cisco's. Correlating this info (it a 5minute increment was chosen, this would be easier) we could correlate any detected lossage information with the utilization statistics presently being generated. \|/ _____ \|/ Jonathan Heiliger @~/ . . \~@ MFS Global Network Services, Inc. ________________________/_( \___/ )_\______________________________________ \__U__/ Direct: [408].975.2259 Email: loco@MFSDatanet.COM CSC: [800].637.7170 NCC: [800].637.4872
On Sun, 22 Oct 1995, Mike Nasto wrote:
On Sat, 21 Oct 1995, Mike O'Dell wrote:
remember that using pings to sample connectivity to a very busy cisco router is not a very reliable probe for several reasons. Returning pings is a low-priority task in the first place, and they are rate-limited, so if the processor is busy processing lots of BGP updates and several folks are fribbling with it using ping or SNMP, it is less than clear what they will see.
-mo
OK! Agreed. So then, what would you use?
Ping, yes it has a low-priority, but we are looking at routers "busy" routers. If they were not busy the pings would not be so bad. Nathan Stratton CEO, NetRail, Inc. Your Gateway to the World! --------------------------------------------------------------------------- Phone (703)524-4800 NetRail, Inc. Fax (703)534-5033 2007 N. 15 St. Suite 5 Email sales@netrail.net Arlington, Va. 22201 WWW http://www.netrail.net/ Access: (703) 524-4802 guest ---------------------------------------------------------------------------
On Fri, 20 Oct 1995, Curtis Villamizar wrote:
Could you tell us a little more about your packet loss sampling? Like how many ping packets are you using per collection point?
Sure. The program "globalping" takes a list of hosts from the rover hostfile at each RS which contains a list of discovered RS BGP Peers. For each of these peers, the command /usr/rovers/bin/ping -s <peeripaddr> 100 -c 5 is issued and the output parsed and stored in $HOME/delay/delaymatrix.YYMMDD. For example, a sample of today's data at MAE-East in delaymatrix.951023: Mon Oct 23 00:00:01 1995 192.41.177.166 68 1 1 1 192.41.177.6 83 6 7 5 4 192.41.177.140 2 1 1 1 130 192.41.177.145 4 1188 345 7 4 192.41.177.150 3 3 4 3 4 192.41.177.160 2 2 135 2 4 192.41.177.170 3 3 2 2 2 192.41.177.181 2 2 1 73 2 192.41.177.190 108 2 3 2 1 192.41.177.210 2 2 2 2 2 192.41.177.220 3 3 4 3 3 192.41.177.241 2 166 2 2 6 192.41.177.249 4 119 2 1 3 192.41.177.115 2 2 5 2 2 192.41.177.110 1 1 1 1 1 192.41.177.85 4 3 3 3 91 192.41.177.163 2 1 73 1 1 192.41.177.90 2 1 1 1 2 35.1.1.48 33 32 31 31 31 192.41.177.251 4 94 8 7 5 192.41.177.252 7 109 6 4 192.41.177.169 2 1 1 1 2 198.32.130.130 28 27 27 1044 204 198.32.130.131 28 26 35 1044 198 192.157.69.251 11 11 11 11 12 192.157.69.250 18 11 34 12 11 198.32.128.130 85 86 86 198.32.128.131 85 85 86 85 85 198.108.0.10 55 32 31 32 31 Mon Oct 23 00:15:01 1995 : : : I believe for the graphs, Dun removes outliers. We are using the stock Sun ping: rs1.mae-east.ra.net : /usr/users/wbn/delay > ls -l /usr/rovers/bin/ping -rwsr-xr-x 1 root 16446 Oct 9 09:50 /usr/rovers/bin/ping* If you are really interested I installed the globalping source code on home.merit.edu:~ftp/pub/users/norton/globalping.tar, but it is really nothing more than a forker and output parser. Bill ------------------------------------------------------------------------- William B. Norton Merit Network Inc. e-mail: wbn@merit.edu phone: (313) 936-2656 WWW: http://home.merit.edu/~wbn
Sure. The program "globalping" takes a list of hosts from the rover hostfile at each RS which contains a list of discovered RS BGP Peers. For each of these peers, the command
Another source of data - use the TCP data associated with the BGP session that your router already has in place to each of its peers. You should be able to get RTT & retransmits and all sorts of useful data points. --asp@uunet.uu.net (Andrew Partan)
participants (10)
-
asp@uunet.uu.net
-
Curtis Villamizar
-
Elise Gerich
-
Jeff Young
-
Jonathan Heiliger
-
Mike Nasto
-
Mike O'Dell
-
Nathan Stratton
-
Tim Bass
-
William B. Norton