level3.net in Chicago - high packet loss?!?
Anybody having any idea why such a high packet loss on lever3's network, in Chicago? Stef:~ scm$ mtr -r www.yahoo.com ... tbr1-p010802.cgcil.ip.att.net 0% 16 16 15.12 24.21 49.26 ggr2-p310.cgcil.ip.att.net 0% 16 16 13.18 42.66 118.99 so-1-1-0.edge1.chicago1.level3.net 0% 16 16 14.48 35.84 126.48 so-2-1-0.bbr1.chicago1.level3.net 63% 6 16 14.44 43.74 79.97 ^^^^^^^ as-1-0.bbr2.sanjose1.level3.net 0% 16 16 61.95 80.64 176.01 ge-10-2.ipcolo3.sanjose1.level3.net 0% 16 16 63.37 95.61 148.46 unknown.level3.net 0% 16 16 63.34 86.46 168.62 unknown-66-218-82-217.yahoo.com 0% 16 16 62.09 88.91 127.58 p4.www.scd.yahoo.com 0% 16 16 64.51 89.96 183.79 TIA, Stef Network Fortius, LLC
Network Fortius <netfortius@gmail.com> writes:
Anybody having any idea why such a high packet loss on lever3's network, in Chicago?
End-user misinterpreting output from MTR. This network does not appear to have any packet loss end-to-end. ---Rob
Stef:~ scm$ mtr -r www.yahoo.com ... tbr1-p010802.cgcil.ip.att.net 0% 16 16 15.12 24.21 49.26 ggr2-p310.cgcil.ip.att.net 0% 16 16 13.18 42.66 118.99 so-1-1-0.edge1.chicago1.level3.net 0% 16 16 14.48 35.84 126.48 so-2-1-0.bbr1.chicago1.level3.net 63% 6 16 14.44 43.74 79.97 ^^^^^^^ as-1-0.bbr2.sanjose1.level3.net 0% 16 16 61.95 80.64 176.01 ge-10-2.ipcolo3.sanjose1.level3.net 0% 16 16 63.37 95.61 148.46 unknown.level3.net 0% 16 16 63.34 86.46 168.62 unknown-66-218-82-217.yahoo.com 0% 16 16 62.09 88.91 127.58 p4.www.scd.yahoo.com 0% 16 16 64.51 89.96 183.79
TIA, Stef Network Fortius, LLC
And how exactly would you interpret the number returned by net_loss (int), in a column called "LOSS", in reference to reachability of a "hop" between two end points: int net_loss(int at) { if ((host[at].xmit - host[at].transit) == 0) return 0; /* times extra 1000 */ return 1000*(100 - (100.0 * host[at].returned / (host[at].xmit - host[at].transit)) ); } ? Thanks, Stef Network Fortius, LLC On Sep 6, 2005, at 7:45 AM, Robert E.Seastrom wrote:
Network Fortius <netfortius@gmail.com> writes:
Anybody having any idea why such a high packet loss on lever3's network, in Chicago?
End-user misinterpreting output from MTR. This network does not appear to have any packet loss end-to-end.
---Rob
Stef:~ scm$ mtr -r www.yahoo.com ... tbr1-p010802.cgcil.ip.att.net 0% 16 16 15.12 24.21 49.26 ggr2-p310.cgcil.ip.att.net 0% 16 16 13.18 42.66 118.99 so-1-1-0.edge1.chicago1.level3.net 0% 16 16 14.48 35.84 126.48 so-2-1-0.bbr1.chicago1.level3.net 63% 6 16 14.44 43.74 79.97
^^^^^^^ as-1-0.bbr2.sanjose1.level3.net 0% 16 16 61.95 80.64 176.01 ge-10-2.ipcolo3.sanjose1.level3.net 0% 16 16 63.37 95.61 148.46 unknown.level3.net 0% 16 16 63.34 86.46 168.62 unknown-66-218-82-217.yahoo.com 0% 16 16 62.09 88.91 127.58 p4.www.scd.yahoo.com 0% 16 16 64.51 89.96 183.79
TIA, Stef Network Fortius, LLC
If the hop(s) following the one you see loss for shows no loss, then disregard the loss for that hop, obviously whatever it is, it does not affect transit, which is what you really want to know. Is that correct? Network Fortius wrote:
And how exactly would you interpret the number returned by net_loss (int), in a column called "LOSS", in reference to reachability of a "hop" between two end points:
int net_loss(int at) { if ((host[at].xmit - host[at].transit) == 0) return 0; /* times extra 1000 */ return 1000*(100 - (100.0 * host[at].returned / (host[at].xmit - host[at].transit)) ); } ?
Thanks, Stef Network Fortius, LLC
On Sep 6, 2005, at 7:45 AM, Robert E.Seastrom wrote:
Network Fortius <netfortius@gmail.com> writes:
Anybody having any idea why such a high packet loss on lever3's network, in Chicago?
End-user misinterpreting output from MTR. This network does not appear to have any packet loss end-to-end.
---Rob
Stef:~ scm$ mtr -r www.yahoo.com ... tbr1-p010802.cgcil.ip.att.net 0% 16 16 15.12 24.21 49.26 ggr2-p310.cgcil.ip.att.net 0% 16 16 13.18 42.66 118.99 so-1-1-0.edge1.chicago1.level3.net 0% 16 16 14.48 35.84 126.48 so-2-1-0.bbr1.chicago1.level3.net 63% 6 16 14.44 43.74 79.97
^^^^^^^ as-1-0.bbr2.sanjose1.level3.net 0% 16 16 61.95 80.64 176.01 ge-10-2.ipcolo3.sanjose1.level3.net 0% 16 16 63.37 95.61 148.46 unknown.level3.net 0% 16 16 63.34 86.46 168.62 unknown-66-218-82-217.yahoo.com 0% 16 16 62.09 88.91 127.58 p4.www.scd.yahoo.com 0% 16 16 64.51 89.96 183.79
TIA, Stef Network Fortius, LLC
On 9/6/05, Joe Maimon <jmaimon@ttec.com> wrote:
If the hop(s) following the one you see loss for shows no loss, then disregard the loss for that hop, obviously whatever it is, it does not affect transit, which is what you really want to know.
Is that correct?
This is one of the most misunderstood concepts in properly reading output from a traceroute (mtr, visualtraceroute, whatever). Basically you are seeing loss of packets destined directly *TO* that router, not THRU it. Most often this is caused by 1) the router having ratelimits applied to these packets so as not to bog down the CPU while it's trying to perfom its main function...forwarding packets or 2) the router is already busy and places a low priority on responding to those packets so as to leave CPU available for forwarding packets. You can see from the trace that hops after that don't show any loss. If that router was actually causing loss then you would see the loss continue thru the rest of the trace. Since you don't, you can assume that the router is experiencing one of the cases above. Of course there are always exceptions but 99.9% of the time this is the case. This same concept applies to latency as well. If you see only a single hop with a high response time and everything afterwards is normal, it's the same situation but it's taking the router a longer time to respond to you rather than it ignoring you. You can test this by simply pinging the end destination...do you see the same loss and/or high latency, if not you can disregard it. And while we're on the subject of reading this output, remember that traces only show you the forward path, not the reverse. Thanks to the wonders of asymmetric routing, at times it could be the return path that actually has the loss on it, the loss in the forward path only gives you an idea of where to begin troubleshooting. --chip -- Just my $.02, your mileage may vary, batteries not included, etc....
On Tue, 6 Sep 2005, chip wrote:
On 9/6/05, Joe Maimon <jmaimon@ttec.com> wrote:
If the hop(s) following the one you see loss for shows no loss, then disregard the loss for that hop, obviously whatever it is, it does not affect transit, which is what you really want to know.
Is that correct?
This is one of the most misunderstood concepts in properly reading output from a traceroute (mtr, visualtraceroute, whatever). Basically you are seeing loss of packets destined directly *TO* that router, not THRU it. Most
no... not destined TO the router, destined THROUGH the router that happen to TTL=0 ON that router. which is also misunderstood by just about everyone :( but anyway... 'not affecting transit' for reasons sited by yourself and min and adam already, yes.
On Tue, 6 Sep 2005, Christopher L. Morrow wrote:
On Tue, 6 Sep 2005, chip wrote:
On 9/6/05, Joe Maimon <jmaimon@ttec.com> wrote:
If the hop(s) following the one you see loss for shows no loss, then disregard the loss for that hop, obviously whatever it is, it does not affect transit, which is what you really want to know.
Is that correct?
This is one of the most misunderstood concepts in properly reading output from a traceroute (mtr, visualtraceroute, whatever). Basically you are seeing loss of packets destined directly *TO* that router, not THRU it. Most
no... not destined TO the router, destined THROUGH the router that happen to TTL=0 ON that router.
Very true. Most backbone kit on a tier 1 network is designed to switch packets in a distributed fashion, shifting packets between ports/cards over a backplane of some sort. On such kit, generating things such as a TTL-exceeded packet is usually punted to a central processor (whose primary task is to build route tables to hand off to the cards), which deals with the task in a much slower and much lower priority way than packets which transit the routing device. You also don't want your central processor to have to deal with too much of this sort of thing, which is (at least one of the reasons) why it's often rate limited.
which is also misunderstood by just about everyone :( but anyway... 'not affecting transit' for reasons sited by yourself and min and adam already, yes.
Agreed. SB -- Stewart Bamford (Posting as an individual) Level3 Snr IP Engineer *** Views expressed are my own and not necessarily those of Level3 *** Primary email stewart@whoever.com Secondary email me@stewartb.com Personal website http://www.stewartb.com/
On Tue, 6 Sep 2005 sdb@stewartb.com wrote:
On Tue, 6 Sep 2005, Christopher L. Morrow wrote:
On Tue, 6 Sep 2005, chip wrote:
On 9/6/05, Joe Maimon <jmaimon@ttec.com> wrote:
If the hop(s) following the one you see loss for shows no loss, then disregard the loss for that hop, obviously whatever it is, it does not affect transit, which is what you really want to know.
Is that correct?
This is one of the most misunderstood concepts in properly reading output from a traceroute (mtr, visualtraceroute, whatever). Basically you are seeing loss of packets destined directly *TO* that router, not THRU it. Most
no... not destined TO the router, destined THROUGH the router that happen to TTL=0 ON that router.
Very true. Most backbone kit on a tier 1 network is designed to switch
I was really just pointing out that 'traceroute' or 'mtr' send packets with increasing TTL to show 'loss' or 'delay' from place to place, I wasn't trying to debate the every-changing reasons why backbone equipment might or might not answer 'ttl-expired' or 'unreachable' (or any 'exception traffic' really) in a 'timely' fashion. That issue changes with the wind/os/hardware/model.... :) nice to L3 sending in the answer police though :) Thanks! -Chris
On Wed, 7 Sep 2005, Christopher L. Morrow wrote:
On Tue, 6 Sep 2005 sdb@stewartb.com wrote:
On Tue, 6 Sep 2005, Christopher L. Morrow wrote:
On Tue, 6 Sep 2005, chip wrote:
On 9/6/05, Joe Maimon <jmaimon@ttec.com> wrote:
If the hop(s) following the one you see loss for shows no loss, then disregard the loss for that hop, obviously whatever it is, it does not affect transit, which is what you really want to know.
Is that correct?
This is one of the most misunderstood concepts in properly reading output from a traceroute (mtr, visualtraceroute, whatever). Basically you are seeing loss of packets destined directly *TO* that router, not THRU it. Most
no... not destined TO the router, destined THROUGH the router that happen to TTL=0 ON that router.
Very true. Most backbone kit on a tier 1 network is designed to switch
I was really just pointing out that 'traceroute' or 'mtr' send packets with increasing TTL to show 'loss' or 'delay' from place to place, I wasn't trying to debate the every-changing reasons why backbone equipment might or might not answer 'ttl-expired' or 'unreachable' (or any 'exception traffic' really) in a 'timely' fashion. That issue changes with the wind/os/hardware/model.... :)
Yeah, it was a sweeping generalisation, hence the excessive use of words such as "usually" and "most" :) I was trying to put the point across as to why things are like this, for those that might be wondering why. The main point was actually that the ability of a device (router, web server etc) to deal with stuff _like_ ICMP message generation does not reflect its ability to perform it's main task.
nice to L3 sending in the answer police though :) Thanks!
Thanks :) SB -- Stewart Bamford (Posting as an individual) Level3 Snr IP Engineer *** Views expressed are my own and not necessarily those of Level3 *** Primary email stewart@whoever.com Secondary email me@stewartb.com Personal website http://www.stewartb.com/
In message <38D1AE7F-52FA-4BCF-A32F-CF40FA958DC6@gmail.com>, Network Fortius wr ites:
And how exactly would you interpret the number returned by net_loss (int), in a column called "LOSS", in reference to reachability of a "hop" between two end points:
The hop not sending out the ICMP message for those packets, either because its CPU is overloaded or because of a configuration option to rate-limit them. --Steven M. Bellovin, http://www.cs.columbia.edu/~smb
On Sep 6, 2005, at 9:36 AM, Steven M. Bellovin wrote:
In message <38D1AE7F-52FA-4BCF-A32F-CF40FA958DC6@gmail.com>, Network Fortius wr ites:
And how exactly would you interpret the number returned by net_loss (int), in a column called "LOSS", in reference to reachability of a "hop" between two end points:
The hop not sending out the ICMP message for those packets, either because its CPU is overloaded or because of a configuration option to rate-limit them.
--Steven M. Bellovin, http://www.cs.columbia.edu/~smb
Thank you. The former seems close to what may have happened, with a possible impact beyond ICMP, as once having moved my client over to their Broadwing connection, their processing from Yahoo's site seems to have come back to where it was a few days ago. Thanks again, Stef Network Fortius, LLC
On Tue, 6 September 2005 10:09:17 -0500, Network Fortius wrote: [..]
Thank you. The former seems close to what may have happened, with a possible impact beyond ICMP, as once having moved my client over to their Broadwing connection, their processing from Yahoo's site seems to have come back to where it was a few days ago.
Oh, yes. Is it just me, or did this read as: "Bah! But I was right anyway, all you suckers!" ? Go on, amuse the crowd. Alexander
On Sep 6, 2005, at 10:16 AM, Alexander Koch wrote:
On Tue, 6 September 2005 10:09:17 -0500, Network Fortius wrote: [..]
Thank you. The former seems close to what may have happened, with a possible impact beyond ICMP, as once having moved my client over to their Broadwing connection, their processing from Yahoo's site seems to have come back to where it was a few days ago.
Oh, yes. Is it just me, or did this read as: "Bah! But I was right anyway, all you suckers!"
?
Go on, amuse the crowd.
Alexander
I am sorry, if you interpret it this way. I do not have much choice, when it comes to servicing people asking for immediate resolution, so it is either trying to determine (via the wrong tool, in this case?!?) if there is something going on, during Labor Day, when the client still accepts "toying around", or the knee-jerk reaction to move them first thing Tuesday morning, then trying to philosophize around the problem. Again - please accept my apologies - it must have been just a coincidence that my lack of properly interpreting the tool output, combined with something actually having happened on the client's side, led to the wrong assumption that things were wrong in a place that the tool's output should not have been indicative of. Stef Network Fortius, LLC
Apparently it's just you. Nice job on taking an on-topic discussion and then flaming the original poster who (based on my personal subjective read of the responses) was geninuinely asking for assistance and appears to have gleaned some value from this thread and got her client's service restored. Consider the value of your post prior to doing it. For those asking me to do the same, the value is to bring more attention to the lack of moderated influence on nanog. While recognizing it is nobody's full-time job to baby-sit this forum, I find that posts from the steering committee are generall heeded. Further, attacking those looking for help in no way aides the network community as a whole and you present yourself as elitist. People may be misguided and may be off topic, but helping them get where they are trying to go is our job in many levels. My apologies for providing troll food. - Scott On 9/6/05, Alexander Koch <koch@tiscali.net> wrote:
Oh, yes. Is it just me, or did this read as: "Bah! But I was right anyway, all you suckers!"
?
Go on, amuse the crowd.
Alexander
On 2005-09-06-10:25:28, Network Fortius <netfortius@gmail.com> wrote:
And how exactly would you interpret the number returned by net_loss (int), in a column called "LOSS", in reference to reachability of a "hop" between two end points [...]
I'd interpret it to mean you're hitting a control plane policer or somesuch, with no actual bearing on end-to-end performance, judging from the diagnostic output you've graciously provided us with. I find myself giving this lecture several times a week to random "gamer" customers upset that intermediary routers don't reply to their pings at full line rate; I'd expect slightly better critical thinking skills from the posters on this list, but I've been wrong before. :) -a
On Tue, Sep 06, 2005 at 10:39:12AM -0400, Adam Rothschild wrote:
On 2005-09-06-10:25:28, Network Fortius <netfortius@gmail.com> wrote:
And how exactly would you interpret the number returned by net_loss (int), in a column called "LOSS", in reference to reachability of a "hop" between two end points [...]
I'd interpret it to mean you're hitting a control plane policer or somesuch, with no actual bearing on end-to-end performance, judging from the diagnostic output you've graciously provided us with.
I find myself giving this lecture several times a week to random "gamer" customers upset that intermediary routers don't reply to their pings at full line rate; I'd expect slightly better critical thinking skills from the posters on this list, but I've been wrong before. :)
And yet, his client had a problem, with that link, and did not have a problem with some other link, which, presumably, did *not* show that indication. Correlation does not imply causation, given, but it's certainly a datapoint. Best Practices of wide-area diagnosis, anyone? Cheers, -- jra -- Jay R. Ashworth jra@baylink.com Designer +-Internetworking------+----------+ RFC 2100 Ashworth & Associates | Best Practices Wiki | | '87 e24 St Petersburg FL USA http://bestpractices.wikicities.com +1 727 647 1274 If you can read this... thank a system administrator. Or two. --me
owner-nanog@merit.edu wrote:
Best Practices of wide-area diagnosis, anyone?
I'd be interested in a discussion of this as well. To answer a slightly different question, I usually point the "ping and traceroute" geeks to Karl's wonderful treatise on the subject: http://www.iwl.com/Resources/Papers/icmp-echo_print.html. Andrew Cruse
On Tue, Sep 06, 2005 at 01:16:59PM -0400, andrew2@one.net wrote:
owner-nanog@merit.edu wrote:
Best Practices of wide-area diagnosis, anyone?
I'd be interested in a discussion of this as well. To answer a slightly different question, I usually point the "ping and traceroute" geeks to Karl's wonderful treatise on the subject: http://www.iwl.com/Resources/Papers/icmp-echo_print.html.
i've found it useful to use a simple udp probe tool to test networks in the past. You can test end-to-end loss and get something reasonable. The following expects you to know: 1) GCC/Makefiles 2) how to insure you link in your resolver and socket/nsl functions 3) tweak your cpu compile options for your host.. but.. ftp://puck.nether.net/pub/jared/rtt-0.12.tar.gz If your clocks are accurately synced, you can even get unidirectional delay. I usually run it like this: ./rtt -v <host> you will need to run ./rtt_resp on the far end host. You can also use iperf or similer tools to help customers diagnose network problems, but a easy/lightweight daemon on a few hosts is always fairly easy to play with in a quick-and-dirty way... - jared -- Jared Mauch | pgp key available via finger from jared@puck.nether.net clue++; | http://puck.nether.net/~jared/ My statements are only mine.
participants (13)
-
Adam Rothschild
-
Alexander Koch
-
andrew2@one.net
-
chip
-
Christopher L. Morrow
-
Jared Mauch
-
Jay R. Ashworth
-
Joe Maimon
-
Network Fortius
-
Robert E.Seastrom
-
Scott Altman
-
sdb@stewartb.com
-
Steven M. Bellovin