My general question is "what meaning do I give to lossy traceroutes, even when pings show no problem." Can I expect that backbone routers should never give me timeouts on a traceroute through them, so, lots of asterisks from these systems indicate a packet loss problem that needs to be fixed? Or, are these traceroute asterisks essentially meaningless, and should be expected on any busy link? More specifically, is anyone else getting lots of *s for NYC1.gblx.net for traceroutes through them? If I do three traceroutes through there, at least one will show losses at or beyond the NYC1 hops (and, the *s beyond NYC1 might be getting lost in NYC1, rather than indicating a different error). But, Global Crossing's on-line tools don't show any loss. I am at simons-rock.edu, in Western Mass, and we connect via Boston. A few days ago, our users of a database that's hosted at our parent campus, bard.edu, started complaining of many frequent (but intermittent) delays. Bard is in the Hudson Valley, and connects via Poughkeepsie. Both of our local providers connect to Global Crossing. Once before, we saw similar database symptoms, and that time, Bard had a problem dropping packets at their gateway. So I think these symptoms mean packet loss is happening somewhere. However, this time, pings from Simon's Rock to Bard, and vice-versa, show essentially no errors, typically 1000 pings will get through 100%. Still, despite the good pings, traceroutes from either end show lots of asterisks at or after Global Crossing's NYC1.gblx.net links. I have opened a ticket with our provider, who has opened one with Global Crossing; and Bard has done the same with their end, but no significant response so far. (Bard's Graduate campus, located in New York City, is having similar poor database performance, so I'm pretty sure it is not just my end. Staff at the main Bard campus have no troubles, so it seems a network problem, not a server problem.) As I understand it, an asterisk in traceroute means that the sending machine did not get any reply to a given packet. Since the traceroute packets have small TTL values, it expects to get a reply when the TTL is decremented to zero. But, I don't know if big routers are just lazy about sending such responses, or if these asterisks really indicate packets getting lost. (As far as I remember in the past, when things work well, I never see *s at the central links, but, I have not really done any baseline testing of the link from here to Bard when the database was working.) So, another question is why pings work so well when traceroutes work so poorly. (By experiment, I believe our database application performs more like traceroute than like ping.) Is it packet size? Different handling for different sorts of traffic? Magic? Here are some sample traceroutes each way: Simon's Rock to Bard: 2h189:bin skbohrer$ traceroute -q5 -S bip.bard.edu traceroute to bip.bard.edu (192.246.228.16), 64 hops max, 40 byte packets 1 10.30.2.1 (10.30.2.1) 1.514 ms 1.791 ms 0.684 ms 0.761 ms 0.712 ms (0% loss) 2 michael.simons-rock.edu (208.81.88.1) 2.509 ms 1.882 ms 0.899 ms 1.345 ms 2.057 ms (0% loss) 3 64.213.79.249 (64.213.79.249) 104.294 ms 10.605 ms 17.106 ms 18.987 ms 38.740 ms (0% loss) 4 pos2-0-155M.cr2.BOS1.gblx.net (67.17.70.166) 21.962 ms 20.411 ms 8.394 ms 23.308 ms 10.192 ms (0% loss) 5 so1-2-0-2488M.scr2.NYC1.gblx.net (67.17.94.158) 15.738 ms 14.582 ms 17.306 ms 24.444 ms 15.466 ms (0% loss) 6 ae3-30g.scr3.NYC1.gblx.net (67.17.104.189) 15.586 ms 13.358 ms ae0-30G.scr4.NYC1.gblx.net (67.16.139.2) 13.875 ms 13.495 ms 12.780 ms (0% loss) 7 e5-1-30G.ar9.NYC1.gblx.net (67.16.142.54) 75.184 ms lag1.ar9.NYC1.gblx.net (67.16.142.50) 15.766 ms 11.947 ms * e5-1-30G.ar9.NYC1.gblx.net (67.16.142.54) 25.916 ms (20% loss) 8 * * wbs-connect.gigabitethernet1-0-2.asr1.jfk1.gblx.net (64.211.195.6) 55.909 ms 73.803 ms * (60% loss) 9 * pghknyshj42-xe-0-3-0.lightower.net (72.22.160.150) 16.521 ms 21.817 ms 23.715 ms 17.236 ms (20% loss) 10 pghknyshj91-ae0-66.lightower.net (72.22.160.165) 76.257 ms 27.712 ms 20.372 ms 18.923 ms 55.355 ms (0% loss) 11 kgtnnykgj91-ae3.66.lightower.net (72.22.160.107) 18.088 ms 51.631 ms 19.052 ms 20.876 ms 22.942 ms (0% loss) 12 BardCollege-cust.customer.hvdata.net (64.72.66.234) 51.243 ms 47.800 ms 32.835 ms 19.040 ms 55.661 ms (0% loss) 13 *^C Bard to SR (their version of traceroute doen't have the handy -S option): SRDB/users/usrsr/finrep: traceroute mail.simons-rock.edu trying to get source for mail.simons-rock.edu source should be 10.20.11.23 traceroute to hedwig.simons-rock.edu (208.81.88.14) from 10.20.11.23 (10.20.11.23), 30 hops max outgoing MTU = 1500 1 hcrcgw (10.20.11.1) 1 ms 0 ms 0 ms 2 hyphen (192.246.235.1) 1 ms 1 ms 1 ms 3 BardCollege-hvdn.customer.hvdata.net (64.72.66.233) 1 ms 1 ms 1 ms 4 pghknyshj91-xe-5-2-0.lightower.net (72.22.160.106) 2 ms 2 ms 2 ms 5 pghknyshj42-ae0-66.lightower.net (72.22.160.159) 27 ms 2 ms 2 ms 6 nycmnyzrj42-xe-0-3-0.lightower.net (72.22.160.151) 4 ms 4 ms 4 ms 7 ve463.ar9.NYC1.gblx.net (64.211.195.5) 4 ms 4 ms 4 ms 8 * ae0-40G.scr1.NYC1.gblx.net (67.16.138.253) 4 ms 4 ms 9 pos5-0-2488M.cr1.BOS1.gblx.net (67.17.94.57) 9 ms pos9-0-2488M.cr2.BOS1.gblx.net (67.17.94.157) 9 ms 11 ms 10 pos1-0-0-155M.ar1.BOS1.gblx.net (67.17.70.165) 14 ms 10 ms 9 ms 11 64.213.79.250 (64.213.79.250) 15 ms 15 ms 18 ms ^C For more automated testing, I used -m10 to set the max hops so that the traces stop within the backbone network, as this avoids any issue of the boxes at the ends not really responding to traceroutes. That way, I could assume any * was a real time out. I also used -q4 for 4 queries to each host. With a few hundred traceroutes each direction, more than 75% from SR to Bard, and more than 94% from Bard to SR, showed an asterisk at or past the NYC1 hops. There were zero asterisks on the links before NYC1 from either side. Thanks for any insights. Steve Bohrer Network Administrator ITS, Bard College at Simon's Rock 413-528-7645
On Sep 16, 2011, at 2:42 PM, Steve Bohrer wrote:
My general question is "what meaning do I give to lossy traceroutes, even when pings show no problem."
Can I expect that backbone routers should never give me timeouts on a traceroute through them, so, lots of asterisks from these systems indicate a packet loss problem that needs to be fixed?
Or, are these traceroute asterisks essentially meaningless, and should be expected on any busy link?
Basically, you should expect timeouts in the middle of the path from any large network from time-to-time. It could be for any number of reasons, ddos, control-plane "busyness", etc.. There was even a provider that once limited the ability for people to see inside their network. The true tests are always end-to-end tests. I recommend having a host at each end that you can run pings or iperf from. This will aide you greatly in diagnosing the trouble. Traceroute (or, more specifically TTL expiry handling by a multipurpose device) is often lower on the priority list than things to keep the element "alive and operational". Your outputs showed good results on each end of the path, meaning that the device was perhaps rate-limiting the TTL expiry traffic or had something else going on. - Jared
On Fri, Sep 16, 2011 at 2:42 PM, Steve Bohrer <skbohrer@simons-rock.edu> wrote:
Can I expect that backbone routers should never give me timeouts on a traceroute through them, so, lots of asterisks from these systems indicate a packet loss problem that needs to be fixed?
something inside the router has to make the icmp-unreachable-ttl-expired, right? perhaps that thing is rate-limited (in hardware/software) so that a line-rate flood of ttl=1 packets won't induce an outbound dos attack effect? perhaps that is a shared resource among all of the ports on the pic/card/chassis? perhaps the function that does this does more than just make ttl-expired? (other error codes or other ancillary functions)
Or, are these traceroute asterisks essentially meaningless, and should be expected on any busy link?
think router not link, but.... probably less important that you don't see ttl-expired messages, but that you do see no packet loss/mal-effects with the protocols you care about (ping? http? smtp?) it's also possible that the destination has requested gblx to filter udp toward it (depending on what sort of a day they are having and how much fun gblx wants to incur) -chris
On 9/16/11 11:42 , Steve Bohrer wrote:
My general question is "what meaning do I give to lossy traceroutes, even when pings show no problem."
Can I expect that backbone routers should never give me timeouts on a traceroute through them, so, lots of asterisks from these systems indicate a packet loss problem that needs to be fixed?
generation of icmp exceptions e.g. ttl exceeded can/is delberately rate limited, such traffic is sent up to the control plane. if a router is passing traffic loss free through the forwarding plane then you can conclude that it's loss free through the forwarding plane.
Or, are these traceroute asterisks essentially meaningless, and should be expected on any busy link?
more like it's not trivial to conclude what they mean.
participants (4)
-
Christopher Morrow
-
Jared Mauch
-
Joel jaeggli
-
Steve Bohrer