Phantom packet loss is being shown when using pathping in connection with asynchronous routing - although there is no real loss.
Hallo colleagues, Maybe someone of you can help me to understand the phenomenon of pack loss when using asynchronous routing? I have customers who are complaining about packet loss and they are providing me with MTRs and pathpings (that's some sort of traceroute that pings every hop it sees several times - comes with windows xp) that show the loss starting at my routers and ending at their server (=the last hop). All users are coming from a (dialup-)network where the way from them to our servers are going via a carrier different than the carrier we are using to route the traffic back to the dial user. The interesting thing is that there is no loss at all when the users either use a ping instead of this pathping/mtr-stuff or when I perform a ping or even an mtr on my server in direction of the dialup customer. The nasty thing is that there is de facto NO LOSS on the line but the users is seeing some sort of phantom loss. The problem immediately disappears when I change to way back to the same carrier as the way to us so that we have synchronous routing again. My assumption is that pathping and mtr somehow get irritated by the icmp messages due to a wrong timing or something like that. Any ideas? Thanks, Gunther
On 6-Jun-2006, at 08:19, Gunther Stammwitz wrote:
I have customers who are complaining about packet loss and they are providing me with MTRs and pathpings (that's some sort of traceroute that pings every hop it sees several times - comes with windows xp)
(if it comes with win xp, then that sounds interesting-yet-surprising -- it's more usually found at <http://www.bitwizard.nl/mtr/>).
[...]
The nasty thing is that there is de facto NO LOSS on the line but the users is seeing some sort of phantom loss.
The starting point for any investigation like this is to compare the traceroute that apparently shows loss or other problems with traceroutes from strategic points in the path back to the source. If there's a congestion problem which is the cause of the concern then comparing traceroutes in both directions will usually help find it. If there's no congestion problem, or the apparent problem is unusual latency or loss in the numbers mtr displays for particular routers in the path, then mtr's ICMP echo requests towards the control elements of particular routers are probably being deliberately rate-limited by the operators of those routers. Joe
"Gunther Stammwitz" <gstammw@gmx.net> wrote:
I have customers who are complaining about packet loss and they are providing me with MTRs and pathpings (that's some sort of traceroute that pings every hop it sees several times - comes with windows xp) that show the loss starting at my routers and ending at their server (=the last hop). All users are coming from a (dialup-)network where the way from them to our servers are going via a carrier different than the carrier we are using to route the traffic back to the dial user. The interesting thing is that there is no loss at all when the users either use a ping instead of this pathping/mtr-stuff or when I perform a ping or even an mtr on my server in direction of the dialup customer.
The nasty thing is that there is de facto NO LOSS on the line but the users is seeing some sort of phantom loss.
The problem immediately disappears when I change to way back to the same carrier as the way to us so that we have synchronous routing again.
My assumption is that pathping and mtr somehow get irritated by the icmp messages due to a wrong timing or something like that. Any ideas?
Try varying the mtr interval, such as "-i .1" (must be root for <1). Does the packetloss significantly increase with this faster mtr? Try slower "-i 10". Does the packetloss significantly decrease or go away? If the answer to both above questions is yes, then I would suspect ICMP rate limiting. You could also try varying the speed of ping. Windows is pretty limited, but on unix you can do things like .1 second intervals ("-i .1" as root). Does a faster ping trigger this apparent loss? If so, ICMP rate limiting. The only part that I don't get is that you can mtr to him without packetloss. Although the path in-between may be different, the final hop packetloss should exactly equal what he sees when mtring you. A round-trip is a round-trip, and results should be identical regardless of who originates. I can't think of any way this would be different unless echo and echo-reply were being rate limited independently. My home ISP (apartment ethernet "t1" service, which is actually multiple T3s) has a Packeteer or something along that line. If I use ping, everything is fine since it goes so slow. If I use MTR, it works fine for the first few seconds then sees >90% packetloss on all hops from then on once the rate limiter burst bucket runs dry. Of course, TCP still sees no packetloss even when mtr is seeing this heavy rate limited loss...
The only part that I don't get is that you can mtr to him without packetloss. Although the path in-between may be different, the final hop packetloss should exactly equal what he sees when mtring you. A round-trip is a round-trip, and results should be identical regardless of who originates. I can't think of any way this would be different unless echo and echo-reply were being rate limited independently.
If the time was different then the packet loss would be different. Perhaps the customer runs the tests during his busy period when he is concerned about making sure there is no delay. Then, later in the day, after his busy period is over he takes the time to contact his ISP. The ISP then runs some tests which show there is no packet loss at all. To be sure this is not happening, synchronize the tests and run simultaneously. Try tcptraceroute because this more accurately reflects the traffic that is flowing. http://michael.toren.net/code/tcptraceroute/ http://tracetcp.sourceforge.net/ is a windows tool that is similar. The open source tool LFT can be built to run on Windows under cygwin http://pwhois.org/lft/ but they have this warning on their page: Many people have complained about various problems on the Windows platform. Both LFT and the WhoB client compile and run well under Cygwin environments on Windows. Unfortunately, Microsoft's changes to the Windows IP stack (as of XP Service Pack 2) reduced their raw socket functionality significantly as part of their security bolstering process. These changes have effectively stopped LFT from working properly while using TCP. LFT's UDP tracing and other advanced features still work properly. For more information on Windows raw sockets, consult www.microsoft.com/technet/prodtechnol/winxppro/maintain/sp2netwk.mspx#EIAA This may have nothing to do with your MTR issue but it does make one wonder whether a Windows machine is safe to do performance testing. In any case, the LFT people think that their non-TCP features still work properly on Windows and this is a tool that you can also run on your end. Worth a try? --Michael Dillon
On Tue, Jun 06, 2006 at 05:19:33PM +0200, Gunther Stammwitz wrote:
Hallo colleagues,
Maybe someone of you can help me to understand the phenomenon of pack loss when using asynchronous routing?
I have customers who are complaining about packet loss and they are providing me with MTRs and pathpings (that's some sort of traceroute that pings every hop it sees several times - comes with windows xp) that show the loss starting at my routers and ending at their server (=the last hop). All users are coming from a (dialup-)network where the way from them to our servers are going via a carrier different than the carrier we are using to route the traffic back to the dial user. The interesting thing is that there is no loss at all when the users either use a ping instead of this pathping/mtr-stuff or when I perform a ping or even an mtr on my server in direction of the dialup customer.
The nasty thing is that there is de facto NO LOSS on the line but the users is seeing some sort of phantom loss.
The problem immediately disappears when I change to way back to the same carrier as the way to us so that we have synchronous routing again.
My assumption is that pathping and mtr somehow get irritated by the icmp messages due to a wrong timing or something like that. Any ideas?
I can't tell you what is going on. But I can ask, (a) why are you doing asymmetrical routing in the first place? and, (b) is it possible that the MicroSoft versions of these tools are reporting errors BECAUSE of the asynchronous routing? -- Joe Yao ----------------------------------------------------------------------- This message is not an official statement of OSIS Center policies.
On 7-Jun-2006, at 12:35, Joseph S D Yao wrote:
I can't tell you what is going on. But I can ask, (a) why are you doing asymmetrical routing in the first place?
For any non-trivial path, it seems to me that asymmetry in forward and return paths is normal. Symmetrical paths are the exception. From another angle, how can anybody hope to ensure that all forward and return paths are identical when the only exit under their control is the one on the outbound path, at their own border? Joe
On Wed, Jun 07, 2006 at 12:49:04PM -0700, Joe Abley wrote:
On 7-Jun-2006, at 12:35, Joseph S D Yao wrote:
I can't tell you what is going on. But I can ask, (a) why are you doing asymmetrical routing in the first place?
For any non-trivial path, it seems to me that asymmetry in forward and return paths is normal. Symmetrical paths are the exception.
From another angle, how can anybody hope to ensure that all forward and return paths are identical when the only exit under their control is the one on the outbound path, at their own border?
Joe
If this is for their customers, it wasn't clear that the path went outside their zone of control. I did wonder. -- Joe Yao ----------------------------------------------------------------------- This message is not an official statement of OSIS Center policies.
participants (5)
-
Gunther Stammwitz
-
Joe Abley
-
Joseph S D Yao
-
Matt Buford
-
Michael.Dillonļ¼ btradianz.com