packet reordering at exchange points

newer
Re: Ph.D. student looking for data...

older
BGP filtering policies, UU, and you

Paul Vixie

8 Apr 2002 8 Apr '02

9:18 p.m.

packet reordering at MAE East was extremely common a few years ago. Does anyone have information whether this is still happening?

more to the point, does anybody still care about packet reordering at exchange points? we (paix) go through significant effort to prevent it, and interswitch trunking with round robin would be a lot easier. are we chasing an urban legend here, or would reordering still cause pain?

Show replies by date

Sean Donelan

8 Apr 8 Apr

9:36 p.m.

On Mon, 8 Apr 2002, Paul Vixie wrote:

...

...
packet reordering at MAE East was extremely common a few years ago. Does anyone have information whether this is still happening?

more to the point, does anybody still care about packet reordering at exchange points? we (paix) go through significant effort to prevent it, and interswitch trunking with round robin would be a lot easier. are we chasing an urban legend here, or would reordering still cause pain?

Packet re-ordering would still cause pain if it started re-appearing again at high levels.

Richard A Steenbergen

9:45 p.m.

On Mon, Apr 08, 2002 at 02:18:52PM -0700, Paul Vixie wrote:

...

...
packet reordering at MAE East was extremely common a few years ago. Does anyone have information whether this is still happening?

more to the point, does anybody still care about packet reordering at exchange points? we (paix) go through significant effort to prevent it, and interswitch trunking with round robin would be a lot easier. are we chasing an urban legend here, or would reordering still cause pain?

Setup a freebsd system with a dummynet pipe, do a probability match on 50% of the packets and send them through a pipe with a few more bytes of queueing and 1ms more delay than the rest. Then test the performance of TCP across that link. There is a good paper on the subject that was published by ACM in Janurary: http://citeseer.nj.nec.com/450712.html So just how common is packet reordering today? Well I did a quick peak at a few machines which I don't have any reason to believe are out of the ordinary, and they all pretty much come out about the same: 32896155 packets received 9961197 acks (for 2309956346 bytes) 96322 duplicate acks 0 acks for unsent data 17328137 packets (2667939981 bytes) received in-sequence 10755 completely duplicate packets (1803069 bytes) 19 old duplicate packets 375 packets with some dup. data (38297 bytes duped) 53862 out-of-order packets (75435307 bytes) 0.3% of non-ACK packets by packet were received out of order, or 2.8% by bytes. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6)

Jake Khuon

10:12 p.m.

### On Mon, 08 Apr 2002 14:18:52 -0700, Paul Vixie <paul@vix.com> casually ### decided to expound upon nanog@merit.edu the following thoughts about ### "packet reordering at exchange points": PV> > packet reordering at MAE East was extremely common a few years ago. Does PV> > anyone have information whether this is still happening? PV> PV> more to the point, does anybody still care about packet reordering at PV> exchange points? we (paix) go through significant effort to prevent it, PV> and interswitch trunking with round robin would be a lot easier. are PV> we chasing an urban legend here, or would reordering still cause pain? I'd imagine that anyone passing realtime streams, Mbone or VOIP (anyone out there routing their VOIP traffic across an IXP?) would start having issues with the resulting jitter. -- /*===================[ Jake Khuon <khuon@NEEBU.Net> ]======================+ | Packet Plumber, Network Engineers /| / [~ [~ |) | | --------------- | | for Effective Bandwidth Utilisation / |/ [_ [_ |) |_| N E T W O R K S | +=========================================================================*/

Iljitsch van Beijnum

10:32 p.m.

On Mon, 8 Apr 2002, Paul Vixie wrote:

...

...
packet reordering at MAE East was extremely common a few years ago. Does anyone have information whether this is still happening?

...

more to the point, does anybody still care about packet reordering at exchange points? we (paix) go through significant effort to prevent it, and interswitch trunking with round robin would be a lot easier. are we chasing an urban legend here, or would reordering still cause pain?

Obviously some applications care. In addition to the examples mentioned earlier: out of order packets aren't really good for TCP header compression, so they will slow down data transfers over slow links. But how is packet reordering on two parallell gigabit interfaces ever going to translate into reordered packets for individual streams? Packets for streams that are subject to header compression or for voice over IP or even Mbone are nearly always transmitted at relatively large intervals, so they can't travel down parallell paths simultaneously.

E.B. Dreger

11:19 p.m.

...

Date: Tue, 9 Apr 2002 00:32:50 +0200 (CEST) From: Iljitsch van Beijnum <iljitsch@muada.com>

...

Obviously some applications care. In addition to the examples mentioned earlier: out of order packets aren't really good for TCP header compression, so they will slow down data transfers over slow links.

How about ACK? I think that's the point that Richard was making... even with SACK, out-of-order packets can be an issue.

...

But how is packet reordering on two parallell gigabit interfaces ever going to translate into reordered packets for individual streams? Packets

Queue depths. Varying paths. IIRC, 802.3ad DOES NOT allow round robin distribution; it uses hashes. Sure, hashed distribution isn't perfect. But it's better than "perfect" distribution with added latency and/or retransmits out the wazoo.

...

for streams that are subject to header compression or for voice over IP or even Mbone are nearly always transmitted at relatively large intervals, so they can't travel down parallell paths simultaneously.

What MTU? Compare to jitter multiplied by line rate. -- Eddy Brotsman & Dreger, Inc. - EverQuick Internet Division Phone: +1 (316) 794-8922 Wichita/(Inter)national Phone: +1 (785) 865-5885 Lawrence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Mon, 21 May 2001 11:23:58 +0000 (GMT) From: A Trap <blacklist@brics.com> To: blacklist@brics.com Subject: Please ignore this portion of my mail signature. These last few lines are a trap for address-harvesting spambots. Do NOT send mail to <blacklist@brics.com>, or you are likely to be blocked.

Richard A Steenbergen

11:45 p.m.

On Mon, Apr 08, 2002 at 11:19:56PM +0000, E.B. Dreger wrote:

...

...
But how is packet reordering on two parallell gigabit interfaces ever going to translate into reordered packets for individual streams? Packets

Queue depths. Varying paths. IIRC, 802.3ad DOES NOT allow round robin distribution; it uses hashes. Sure, hashed distribution isn't perfect. But it's better than "perfect" distribution with added latency and/or retransmits out the wazoo.

You don't even need varying paths to create a desynch, all you need is varying size packets. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6)

E.B. Dreger

11:47 p.m.

...

Date: Mon, 8 Apr 2002 19:45:16 -0400 From: Richard A Steenbergen <ras@e-gerbil.net>

...

...
Queue depths. Varying paths. IIRC, 802.3ad DOES NOT allow round robin distribution; it uses hashes. Sure, hashed distribution isn't perfect. But it's better than "perfect" distribution with added latency and/or retransmits out the wazoo.

You don't even need varying paths to create a desynch, all you need is varying size packets.

Quite true. My list wasn't meant to be all-inclusive... bad wording on my part. -- Eddy Brotsman & Dreger, Inc. - EverQuick Internet Division Phone: +1 (316) 794-8922 Wichita/(Inter)national Phone: +1 (785) 865-5885 Lawrence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Mon, 21 May 2001 11:23:58 +0000 (GMT) From: A Trap <blacklist@brics.com> To: blacklist@brics.com Subject: Please ignore this portion of my mail signature. These last few lines are a trap for address-harvesting spambots. Do NOT send mail to <blacklist@brics.com>, or you are likely to be blocked.

Jesper Skriver

9 Apr 9 Apr

5:13 a.m.

On Mon, Apr 08, 2002 at 11:19:56PM +0000, E.B. Dreger wrote:

...

...
Date: Tue, 9 Apr 2002 00:32:50 +0200 (CEST) From: Iljitsch van Beijnum <iljitsch@muada.com>

...
But how is packet reordering on two parallell gigabit interfaces ever going to translate into reordered packets for individual streams? Packets

Queue depths. Varying paths.

We're talking parallel GigE links between switches which are located close to each other. And we're talking real life applications, which perhaps sends 100 pps in one stream, which means that you need to have ~ 10 ms different transmission delay on the individual links, before the risk of out of order packets for a given stream arise.

...

IIRC, 802.3ad DOES NOT allow round robin distribution;

That is not what we're talking about, we're talking about the impact of doing it.

...

it uses hashes. Sure, hashed distribution isn't perfect.

It's broken in a IX environment where you have few src/dst pairs, and where a single src/dst pair can easily use several hundreds of Mbps, if you have a few of those going of the same link due to the hashing algorithm, you will have problems. A large IX in Europe have this exact problem on their Foundry swiches, which doesn't support round robin, and is currently forced to moving for 10 GigE due to this very fact. /Jesper -- Jesper Skriver, jesper(at)skriver(dot)dk - CCIE #5456 Work: Network manager @ AS3292 (Tele Danmark DataNetworks) Private: FreeBSD committer @ AS2109 (A much smaller network ;-) One Unix to rule them all, One Resolver to find them, One IP to bring them all and in the zone to bind them.

E.B. Dreger

6 p.m.

...

Date: Tue, 9 Apr 2002 07:13:38 +0200 From: Jesper Skriver <jesper@skriver.dk>

...

We're talking parallel GigE links between switches which are located close to each other.

And we're talking real life applications, which perhaps sends 100 pps in one stream, which means that you need to have ~ 10 ms different transmission delay on the individual links, before the risk of out of order packets for a given stream arise.

Hmmmm. You're right. I lost sight of the original thread... GigE inter-switch trunking at PAIX. In that case, congestion _should_ be low, and there shouldn't be much queue depth. But this _does_ bank on current "real world" behavior. If endpoints ever approach GigE speeds (of course requiring "low enough" latency and "big enough" windows)... Then again, last mile is so slow that we're probably a ways away from that happening.

...

...
IIRC, 802.3ad DOES NOT allow round robin distribution;

That is not what we're talking about, we're talking about the impact of doing it.

Yes, I was incomplete in that part. Intended point was that IEEE at least [seemingly] found round robin inappropriate for general case.

...

...
it uses hashes. Sure, hashed distribution isn't perfect.

It's broken in a IX environment where you have few src/dst pairs, and where a single src/dst pair can easily use several hundreds of Mbps, if you have a few of those going of the same link due to the hashing algorithm, you will have problems.

In the [extreme] degenerate case, yes, one goes from N links to 1 effective link.

...

A large IX in Europe have this exact problem on their Foundry swiches, which doesn't support round robin, and is currently forced to moving for

Can you state how many participants? With N x GigE, what sort of [im]balance is there over the N lines? Of course, I'd hope that individual heavy pairs would establish private interconnects instead of using public switch fabric, but I know that's not always { an option | done | ... }.

...

10 GigE due to this very fact.

I'm going to have to play with ISL RR...

...

/Jesper

-- Eddy Brotsman & Dreger, Inc. - EverQuick Internet Division Phone: +1 (316) 794-8922 Wichita/(Inter)national Phone: +1 (785) 865-5885 Lawrence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Mon, 21 May 2001 11:23:58 +0000 (GMT) From: A Trap <blacklist@brics.com> To: blacklist@brics.com Subject: Please ignore this portion of my mail signature. These last few lines are a trap for address-harvesting spambots. Do NOT send mail to <blacklist@brics.com>, or you are likely to be blocked.

Jesper Skriver

9:42 p.m.

On Tue, Apr 09, 2002 at 06:00:31PM +0000, E.B. Dreger wrote:

...

...
A large IX in Europe have this exact problem on their Foundry swiches, which doesn't support round robin, and is currently forced to moving for

Can you state how many participants?

100+

...

With N x GigE, what sort of [im]balance is there over the N lines?

a few links overloaded, which other practically doesn't carry traffic.

...

Of course, I'd hope that individual heavy pairs would establish private interconnects instead of using public switch fabric, but I know that's not always { an option | done | ... }.

If A and B exchange say 200 Mbps of traffic, moving to a PNI is for sure a option, but if both have GigE connections to the shared infrastruture with spare capacity, both can expect the IX to handle that traffic. /Jesper -- Jesper Skriver, jesper(at)skriver(dot)dk - CCIE #5456 Work: Network manager @ AS3292 (Tele Danmark DataNetworks) Private: FreeBSD committer @ AS2109 (A much smaller network ;-) One Unix to rule them all, One Resolver to find them, One IP to bring them all and in the zone to bind them.

Stephen Sprunk

1:52 a.m.

Thus spake "Iljitsch van Beijnum" <iljitsch@muada.com>

...

But how is packet reordering on two parallell gigabit interfaces ever going to translate into reordered packets for individual streams?

Think of a large FTP between two well-connected machines. Such flows tend to generate periodic clumps of packets; split one of these clumps across two pipes and the clump will arrive out of order at the other end. The resulting mess will create a clump of retransmissions, then another bigger clump of new data, ...

...

Packets for streams that are subject to header compression or for voice over IP or even Mbone are nearly always transmitted at relatively large intervals, so they can't travel down parallell paths simultaneously.

RTP reordering isn't a problem in my experience, probably since RTP has an inherent resequencing mechanism. The problem with RTP is that if the packets don't follow a deterministic path, the header compression scheme is severely trashed. Also, non-deterministic paths tend to increase jitter, requiring more bufferring at endpoints. S

Iljitsch van Beijnum

4:35 p.m.

On Mon, 8 Apr 2002, Stephen Sprunk wrote:

...

Thus spake "Iljitsch van Beijnum" <iljitsch@muada.com>

...
But how is packet reordering on two parallell gigabit interfaces ever going to translate into reordered packets for individual streams?

...

Think of a large FTP between two well-connected machines. Such flows tend to generate periodic clumps of packets; split one of these clumps across two pipes and the clump will arrive out of order at the other end. The resulting mess will create a clump of retransmissions, then another bigger clump of new data, ...

I don't think it will be this bad, even if hosts are connected at GigE and the trunk is 2 x GigE. In this case, a (delayed) ACK will usually acknowledge 2 segments so it will trigger transmission of two new segments. These will arrive back to back at the router/switch doing the load balancing. Since there is obviously need for more than 1 Gbit worth of bandwidth, it is likely the average queue size is at least close to 1 (= ~65% line use) or even higher. If this is the case, there is a _chance_ the second packet gains a full packet time over the first and arrives first at the destination. However, this is NOT especially likely if both packets are the same size: the _average_ queue sizes will be the same so in half the cases the first packet gains an even bigger advance over the second, and only in a fraction of half the cases the second packet gains enough over the first to pass it. And then, the destination host still only sees a single packet coming in out of order, which isn't enough to trigger fast retransmit. You need to load balance over more than two connections to trigger unnecessary fast retransmit (over two lines, packet #3 isn't going to pass by packet #1), AND you need to send more than two packets back to back. Also, you need to be at the same speed as the load balanced lines, otherwise your packet train gets split up by traffic from other interfaces or idle time on the line. And _then_, if all of this happens, all the retransmitted data is left of window. I'm not even sure if those packets generate an ACK, and if they do, if the sender takes any action on this ACK. If this triggers another round of fast retransmit, the FR implementation should be considered broken, IMO.

...

...
Packets for streams that are subject to header compression or for voice over IP or even Mbone are nearly always transmitted at relatively large intervals, so they can't travel down parallell paths simultaneously.

...

RTP reordering isn't a problem in my experience, probably since RTP has an inherent resequencing mechanism.

My point is real time protocols will not see reordering unless they are using up nearly the full line speed or there is congestion, because these protocols don't send out packets back to back like TCP sometimes does. How big are VoIP packets? Even with an 80 byte payload you get 100 packets per second = 10 ms between packets, which is more than 80 packet times for GigE = congestion. And if there is congestion, all performance bets are off. It seems to me spending (CPU) time and money to do more complex load balancing than per packet round robing in order to avoid reordering only helps some people with GigE connected hosts some of the time. Using this time or money to overcome congestion is probably a better investment. PS. For everyone looking at their netstat -p tcp output: packet loss also counts towards the out of order packets, it is hard to get the real out of order figures. PS2. Isn't it annoying to have to think about layer 4 to build layer 2 stuff?

Valdis.Kletnieks＠vt.edu

1:57 a.m.

On Tue, 09 Apr 2002 00:32:50 +0200, Iljitsch van Beijnum said:

...

Obviously some applications care. In addition to the examples mentioned earlier: out of order packets aren't really good for TCP header compression, so they will slow down data transfers over slow links.

On the other hand, wouldn't this sort of slow link tend to close down the TCP window and thus tend to minimize the effect? A quick back-of-envelope calculation gives me a 56K modem line only opening the window up to 10K or so - so there should only be 5-6 1500 byte packets in flight at a given time, so the chances of *that flow* getting out-of-order at a core router that's flipping 200K packets/sec are fairly low. Not saying it doesn't happen, or that it isn't a problem when it does - but I'm going to wait till somebody posts a 'netstat' output showing that it is in fact an issue for some environments... -- Valdis Kletnieks Computer Systems Senior Engineer Virginia Tech

Jesper Skriver

5:08 a.m.

On Mon, Apr 08, 2002 at 02:18:52PM -0700, Paul Vixie wrote:

...

...
packet reordering at MAE East was extremely common a few years ago. Does anyone have information whether this is still happening?

more to the point, does anybody still care about packet reordering at exchange points? we (paix) go through significant effort to prevent it, and interswitch trunking with round robin would be a lot easier. are we chasing an urban legend here, or would reordering still cause pain?

LINX uses Extreme swiches with round robin load-sharing among 4*GigE and 8*GigE trunks, no problems has been noted. /Jesper -- Jesper Skriver, jesper(at)skriver(dot)dk - CCIE #5456 Work: Network manager @ AS3292 (Tele Danmark DataNetworks) Private: FreeBSD committer @ AS2109 (A much smaller network ;-) One Unix to rule them all, One Resolver to find them, One IP to bring them all and in the zone to bind them.

8498

Age (days ago)

8499

Last active (days ago)

List overview

Download

14 comments

9 participants

participants (9)

E.B. Dreger
Iljitsch van Beijnum
Jake Khuon
Jesper Skriver
Paul Vixie
Richard A Steenbergen
Sean Donelan
Stephen Sprunk
Valdis.Kletnieks＠vt.edu