Re: packet reordering at exchange points

older
Sheilded Cat-5E Ground Loop - Myth...

Paul Vixie

9 Apr 2002 9 Apr '02

6:16 p.m.

Hmmmm. You're right. I lost sight of the original thread... GigE inter-switch trunking at PAIX. In that case, congestion _should_ be low, and there shouldn't be much queue depth.

indeed, this is the case. we keep a lot of headroom on those trunks.

...

But this _does_ bank on current "real world" behavior. If endpoints ever approach GigE speeds (of course requiring "low enough" latency and "big enough" windows)...

Then again, last mile is so slow that we're probably a ways away from that happening.

my expectation is that when the last mile goes to 622Mb/s or 1000Mb/s, exchange points will all be operating at 10Gb/s, and interswitch trunks at exchange points will be multiples of 10Gb/s.

...

Of course, I'd hope that individual heavy pairs would establish private interconnects instead of using public switch fabric, but I know that's not always { an option | done | ... }.

individual heavy pairs do this, but as a long term response to growth, not as a short term response to congestion. in the short term, the exchange point switch can't present congestion. it's just not on the table at all.

Show replies by date

E.B. Dreger

9 Apr 9 Apr

7:18 p.m.

New subject: packet reordering at exchange points

...

Date: Tue, 09 Apr 2002 11:16:24 -0700 From: Paul Vixie <paul@vix.com>

...

my expectation is that when the last mile goes to 622Mb/s or 1000Mb/s, exchange points will all be operating at 10Gb/s, and interswitch trunks at exchange points will be multiples of 10Gb/s.

I guess Moore's Law comes into play again. One will need some pretty hefty TCP buffers for a single stream to hit those rates, unless latency _really_ drops. (Distributed CDNs, anyone? Speed of light ain't getting faster any time soon...) Of course, IMHO I expect DCDNs to become increasingly common... but that topic would warrant a thread fork. Looks like RR ISLs are feasible between GigE+ core switches... -- Eddy Brotsman & Dreger, Inc. - EverQuick Internet Division Phone: +1 (316) 794-8922 Wichita/(Inter)national Phone: +1 (785) 865-5885 Lawrence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Mon, 21 May 2001 11:23:58 +0000 (GMT) From: A Trap <blacklist@brics.com> To: blacklist@brics.com Subject: Please ignore this portion of my mail signature. These last few lines are a trap for address-harvesting spambots. Do NOT send mail to <blacklist@brics.com>, or you are likely to be blocked.

Richard A Steenbergen

8:03 p.m.

New subject: packet reordering at exchange points

On Tue, Apr 09, 2002 at 07:18:35PM +0000, E.B. Dreger wrote:

...

...
Date: Tue, 09 Apr 2002 11:16:24 -0700 From: Paul Vixie <paul@vix.com>

...
my expectation is that when the last mile goes to 622Mb/s or 1000Mb/s, exchange points will all be operating at 10Gb/s, and interswitch trunks at exchange points will be multiples of 10Gb/s.

I guess Moore's Law comes into play again. One will need some pretty hefty TCP buffers for a single stream to hit those rates, unless latency _really_ drops. (Distributed CDNs, anyone? Speed of light ain't getting faster any time soon...)

To transfer 1Gb/s across 100ms I need to be prepared to buffer at least 25MB of data. According to pricewatch, I can pick up a high density 512MB PC133 DIMM for $70, and use $3.50 of it to catch that TCP stream. Throw in $36 for a GigE NIC, and we're ready to go for under $40. Yeah I know thats cheapest garbage you can get, but this is just to prove a point. :) I might only be able to get 800Mbit across a 32bit/33mhz PCI bus, but whatever. The problem isn't the lack of hardware, it's a lack of good software (both on the receiving side and probably more importantly the sending side), a lot of bad standards coming back to bite us (1500 byte packets is about as far from efficient as you can get), a lack of people with enough know-how to actually build a network that can transport it all (heck they can't even build decent networks to deliver 10Mbit/s, @Home was the closest), and just a general lack of things for end users to do with that much bandwidth even if they got it. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6)

E.B. Dreger

10:51 p.m.

New subject: fixing TCP buffers (Re: packet reordering at exchange points)

...

Date: Tue, 9 Apr 2002 16:03:53 -0400 From: Richard A Steenbergen <ras@e-gerbil.net>

...

To transfer 1Gb/s across 100ms I need to be prepared to buffer at least 25MB of data. According to pricewatch, I can pick up a high density 512MB

[ snip ]

...

The problem isn't the lack of hardware, it's a lack of good software (both

[ snip ] But how many simultaneous connections? Until TCP stacks start using window autotuning (of which I know you're well aware), we must either use suboptimal windows or chew up ridiculous amounts of memory. Yes, bad software, but still a limit... It would be nice to allocate a 32MB chunk of RAM for buffers, then dynamically split it between streams. Fragmentation makes that pretty much impossible. OTOH... perhaps that's a reasonable start: 1. Alloc buffer of size X 2. Let it be used for Y streams 3. When we have Y streams, split each stream "sub-buffer" into Y parts, giving capacity for Y^2, streams. Aggregate transmission can't exceed line rate. So instead of fixed-size buffers for each stream, perhaps our TOTAL buffer size should remain constant. Use PSC-style autotuning to eek out more capacity/performance, instead of using fixed value of "Y" or splitting each and every last buffer. (Actually, I need to reread/reexamine the PSC code in case it actually _does_ use a fixed total buffer size.) This shouldn't be terribly hard to hack into an IP stack... -- Eddy Brotsman & Dreger, Inc. - EverQuick Internet Division Phone: +1 (316) 794-8922 Wichita/(Inter)national Phone: +1 (785) 865-5885 Lawrence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Mon, 21 May 2001 11:23:58 +0000 (GMT) From: A Trap <blacklist@brics.com> To: blacklist@brics.com Subject: Please ignore this portion of my mail signature. These last few lines are a trap for address-harvesting spambots. Do NOT send mail to <blacklist@brics.com>, or you are likely to be blocked.

Richard A Steenbergen

11:17 p.m.

New subject: fixing TCP buffers (Re: packet reordering at exchange points)

On Tue, Apr 09, 2002 at 10:51:27PM +0000, E.B. Dreger wrote:

...

But how many simultaneous connections? Until TCP stacks start using window autotuning (of which I know you're well aware), we must either use suboptimal windows or chew up ridiculous amounts of memory. Yes, bad software, but still a limit...

Thats precisely what I ment by bad software, as well as the server code that pushes the data out in the first place. And for that matter, the receiver side is just as important.

...

It would be nice to allocate a 32MB chunk of RAM for buffers, then dynamically split it between streams. Fragmentation makes that pretty much impossible.

OTOH... perhaps that's a reasonable start:

1. Alloc buffer of size X 2. Let it be used for Y streams 3. When we have Y streams, split each stream "sub-buffer" into Y parts, giving capacity for Y^2, streams.

You don't actually allocate the buffers until you have something to put in them, you're just fixing a limit on the maximum you're willing to allocate. The problem comes from the fact that you're fixing the limits on a "per-socket" basis, not on a "total system" basis.

...

Aggregate transmission can't exceed line rate. So instead of fixed-size buffers for each stream, perhaps our TOTAL buffer size should remain constant.

Use PSC-style autotuning to eek out more capacity/performance, instead of using fixed value of "Y" or splitting each and every last buffer. (Actually, I need to reread/reexamine the PSC code in case it actually _does_ use a fixed total buffer size.)

This shouldn't be terribly hard to hack into an IP stack...

Actually here's an even simpler one. Define a global limit for this, something like 32MB would be more then reasonable. Then instead of advertising the space "remaining" in individual socket buffers, advertise the total space remaining in this virtual memory pool. If you overrun your buffer, you might have the other side send you a few unnecessary bytes that you just have to drop, but the situation should correct itself very quickly. I don't think this would be "unfair" to any particular flow, since you've eliminated the concept of one flow "hogging" the socket buffer and leave it to TCP to work out the sharing of the link. Second opinions? -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6)

E.B. Dreger

10 Apr 10 Apr

12:22 a.m.

New subject: fixing TCP buffers (Re: packet reordering at exchange points)

...

Date: Tue, 9 Apr 2002 19:17:44 -0400 From: Richard A Steenbergen <ras@e-gerbil.net>

[ snip beginning ]

...

Actually here's an even simpler one. Define a global limit for this, something like 32MB would be more then reasonable. Then instead of advertising the space "remaining" in individual

My static buffer presumed that one would regularly see line rate; that's probably an invalid assumption.

...

socket buffers, advertise the total space remaining in this

Why bother advertising space remaining? Simply take the total space -- which is tuned to line rate -- and divide equitably. Equal division is the primitive way. Monitoring actual buffer use, a la PSC window-tuning code, is more efficient. To respect memory, sure, you could impose a global limit and alloc as needed. But on a "busy enough" server/client, how much would that save? Perhaps one could allocate 8MB chunks at a time... but fragmentation could prevent the ability to have a contiguous 32MB in the future. (Yes, I'm assuming high memory usage and simplistic paging. But I think that's plausible.) Honestly... memory is so plentiful these days that I'd gladly devote "line rate"-sized buffers to the cause on each and every server that I run.

...

virtual memory pool. If you overrun your buffer, you might have the other side send you a few unnecessary bytes that you just have to drop, but the situation should correct itself very

By allocating 32MB, one stream could achieve line rate with no wasted space (assuming latency is exactly what we predict, which we all know won't happen). When another stream or two are opened, we split the buffer into four. Maybe we drop, like you suggest, in a RED-like manner. Maybe we flush the queue if it's not "too full". Now we have up to four streams, each with an 8MB queue. Need more streams? Fine, split { one | some | all } of the 8MB windows into 2MB segments. Simple enough, until we hit the variable bw*delay times... then we should use intelligence when splitting, probably via mechanisms similar to the PSC stack. Granularity of 4 is for example only. I know that would be non-ideal. One could split 32 MB into 6.0 MB + 7.0 MB + 8.5 MB + 10.5 MB, which would then be halved as needed. Long-running sessions could be moved between buffer clumps as needed. (i.e., if 1.5 MB is too small and 2.0 MB is too large, 1.75 MB fits nicely into the 7.0 MB area.)

...

quickly. I don't think this would be "unfair" to any particular flow, since you've eliminated the concept of one flow "hogging" the socket buffer and leave it to TCP to work out the sharing of the link. Second opinions?

Smells to me like ALTQ's TBR (token buffer regulator). Perhaps also have a dynamically-allocated "tuning" buffer: Imagine 2000 dialups and 10 DSL connections transferring over a DS3... use a single "big enough" buffer (few buffers?) to sniff out each stream's capability, to determine which stream can use how much more space. -- Eddy Brotsman & Dreger, Inc. - EverQuick Internet Division Phone: +1 (316) 794-8922 Wichita/(Inter)national Phone: +1 (785) 865-5885 Lawrence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Mon, 21 May 2001 11:23:58 +0000 (GMT) From: A Trap <blacklist@brics.com> To: blacklist@brics.com Subject: Please ignore this portion of my mail signature. These last few lines are a trap for address-harvesting spambots. Do NOT send mail to <blacklist@brics.com>, or you are likely to be blocked.

Richard A Steenbergen

12:39 a.m.

New subject: fixing TCP buffers (Re: packet reordering at exchange points)

On Wed, Apr 10, 2002 at 12:22:57AM +0000, E.B. Dreger wrote:

...

My static buffer presumed that one would regularly see line rate; that's probably an invalid assumption.

Indeed. But thats why it's not an actual allocation.

...

Why bother advertising space remaining? Simply take the total space -- which is tuned to line rate -- and divide equitably. Equal division is the primitive way. Monitoring actual buffer use, a la PSC window-tuning code, is more efficient.

Because then you havn't accomplished your goal. If you have 32MB of buffer memory available, and you open 32 connections and share it equally for 1MB/ea, you could have 1 connection that is doing no bandwidth and one connection that wants to scale to more then 1MB of packets inflight. Then you have to start scanning all your connections on a periodic basis adjusting the socket buffers to reflect the actual congestion window, a la PSC. My suggestion was to cut out all that non-sense by simply removing the received window limits all together. Actually you could accomplish this goal by just advertising the maximum possible window size and rely on packet drops to shrink the congestion window on the sending side as necessary, but this would be slightly less efficient in the case of a sender overrunning the receiver. But alas we're both forgetting the sender side, which controls how quickly data moves from userland into the kernel. This part must be set by looking at the sending congestion window. And I thought of another problem as well. If you had a receiver which made a connection, requested as much data as possible, and then never did a read() on the socket buffer, all the data would pile up in the kernel and consume the total buffer space for the entire system.

...

To respect memory, sure, you could impose a global limit and alloc as needed. But on a "busy enough" server/client, how much would that save? Perhaps one could allocate 8MB chunks at a time... but fragmentation could prevent the ability to have a contiguous 32MB in the future. (Yes, I'm assuming high memory usage and simplistic paging. But I think that's plausible.)

You're missing the point, you don't allocate ANYTHING until you have a packet to fill that buffer, and then when you're done buffering it, it is free'd. The limits are just there to prevent you from running away with a socket buffer. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6)

E.B. Dreger

12:57 a.m.

New subject: fixing TCP buffers (Re: packet reordering at exchange points)

...

Date: Tue, 9 Apr 2002 20:39:34 -0400 From: Richard A Steenbergen <ras@e-gerbil.net>

...

My suggestion was to cut out all that non-sense by simply removing the received window limits all together. Actually you could accomplish this goal by just advertising the maximum possible window size and rely on packet drops to shrink the congestion window on the sending side as necessary, but this would be slightly less efficient in the case of a sender overrunning the receiver.

But alas we're both forgetting the sender side, which controls how quickly data moves from userland into the kernel. This part must be set by looking at the sending congestion window. And I thought of another problem as

Actually, I was thinking more in terms of sending than receiving. Yes, your approach sounds quite slick for the RECV side, and I see your point. But WND info will be negotiated for sending... so why not base it on "splitting the total pie" instead of "arbitrary maximum"?

...

well. If you had a receiver which made a connection, requested as much data as possible, and then never did a read() on the socket buffer, all the data would pile up in the kernel and consume the total buffer space for the entire system.

Unless, again, there's some sort of limit. 32 MB total, 512 connections, each socket gets 64 kB until it proves its worth. Sockets don't get to play the RED-ish game until they _prove_ that they're serious about sucking down data. Once a socket proves its intentions (and periodically after that), it gets to use a BIG buffer, so we find out just how fast the connection can go.

...

You're missing the point, you don't allocate ANYTHING until you have a packet to fill that buffer, and then when you're done buffering it, it is free'd. The limits are just there to prevent you from running away with a socket buffer.

No, I understand your point perfectly, and that's how it's currently done. But why even bother with constant malloc(9)/free(9) when the overall buffer size remains reasonably constant? i.e., kernel allocation to IP stack changes slowly if at all. IP stack alloc to individual streams changes regularly. -- Eddy Brotsman & Dreger, Inc. - EverQuick Internet Division Phone: +1 (316) 794-8922 Wichita/(Inter)national Phone: +1 (785) 865-5885 Lawrence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Mon, 21 May 2001 11:23:58 +0000 (GMT) From: A Trap <blacklist@brics.com> To: blacklist@brics.com Subject: Please ignore this portion of my mail signature. These last few lines are a trap for address-harvesting spambots. Do NOT send mail to <blacklist@brics.com>, or you are likely to be blocked.

Richard A Steenbergen

1:12 a.m.

New subject: fixing TCP buffers (Re: packet reordering at exchange points)

On Wed, Apr 10, 2002 at 12:57:19AM +0000, E.B. Dreger wrote:

...

Unless, again, there's some sort of limit. 32 MB total, 512 connections, each socket gets 64 kB until it proves its worth. Sockets don't get to play the RED-ish game until they _prove_ that they're serious about sucking down data.

Once a socket proves its intentions (and periodically after that), it gets to use a BIG buffer, so we find out just how fast the connection can go.

That doesn't prevent an intentional local DoS though. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6)

E.B. Dreger

1:27 a.m.

New subject: fixing TCP buffers (Re: packet reordering at exchange points)

...

Date: Tue, 9 Apr 2002 21:12:30 -0400 From: Richard A Steenbergen <ras@e-gerbil.net>

...

That doesn't prevent an intentional local DoS though.

And the current stacks do? (Note that my 64 kB figure was an example, for an example system that had 512 current connections.) Okay, how about new sockets split "excess" buffer space, subject to certain minimum size restrictions? New sockets do not impact establish streams, unless we have way too many sockets or too little buffer space. If way too many sockets, it's just like current stacks, although hopefully ulimit would prevent this scenario. If we're out of buffer space, then we're going to have even more problems when the sockets are actually passing data. Yes, I'm still thinking about carving up a 32 MB chunk of RAM, shrinking window sizes when we need more buffers. Of course, we probably should consider AIO, too... if we can have buffers in userspace instead of copying from kernel to user via read(), that makes memory issues a bit more pleasant. -- Eddy Brotsman & Dreger, Inc. - EverQuick Internet Division Phone: +1 (316) 794-8922 Wichita/(Inter)national Phone: +1 (785) 865-5885 Lawrence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Mon, 21 May 2001 11:23:58 +0000 (GMT) From: A Trap <blacklist@brics.com> To: blacklist@brics.com Subject: Please ignore this portion of my mail signature. These last few lines are a trap for address-harvesting spambots. Do NOT send mail to <blacklist@brics.com>, or you are likely to be blocked.

E.B. Dreger

1:13 a.m.

New subject: fixing TCP buffers (Re: packet reordering at exchange points)

Rough attempt at processing rules: 1. If "enough" buffer space, let each stream have its fill. DONE. 2. Not "enough" buffer space, so we must invoke limits. 3. If new connection, impose low limit until socket proves its intentions... much like not allocating an entire socket struct until TCP handshake is complete, or TCP slow start. DONE. 4. It's an existing connection. 5. Does it act like it could use a smaller window? If so, shrink the window. DONE. 6. Stream might be able to use a larger window. 7. Is it "tuning time" for this stream according to round robin or random robin? If so, use BIG buffer for a few packets, measuring the stream's desires. 8. Does the stream want more buffer space? If not, DONE. 9. Is it fair to other streams to adjust window? If not, DONE. 10. Adjust appropriately. I guess this shoots my "split into friendly fractions" approach out of the water... and we're back to "standard" autotuning (for sending) once we enforce minimum buffer size. Major differences: + We're saying to approach memory usage macroscopically instead of microscopically. i.e., per system instead of per stream. + We're removing upper bounds when bandwidth is plentiful. + Receive like you suggested, save for the "low memory" start phase. -- Eddy Brotsman & Dreger, Inc. - EverQuick Internet Division Phone: +1 (316) 794-8922 Wichita/(Inter)national Phone: +1 (785) 865-5885 Lawrence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Mon, 21 May 2001 11:23:58 +0000 (GMT) From: A Trap <blacklist@brics.com> To: blacklist@brics.com Subject: Please ignore this portion of my mail signature. These last few lines are a trap for address-harvesting spambots. Do NOT send mail to <blacklist@brics.com>, or you are likely to be blocked.

Peter Galbavy

2:48 p.m.

New subject: packet reordering at exchange points

...

To transfer 1Gb/s across 100ms I need to be prepared to buffer at least 25MB of data. According to pricewatch, I can pick up a high density 512MB

Why ? I am still waiting (after many years) for anyone to explain to me the issue of buffering. It appears to be completely unneccesary in a router. Everyone seems to answer me with 'bandwidth x delay product' and similar, but think about IP routeing. The intermediate points are not doing any form of per-packet ack etc. and so do not need to have large windows of data etc. I can understand the need in end-points and networks (like X.25) that do per-hop clever things... Will someone please point me to references that actually demonstrate why an IP router needs big buffers (as opposed to lots of 'downstream' ports) ? Peter

neil＠DOMINO.ORG

2:51 p.m.

New subject: packet reordering at exchange points

Peter, For basic Internet style routeing you are probably correct [and possibly even more true for MPLS style switching/routeing], but these days customers demand different classes of service and managed data and bandwidth services over IP. These requires lots of packet hacking and for that you need bufferss. If you need any type of filtering done thats remotely intelligent you need somewhere to process those packets especially when you are running 10G interfaces. Regards, Neil. -- Neil J. McRae - Alive and Kicking neil@DOMINO.ORG

John Kristoff

3:06 p.m.

New subject: packet reordering at exchange points

On Wed, 10 Apr 2002 15:48:36 +0100 "Peter Galbavy" <peter.galbavy@knowtion.net> wrote:

...

I am still waiting (after many years) for anyone to explain to me the issue of buffering. It appears to be completely unneccesary in a router.

OK, what I am missing? Unless I'm misunderstanding your question, this seems relatively simplistic and the need for buffers on routes is actually quite obvious. Imagine a router with more than 2 interfaces, each interface being of the same speed. Packets arrive on 2 or more interfaces and each need to be forwarded onto the same outbound interface. Imagine packets arrive at exactly or roughly the same time. Since the bits are going out serially, you're gonna need to buffer packets one behind the others on the egress interface. Similar scenarios occur when egress interface capacity is less than some rate or aggregate rate wanting to exit via that interface. John

Richard A Steenbergen

3:30 p.m.

New subject: packet reordering at exchange points

On Wed, Apr 10, 2002 at 03:48:36PM +0100, Peter Galbavy wrote:

...

...
To transfer 1Gb/s across 100ms I need to be prepared to buffer at least 25MB of data. According to pricewatch, I can pick up a high density 512MB

Why ?

I am still waiting (after many years) for anyone to explain to me the issue of buffering. It appears to be completely unneccesary in a router.

Note that the previous example was about end to end systems achieving line rate across a continent, nothing about routers was mentioned. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6)

Peter Galbavy

4:27 p.m.

New subject: packet reordering at exchange points

...

Note that the previous example was about end to end systems achieving line rate across a continent, nothing about routers was mentioned.

Fair enough - for that I can see the point. Maybe I need to read more though :) Peter

Mathew Lodge

4:04 p.m.

New subject: packet reordering at exchange points

At 03:48 PM 4/10/2002 +0100, Peter Galbavy wrote:

...

Why ?

I am still waiting (after many years) for anyone to explain to me the issue of buffering. It appears to be completely unneccesary in a router.

Well, that's some challenge but I'll have a go :-/ As far as I can tell, the use of buffering has to do with traffic shaping vs. rate limiting. If you have a buffer on the interface, you are doing traffic shaping -- whether or not your vendor calls it that. That's because when the rate at which traffic arrives at the queue exceeds the rate that it leaves the queue, the packets get buffered for transmission some time later. In effect, the queue buffers traffic bursts and then spreads transmission of the buffered packets over time. If you have no queue or a very small queue (relative to the Rate x Average packet size) and the arrival rate exceeds transmission rate, you can't buffer the packet to transmit later, and so simply drop it. This is rate limiting. That's my theory, but what's the effect? I have seen the difference in effect on a real network running IP over ATM. The ATM core at this large European service provider was running equipment from "Vendor N". N's ATM access switches have very small cell buffers -- practically none, in fact. When we connected routers to this core from "vendor C" that didn't have much buffering on the ATM interfaces, users saw very poor e-mail and HTTP throughput. We discovered that this was happening because during bursts of traffic, there were long trains of sequential packet loss -- including many TCP ACKs. This caused the TCP senders to rapidly back off their transmit windows. That and the packet loss was the major cause of poor throughput. Although we didn't figure this out until much later, a side effect of the sequential packet loss (i.e. no drop policy) was to synchronize all of the TCP senders -- i.e. the "burstyness" of the traffic got worse because now all of the TCP senders were trying to increase their send windows at the same. To fix the problem, we replaced the ATM interface cards on the routers -- it turns out Vendor C has an ATM interface with lots of buffering, configurable drop policy (we used WRED) and a cell-level traffic shaper, presumably to address this very issue. The users saw much improved e-mail and web performance and everyone was happy, except for the owner of the routers who wanted to know why they had to buy the more expensive ATM card (i.e. why couldn't the ATM core people couldn't put more buffering on their ATM access ports). Hope this helps, Mathew

...

Everyone seems to answer me with 'bandwidth x delay product' and similar, but think about IP routeing. The intermediate points are not doing any form of per-packet ack etc. and so do not need to have large windows of data etc.

I can understand the need in end-points and networks (like X.25) that do per-hop clever things...

Will someone please point me to references that actually demonstrate why an IP router needs big buffers (as opposed to lots of 'downstream' ports) ?

Peter

Stephen Sprunk

4:44 p.m.

New subject: packet reordering at exchange points

Thus spake "Mathew Lodge" <mathew@cplane.com>

...

At 03:48 PM 4/10/2002 +0100, Peter Galbavy wrote:

...
Why ?

I am still waiting (after many years) for anyone to explain to me the issue of buffering. It appears to be completely unneccesary in a router.

Well, that's some challenge but I'll have a go :-/

As far as I can tell, the use of buffering has to do with traffic shaping vs. rate limiting. If you have a buffer on the interface, you are doing traffic shaping -- whether or not your vendor calls it that. ... If you have no queue or a very small queue ... This is rate limiting.

Well, that's implicit shaping/policing if you wish to call it that. It's only common to use those terms with explicit shaping/policing, i.e. when you need to shape/police at something other than line rate.

...

except for the owner of the routers who wanted to know why they had to buy the more expensive ATM card (i.e. why couldn't the ATM core people couldn't put more buffering on their ATM access ports).

The answer here lies in ATM switches being designed primarily for carriers (and by people with a carrier mindset). Carriers, by and large, do not want to carry unfunded traffic across their networks and then be forced to buffer it; it's much easier (and cheaper) to police at ingress and buffer nothing. It would have been nice to see a parallel line of switches (or cards) with more buffers. However, anyone wise enough to buy those was wise enough to ditch ATM altogether :) S

Stephen Sprunk

5:02 p.m.

New subject: packet reordering at exchange points

Thus spake "Peter Galbavy" <peter.galbavy@knowtion.net>

...

Why ?

I am still waiting (after many years) for anyone to explain to me the issue of buffering. It appears to be completely unneccesary in a router.

Routers are not non-blocking devices. When an output port is blocked, packets going to that port must be either buffered or dropped. While it's obviously possible to drop them, like ATM/FR carriers do, ISPs have found they have much happier customers when they do a reasonable amount of buffering. S

Jim Forster

11 Apr 11 Apr

6:54 a.m.

New subject: packet reordering at exchange points

...

...
To transfer 1Gb/s across 100ms I need to be prepared to buffer at least 25MB of data. According to pricewatch, I can pick up a high density 512MB

Why ?

I am still waiting (after many years) for anyone to explain to me the issue of buffering. It appears to be completely unneccesary in a router.

Everyone seems to answer me with 'bandwidth x delay product' and similar, but think about IP routeing. The intermediate points are not doing any

form

...

of per-packet ack etc. and so do not need to have large windows of data etc.

I can understand the need in end-points and networks (like X.25) that do per-hop clever things...

Will someone please point me to references that actually demonstrate why an IP router needs big buffers (as opposed to lots of 'downstream' ports) ?

Sure, see the original Van Jacobson-Mike Karels paper "Congestion Avoidance and Control", at http://www-nrg.ee.lbl.gov/papers/congavoid.pdf. Briefly, TCP end systems start pumping packets into the path until they've gotten about RTT*BW worth of packets "in the pipe". Ideally these packets are somewhat evenly spaced out, but in practice in various circumtances they can get clumped together at a bottleneck link. If the bottleneck link router can't handle the burst then some get dumped. -- Jim

Craig Partridge

10:19 a.m.

New subject: packet reordering at exchange points

In message <00ae01c1e125$ba6b5380$dc9247ab@amer.cisco.com>, "Jim Forster" write s:

...

Sure, see the original Van Jacobson-Mike Karels paper "Congestion Avoidance and Control", at http://www-nrg.ee.lbl.gov/papers/congavoid.pdf. Briefly, TCP end systems start pumping packets into the path until they've gotten about RTT*BW worth of packets "in the pipe". Ideally these packets are somewhat evenly spaced out, but in practice in various circumtances they can get clumped together at a bottleneck link. If the bottleneck link router can't handle the burst then some get dumped.

Actually, it is even stronger than that -- in a perfect world (without jitter, etc), the packets *will* get clumped together at the bottleneck link. The reason is that for every ack, TCP's pumping out two back to back packets -- but the acks are coming back at approximately the spacing at which full-sized data packets get the bottleneck link... So you're sending two segments (or 1.5 if you ack every other segment) in the time the bottleneck can only handle one. [Side note, this works because during slow start, you're not sending during the entire RTT -- you're sending bursts at the start of the RTT, and each slow start you fill more of the RTT] Craig

8484

Age (days ago)

8486

Last active (days ago)

List overview

Download

20 comments

10 participants

participants (10)

Craig Partridge
E.B. Dreger
Jim Forster
John Kristoff
Mathew Lodge
neil＠DOMINO.ORG
Paul Vixie
Peter Galbavy
Richard A Steenbergen
Stephen Sprunk