"Does TCP Need an Overhaul?" (internetevolution, via slashdot)

older
RE: Speedtest site accuracy [was:...

Paul Vixie

5 Apr 2008 5 Apr '08

1:51 a.m.

in <http://www.internetevolution.com/author.asp?section_id=499&doc_id=150113> larry roberts says: ..., last year a new alternative to using output queues, called "flow management" was introduced. This concept finally solves the TCP unfairness problem and leads to my answer: Fix the network, not TCP. ... What is really necessary is to detect just the flows that need to slow down, and selectively discard just one packet at the right time, but not more, per TCP cycle. Discarding too many will cause a flow to stall -- we see this when Web access takes forever. Flow management requires keeping information on each active flow, which currently is inexpensive and allows us to build an intelligent process that can precisely control the rate of every flow as needed to insure no overloads. Thus, there are now two options for network equipment: o Random discards from output queues bIntelligent rate control of every flow -- creates much TCP unfairness o Intelligent rate control of every flow -- eliminates most TCP unfairness ... i wouldn't want to get in an argument with somebody who was smart and savvy enough to invent packet switching during the year i entered kindergarden, but, somebody told me once that keeping information on every flow was *not* "inexpensive." should somebody tell dr. roberts? (i'd hate to think that everybody would have to buy roberts' (anagran's) Fast Flow Technology at every node of their network to make this work. that doesn't sound "inexpensive" to me.

Show replies by date

Christopher Morrow

5 Apr 5 Apr

5:02 a.m.

On Fri, Apr 4, 2008 at 9:51 PM, Paul Vixie <paul@vix.com> wrote:

...

(i'd hate to think that everybody would have to buy roberts' (anagran's) Fast Flow Technology at every node of their network to make this work. that doesn't sound "inexpensive" to me.

I suppose he could try to sell it... and people with larger networks could see if keeping state on a few million active flows per device is 'expensive' or 'inexpensive'. Perhaps it's less expensive than it seems as though it would. Oh, will this be in linecard RAM? main-cpu-RAM? calculated on ASIC/port or overall for the whole box/system? How about deconflicting overlapping ip-space (darn that mpls!!!) what about asymmetric flows? I had thought the flow-routing thing was a dead end subject long ago? -Chris

Steven M. Bellovin

9:54 a.m.

On Sat, 5 Apr 2008 01:02:24 -0400 "Christopher Morrow" <morrowc.lists@gmail.com> wrote:

...

On Fri, Apr 4, 2008 at 9:51 PM, Paul Vixie <paul@vix.com> wrote:

...
(i'd hate to think that everybody would have to buy roberts' (anagran's) Fast Flow Technology at every node of their network to make this work. that doesn't sound "inexpensive" to me.

I suppose he could try to sell it... and people with larger networks could see if keeping state on a few million active flows per device is 'expensive' or 'inexpensive'. Perhaps it's less expensive than it seems as though it would.

Oh, will this be in linecard RAM? main-cpu-RAM? calculated on ASIC/port or overall for the whole box/system? How about deconflicting overlapping ip-space (darn that mpls!!!) what about asymmetric flows?

I had thought the flow-routing thing was a dead end subject long ago?

And you can't get high speeds with Ethernet; you get too many collisions. Besides, it doesn't have any fairness properties. Clearly, you need token ring. Oh yeah, they fixed those. I have no idea if it's economically feasible or not -- technology and costs change, and just because something wasn't possible 5 years ago doesn't mean it isn't possible today. It does strike me that any such scheme would be implemented on access routers, not backbone routers, for lots of good and sufficient reasons. That alone makes it more feasible. I also note that many people are using NetFlow, which shares some of the same properties as this scheme. As for the need -- well, it does deal with the BitTorrent problem, assuming that that is indeed a problem. Bottom line: I have no idea if it makes any economic sense, but I'm not willing to rule it out without analysis. --Steve Bellovin, http://www.cs.columbia.edu/~smb

Jeroen Massar

10:16 a.m.

New subject: Flow Based Routing/Switching (Was: "Does TCP Need an Overhaul?" (internetevolution, via slashdot))

Paul Vixie wrote: [..]

...

i wouldn't want to get in an argument with somebody who was smart and savvy enough to invent packet switching during the year i entered kindergarden, but, somebody told me once that keeping information on every flow was *not* "inexpensive." should somebody tell dr. roberts?

Isn't the reason that "NetFlow" (or v10 which is the the IETF/Cisco named IPFIX) exists the side-effect of having routers doing "flow based routing" aka "keeping an entry per IP flow, thus using that entry for every next packet to quickly select the outgoing interface instead of having to go through all the prefixes" ? The flows are in those boxes, but only for stats purposes exported with NetFlow/IPFIX/sFlow/etc. Apparently it was not as fast as they liked it to be and there where other issues. Thus what exactly is new here in his boxes that has not been tried and failed before? Greets, Jeroen

Roland Dobbins

10:28 a.m.

New subject: Flow Based Routing/Switching (Was: "Does TCP Need an Overhaul?" (internetevolution, via slashdot))

On Apr 5, 2008, at 5:16 PM, Jeroen Massar wrote:

...

The flows are in those boxes, but only for stats purposes exported with NetFlow/IPFIX/sFlow/etc. Apparently it was not as fast as they liked it to be

This is essentially correct. NetFlow was originally intended as a switching mechanism, but then it became apparent that the value of the information in the cache and the ability to export it as telemetry were of more value, as there were other, more efficient methods of moving the packets around. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@cisco.com> // +66.83.266.6344 mobile History is a great teacher, but it also lies with impunity. -- John Robb

Lincoln Dale

10:40 a.m.

New subject: Flow Based Routing/Switching (Was: "Does TCP Need an Overhaul?" (internetevolution, via slashdot))

...

...
i wouldn't want to get in an argument with somebody who was smart and savvy enough to invent packet switching during the year i entered kindergarden, but, somebody told me once that keeping information on every flow was *not* "inexpensive." should somebody tell dr. roberts?

Isn't the reason that "NetFlow" (or v10 which is the the IETF/Cisco named IPFIX) exists the side-effect of having routers doing "flow based routing" aka "keeping an entry per IP flow, thus using that entry for every next packet to quickly select the outgoing interface instead of having to go through all the prefixes" ?

flow-cache based forwarding can work perfectly fine provided: - your flow table didn't overflow - you didn't need to invalidate large amounts of your table at once (e.g. next-hop change) this was the primary reason why Cisco went from 'fast-switch' to 'CEF' which uses a FIB. the problem back then was that when you had large amounts of invalidated flow-cache entries due to a next-hop change, typically that next-hop change was caused by something in the routing table - and then you had a problem because you wanted to use all your router CPU to recalculate the next-best paths yet you couldn't take advantage of any shortcut information, so you were dropping to a 'slow path' of forwarding. for a long long time now, Cisco platforms with netflow have primarily had netflow as an _accounting_ mechanism and generally not as the primary forwarding path. some platforms (e.g. cat6k) have retained a flow-cache that CAN be used to influence forwarding decisions, and that has been the basis for how things like NAT can be done in hardware (where per-flow state is necessary), but the primary forwarding mechanism even on that platform has been CEF in hardware since Supervisor-2 came out. no comment on the merits of the approach by Larry, anything i'd say would be through rose-coloured glasses anyway. cheers, lincoln. (work:ltd@cisco.com)

michael.dillon＠bt.com

9:59 p.m.

New subject: Flow Based Routing/Switching (Was: "Does TCP Need an Overhaul?" (internetevolution, via slashdot))

...

The flows are in those boxes, but only for stats purposes exported with NetFlow/IPFIX/sFlow/etc. Apparently it was not as fast as they liked it to be and there where other issues. Thus what exactly is new here in his boxes that has not been tried and failed before?

Roberts is selling a product to put in at the edge of your WAN to solve packet loss problems in the core network. Since most ISPs don't have packet loss problems in the core, but most enterprise networks *DO* have problems in the core, I think that Roberts is selling a box with less magic, and more science behind what it does. People seem to be assuming that Roberts is trying to put these boxes in the biggest ISP networks which does not appear to be the case. I expect that he is smart enough to realize that there are too many flows in such networks. On the other hand, Enterprise WANs are small enough to feasibly implement flow discards, yet big enough to be pushing the envelope of the skill levels of enterprise networking people. In addition Enterprise WANs can live with the curse of QoS, i.e. that you have to punish some network users when you implement QoS. In the Enterprise, this is acceptable because not all users have the same cost/benefit. If flow switching lets them sweat their network assets harder, then they will be happy. --Michael Dillon

Kevin Day

10:34 a.m.

On Apr 4, 2008, at 8:51 PM, Paul Vixie wrote:

...

What is really necessary is to detect just the flows that need to slow down, and selectively discard just one packet at the right time, but not more, per TCP cycle. Discarding too many will cause a flow to stall -- we see this when Web access takes forever. ...

i wouldn't want to get in an argument with somebody who was smart and savvy enough to invent packet switching during the year i entered kindergarden, but, somebody told me once that keeping information on every flow was *not* "inexpensive." should somebody tell dr. roberts?

I don't claim to understand this area more than Dr. Roberts either, but to paraphrase George Santayana: "Those who do not understand SACK are condemned to re-implement it." (or fight it) Years ago I worked on a project that was supposed to allow significant over-subscription of bandwidth to unmodified clients/servers by tracking the full state of all TCP traffic going through it. When the output interface started filling up, we'd do tricks like delaying the packets and selectively dropping packets "fairly" so that any one flow wasn't penalized too harshly unless it was monopolizing the bandwidth. To sum up many months of work that went nowhere: As long as you didn't drop more packets than SACK could handle (generally 2 packets in-flight) dropping packets is pretty ineffective at causing TCP to slow down. As long as the packets are making it there quickly, and SACK retransmits happen fast enough to keep the window full... You aren't slowing TCP down much. Of course if you're intercepting TCP packets you could disable SACK, but that strikes me as worse off than doing nothing. If you are dropping enough packets to stall SACK, you're dropping a lot of packets. With a somewhat standard 32K window and 1500 byte packets, to lose 3 non-contiguous packets inside a 32K window you're talking about 13% packet loss within one window. I would be very annoyed if I were on a connection that did this regularly. You get very little granularity when it comes to influencing a SACK flow - too little loss and SACK handles it without skipping a beat. Too much loss and you're severely affecting the connection. You've also got fast retransmit, New Reno, BIC/CUBIC, as well as host parameter caching to limit the affect of packet loss on recovery time. I don't doubt that someone else could do a better job than I did in this field, but I'd be really curious to know how much of an effect a intermediary router can have on a TCP flow with SACK that doesn't cause more packet loss than anyone would put up with for interactive sessions. The biggest thing we learned was that end user perceived speed is something completely different from flow speed. Prioritizing UDP/53 and TCP setup packets had a bigger impact than anything else we did, from a user's perspective. If either of those got delayed/dropped, pages would appear to stall while loading, and the delay between a click and visible results could greatly increase. This annoyed users far more than a slow download. Mark UDP/53 and tcpflags(syn) packets as high priority. If you wanted to get really fancy and can track TCP state, prioritize the first 2k of client->server and server->client of HTTP to allow the request and reply headers to pass uninterrupted. Those made our client happier than anything else we did, at far far less cost. -- Kevin

Paul Vixie

12:49 p.m.

...

You've also got fast retransmit, New Reno, BIC/CUBIC, as well as host parameter caching to limit the affect of packet loss on recovery time. I don't doubt that someone else could do a better job than I did in this field, but I'd be really curious to know how much of an effect a intermediary router can have on a TCP flow with SACK that doesn't cause more packet loss than anyone would put up with for interactive sessions.

my takeaway from the web site was that one of the ways p2p is bad is that it tends to start several parallel tcp sessions from the same client (i guess think of bittorrent where you're getting parts of the file from several folks at once). since each one has its own state machine, each will try to sense the end to end bandwidth-delay product. thus, on headroom-free links, each will get 1/Nth of that link's bandwidth, which could be (M>1)/Nth aggregate, and apparently this is unfair to the other users depending on that link. i guess i can see the point, if i squint just right. nobody wants to get blown off the channel because someone else gamed the fairness mechanisms. (on the other hand some tcp stacks are deliberately overaggressive in ways that don't require M>1 connections to get (M>1)/Nth of a link's bandwidth. on the internet, generally speaking, if someone else says fairness be damned, then fairness will be damned. however, i'm not sure that all TCP sessions having one endpoint in common or even all those having both endpoints in common ought to share fate. one of those endpoints might be a NAT box with M>1 users behind it, for example. in answer to your question about SACK, it looks like they simulate a slower link speed for all TCP sessions that they guess are in the same flow-bundle. thus, all sessions in that flow-bundle see a single shared contributed bandwidth-delay product from any link served by one of their boxes.

Kevin Day

1:40 p.m.

On Apr 5, 2008, at 7:49 AM, Paul Vixie wrote:

...

...
You've also got fast retransmit, New Reno, BIC/CUBIC, as well as host parameter caching to limit the affect of packet loss on recovery time. I don't doubt that someone else could do a better job than I did in this field, but I'd be really curious to know how much of an effect a intermediary router can have on a TCP flow with SACK that doesn't cause more packet loss than anyone would put up with for interactive sessions.

my takeaway from the web site was that one of the ways p2p is bad is that it tends to start several parallel tcp sessions from the same client (i guess think of bittorrent where you're getting parts of the file from several folks at once). since each one has its own state machine, each will try to sense the end to end bandwidth-delay product. thus, on headroom-free links, each will get 1/Nth of that link's bandwidth, which could be (M>1)/Nth aggregate, and apparently this is unfair to the other users depending on that link.

This is true. But it's not just bittorrent that does this. IE8 opens up to 6 parallel TCP sockets to a single server, Firefox can be tweaked to open an arbitrary number (and a lot of "Download Optimizers" do exactly that), etc. Unless you're keeping a lot of state on the history of what each client is doing, it's going to be hard to tell the difference between 6 IE sockets downloading cnn.com rapidly and bittorrent masquerading as HTTP.

...

i guess i can see the point, if i squint just right. nobody wants to get blown off the channel because someone else gamed the fairness mechanisms. (on the other hand some tcp stacks are deliberately overaggressive in ways that don't require M>1 connections to get (M>1)/Nth of a link's bandwidth. on the internet, generally speaking, if someone else says fairness be damned, then fairness will be damned.

Exactly. I'm nervously waiting for the first bittorrent client to have their own TCP engine built into it that plays even more unfairly. I seem to remember a paper that described where one client was sending ACKs faster than it was actually receiving the data it from several well connected servers, and ended up bringing enough traffic in to completely swamp their university's pipes. As soon as P2P authors realize they can get around caps by not playing by the rules, you'll be back to putting hard limits on each subscriber - which is where we are now. I'm not saying some fancier magic couldn't be put over top of that, but that's all depending on everyone to play by the rules to begin with.

...

however, i'm not sure that all TCP sessions having one endpoint in common or even all those having both endpoints in common ought to share fate. one of those endpoints might be a NAT box with M>1 users behind it, for example.

in answer to your question about SACK, it looks like they simulate a slower link speed for all TCP sessions that they guess are in the same flow- bundle. thus, all sessions in that flow-bundle see a single shared contributed bandwidth-delay product from any link served by one of their boxes.

Yeah, I guess the point I was trying to make is that once you throw SACK into the equation you lose the assumption that if you drop TCP packets, TCP slows down. Before New Reno, fast-retransmit and SACK this was true and very easy to model. Now you can drop a considerable number of packets and TCP doesn't slow down very much, if at all. If you're worried about data that your clients are downloading you're either throwing away data from the server (which is wasting bandwidth getting all the way to you) or throwing away your clients' ACKs. Lost ACKs do almost nothing to slow down TCP unless you've thrown them *all* away. I'm not saying all of this is completely useless, but it's relying a lot on the fact that the people you're trying to rate limit are going to be playing by the same rules you intended. This makes me really wish that something like ECN had taken off - any router between the two end-points can say "slow this connection down" and (if both ends are playing by the rules) they do so without wasting time on retransmits. -- Kevin

David Andersen

4:05 p.m.

On Apr 5, 2008, at 9:40 AM, Kevin Day wrote:

...

...
in answer to your question about SACK, it looks like they simulate a slower link speed for all TCP sessions that they guess are in the same flow-bundle. thus, all sessions in that flow-bundle see a single shared contributed bandwidth-delay product from any link served by one of their boxes.

Yeah, I guess the point I was trying to make is that once you throw SACK into the equation you lose the assumption that if you drop TCP packets, TCP slows down. Before New Reno, fast-retransmit and SACK this was true and very easy to model. Now you can drop a considerable number of packets and TCP doesn't slow down very much, if at all. If you're worried about data that your clients

That's only partially correct: TCP doesn't _time out_, but it still cuts its sending window in half (ergo, it cuts the rate at which it sends in half). The TCP sending rate computations are unchanged by either NewReno or SACK; the difference is that NR and SACK are much more efficient at getting back on their feet after the loss and: a) Are less likely to retransmit packets they've already sent b) Are less likely to go into a huge timeout and therefore back to slow-start You can force TCP into basically whatever sending rate you want by dropping the right packets.

...

are downloading you're either throwing away data from the server (which is wasting bandwidth getting all the way to you) or throwing away your clients' ACKs. Lost ACKs do almost nothing to slow down TCP unless you've thrown them *all* away.

You're definitely tossing useful data. One can argue that you're going to do that anyway at the bottleneck link, but I'm not sure I've had enough espresso to make that argument yet. :)

...

I'm not saying all of this is completely useless, but it's relying a lot on the fact that the people you're trying to rate limit are going to be playing by the same rules you intended. This makes me really wish that something like ECN had taken off - any router between the two end-points can say "slow this connection down" and (if both ends are playing by the rules) they do so without wasting time on retransmits.

Yup. -Dave

Sam Stickland

7 Apr 7 Apr

10:29 a.m.

Kevin Day wrote:

...

Yeah, I guess the point I was trying to make is that once you throw SACK into the equation you lose the assumption that if you drop TCP packets, TCP slows down. Before New Reno, fast-retransmit and SACK this was true and very easy to model. Now you can drop a considerable number of packets and TCP doesn't slow down very much, if at all. If you're worried about data that your clients are downloading you're either throwing away data from the server (which is wasting bandwidth getting all the way to you) or throwing away your clients' ACKs. Lost ACKs do almost nothing to slow down TCP unless you've thrown them *all* away. If this was true surely it would mean that drop models such WRED/RED are becoming useless?

Sam

Hank Nussbacher

5 Apr 5 Apr

5:49 p.m.

On Sat, 5 Apr 2008, Kevin Day wrote:

...

On Apr 4, 2008, at 8:51 PM, Paul Vixie wrote:

...
What is really necessary is to detect just the flows that need to slow down, and selectively discard just one packet at the right time, but not more, per TCP cycle. Discarding too many will cause a flow to stall -- we see this when Web access takes forever. ...

i wouldn't want to get in an argument with somebody who was smart and savvy enough to invent packet switching during the year i entered kindergarden, but, somebody told me once that keeping information on every flow was *not* "inexpensive." should somebody tell dr. roberts?

I suggest reading the excellent page: "High-Speed TCP Variants": http://kb.pert.geant2.net/PERTKB/TcpHighSpeedVariants Enough material there to keep NANOG readers busy all weekend long. -Hank

Charles N Wyble

6:36 p.m.

On Sat, 5 Apr 2008, Kevin Day wrote:

...

On Apr 4, 2008, at 8:51 PM, Paul Vixie wrote:

...
What is really necessary is to detect just the flows that need to slow down, and selectively discard just one packet at the right time, but not more, per TCP cycle. Discarding too many will cause a flow to stall -- we see this when Web access takes forever. ...

I suggest reading the excellent page: "High-Speed TCP Variants": http://kb.pert.geant2.net/PERTKB/TcpHighSpeedVariants [Charles N Wyble] Hank, This is in fact an excellent resource. Thanks for sharing it!

Jorge Amodio

8:54 p.m.

outstanding compilation Hank, thanks !!! Rgds -Jorge

Iljitsch van Beijnum

7 Apr 7 Apr

12:17 p.m.

On 5 apr 2008, at 12:34, Kevin Day wrote:

...

As long as you didn't drop more packets than SACK could handle (generally 2 packets in-flight) dropping packets is pretty ineffective at causing TCP to slow down.

It shouldn't be. TCP hovers around the maximum bandwidth that a path will allow (if the underlying buffers are large enough). It increases its congestion window in congestion avoidance until a packet is dropped, then the congestion window shrinks but it also starts growing again. If you read "The macroscopic behavior of the TCP Congestion Avoidance algorithm" by Mathis et al you'll see that TCP performance conforms to: bandwdith = MSS / RTT * C / sqrt(p) Where MSS is the maximum segment size, RTT the round trip time, C a constant close to 1 and p the packet loss probability. Since the overshooting of the congestion window causes congestion = packet loss, you end up at some equilibrium of bandwidth and packet loss. Or, for a given link: number of flows, bandwidth and packet loss. I'm sure this behavior isn't any different in the presence of SACK. However, the caveat is that the congestion window never shrinks between two maximum segment sizes. If packet loss is such that you reach that size, then more packet loss will not slow down sessions. Note that for short RTTs you can still move a fair amount of data in this state, but any lost packet means a retransmission timeout, which stalls the session.

...

You've also got fast retransmit, New Reno, BIC/CUBIC, as well as host parameter caching to limit the affect of packet loss on recovery time.

The really interesting one is TCP Vegas, which doesn't need packet loss to slow down. But Vegas is a bit less aggressive than Reno (which is what's widely deployed) or New Reno (which is also deployed but not so widely). This is a disincentive for users to deploy it, but it would be good for service providers. Additional benefit is that you don't need to keep huge numbers of buffers in your routers and switches because Vegas flows tend to not overshoot the maximum available bandwidth of the path.

Kevin Day

2:20 p.m.

On Apr 7, 2008, at 7:17 AM, Iljitsch van Beijnum wrote:

...

On 5 apr 2008, at 12:34, Kevin Day wrote:

...
As long as you didn't drop more packets than SACK could handle (generally 2 packets in-flight) dropping packets is pretty ineffective at causing TCP to slow down.

It shouldn't be. TCP hovers around the maximum bandwidth that a path will allow (if the underlying buffers are large enough). It increases its congestion window in congestion avoidance until a packet is dropped, then the congestion window shrinks but it also starts growing again.

I'm sure this behavior isn't any different in the presence of SACK.

At least in FreeBSD, packet loss handled by SACK recovery changes the congestion window behavior. During a SACK recovery, the congestion window is clamped down to allow no more than 2 additional segments in flight, but that only lasts until the recovery is complete and quickly recovers. (That's significantly glossing over a lot of details that probably only matter to those who already know them - don't shoot me for that not being 100% accurate :) ) I don't believe that Linux or Windows are quite that aggressive with SACK recovery though, but I'm less familiar there. As a quick example on two FreeBSD 7.0 boxes attached directly over GigE, with New Reno, fast retransmit/recovery, and 256K window sizes, with an intermediary router simulating packet loss. A single HTTP TCP session going from a server to client. SACK enabled, 0% packet loss: 780Mbps SACK disabled, 0% packet loss: 780Mbps SACK enabled, 0.005% packet loss: 734Mbps SACK disabled, 0.005% packet loss: 144Mbps (19.6% the speed of having SACK enabled) SACK enabled, 0.01% packet loss: 664Mbps SACK disabled, 0.01% packet loss: 88Mbps (13.3%) However, this falls apart pretty fast when the packet loss is high enough that SACK doesn't spend enough time outside the recovery phase. It's still much better than without SACK though: SACK enabled, 0.1% packet loss: 48Mbps SACK disabled, 0.1% packet loss: 36Mbps (75%)

...

However, the caveat is that the congestion window never shrinks between two maximum segment sizes. If packet loss is such that you reach that size, then more packet loss will not slow down sessions. Note that for short RTTs you can still move a fair amount of data in this state, but any lost packet means a retransmission timeout, which stalls the session.

True, a longer RTT changes this effect. Same test, but instead of back- to-back GigE, this is going over a real-world trans-atlantic link: SACK enabled, 0% packet loss: 2.22Mbps SACK disabled, 0% packet loss: 2.23Mbps SACK enabled, 0.005% packet loss: 2.03Mbps SACK disabled, 0.005% packet loss: 1.95Mbps (96%) SACK enabled, 0.01% packet loss: 2.01Mbps SACK disabled, 0.01% packet loss: 1.94Mbps (96%) SACK enabled, 0.1% packet loss: 1.93Mbps SACK disabled, 0.1% packet loss: 0.85Mbps (44%) (No, this wasn't a scientifically valid test there, but the best I can do for an early Monday morning)

...

...
You've also got fast retransmit, New Reno, BIC/CUBIC, as well as host parameter caching to limit the affect of packet loss on recovery time.

The really interesting one is TCP Vegas, which doesn't need packet loss to slow down. But Vegas is a bit less aggressive than Reno (which is what's widely deployed) or New Reno (which is also deployed but not so widely). This is a disincentive for users to deploy it, but it would be good for service providers. Additional benefit is that you don't need to keep huge numbers of buffers in your routers and switches because Vegas flows tend to not overshoot the maximum available bandwidth of the path.

It would be very nice if more network-friendly protocols were in use, but with "download optimizers" for Windows that cranks the TCP window sizes way up, the general move to solving latency by opening more sockets, and P2P doing whatever it can to evade ISP detection - it's probably a bit late. -- Kevin

Iljitsch van Beijnum

4:43 p.m.

On 7 apr 2008, at 16:20, Kevin Day wrote:

...

As a quick example on two FreeBSD 7.0 boxes attached directly over GigE, with New Reno, fast retransmit/recovery, and 256K window sizes, with an intermediary router simulating packet loss. A single HTTP TCP session going from a server to client.

Ok, assuming a 1460 MSS that leaves the RTT as the unknown.

...

SACK enabled, 0% packet loss: 780Mbps SACK disabled, 0% packet loss: 780Mbps

Is that all? Try with jumboframes.

...

SACK enabled, 0.005% packet loss: 734Mbps SACK disabled, 0.005% packet loss: 144Mbps (19.6% the speed of having SACK enabled)

144 Mbps and 0.00005 packet loss probability would result in a ~ 110 ms RTT so obviously something isn't right with that case. 734 would be an RTT of around 2 ms, which sounds fairly reasonable. I'd be interested to see what's really going on here, I suspect that the packet loss isn't sufficiently random so multiple segments are lost from a single window. Or maybe disabling SACK also disables fast retransmit? I'll be happy to look at a tcpdump for the 144 Mbps case.

...

It would be very nice if more network-friendly protocols were in use, but with "download optimizers" for Windows that cranks the TCP window sizes way up, the general move to solving latency by opening more sockets, and P2P doing whatever it can to evade ISP detection - it's probably a bit late.

Don't forget that the user is only partially in control, the data also has to come from somewhere. Service operators have little incentive to break the network. And users would probably actually like it if their p2p was less aggressive, that way you can keep it running when you do other stuff without jumping through traffic limiting hoops.

Mike Gonnason

8 Apr 8 Apr

11:49 a.m.

On Mon, Apr 7, 2008 at 8:43 AM, Iljitsch van Beijnum <iljitsch@muada.com> wrote:

...

On 7 apr 2008, at 16:20, Kevin Day wrote:

...
As a quick example on two FreeBSD 7.0 boxes attached directly over GigE, with New Reno, fast retransmit/recovery, and 256K window sizes, with an intermediary router simulating packet loss. A single HTTP TCP session going from a server to client.

Ok, assuming a 1460 MSS that leaves the RTT as the unknown.

...
SACK enabled, 0% packet loss: 780Mbps SACK disabled, 0% packet loss: 780Mbps

Is that all? Try with jumboframes.

...
SACK enabled, 0.005% packet loss: 734Mbps SACK disabled, 0.005% packet loss: 144Mbps (19.6% the speed of having SACK enabled)

144 Mbps and 0.00005 packet loss probability would result in a ~ 110 ms RTT so obviously something isn't right with that case.

734 would be an RTT of around 2 ms, which sounds fairly reasonable.

I'd be interested to see what's really going on here, I suspect that the packet loss isn't sufficiently random so multiple segments are lost from a single window. Or maybe disabling SACK also disables fast retransmit? I'll be happy to look at a tcpdump for the 144 Mbps case.

...
It would be very nice if more network-friendly protocols were in use, but with "download optimizers" for Windows that cranks the TCP window sizes way up, the general move to solving latency by opening more sockets, and P2P doing whatever it can to evade ISP detection - it's probably a bit late.

Don't forget that the user is only partially in control, the data also has to come from somewhere. Service operators have little incentive to break the network. And users would probably actually like it if their p2p was less aggressive, that way you can keep it running when you do other stuff without jumping through traffic limiting hoops.

This might have been mentioned earlier in the thread, but has anyone read the paper by Bob Briscoe titled "Flow Rate Fairness:Dismantling a Religion"? http://www.cs.ucl.ac.uk/staff/bbriscoe/projects/2020comms/refb/draft-briscoe... The paper essentially describes the fault in TCP congestion avoidance and how P2P applications leverage that flaw to consume as much bandwidth as possible. He also proposes that we redefine the mechanism we use to determine "fair" resource consumption. His example is individual flow rate fairness (traditional TCP congestion avoidance) vs cost fairness (a combination of congestion "cost" and flow rate associated with a specific entity). He also compares his cost fairness methodology to existing proposed TCP variants, which Hank previously mentioned. i.e. XCP, WFQ, ... Any thoughts regarding this? -Mike Gonnason

Marcin Cieslak

11:10 p.m.

Mike Gonnason wrote:

...

This might have been mentioned earlier in the thread, but has anyone read the paper by Bob Briscoe titled "Flow Rate Fairness:Dismantling a Religion"? http://www.cs.ucl.ac.uk/staff/bbriscoe/projects/2020comms/refb/draft-briscoe... The paper essentially describes the fault in TCP congestion avoidance and how P2P applications leverage that flaw to consume as much bandwidth as possible. He also proposes that we redefine the mechanism we use to determine "fair" resource consumption.

Any thoughts regarding this?

The problem is that fairness was probably never a design goal of TCP, even with Van Jacobson's congestion avoidance patch. Bob Briscoe is a member of the IETF Transport Working Group (TSVWG). This subject got some publicity and politics involved, but please see some real discussion on the TSVWG list, with my favorite answer highlighted: http://thread.gmane.org/gmane.ietf.tsvwg/5184/focus=5199 I recommend some neighboring threads as well: http://thread.gmane.org/gmane.ietf.tsvwg/5197/focus=5214 http://thread.gmane.org/gmane.ietf.tsvwg/5205 -- << Marcin Cieslak // saper@saper.info >>

Greg Skinner

11:33 p.m.

On Wed, Apr 09, 2008 at 01:10:53AM +0200, Marcin Cieslak wrote:

...

The problem is that fairness was probably never a design goal of TCP, even with Van Jacobson's congestion avoidance patch.

Bob Briscoe is a member of the IETF Transport Working Group (TSVWG).

...

This subject got some publicity and politics involved, but please see some real discussion on the TSVWG list, with my favorite answer highlighted:

This issue also got some publicity and politics on the IRTF end2end list. For example, start at http://www.postel.org/pipermail/end2end-interest/2007-August/006925.html . --gregbo

Mike Gonnason

9 Apr 9 Apr

12:47 p.m.

On Tue, Apr 8, 2008 at 3:33 PM, Greg Skinner <gds@gds.best.vwh.net> wrote:

...

On Wed, Apr 09, 2008 at 01:10:53AM +0200, Marcin Cieslak wrote:

...
The problem is that fairness was probably never a design goal of TCP, even with Van Jacobson's congestion avoidance patch.

Bob Briscoe is a member of the IETF Transport Working Group (TSVWG).

...
This subject got some publicity and politics involved, but please see some real discussion on the TSVWG list, with my favorite answer highlighted:

This issue also got some publicity and politics on the IRTF end2end list.

For example, start at http://www.postel.org/pipermail/end2end-interest/2007-August/006925.html .

--gregbo

Thank you both (Marcin and Greg) for the links, they have made for some great reading tonight. It would appear that I have introduced a slight tangent from the goal of this topic. The main discussion here was regarding an overhaul of TCP, whereas my Briscoe suggestion is more of an architectural overhaul which leads to many other changes (infrastructural, economical and protocol) in the network. Briscoe's somewhat undiplomatic introduction of his idea seems to have elicited an initially negative response, however I happy to see that others have seen his ideas to have merit and hence test his reasoning. I surmise that the big question now is, do we go with a small step (enhanced congestion detection) or a jump (total reworking of network policing architecture). I am glad to say that whatever is decided, it will most likely be implemented far faster than IPv6. As we will not have a specific feature to buy us time from congestion, unlike what NAT did for IPv4. :) -Mike Gonnason

Matthew Moyle-Croft

5 Apr 5 Apr

10:45 a.m.

I'm not quite sure about with this Fast Flow technology is exactly what it's really doing that isn't being done already by the DPI vendors (eg. Cisco SCE2000 on the Cisco webpage claims to be able to track 2million unidirectional flows). Am I missing something? MMC Paul Vixie wrote: ...

...

i wouldn't want to get in an argument with somebody who was smart and savvy enough to invent packet switching during the year i entered kindergarden, but, somebody told me once that keeping information on every flow was *not* "inexpensive." should somebody tell dr. roberts?

(i'd hate to think that everybody would have to buy roberts' (anagran's) Fast Flow Technology at every node of their network to make this work. that doesn't sound "inexpensive" to me.

Matthew Moyle-Croft

11:43 a.m.

Reworded: What I'm not quite sure "Fast Flow" is exactly what is it really doing that isn't being done already by the DPI vendors (eg. Cisco SCE2000 on the Cisco webpage claims to be able to track 2million unidirectional flows).

...

Am I missing something?

Aside from my horribly broken attempt at English? :-) MMC

...

MMC

Paul Vixie wrote:

...

...
i wouldn't want to get in an argument with somebody who was smart and savvy enough to invent packet switching during the year i entered kindergarden, but, somebody told me once that keeping information on every flow was *not* "inexpensive." should somebody tell dr. roberts?

(i'd hate to think that everybody would have to buy roberts' (anagran's) Fast Flow Technology at every node of their network to make this work. that doesn't sound "inexpensive" to me.

6293

Age (days ago)

6297

Last active (days ago)

List overview

Download

23 comments

18 participants

participants (18)

Charles N Wyble
Christopher Morrow
David Andersen
Greg Skinner
Hank Nussbacher
Iljitsch van Beijnum
Jeroen Massar
Jorge Amodio
Kevin Day
Lincoln Dale
Marcin Cieslak
Matthew Moyle-Croft
michael.dillon＠bt.com
Mike Gonnason
Paul Vixie
Roland Dobbins
Sam Stickland
Steven M. Bellovin