Hey, In past few years there's been lot of talk about reducing buffer depths, and many seem to think vendors are throwing memory on the chips for the fun of it. If we look at some particularly pathological case. Let's assume sender is CDN network with 40GE connected server and receiver is 10GE connected. There is 300ms latency between them. 10Gbps * 300ms = 375MB, is the window size the client wants to be able to fill its pipe to the 40GE sender. However TCP does not normally pace packets inside the window, so 40GE server will flood the window as fast as it can, instead of limiting itself to 10Gbps, optimally it'll send at linerate. While receiver can only serialise them 10GE out, causing majority of that 375MB ending up in the sender side switch/router buffers. If we can't buffer that, then the receiver cannot receive at 10Gbps, as window size will shrink. Is this a problem? What rate should you be able to expect to get and at what latency? Usually contracts to customers won't have any limitations on bandwidth achievable on given latency and writing such down might make you appear inferior to your competitor. Perhaps this is unrealistic case, however if you run the numbers in much less pathological cases, you'll still end up having much larger buffer needs than large number of switch chips out there have. Some new ones, like JNPR QFX10k and Broadcom Jericho come with much larger buffers than predecessors, and will be able to deal with what I hope are most practical cases. Linux actually these days does have bandwidth estimator for TCP sessions, but it's not used by default for anything, it's just for consumption for other layers so they can do something about it. And I believe in 'tc' you can use these to cause packet pacing inside a window. QUIC and MinimaLT, AFAIK, do bandwidth estimation and packet pacing by default. In perfect world, we'd be done now. Receiver side switch can do with very small buffers, few packets should suffice. However, if network itself is congested, the bandwidth estimation keeps sinking, and these well-behaved streams are losing to the aggressive TCP streams, and you'll end up having 0bps estimations. So perhaps the bandwidth estimator should be application aware, and never report lower estimate than what is practical for given application, so that it could compete fairly with aggressive streams, up-to required rate. Information I'd love to have, is how large window sizes do TCP sessions peak at, in real network? Some CDN network must be collecting these stats. I'd love to see rough statistics. <1% go over 100MB? 2% between 50MB-100MB? ... few large brackets of distribution of window sizes by some CDN offering content download (GGC, OpenConnect are not interesting, as they won't send large files). Also, are some CDN's already implementing packet pacing inside window? If so, how? Do they have lower limit to it? Some related URLs: https://lwn.net/Articles/645115/ https://lwn.net/Articles/564978/ http://www.ietf.org/proceedings/88/slides/slides-88-iccrg-6.pdf http://www.ietf.org/proceedings/84/slides/slides-84-iccrg-2.pdf -- ++ytti