Hey, In past few years there's been lot of talk about reducing buffer depths, and many seem to think vendors are throwing memory on the chips for the fun of it. If we look at some particularly pathological case. Let's assume sender is CDN network with 40GE connected server and receiver is 10GE connected. There is 300ms latency between them. 10Gbps * 300ms = 375MB, is the window size the client wants to be able to fill its pipe to the 40GE sender. However TCP does not normally pace packets inside the window, so 40GE server will flood the window as fast as it can, instead of limiting itself to 10Gbps, optimally it'll send at linerate. While receiver can only serialise them 10GE out, causing majority of that 375MB ending up in the sender side switch/router buffers. If we can't buffer that, then the receiver cannot receive at 10Gbps, as window size will shrink. Is this a problem? What rate should you be able to expect to get and at what latency? Usually contracts to customers won't have any limitations on bandwidth achievable on given latency and writing such down might make you appear inferior to your competitor. Perhaps this is unrealistic case, however if you run the numbers in much less pathological cases, you'll still end up having much larger buffer needs than large number of switch chips out there have. Some new ones, like JNPR QFX10k and Broadcom Jericho come with much larger buffers than predecessors, and will be able to deal with what I hope are most practical cases. Linux actually these days does have bandwidth estimator for TCP sessions, but it's not used by default for anything, it's just for consumption for other layers so they can do something about it. And I believe in 'tc' you can use these to cause packet pacing inside a window. QUIC and MinimaLT, AFAIK, do bandwidth estimation and packet pacing by default. In perfect world, we'd be done now. Receiver side switch can do with very small buffers, few packets should suffice. However, if network itself is congested, the bandwidth estimation keeps sinking, and these well-behaved streams are losing to the aggressive TCP streams, and you'll end up having 0bps estimations. So perhaps the bandwidth estimator should be application aware, and never report lower estimate than what is practical for given application, so that it could compete fairly with aggressive streams, up-to required rate. Information I'd love to have, is how large window sizes do TCP sessions peak at, in real network? Some CDN network must be collecting these stats. I'd love to see rough statistics. <1% go over 100MB? 2% between 50MB-100MB? ... few large brackets of distribution of window sizes by some CDN offering content download (GGC, OpenConnect are not interesting, as they won't send large files). Also, are some CDN's already implementing packet pacing inside window? If so, how? Do they have lower limit to it? Some related URLs: https://lwn.net/Articles/645115/ https://lwn.net/Articles/564978/ http://www.ietf.org/proceedings/88/slides/slides-88-iccrg-6.pdf http://www.ietf.org/proceedings/84/slides/slides-84-iccrg-2.pdf -- ++ytti
On 03/09/2015 11:56, Saku Ytti wrote:
40GE server will flood the window as fast as it can, instead of limiting itself to 10Gbps, optimally it'll send at linerate.
optimally, but tcp slow start will generally stop this from happening on well behaved sending-side stacks so you send up ramping up quickly to path rate rather than egress line rate from the sender side. Also, regardless of an individual flow's buffering requirements, the intermediate path will be catering with large numbers of flows, so while it's interesting to talk about 375mb of intermediate path buffers, this is shared buffer space and any attempt on the part of an individual sender to (ab)use the entire path buffer will end up causing RED/WRED for everyone else. Otherwise, this would be a fascinating talk if people had real world data. Nick
On 3 September 2015 at 15:04, Nick Hilliard <nick@foobar.org> wrote:
optimally, but tcp slow start will generally stop this from happening on well behaved sending-side stacks so you send up ramping up quickly to path rate rather than egress line rate from the sender side.
This assumes network is congested and unable to reach its potential rate. If it can reach its potential rate, eventually the window will scale to 375MB and the pathological flooding will occur. Mostly network is congested, and the pathological case cannot happen, as the egress cannot ingest the floods, not allowing window to grow to needed size, which also means the potential rate will not be reached, and rate will be something less than 10Gbps. Essentially we threw the baby out with the bath water, kind of like protecting from DoS by killing the victim. -- ++ytti
On Thu, Sep 03, 2015 at 01:04:34PM +0100, Nick Hilliard wrote:
On 03/09/2015 11:56, Saku Ytti wrote:
40GE server will flood the window as fast as it can, instead of limiting itself to 10Gbps, optimally it'll send at linerate.
optimally, but tcp slow start will generally stop this from happening on well behaved sending-side stacks so you send up ramping up quickly to path rate rather than egress line rate from the sender side. Also, regardless of an individual flow's buffering requirements, the intermediate path will be catering with large numbers of flows, so while it's interesting to talk about 375mb of intermediate path buffers, this is shared buffer space and any attempt on the part of an individual sender to (ab)use the entire path buffer will end up causing RED/WRED for everyone else.
Otherwise, this would be a fascinating talk if people had real world data.
The original analysis is flawed because it assumes latency is constant. Any analysis has to include the fact that buffering changes latency. If you start with a 300ms path (by propogation delay, switching latency, hetc.), and 375MB of buffers on a 10G port, then, when the buffers fill, you end up with a 600ms path[1]. And a 375MB window is no longer sufficient to keep the pipe full. Instead, you need a 750MB buffer. But now the latency is 900ms. And so on. This doesn't converge. Every byte of filled buffer is another byte you need in the window if you're going to fill the pipe. Not accounting for this is part of the reason the original analysis is flawed. The end result is that you always run out of window or run out of buffer (causing packet loss). Here's a paper that shows you don't need buffers equal to bandwidth*delay to get near capacity: http://www.cs.bu.edu/~matta/Papers/hstcp-globecom04.pdf (I'm not endorsing it. Just pointing out it out as a datapoint.) -- Brett [1] 0.300 + 375E6 * 8 / 10E9 = 600ms
Hey Brett,
Here's a paper that shows you don't need buffers equal to bandwidth*delay to get near capacity: http://www.cs.bu.edu/~matta/Papers/hstcp-globecom04.pdf (I'm not endorsing it. Just pointing out it out as a datapoint.)
Quick glance makes me believe the S and D nodes are equal bandwidth, but only R1-R2 bandwidth is explicitly stated.S1, D1, Sn, Dn are only ever mentioned in the topology. If Sender is same or lower rate than Destination, then we really shouldn't need almost any buffering. Issue should only come when Sender is significantly higher rate than Destination and network is not limiting them. -- ++ytti
On Thu, Sep 03, 2015 at 05:48:00PM +0300, Saku Ytti wrote:
Hey Brett,
Here's a paper that shows you don't need buffers equal to bandwidth*delay to get near capacity: http://www.cs.bu.edu/~matta/Papers/hstcp-globecom04.pdf (I'm not endorsing it. Just pointing out it out as a datapoint.)
Quick glance makes me believe the S and D nodes are equal bandwidth, but only R1-R2 bandwidth is explicitly stated.S1, D1, Sn, Dn are only ever mentioned in the topology. If Sender is same or lower rate than Destination, then we really shouldn't need almost any buffering.
Unless Sender is higher than R1-R2.
Issue should only come when Sender is significantly higher rate than Destination and network is not limiting them.
I didn't read it in detail either, but at first glance, it appears to me that the model is infinite bandwidth and zero latency between S and R1, and D and R2, with queueing happening in R1. That's not going to give materially different results than, having S-R1 be 4 times R1-R2, and R2-D being the same as R1-R2. So it fits well with the original discussion here of 40G into 10G. -- Brett
Can anyone provide references on this top so I can educate myself? This e-mail and any attachments thereto is intended only for use by the addressee(s) named herein and may be proprietary and/or legally privileged. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this email, and any attachments thereto, without the prior written permission of the sender is strictly prohibited. If you receive this e-mail in error, please immediately telephone or e-mail the sender and permanently delete the original copy and any copy of this e-mail, and any printout thereof. All documents, contracts or agreements referred or attached to this e-mail are SUBJECT TO CONTRACT. The contents of an attachment to this e-mail may contain software viruses that could damage your own computer system. While Hibernia Networks has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage that you sustain as a result of software viruses. You should carry out your own virus checks before opening any attachment.
About every edition of Packet Pushers Podcast for the last 18 months would be a good start probably. That'll keep you busy. Jethro. On Fri, 4 Sep 2015, Rod Beck wrote:
Can anyone provide references on this top so I can educate myself?
This e-mail and any attachments thereto is intended only for use by the addressee(s) named herein and may be proprietary and/or legally privileged. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this email, and any attachments thereto, without the prior written permission of the sender is strictly prohibited. If you receive this e-mail in error, please immediately telephone or e-mail the sender and permanently delete the original copy and any copy of this e-mail, and any printout thereof. All documents, contracts or agreements referred or attached to this e-mail are SUBJECT TO CONTRACT. The contents of an attachment to this e-mail may contain software viruses that could damage your own computer system. While Hibernia Networks has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage that you sustain as a result of software viruses. You should carry out your own virus checks before opening any attachment.
. . . . . . . . . . . . . . . . . . . . . . . . . Jethro R Binks, Network Manager, Information Services Directorate, University Of Strathclyde, Glasgow, UK The University of Strathclyde is a charitable body, registered in Scotland, number SC015263.
On 4 Sep 2015, at 15:40, Rod Beck <Rod.Beck@hibernianetworks.com> wrote:
Can anyone provide references on this top so I can educate myself?
This might be of help http://packetpushers.net/sdn-network-virtualization-hypervisors/ Niraj ---------------------- Niraj Kacha Network Security Loughborough University
There's also a quite comprehensive survey from an academic angle: http://arxiv.org/abs/1406.0440
On Fri, 4 Sep 2015 14:40:31 +0000 Rod Beck <Rod.Beck@hibernianetworks.com> wrote:
Can anyone provide references on this top so I can educate myself?
A bit more effort will be required on your part to get the most out it, but one potentially in depth resource would be Nick Feamster's Software Defined Networking course, currently available through Coursera: <https://www.coursera.org/course/sdn1> John
On Sep 4, 2015, at 07:40, Rod Beck <Rod.Beck@hibernianetworks.com> wrote:
Can anyone provide references on this top so I can educate myself?
What do you mean when you say “software defined networking”? Do you have a particular problem or use case you are approaching? Cheers, -j
I thought it was purely for vendors to sell you more crap? ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com ----- Original Message ----- From: "Nick Hilliard" <nick@foobar.org> To: nanog@nanog.org Sent: Monday, September 7, 2015 4:14:38 PM Subject: Re: Software Defined Networking On 07/09/2015 21:23, James Downs wrote:
What do you mean when you say “software defined networking”? Do you have a particular problem or use case you are approaching?
since when was that a requirement for SDN? Nick
On 8/09/2015 7:14 am, "Nick Hilliard" <nick@foobar.org> wrote:
On 07/09/2015 21:23, James Downs wrote:
What do you mean when you say ³software defined networking²? Do you have a particular problem or use case you are approaching?
since when was that a requirement for SDN?
Nick
Yes. Usually Automation/Orchestration and allowing the customer to manage their own network requirements in real-time through a portal/iPhone etc? Cheers [b]
On 09/09/15 11:51, Bevan Slattery wrote:
Yes. Usually Automation/Orchestration and allowing the customer to manage their own network requirements in real-time through a portal/iPhone etc?
"But, where does this OpenFlow stuff fit into that?" Ad infinitum. -- Tom
participants (12)
-
Bevan Slattery
-
Brett Frankenberger
-
James Downs
-
Jethro R Binks
-
John Kristoff
-
Mike Hammett
-
Narseo Vallina Rodriguez
-
Nick Hilliard
-
Niraj Kacha
-
Rod Beck
-
Saku Ytti
-
Tom Hill