10GE TOR port buffers (was Re: 10G switch recommendaton)
Hi, Is there a reason switch vendors 1U TOR 10GE aggregation switches are all cut-through and there are no models with deep buffers? I've ben looking at all vendors I can think of and all have the same models. TOR switches as cut-through with little buffers, and chassis based boxes with deep buffers. TOR: Juniper EX4500 208KB/10GE (4MB shared per PFE) Cisco 4900M 728KB/10GE (17.5MB shared) Cisco Nexus 3064 140KB/10GE (9MB shared) Cisco Nexus 5000 680KB/10GE Force10 S2410 I can't find it anymore, but it wasn't much Arista 7148SX 123KB/10GE (80KB per port plus 5MB dynamic) Arista 7050S 173KB/10GE (9MB shared) Brocade VDX 6730-32 170KB/10GE Brocade TurboIron 24X 85KB/10GE HP 6600-24XG 4500KB/10GE HP 5820-24XG-SFP+ 87KB/10GE Extreme Summit X650 375KB/10GE Chassis: Juniper EX8200-8XS 512MB/10GE Cisco WS-X6708-10GE 32MB/10GE (or 24MB) Cisco N7K-M132XP-12 36MB/10GE Arista DCS-7548S-LC 48MB/10GE Brocade BR-MLX-10Gx8-X 128MB/10GE (not sure) 1GE aggregation. Force10 S60 1250MB shared HP 5830 3000MB shared I am at a loss why there are no 10GE TOR switches with deep buffers. Apparently there is a need for deep buffers as the vendors make them available in the chassis linecards. There also are deep buffer 1GE aggregation switches. Is there some (technical) reason for this? I can imagine some vendors would say that you need to scale up to a chassis if you need deep buffers, but at least one vendor should be able to get quite some customers with a 10G deep buffer TOR switch. I understand that flow-control should prevent loss with microbursts, but in my customers get adverse effects, with strong negative performance if they let flow-control do its thing. Any pointers why this is, or if there is a solution for microburst loss would be greatly appreciated. Thanks, Bas
On (2012-01-27 17:35 +0100), bas wrote:
Chassis: Juniper EX8200-8XS 512MB/10GE Cisco WS-X6708-10GE 32MB/10GE (or 24MB) Cisco N7K-M132XP-12 36MB/10GE Arista DCS-7548S-LC 48MB/10GE Brocade BR-MLX-10Gx8-X 128MB/10GE (not sure)
1GE aggregation. Force10 S60 1250MB shared HP 5830 3000MB shared
I'd take some of these with grain of salt, take EX8200-8XS, PDF indeed does agree: --- Total buffer size is 512 MB on each EX8200-8XS 10-Gigabit Ethernet port or each EX8200-40XS port group, and 42 MB on each EX8200-48T and EX8200-48F Gigabit Ethernet port, providing 50-100 ms of bandwidth delay buffering --- However 512MB is about 400ms of buffering, while 512Mb is 50ms. So I think JNPR PDF is just wrong. Similar error may exist for some other quoted numbers. But generally nice list, especially the 10GE fixed config looked realistic, sometimes I wish we'd have 'dpreview' style page for routers and switches, especially now with dozen or more vendors selling 'same' trident+ switch, differentiating them is hard. -- ++ytti
On Fri, Jan 27, 2012 at 5:55 PM, Saku Ytti <saku@ytti.fi> wrote:
On (2012-01-27 17:35 +0100), bas wrote: But generally nice list, especially the 10GE fixed config looked realistic, sometimes I wish we'd have 'dpreview' style page for routers and switches, especially now with dozen or more vendors selling 'same' trident+ switch, differentiating them is hard.
But do you generally agree that "the market" has a requirement for a deep-buffer TOR switch? Or am I crazy for thinking that my customers need such a solution? Bas
In a message written on Fri, Jan 27, 2012 at 10:40:03PM +0100, bas wrote:
But do you generally agree that "the market" has a requirement for a deep-buffer TOR switch?
Or am I crazy for thinking that my customers need such a solution?
You're crazy. :) You need to google "bufferbloat", which while the aim has been more at (SOHO) routers that have absurd (multi-second) buffers, the concepts at play work here as well. Let's say you have a VOIP application with 250ms of jitter tolerance, and you're going 80ms across country. You then add in a switch on one end that has 300ms of buffer. Ooops, you go way over, but only from time to time when the switch is full, getting 300+80ms of latency for a few packets. Dropped packets are a _GOOD_ thing. If your ethernet switch can't get the packet out another port in ~1-2ms it should drop it. The output port is congested, congestion is what tells the sender to back off. If you buffer the packets you get congestion collapse, which is far worse for throughput in the end, and in particular has severely detremental effects on the others on the LAN, not just the box filling the buffers. A network dropping packets is healthy, telling the upstream boxes to throttle to the appropiate speeds with packet loss which is how TCP operates. I can' tell you how many times I've seen network engineers tell me "no matter how big I make the buffers performance gets worse and worse". Well duh, you're just introducing more and more latency in your network, and making TCP backoff fail, rather than work properly. I go in and slash their 50-100 packet buffers down to 5 and magically the network performs great, even when full. Now, how much buffer do you need? One packet is the minimum. If you can't buffer one packet it becomes hard to reach 100% utilization on a link. Anyone who's tried with a pure cut-through switch can tell you it tops out around 90% (with multiple senders to a single egress). Amazing one packet of buffer almost entirely fixes the problem. When I can manually set the buffers, I generally go for 1ms of buffers on high speed (e.g. 10GE) links, and might increase that to as much as 15 ms of buffers on extremely low speed links, like sub-T1. Remember, your RTT will vary (jitter) +- the sum of all buffers on all hops along the path. A 10 hop path with 15ms per hop could see 150ms of jitter if all links go between full and not full! Buffers in most network gear is bad, don't do it. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
Hi, On Fri, Jan 27, 2012 at 10:52 PM, Leo Bicknell <bicknell@ufp.org> wrote:
In a message written on Fri, Jan 27, 2012 at 10:40:03PM +0100, bas wrote:
But do you generally agree that "the market" has a requirement for a deep-buffer TOR switch?
Or am I crazy for thinking that my customers need such a solution?
You're crazy. :)
You need to google "bufferbloat", which while the aim has been more at (SOHO) routers that have absurd (multi-second) buffers, the concepts at play work here as well.
While your reasoning holds truth it does not explain why the expensive chassis solution (good) makes my customers happy, and the cheaper TOR solution makes my customers unhappy..... Bufferbloat does not matter to them as jitter and latency does not matter. As long as the TCP window size negotioation is not reset the total amount of bit/sec increases for them. If deep buffers are bad I would expect high-end chassis solutions not to offer them either. But the market seems to offer expensive deep buffer chassis solutions and cheap (per 10GE) TOR solutions. IMHO there is no reasoning why.... (why the expensive solution is not offered in a 1U box) My customers want to buffer 10 to 24 * 10GE in a 1 or 2 10GE uplinks to do this they need some buffers.... Bas
In a message written on Fri, Jan 27, 2012 at 11:30:14PM +0100, bas wrote:
While your reasoning holds truth it does not explain why the expensive chassis solution (good) makes my customers happy, and the cheaper TOR solution makes my customers unhappy.....
Bufferbloat does not matter to them as jitter and latency does not matter. As long as the TCP window size negotioation is not reset the total amount of bit/sec increases for them.
I obviously don't know your application. The bufferbloat problem exists for 99.99% of the standard applications in the world. There are, however, a few corner cases. For instance, if you want to move a _single_ TCP stream at more than 1Gbps you need deep buffers. Dropping a single packet slows throughput too much due to a slow-start event. For most of the world with hundreds or thousands of TCP streams across a single port, such problems never occur.
If deep buffers are bad I would expect high-end chassis solutions not to offer them either. But the market seems to offer expensive deep buffer chassis solutions and cheap (per 10GE) TOR solutions.
The margin on a top-of-rack switch is very low. 48 port gige with 10GE uplinks are basically commodity boxes, with plenty of competition. Saving $100 on the bill of materials by cutting out some buffer makes the box more competitive when it's at a $2k price point. In contrast, large, modular chasses have a much higher margin. They are designed with great flexability, to take things like firewall modules and SSL accelerator cards. There are configs where you want some (not much) buffer due to these active appliances in the chassis, plus it is easier to hide an extra $100 of RAM in a $100k box. Also, as was pointed out to me privately, it is also important to loook at adaptive queue management features. The most famous is WRED, but there are other choices. Having a queue management solution on your routers and switches that works in concert with the congestion control mechanism used by the end stations always results in better goodput. Many of the low end switches have limited or no AQM choices, while the higher end switches with fancier ASICs can default to something like WRED. Be sure it is the deeper buffers that are making the difference, and not simply some queue management. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
Hi,
The margin on a top-of-rack switch is very low. 48 port gige with 10GE uplinks are basically commodity boxes, with plenty of competition. Saving $100 on the bill of materials by cutting out some buffer makes the box more competitive when it's at a $2k price point.
The list of 10GE TOR switches I sent earlier are list from $20K to $100K So actual purchase cost for us would be $10K to $30K $500 for some (S)(Q)(bla)RAM shouldn't hold back a vendor from releasing a bitchin switch.... Again this argument does not explain why there are 1GE aggregation switches with deep buffers..
Also, as was pointed out to me privately, it is also important to loook at adaptive queue management features. The most famous is WRED, but there are other choices. Having a queue management solution on your routers and switches that works in concert with the congestion control mechanism used by the end stations always results in better goodput. Many of the low end switches have limited or no AQM choices, while the higher end switches with fancier ASICs can default to something like WRED. Be sure it is the deeper buffers that are making the difference, and not simply some queue management.
All true... Still no reason why not to offer a deep buffer TOR...
Buffers in most network gear is bad, don't do it.
+1 I'm amazed at how many will spend money on switches with more buffering but won't take steps to ease the congestion. Part of the reason is trying to convince non-technical people that packet loss in and of itself doesn't have to be a bad thing, that it allows applications to adapt to network conditions. They can use tools to see packet loss, that gives them something to complain about. They don't know how to interpret jitter or understand what impact that has on their applications. They just know that they can run some placket blaster and see a packet dropped and want that to go away, so we end up in "every packet is precious" mode. They would rather have a download that starts and stops and starts and stops rather than have one that progresses smoothly from start to finish and trying to explain to them that performance is "bursty" because nobody wants to allow a packet to be dropped sails right over their heads. They'll accept crappy performance with no packet loss before they will accept better overall performance with an occasional packet lost. If an applications is truly intolerant of packet loss, then you need to address the congestion, not get bigger buffers.
While I agree _again_!!!!! It does not explain why TOR boxes have little buffers and chassis box have many..... On Fri, Jan 27, 2012 at 11:36 PM, George Bonser <gbonser@seven.com> wrote:
Buffers in most network gear is bad, don't do it.
+1
I'm amazed at how many will spend money on switches with more buffering but won't take steps to ease the congestion. Part of the reason is trying to convince non-technical people that packet loss in and of itself doesn't have to be a bad thing, that it allows applications to adapt to network conditions. They can use tools to see packet loss, that gives them something to complain about. They don't know how to interpret jitter or understand what impact that has on their applications. They just know that they can run some placket blaster and see a packet dropped and want that to go away, so we end up in "every packet is precious" mode.
They would rather have a download that starts and stops and starts and stops rather than have one that progresses smoothly from start to finish and trying to explain to them that performance is "bursty" because nobody wants to allow a packet to be dropped sails right over their heads.
They'll accept crappy performance with no packet loss before they will accept better overall performance with an occasional packet lost.
If an applications is truly intolerant of packet loss, then you need to address the congestion, not get bigger buffers.
-----Original Message----- From: bas Sent: Friday, January 27, 2012 2:54 PM To: George Bonser Subject: Re: 10GE TOR port buffers (was Re: 10G switch recommendaton)
While I agree _again_!!!!!
It does not explain why TOR boxes have little buffers and chassis box have many.....
Because that is what customers think they want so that is what they sell. Customers don't realize that the added buffers are killing performance. I have had network sales reps tell me "you want this switch over here, it has bigger buffers" when that is exactly the opposite of what I want unless I am sending a bunch of UDP through very brief microbursts. If you are sending TCP streams, what you want is less buffering. Spend the extra money on more bandwidth to relieve the congestion. Going to 4 10G aggregated uplinks instead of 2 might get you a much better performance boost than increasing buffers. But it really depends on the end to end application.
On Sat, Jan 28, 2012 at 12:01 AM, George Bonser <gbonser@seven.com> wrote:
Going to 4 10G aggregated uplinks instead of 2 might get you a much better performance boost than increasing buffers. But it really depends on the end to end application.
Also these TOR boxes go to my (more expensive ASR9K and MX) boxes, so from an CAPEX standpoint I simply do not want to give them more ports than required.
On 1/27/12 15:01 , George Bonser wrote:
-----Original Message----- From: bas Sent: Friday, January 27, 2012 2:54 PM To: George Bonser Subject: Re: 10GE TOR port buffers (was Re: 10G switch recommendaton)
While I agree _again_!!!!!
It does not explain why TOR boxes have little buffers and chassis box have many.....
Because that is what customers think they want so that is what they sell. Customers don't realize that the added buffers are killing performance.
It is possible, trivial in fact to buy a switch that has a buffer too small to provide stable performance at some high fraction of it's uplink utilization. You can differentiate between the enterprise/soho 1gig switch you bought to support your ip-phones and wireless APs and the datacenter spec 1u tor along these lines. It is also possible and in fact easy to have enough to accumulate latency in places where you should be discarding packets earlier. I'd rather not be in either situation, but in the later I can police my way out of it.
I have had network sales reps tell me "you want this switch over here, it has bigger buffers" when that is exactly the opposite of what I want unless I am sending a bunch of UDP through very brief microbursts. If you are sending TCP streams, what you want is less buffering. Spend the extra money on more bandwidth to relieve the congestion.
Going to 4 10G aggregated uplinks instead of 2 might get you a much better performance boost than increasing buffers. But it really depends on the end to end application.
It is also possible and in fact easy to have enough to accumulate latency in places where you should be discarding packets earlier.
I'd rather not be in either situation, but in the later I can police my way out of it.
That is why I added the "it depends on the end to end application" caveat.
On 1/27/12 14:53 , bas wrote:
While I agree _again_!!!!!
It does not explain why TOR boxes have little buffers and chassis box have many.....
you need purportionally more buffer when you need to drain 16 x 10 gig into 4 x 10Gig then when you're trying to drain 10Gb/s into 2 x 1Gb/s there's a big incentive bom wise to not use offchip dram buffer in a merchant silicon single chip switch vs something that's more complex.
On Fri, Jan 27, 2012 at 11:36 PM, George Bonser <gbonser@seven.com> wrote:
Buffers in most network gear is bad, don't do it.
+1
I'm amazed at how many will spend money on switches with more buffering but won't take steps to ease the congestion. Part of the reason is trying to convince non-technical people that packet loss in and of itself doesn't have to be a bad thing, that it allows applications to adapt to network conditions. They can use tools to see packet loss, that gives them something to complain about. They don't know how to interpret jitter or understand what impact that has on their applications. They just know that they can run some placket blaster and see a packet dropped and want that to go away, so we end up in "every packet is precious" mode.
They would rather have a download that starts and stops and starts and stops rather than have one that progresses smoothly from start to finish and trying to explain to them that performance is "bursty" because nobody wants to allow a packet to be dropped sails right over their heads.
They'll accept crappy performance with no packet loss before they will accept better overall performance with an occasional packet lost.
If an applications is truly intolerant of packet loss, then you need to address the congestion, not get bigger buffers.
Hi All, On Sat, Jan 28, 2012 at 12:32 AM, Joel jaeggli <joelja@bogus.com> wrote:
On 1/27/12 14:53 , bas wrote:
While I agree _again_!!!!!
It does not explain why TOR boxes have little buffers and chassis box have many.....
you need purportionally more buffer when you need to drain 16 x 10 gig into 4 x 10Gig then when you're trying to drain 10Gb/s into 2 x 1Gb/s
there's a big incentive bom wise to not use offchip dram buffer in a merchant silicon single chip switch vs something that's more complex.
I'm almost ready to throw the towel in the ring, and declare myself a looney.. I can imagine at least one vendor ingnoring the extra BOM capex, and simpky try to please #$%^#@! like me. C NSP has been full with threads about appalling microburst performance of the 6500 for years.. One would think a vendor would jump to a copetitive edge like this...
On 1/27/12 15:40 , bas wrote:
Hi All,
On Sat, Jan 28, 2012 at 12:32 AM, Joel jaeggli <joelja@bogus.com> wrote:
On 1/27/12 14:53 , bas wrote:
While I agree _again_!!!!!
It does not explain why TOR boxes have little buffers and chassis box have many.....
you need purportionally more buffer when you need to drain 16 x 10 gig into 4 x 10Gig then when you're trying to drain 10Gb/s into 2 x 1Gb/s
there's a big incentive bom wise to not use offchip dram buffer in a merchant silicon single chip switch vs something that's more complex.
I'm almost ready to throw the towel in the ring, and declare myself a looney.. I can imagine at least one vendor ingnoring the extra BOM capex, and simpky try to please #$%^#@! like me.
C NSP has been full with threads about appalling microburst performance of the 6500 for years..
And people who care have been using something other than a c6500 for years. it's a 15 year old architecture, and it's had a pretty good run, but it's 2012. An ex8200 has 512MB per port on non-oversuscribed 10Gig ports and 42MB per port on 1Gig ports. that's a lot of ram. to take this back to actual tors. a broadcom 56840 based switch has something in the neighborhood of 9MB available for packet buffer on chip if you need more then more drams are in order. while the TOR can cut-through-switch the chassis can't. the tor is also probably not built with offchip cam (there are examples of off chip cam as well) for much the same reason.
One would think a vendor would jump to a copetitive edge like this...
for those who say bufferbloat is a problem, do you have wred enabled on backbone or customer links? randy
In a message written on Sat, Jan 28, 2012 at 10:06:20AM +0900, Randy Bush wrote:
for those who say bufferbloat is a problem, do you have wred enabled on backbone or customer links?
For *most backbone networks* it is a no-op on the backbone. To be more precise, if the backbone is at least 10x, and preferably more like 50x faster than the largest single TCP flow from any customer it will be nearly impossible to measure the performance difference between a short FIFO queue and a WRED queue. To the customer, absolutely, whenever possible, which generally means when the hardware supports. Ideally with the queue length tuned to match the link speed of the customer port. The slower speed the customer port the more critical the tuning. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
for those who say bufferbloat is a problem, do you have wred enabled on backbone or customer links?
For *most backbone networks* it is a no-op on the backbone. To be more precise, if the backbone is at least 10x, and preferably more like 50x faster than the largest single TCP flow from any customer it will be nearly impossible to measure the performance difference between a short FIFO queue and a WRED queue.
when a line card is designed to buffer the b*d of a trans-pac 40g, the oddities on an intra-pop link have been observed to spike to multiple seconds.
To the customer, absolutely, whenever possible, which generally means when the hardware supports. Ideally with the queue length tuned to match the link speed of the customer port. The slower speed the customer port the more critical the tuning.
so, do you have wred enabled anywhere? who actually has it enabled? (embarrassed to say, but to set an honest example, i do not believe iij does) randy
In a message written on Sat, Jan 28, 2012 at 10:31:20AM +0900, Randy Bush wrote:
when a line card is designed to buffer the b*d of a trans-pac 40g, the oddities on an intra-pop link have been observed to spike to multiple seconds.
Please turn that buffer down. It's bad enough to take a 100ms hop across the pacific. It's far worse when there is +0-100ms of additional buffer. :( Unless that 40G has like 4x10Gbps TCP flows on it you don't need b*d of buffer. I bet many of your other problems go away. 10ms of buffer would be a good number.
so, do you have wred enabled anywhere? who actually has it enabled?
(embarrassed to say, but to set an honest example, i do not believe iij does)
My current employment offers few places where it is appropriate. However, cribbing from a previous ob where I rolled it out network wide: policy-map atm-queueing-out class class-default fair-queue random-detect random-detect precedence 0 10 40 10 random-detect precedence 1 13 40 10 random-detect precedence 2 16 40 10 random-detect precedence 3 19 40 10 random-detect precedence 4 22 40 10 random-detect precedence 5 25 40 10 random-detect precedence 6 28 40 10 random-detect precedence 7 31 40 10 int atm1/0.1 pvc 1/105 vbr-nrt 6000 5000 600 tx-ring-limit 4 service-policy output atm-queueing-out Those packet thresholds were computed as the best balance for 6-20MMbps PVC's on an ATM interface. Also notice that the hardware tx-ring-limit had to be reduced in order to make it effective. There is a hardware buffer that is way too big below the software wred on the platforms in question (7206XVR's). Here's one to wrap your head around. You have an ATM OC-3, it has on it 40 PVC's. Each PVC has a WRED config on it allowing up to 40 packets to be buffered. Some genius in security fires off a network scanning tool across all 40 sites. Yes, you now have 40*40, or 1600 packets of buffer on your single physical port. :( If you work with Frame or ATM, or even dot1q vlans you have to be careful of buffering per-subinterface. It can quickly get absurd. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
when a line card is designed to buffer the b*d of a trans-pac 40g, the oddities on an intra-pop link have been observed to spike to multiple seconds. Please turn that buffer down.
not my router. research probes seeing fun anomalies around the global network.
cribbing from a previous ob where I rolled it out network wide:
policy-map atm-queueing-out class class-default fair-queue random-detect random-detect precedence 0 10 40 10 random-detect precedence 1 13 40 10 random-detect precedence 2 16 40 10 random-detect precedence 3 19 40 10 random-detect precedence 4 22 40 10 random-detect precedence 5 25 40 10 random-detect precedence 6 28 40 10 random-detect precedence 7 31 40 10
int atm1/0.1 pvc 1/105 vbr-nrt 6000 5000 600 tx-ring-limit 4 service-policy output atm-queueing-out
Those packet thresholds were computed as the best balance for 6-20MMbps PVC's on an ATM interface.
while i hope few folk other than telephants still have atm <g>, thanks for posting running code. one problem is that we do not have good tools to look at a link and suggest parms. how did you derive those? randy
In a message written on Sat, Jan 28, 2012 at 11:02:14AM +0900, Randy Bush wrote:
one problem is that we do not have good tools to look at a link and suggest parms. how did you derive those?
It's actually simple math, it just can get moderate complex. Let's say you have a 10Mbps ethernet interface, and you want to set the queue size (in packets). 10Mbps is ~1250000 bytes/sec. Now, I pick an arbitrary value, this is where experience comes in. For this example I'm going to say I want no more than 5ms queuing latency. 5ms/1000sec/ms * 1250000 = 6250 bytes. I then look at my MTU, we'll go with 1500 here. 6250 / 1500 4.16 packets. So queueing around 4 full sized packets generates 0-5ms of jitter on a 10Mbps ethernet, worst case. How many ms is good? Well, that depends, a lot. However I suspect most people here have seen enough pings they have some idea what is good and what isn't. From there you have to look at if there is a hardware ring buffer under the software QOS (not on most interfaces, but yes on some), and then look if the buffer is per-VC (subint, whatever) on an interface with multiple subinterfaces. This is as much art as science. My rules of thumb: - High speed backbone interfaces, 1-3ms of buffer. - Medium to high speed links inside of a single pop/site, 2-5ms of buffer. - Low speed access/edge, 5-20ms of buffer. I have rarely seen an application benefit from more than 20ms of buffer. Remember, this is per hop. If you take a 20 hop traceroute and each hop that 20ms of buffer, you would be waiting 400ms if all the buffers were full! That's even if all 20 hops are in the same building, so the RTT is 1ms. Imagine how crappy a 1ms RTT would be with random 4/10ths of a second pauses would be! Now, here's where it gets non-intuitive. Reducing the buffers will _increase_ packet drops, which will make your customers _happier_. It will also generally smooth out sawtooth patterns (caused by congestion collapse syncronization, everyone fills the buffer at the same time, backs off at the same time, etc). So your links may go from spiky between 90-100%, to flatlined at 100%, but your customers will be happier. Run the math the other way to see how many ms your current buffer size allows the router to hold. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
In a message written on Sat, Jan 28, 2012 at 10:31:20AM +0900, Randy Bush wrote:
(embarrassed to say, but to set an honest example, i do not believe iij does)
I also want to take this opportunity to say there are some cool new features (that I have not had a chance to deploy myself) that may have been missed if queueing wasn't your day job for the last few years. "QoS: Time-Based Thresholds for WRED and Queue Limit for the Cisco 12000 Series Router" http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/12stbwr.html Don't want to do math on how big the queue should be? Configure by ms: outer> enable Router# configure terminal Router (config)# policy-map policy1 Router(config-pmap)# class class-default Router(config-pmap-c)# bandwidth percent 80 Router(config-pmap-c)# random-detect Router(config-pmap-c)# random-detect precedence 2 4 ms 8 ms Router(config-pmap-c)# exit Router(config-pmap)# exit Router(config)# interface serial8/0/0:0.1000 Router(config-subif)# service-policy output policy1 Router(config-subif)# end That's a 4ms to 8ms buffer! Handy, nice! Another frame concept brought to IP is ECN, congestion notification. http://www.cisco.com/en/US/docs/ios-xml/ios/qos_conavd/configuration/15-0m/q... Router(config)# policy-map pol1 Router(config-pmap)# class class-default Router(config-pmap-c)# bandwidth per 70 Router(config-pmap-c)# random-detect Router(config-pmap-c)# random-detect ecn Requires other bits in the network to be ECN aware, but if they are, good stuff. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
Router(config)# policy-map pol1 Router(config-pmap)# class class-default Router(config-pmap-c)# bandwidth per 70 Router(config-pmap-c)# random-detect Router(config-pmap-c)# random-detect ecn
Requires other bits in the network to be ECN aware, but if they are, good stuff.
--
+1 There is no excuse these days for stuff not to be ECN aware. That GREATLY mitigates things as it makes hosts aware pretty much immediately that there is congestion and they don't have to wait for a lost packet to time out. I brought it up to a Brocade engineer once asking for the option to set ECN rather than drop the packet and he said "nobody uses it". I told him nobody uses it because you don't have the feature available. How can anyone use it if you don't have the feature?
See below Jared Mauch On Jan 27, 2012, at 9:13 PM, George Bonser <gbonser@seven.com> wrote:
Router(config)# policy-map pol1 Router(config-pmap)# class class-default Router(config-pmap-c)# bandwidth per 70 Router(config-pmap-c)# random-detect Router(config-pmap-c)# random-detect ecn
Requires other bits in the network to be ECN aware, but if they are, good stuff.
--
+1
There is no excuse these days for stuff not to be ECN aware. That GREATLY mitigates things as it makes hosts aware pretty much immediately that there is congestion and they don't have to wait for a lost packet to time out. I brought it up to a Brocade engineer once asking for the option to set ECN rather than drop the packet and he said "nobody uses it". I told him nobody uses it because you don't have the feature available. How can anyone use it if you don't have the feature?
This sounds a lot like most peoples ipv6 rationale as well. I'm still feeling some scars from last time Ecn was enabled in my hosts. Many firewalls would eat packets with. Ecn enabled.
This sounds a lot like most peoples ipv6 rationale as well.
I'm still feeling some scars from last time Ecn was enabled in my hosts. Many firewalls would eat packets with. Ecn enabled.
That was, I believe, nearly 10 years ago, was it not? There has been considerable testing with ECN with the bufferbloat folks and I have done some myself and haven't noticed anyone blocking ECN lately. There might still be a few corner cases out there still, but none that I have noticed. What you will find, according to what I have read by others doing testing is that some networks will clobber the ECN bits (reset them) but pass the traffic. These days at worst you would not be able to negotiate ECN but the traffic wouldn't be blocked. Anyone clearing the entire DSCP byte on traffic entering their network, for example, would clobber ECN but not block the traffic. The key thing here would be to have people NOT clear ECN bits on traffic flowing through their network to allow it to be negotiated end to end by the hosts involved in the transaction.
Additionally, ECN is just between hosts, end to end. If an flow is not ECN enabled (neither of the ECN bits set), then the routing gear does what it always has done, drop a packet. Only if one of the ECN bits is already set (meaning the flow is ECN aware, end to end) does the router set the other bit to signal congestion. So enabling this on routing gear would have no impact on user traffic except to allow a better experience for ECN aware flows. In other words, allowing this option in the network gear would have no impact on non-ECN flows and only help flows that negotiated ECN end-to-end at connection setup. These flows would already be known to be trouble-free for ECN else they wouldn't have been able to negotiate it.
On 01/27/2012 08:31 PM, Randy Bush wrote:
for those who say bufferbloat is a problem, do you have wred enabled on backbone or customer links? For *most backbone networks* it is a no-op on the backbone. To be more precise, if the backbone is at least 10x, and preferably more like 50x faster than the largest single TCP flow from any customer it will be nearly impossible to measure the performance difference between a short FIFO queue and a WRED queue. when a line card is designed to buffer the b*d of a trans-pac 40g, the oddities on an intra-pop link have been observed to spike to multiple seconds.
See the CACM article Bufferbloat: Dark Buffers in the Internet, in the January CACM, by Kathy Nichols and myself or online at: http://cacm.acm.org/magazines/2012/1/144810-bufferbloat/fulltext The section entitled "Revisiting the Bandwidth Delay Product" is germain to the discussion here. Fundamentally, the b*d "rule" isn't really very useful under most circumstances, though it helps to understand what it tells you, and may be a useful upper bound under some circumstances, though very seldom for a network operator such as found on the NANOG list. The fundamental problem is most people don't know either the bandwidth, nor the delay. The BDP is what you need for a single long lived TCP flow; as soon as you have multiple flows, it's over-estimating the buffering needed, even if you know the bandwidth and the delay... And this work in particular for routers (or potentially switches) is important: Appenzeller, G., Keslassy, I., McKeown, N. 2004. Sizing router buffers. ACM SIGCOMM, Portland, OR, (August). http://yuba.stanford.edu/~nickm/papers/sigcomm2004-extended.pdf <http://yuba.stanford.edu/%7Enickm/papers/sigcomm2004-extended.pdf> Ultimately, we need an AQM algorithm that works so well, and requires no configuration, so that we can just always have it on and we can forget about it; (W)RED and friends aren't it; but it's the best we've got for the moment that you can actually use. It's hopeless to try to use it in the home, where we have very highly variable bandwidth. In backbone networks, the biggest reason I can see for enabling (W)RED may be for robustness sake: if you have a link and it congests, you can quickly be in a world of hurt. I wonder what happened in Japan after the earthquake.... And it should always be on on congested links you know about, of course. There is hope on this front, but it's early days yet. - Jim
On 01/28/2012 10:28 AM, Jim Gettys wrote:
for those who say bufferbloat is a problem, do you have wred enabled on backbone or customer links? For *most backbone networks* it is a no-op on the backbone. To be more precise, if the backbone is at least 10x, and preferably more like 50x faster than the largest single TCP flow from any customer it will be nearly impossible to measure the performance difference between a short FIFO queue and a WRED queue. when a line card is designed to buffer the b*d of a trans-pac 40g, the oddities on an intra-pop link have been observed to spike to multiple seconds. See the CACM article Bufferbloat: Dark Buffers in the Internet, in the January CACM, by Kathy Nichols and myself or online at: http://cacm.acm.org/magazines/2012/1/144810-bufferbloat/fulltext The section entitled "Revisiting the Bandwidth Delay Product" is germain to the discussion here. Fundamentally, the b*d "rule" isn't really very useful under most circumstances, though it helps to understand what it tells you, and may be a useful upper bound under some circumstances,
On 01/27/2012 08:31 PM, Randy Bush wrote: though very seldom for a network operator such as found on the NANOG list.
The fundamental problem is most people don't know either the bandwidth, nor the delay.
The BDP is what you need for a single long lived TCP flow; as soon as you have multiple flows, it's over-estimating the buffering needed, even if you know the bandwidth and the delay...
And this work in particular for routers (or potentially switches) is important:
Appenzeller, G., Keslassy, I., McKeown, N. 2004. Sizing router buffers. ACM SIGCOMM, Portland, OR, (August). http://yuba.stanford.edu/~nickm/papers/sigcomm2004-extended.pdf <http://yuba.stanford.edu/%7Enickm/papers/sigcomm2004-extended.pdf>
Ultimately, we need an AQM algorithm that works so well, and requires no configuration, so that we can just always have it on and we can forget about it; (W)RED and friends aren't it; but it's the best we've got for the moment that you can actually use. It's hopeless to try to use it in the home, where we have very highly variable bandwidth.
In backbone networks, the biggest reason I can see for enabling (W)RED may be for robustness sake: if you have a link and it congests, you can quickly be in a world of hurt. I wonder what happened in Japan after the earthquake.... And it should always be on on congested links you know about, of course.
Also in particular see section 3.1 in the Sizing Router Buffers paper: if you don't have AQM enabled, *and* your router/switch is a bottleneck link, multiple long lived TCP flows will synchronise and you again need a full BDP sized buffer; this is why random drop in AQM algorithms is important. So your millage will vary. BDP really isn't useful most of the time, other than thinking about the problem in the first place. - Jim
In a message written on Fri, Jan 27, 2012 at 04:00:36PM -0800, Joel jaeggli wrote:
And people who care have been using something other than a c6500 for years. it's a 15 year old architecture, and it's had a pretty good run, but it's 2012.
One of the frustrating things, which the c6500 embodies best, is that the chassis has had many generations of linecards. It came out in 1999, running CatOS, with a 32Gbps shared bus. It exists now as a IOS box with a 720Gbps bus, running distributed switching. While you can call both a 6500, they share little more than some sheet metal, fans, and copper traces on the backplane. Wisdom learned running CatOS on 1st generation cards flat out does not apply to current generation cards. And woe be the admin who mixes and matches generations of cards, there are a million different configurations and pitfalls. Cisco is not the only vendor, and the 6500 is not the only product with this problem. It makes conversation extremely difficult though, you can't say a "6500 has xyz property" without detailing a lot more about the config of the box. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
On 2012-01-28 01:00, Joel jaeggli wrote:
C NSP has been full with threads about appalling microburst performance of the 6500 for years.. And people who care have been using something other than a c6500 for years. it's a 15 year old architecture, and it's had a pretty good run, but it's 2012. An ex8200 has 512MB per port on non-oversuscribed 10Gig ports and 42MB per port on 1Gig ports. that's a lot of ram.
6500 has up to 256MB for non-oversubscribed 10GE ports. People complaining about microburst tend to use the cheapest 6704 linecard, and 'microbursts' are a problem seen across most of the products that don't even try to have a 1/12th of a 6500 history. Everyone has it's own problems, and as people already said, not understanding the way properly sized buffers influence the way TCP traffic behaves can do more harm than good. -- "There's no sense in being precise when | Łukasz Bromirski you don't know what you're talking | jid:lbromirski@jabber.org about." John von Neumann | http://lukasz.bromirski.net
On (2012-01-27 22:40 +0100), bas wrote:
But do you generally agree that "the market" has a requirement for a deep-buffer TOR switch?
Or am I crazy for thinking that my customers need such a solution?
No, you're not crazy. If your core is higher rate than your customer, then you need at minimum serialization delay difference of buffering. If core is 10G and access 100M, you need buffer for minimum of 100 packets, to handle the single 10G incoming, without any extra buffering. Now if you add QoS on top of this, you probably need 100 per each class you are going to support. And if switch does support QoS but operator configures only BE, and operator does not limit BE queue size, operator will see buffer bloat, and think it's clueless vendor dropping expensive memory there for the lulz, while it's just misconfigured box. When it comes to these trident+ 64x10GE/48x10GE+4x40G, your serialization delay difference between interfaces is minimal, and so is buffering demand. -- ++ytti
Saku Ytti wrote:
No, you're not crazy. If your core is higher rate than your customer, then you need at minimum serialization delay difference of buffering. If core is 10G and access 100M, you need buffer for minimum of 100 packets, to handle the single 10G incoming, without any extra buffering.
The required amount of memory is merely 150KB.
Now if you add QoS on top of this, you probably need 100 per each class you are going to support.
If you have 10 classes, it is still 1.5MB.
And if switch does support QoS but operator configures only BE, and operator does not limit BE queue size, operator will see buffer bloat,
1.5MB @ 10Gbps is only 1.2ms, which is not buffer bloat. Masataka Ohta
On (2012-01-28 21:06 +0900), Masataka Ohta wrote:
The required amount of memory is merely 150KB.
Assuming we don't support jumbo frames and switch cannot queue sub packet sizes (normally they can't but VXR at least has 512B cell concept, so tx-ring is packet size agnostic, but this is just PA-A3)
If you have 10 classes, it is still 1.5MB.
Yup, that's not bad at all in 100M port, infact 10 classes would be quite much.
And if switch does support QoS but operator configures only BE, and operator does not limit BE queue size, operator will see buffer bloat,
1.5MB @ 10Gbps is only 1.2ms, which is not buffer bloat.
You can't buffer these in ingress or you risk HOLB issue, you must buffer these in the egress 100M and drop in ingress if egress buffer is full. But I fully agree, it's not buffer bloat. But having switch which does support very different traffic rates in ingress and egress (ingress could even be LACP, which further mandates larger buffers on egress) and if you also need to support QoS towards customer, the amount of buffer quickly reaches the level some of these vendors are supporting. When it becomes buffer bloat, is when inexperienced operator allows all of the buffer to be used for single class in matching ingress/egress rates. -- ++ytti
Saku Ytti wrote:
And if switch does support QoS but operator configures only BE, and operator does not limit BE queue size, operator will see buffer bloat,
1.5MB @ 10Gbps is only 1.2ms, which is not buffer bloat.
You can't buffer these in ingress or you risk HOLB issue, you must buffer these in the egress 100M and drop in ingress if egress buffer is full.
1.5MB @ 100Mbps is 120ms, which is prohibitively lengthy even as BE. The solution is to have less number of classes. For QoS assurance, you only need to have two classes for infinitely many flows with different QoS, if flows in higher priority class receive policing against reserved bandwidths of the flow. Masataka Ohta
But I fully agree, it's not buffer bloat. But having switch which does support very different traffic rates in ingress and egress (ingress could even be LACP, which further mandates larger buffers on egress) and if you also need to support QoS towards customer, the amount of buffer quickly reaches the level some of these vendors are supporting. When it becomes buffer bloat, is when inexperienced operator allows all of the buffer to be used for single class in matching ingress/egress rates.
On (2012-01-28 21:53 +0900), Masataka Ohta wrote:
1.5MB @ 100Mbps is 120ms, which is prohibitively lengthy even as BE.
The solution is to have less number of classes.
The solution is to per class define max queue size, so user with fewer queues configured will not use all available buffer in remaining queues. JNPR MX is happy to buffer >4s on 10GE on QX interfaces. Reading some posts on this thread seems to imply vendor is not knowing what they are doing, but in this case there is good reason why there is potentially lot of buffer space and it's simply operator mistake not to limit it if application is just single class in single vlan/untagged 10G interface -- ++ytti
The HP6600 is a store and forward, not a cut-through. The HP reps that I have dealt with seem to be pretty open to sharing architecture drawings of their stuff, so I bet you could probably get your hands on the same one that I have. Their NDA is a mutual disclosure, though, so that might make things tough depending on your organization's policies. Tom -----Original Message----- From: bas [mailto:kilobit@gmail.com] Sent: Friday, January 27, 2012 9:35 AM To: nanog Subject: 10GE TOR port buffers (was Re: 10G switch recommendaton) Hi, Is there a reason switch vendors 1U TOR 10GE aggregation switches are all cut-through and there are no models with deep buffers? I've ben looking at all vendors I can think of and all have the same models. TOR switches as cut-through with little buffers, and chassis based boxes with deep buffers. TOR: Juniper EX4500 208KB/10GE (4MB shared per PFE) Cisco 4900M 728KB/10GE (17.5MB shared) Cisco Nexus 3064 140KB/10GE (9MB shared) Cisco Nexus 5000 680KB/10GE Force10 S2410 I can't find it anymore, but it wasn't much Arista 7148SX 123KB/10GE (80KB per port plus 5MB dynamic) Arista 7050S 173KB/10GE (9MB shared) Brocade VDX 6730-32 170KB/10GE Brocade TurboIron 24X 85KB/10GE HP 6600-24XG 4500KB/10GE HP 5820-24XG-SFP+ 87KB/10GE Extreme Summit X650 375KB/10GE Chassis: Juniper EX8200-8XS 512MB/10GE Cisco WS-X6708-10GE 32MB/10GE (or 24MB) Cisco N7K-M132XP-12 36MB/10GE Arista DCS-7548S-LC 48MB/10GE Brocade BR-MLX-10Gx8-X 128MB/10GE (not sure) 1GE aggregation. Force10 S60 1250MB shared HP 5830 3000MB shared I am at a loss why there are no 10GE TOR switches with deep buffers. Apparently there is a need for deep buffers as the vendors make them available in the chassis linecards. There also are deep buffer 1GE aggregation switches. Is there some (technical) reason for this? I can imagine some vendors would say that you need to scale up to a chassis if you need deep buffers, but at least one vendor should be able to get quite some customers with a 10G deep buffer TOR switch. I understand that flow-control should prevent loss with microbursts, but in my customers get adverse effects, with strong negative performance if they let flow-control do its thing. Any pointers why this is, or if there is a solution for microburst loss would be greatly appreciated. Thanks, Bas
participants (11)
-
bas
-
George Bonser
-
Jared Mauch
-
Jim Gettys
-
Joel jaeggli
-
Leo Bicknell
-
Masataka Ohta
-
Randy Bush
-
Saku Ytti
-
Tom Ammon
-
Łukasz Bromirski