I'm curious if the community would be willing to share their best-practices and/or recommendations and thoughts on how they handle situations where a customer buys X amount of bandwidth, but the physical link is capable of Y, where Y > X. (Yes, I speak of policy-maps, tx/rx-queues, etc.) For example, it might be arguably common to aggregate customer links Layer 2, and then push them upstream to where they anchor Layer 3. That Layer 2 <-> Layer 3 could be a couple of meters or several kilometers. So, as I see it, my options are: * Rate-limit at the Layer 2 switch for both customer ingress/egress, * Rate-limit at the Layer 3 router upstream, i/e, or * Some combination thereof? E.g.: Rate-limit my traffic towards the customer closer to the core, and rate-limit ingress closer to the edge? I've done all three on some level in my travels, but in the past it's also been oftentimes vendor-centric which hindered a scalable or "templateable" solution. (Some things police in only one direction, or only well in one direction, etc.) In case someone is interested in a tangible example, imagine an Arista switch and an ASR9k router. :) Thoughts?
On Thu, 21 May 2020 at 22:11, Bryan Holloway <bryan@shout.net> wrote:
I've done all three on some level in my travels, but in the past it's also been oftentimes vendor-centric which hindered a scalable or "templateable" solution. (Some things police in only one direction, or only well in one direction, etc.)
I may misunderstand something here, but are you looking for vendor agnostic solutions to replace your vendor specific one? That is an unrealistic goal and it's not clear to me why it would be important what spell incantation is needed at any given moment. If you need just to 'rate-limit', and you don't need to discriminate traffic in any way, it's ezpz🍋sqz. Only thing you need to consider is your stepdown, is ingress rate higher than egress rate? If you have speed stepdown, then you need to consider at which RTT rate do I guarantee full rate on a single TCP session. If you police, you need to allow a burst rate which can take in the TCP window growth, if you use a shaper you need to configure buffers which can ingest it. Consider sender is 10Gbps and RTT is 100ms, and receiver is 1Gbps. The window the sender may burst is 9Gbps*100ms/2 = 56.25MB. If sender is
100ms, you contractually don't guarantee the customer full rate on a single TCP session, you point the ticket to a product definition and close it.
Now if you need to discriminate, things can get very complex. Particularly if you do QoS at the L3 aggregation at the L2 access physical rate, you need to understand very well what is the policer counting L1, L2, L3 rate? And how to get it to count L1 rate. Ideally you'd do all QoS at the congestion point at ANET, and not on the subinterfaces. But often the access device is too dumb to do the QoS you need. Let's take the complex example, you need to discriminate and do all in CSCO and ANET does just BE and is QoS unaware. We have a 10GE interface in CSCO and a customer is 1GE connected at ANET. Configuration would be something like this: interface TenGigE0/1/2/3/4.100 service-policy output CUST:XYZ:PARENT account user-defined 28 encapsulation dot1q 100 ! policy-map CUST:XYZ:PARENT class class-default service-policy CUST:XYZ:CHILD shape average 1 gbps ! end-policy-map ! RP/0/RSP1/CPU0:r15.labxtx01.us.bb#show run policy-map CUST:XYZ:CHILD policy-map CUST:XYZ:CHILD class NC bandwidth percent 1 ! class AF bandwidth percent 20 ! class BE bandwidth percent 78 ! class LE bandwidth percent 1 ! class class-default ! end-policy-map ! Homework: - what RED curve to use - how much should you buffer in given class at worst (in some cases burst cannot be configured small enough, and you need to offer lower-than-bought rate to be able to honor QoS contract) - what is right shaper burst - what to map to each class and how (tip: classify exclusively on ingress interface by set qos-group, on egress class match exclusively on the qos-group, this methodology translates across platforms) If your 'account user-defined' is wrong even by a byte, your in-contract traffic will be dropped by ANET, because CSCO is admitting
1Gbps ethL1 rate, and ANET is QoS-unaware so will drop LE just in the same probability as AF.
1Gbps to the customer, which is physically limited to 1Gbps, forcing
Further complication, let's assume you are all-tomahawk on ASR9k. Let's assume TenGigE0/1/2/3/4 as a whole is pushing 6Gbps traffic across all VLAN, everything is in-contract, nothing is being dropped for any VLAN in any class. Now VLAN 200 gets DDoS attack of 20Gbps coming from single backbone interface. I.e. we are offering that tengig interftace 26Gbps of traffic. What will happen is, all VLANs start dropping packets QoS unaware, 12.5Gbps is being dropped by the ingress NPU which is not aware to which VLAN traffic is going nor is it aware of the QoS policy on the egress VLAN. So VLAN100 starts to see NC, AF, BE, LE drops, even though the offered rate in VLAN100 remains in-contract in all classes. To mitigate this to a degree on the backbone side of the ASR9k you need to set VoQ priority, you have 3 priorities. You could choose for example BE P2, NC+AF P1 and LE Pdefault. Then if the attack traffic to VLAN200 is recognised and classified as LE, then we will only see VLAN0100 LE dropping (as well as every other VLAN LE) instead of all the classes. To wish that this would a vendor agnostic is just not realistic, as there are very specific platform decisions which impact on your QoS design. To stress how critical the accounting is, if you do QoS in the 'wrong' place: https://ytti.fi/after.png In this picture BE is out of contract, AFnb, AFb and EF are all in-contract. However the customer sees loss in all classes, this is because the L3 is shaping at L2 rate not L1 rate, so it's sending the QoS-unaware access device to drop. Only thing fixed at 130 is correct accounting parameter, which causes the L3 to reduce the rate it can send and causes it to start dropping, and as it is QoS aware, it can honor the contract, so all drops move to the out-of-contract class, BE. -- ++ytti
Saku Ytti Sent: Friday, May 22, 2020 7:52 AM
On Thu, 21 May 2020 at 22:11, Bryan Holloway <bryan@shout.net> wrote:
I've done all three on some level in my travels, but in the past it's also been oftentimes vendor-centric which hindered a scalable or "templateable" solution. (Some things police in only one direction, or only well in one direction, etc.)
Further complication, let's assume you are all-tomahawk on ASR9k. Let's assume TenGigE0/1/2/3/4 as a whole is pushing 6Gbps traffic across all VLAN, everything is in-contract, nothing is being dropped for any VLAN in any class. Now VLAN 200 gets DDoS attack of 20Gbps coming from single backbone interface. I.e. we are offering that tengig interftace 26Gbps of traffic. What will happen is, all VLANs start dropping packets QoS unaware, 12.5Gbps is being dropped by the ingress NPU which is not aware to which VLAN traffic is going nor is it aware of the QoS policy on the egress VLAN.
Hmm, is that so? Shouldn’t the egress FIA(NPU) be issuing fabric grants (via central Arbiters) to ingress FIA(NPU) for any of the VOQs all the way up till egress NPU's processing capacity, i.e. till the egress NPU can still cope with the overall pps rate (i.e. pps rate from fabric & pps rate from "edge" interfaces), subject to ingress NPU fairness of course? Or in other words, shouldn't all or most of the 26Gbps end up on egress NPU, since it most likely has all the necessary pps processing capacity to deal with the packets at the rate they are arriving, and decide for each based on local classification and queuing policy whether to enqueue the packet or drop it? Looking at my notes, (from discussions with Xander and Thuijs and Aleksandar Vidakovic): Each 10G entity is represented by on VQI = 4 VOQs (one VOQ for each priority level) The trigger for the back-pressure is the utilisation level of RFD buffers. RFD buffers are holding the packets while the NP microcode is processing them. If you search for the BRKSPG-2904, the more feature processing the packet goes through, the longer it stays in RFD buffers. RFD buffers are from-fabric feeder queues. - Fabric side backpressure kicks in if RFD queues are more than 60% full So according to the above, should the egress NPU be powerful enough to deal with 26Gbps of traffic coming from fabric in addition to whatever business as usual duties its performing, (i.e RFD queues utilization is below 60%) then no drops should occur on ingress NPU (originating the DDoS traffic).
So VLAN100 starts to see NC, AF, BE, LE drops, even though the offered rate in VLAN100 remains in-contract in all classes. To mitigate this to a degree on the backbone side of the ASR9k you need to set VoQ priority, you have 3 priorities. You could choose for example BE P2, NC+AF P1 and LE Pdefault. Then if the attack traffic to VLAN200 is recognised and classified as LE, then we will only see VLAN0100 LE dropping (as well as every other VLAN LE) instead of all the classes.
On Sun, 31 May 2020 at 17:37, <adamv0025@netconsultings.com> wrote:
Shouldn’t the egress FIA(NPU) be issuing fabric grants (via central Arbiters) to ingress FIA(NPU) for any of the VOQs all the way up till egress NPU's processing capacity, i.e. till the egress NPU can still cope with the overall pps rate (i.e. pps rate from fabric & pps rate from "edge" interfaces), subject to ingress NPU fairness of course?
This is how it works in say MX. But in ASR9k VoQ are artificially policed, no questions asked. And as policers are port level, if you subdivide them via satellite or vlan you'll have collateral damage. Technically the policer is programmable, and there is CLI, but CLI config is binary between two low values, not arbitrary.
Or in other words, shouldn't all or most of the 26Gbps end up on egress NPU, since it most likely has all the necessary pps processing capacity to deal with the packets at the rate they are arriving, and decide for each based on local classification and queuing policy whether to enqueue the packet or drop it?
No, as per explanation given. Basically don't subdivide ports or don't get attacked. -- ++ytti
On 21/May/20 21:08, Bryan Holloway wrote:
* Rate-limit at the Layer 2 switch for both customer ingress/egress,
In the past, we did this on the routers, as most switches only supported ingress policing and egress shaping, often with very tiny buffers. Recently, some switches do now support ingress and egress policing. Being able to do this as close to the customer as possible is always most effective, especially if you run LAG's between a switch and upstream router.
* Rate-limit at the Layer 3 router upstream, i/e, or
This is how we used to do it, but became problematic when you ran LAG's between switches and routers. However, between switches supporting ingress/egress policing, as well as moving away from switch-router LAG's to native 100Gbps trunks, you can now police on the router or switch without much concern. The choice of either is determined by the number of services customers buy on a single switch port.
* Some combination thereof? E.g.: Rate-limit my traffic towards the customer closer to the core, and rate-limit ingress closer to the edge?
Where we run LAG's between routers and switches, we police on the switch. Where we run 100Gbps native trunks between switches and routers, we police on the router depending on the type of service, i.e., a Q-in-Q setup for a customer where different services being delivered on the same switch port have different policing requirements.
I've done all three on some level in my travels, but in the past it's also been oftentimes vendor-centric which hindered a scalable or "templateable" solution. (Some things police in only one direction, or only well in one direction, etc.)
Yes, we've oscillated between different methods depending, particularly, on what (switch) vendor we used.
In case someone is interested in a tangible example, imagine an Arista switch and an ASR9k router. :)
Arista do support ingress/egress policing (tested on the 7280R). The previous Juniper EX4550's we ran only shaped on egress, and that was problematic due to the small buffers it has. You should have a lot more flexibility on the ASR9000 router, except in cases where you need to police services delivered on a LAG. Mark.
hey,
Being able to do this as close to the customer as possible is always most effective, especially if you run LAG's between a switch and upstream router.
DDoS can be a problem in this scenario. Assuming the PEs have plenty of capacity available and you can afford DDoS to reach PE, then you would shape to customer contract speed, drop the DDoS traffic and would not congest your access device uplink. -- tarko
On Sun, 24 May 2020 at 16:58, Tarko Tikan <tarko@lanparty.ee> wrote:
DDoS can be a problem in this scenario. Assuming the PEs have plenty of capacity available and you can afford DDoS to reach PE, then you would shape to customer contract speed, drop the DDoS traffic and would not congest your access device uplink.
Provided you are using a strictly egress queueing platform, which OP's ASR9k is not, its ingress NPU will drop packets, causing all customers sharing the physical interface to suffer. -- ++ytti
hey,
Provided you are using a strictly egress queueing platform, which OP's ASR9k is not, its ingress NPU will drop packets, causing all customers sharing the physical interface to suffer.
Correct, QoS is a tricky thing that needs to be planned correctly. I was just pointing out additional benefits (or drawbacks depending where you look from). -- tarko
On 24/May/20 15:55, Tarko Tikan wrote:
DDoS can be a problem in this scenario. Assuming the PEs have plenty of capacity available and you can afford DDoS to reach PE, then you would shape to customer contract speed, drop the DDoS traffic and would not congest your access device uplink.
That is one advantage of policing at the switch port, yes. But that would be to manage traffic coming in from the customer. If the attack traffic is coming from the Internet (toward the customer), then policing on the router saves the router-switch trunk. Either way, over-sizing router-switch trunks is always best. Mark.
participants (5)
-
adamv0025@netconsultings.com
-
Bryan Holloway
-
Mark Tinka
-
Saku Ytti
-
Tarko Tikan