Bandwidth distribution per ip
National operator here ask customers to distribute bandwidth between all ip's equally, e.g. if i have /22, and i have in it CDN from one of the big content providers, this CDN use only 3 ips for ingress bandwidth, so bandwidth distribution is not equal between ips and i am not able to use all my bandwidth. And for me, it sounds like faulty aggregation + shaping setup, for example, i heard once if i do policing on some models of Cisco switch, on an aggregated interface, if it has 4 interfaces it will install 25% policer on each interface and if hashing is done by dst ip only, i will face such issue, but that is old and cheap model, as i recall. Did anybody in the world face such requirements? Is such requirements can be considered as legit?
That seems completely unworkable to me. I would think most environment are going to have heavy hitting devices like firewalls and servers that cause traffic aggregation points in the networks. If they shape on their customer's uplink port I don't see why the individual addresses matter at all. I've never heard of that one. As far as policing on an aggregated interface it would seem to be better to police at a different point where all of the traffic for a given customer can be policed together regardless of the physical port it is received on. Steven Naslund Chicago IL
-----Original Message----- From: NANOG [mailto:nanog-bounces+snaslund=medline.com@nanog.org] On Behalf Of Denys Fedoryshchenko Sent: Wednesday, December 20, 2017 8:56 AM To: nanog@nanog.org Subject: Bandwidth distribution per ip
National operator here ask customers to distribute bandwidth between all ip's equally, e.g. if i have /22, and i have in it CDN from one of the big content providers, this CDN use only 3 ips for ingress bandwidth, so bandwidth >distribution is not equal between ips and i am not able to use all my bandwidth.
And for me, it sounds like faulty aggregation + shaping setup, for example, i heard once if i do policing on some models of Cisco switch, on an aggregated interface, if it has 4 interfaces it will install 25% policer on each >interface and if hashing is done by dst ip only, i will face such issue, but that is old and cheap model, as i recall.
Did anybody in the world face such requirements? Is such requirements can be considered as legit?
On 20 December 2017 at 16:55, Denys Fedoryshchenko <denys@visp.net.lb> wrote:
And for me, it sounds like faulty aggregation + shaping setup, for example, i heard once if i do policing on some models of Cisco switch, on an aggregated interface, if it has 4 interfaces it will install 25% policer on each interface and if hashing is done by dst ip only, i will face such issue, but that is old and cheap model, as i recall.
One such old and cheap model is ASR9k trident, typhoon and tomahawk. It's actually pretty demanding problem, as technically two linecards or even just ports sitting on two different NPU might as well be different routers, they don't have good way to communicate to each other on BW use. So N policer being installed as N/member_count per link is very typical. ECMP is fact of life, and even thought none if any provider document that they have per-flow limitations which are lower than nominal rate of connection you purchases, these do exist almost universally everywhere. People who are most likely to see these limits are people who tunnel everything, so that everything from their say 10Gbps is single flow, from POV of the network. In IPv6 world at least tunnel encap end could write hash to IPv6 flow label, allowing core potentially to balance tunneled traffic, unless tunnel itself guarantees order. I don't think it's fair for operator to demand equal bandwidth per IP, but you will expose yourself to more problems if you do not have sufficient entropy. We are slowly getting solutions to this, Juniper Trio and BRCM Tomahawk3 can detect elephant flows and dynamically unequally map hash results to physical ports to alleviate the problem. -- ++ytti
On 20 December 2017 at 16:55, Denys Fedoryshchenko <denys@visp.net.lb> wrote:
And for me, it sounds like faulty aggregation + shaping setup, for example, i heard once if i do policing on some models of Cisco switch, on an aggregated interface, if it has 4 interfaces it will install 25% policer on each interface and if hashing is done by dst ip only, i will face such issue, but that is old and cheap model, as i recall.
One such old and cheap model is ASR9k trident, typhoon and tomahawk.
It's actually pretty demanding problem, as technically two linecards or even just ports sitting on two different NPU might as well be different routers, they don't have good way to communicate to each other on BW use. So N policer being installed as N/member_count per link is very typical.
ECMP is fact of life, and even thought none if any provider document that they have per-flow limitations which are lower than nominal rate of connection you purchases, these do exist almost universally everywhere. People who are most likely to see these limits are people who tunnel everything, so that everything from their say 10Gbps is single flow, from POV of the network. In IPv6 world at least tunnel encap end could write hash to IPv6 flow label, allowing core potentially to balance tunneled traffic, unless tunnel itself guarantees order.
I don't think it's fair for operator to demand equal bandwidth per IP, but you will expose yourself to more problems if you do not have sufficient entropy. We are slowly getting solutions to this, Juniper Trio and BRCM Tomahawk3 can detect elephant flows and dynamically unequally map hash results to physical ports to alleviate the problem. As person who is in love with embedded systems development, i just watched today beautiful 10s of meters long 199x machine, where multi kW VFDs manage huge motors(not steppers), dragging synchronously and printing on thin
On 2017-12-20 17:52, Saku Ytti wrote: paper with crazy speed and all they have is long ~9600 link between a bunch of encoders and PLC dinosaur managing all this beauty. If any of them will apply a bit wrong torque, stretched paper will rip apart. In fact nothing complex there, and technology is ancient these days. Engineers who cannot synchronize and update few virtual "subinstances" policing ratio based on feedback, in one tiny, expensive box, with reasonable update ratio, having in hands modern technologies, maybe incompetent? National operator doesn't provide IPv6, that's one of the problems. In most of cases there is no tunnels, but imperfection still exist. When ISP pays ~$150k/month (bandwidth very expensive here), and because CDN has 3 units & 3 ingress ips, and carrier have bonding somewhere over 4 links, it just means ~$37.5k is lost in rough estimations, no sane person will accept that. Sometimes one CDN unit are in maintenance, and 2 existing can perfectly serve demand, but because of this "balancing" issues - it just go down, as half of capacity missing. But, tunnels in rare cases true too, but what we can do, if they don't have reasonable DDoS protection tools all world have (not even blackholing). Many DDoS-protection operators charge extra for more tunnel endpoints with balancing, and this balancing is not so equal as well (same src+dst ip at best). And when i did round-robin on my own solution, i noticed except this "bandwidth distribution" issue, latency on each ip is unequal, so RR create for me "out of order" issues. Another problem, most popular services in region (in matters of bandwidth) is facebook, whatsapp, youtube. Most of them is big fat flows running over few ips. I doubt i can convince them to balance over more.
On 20 December 2017 at 19:04, Denys Fedoryshchenko <denys@visp.net.lb> wrote:
As person who is in love with embedded systems development, i just watched today beautiful 10s of meters long 199x machine, where multi kW VFDs manage huge motors(not steppers), dragging synchronously and printing on thin paper with crazy speed and all they have is long ~9600 link between a bunch of encoders and PLC dinosaur managing all this beauty. If any of them will apply a bit wrong torque, stretched paper will rip apart. In fact nothing complex there, and technology is ancient these days. Engineers who cannot synchronize and update few virtual "subinstances" policing ratio based on feedback, in one tiny, expensive box, with reasonable update ratio, having in hands modern technologies, maybe incompetent?
As appealing it is to say everyone, present company excluded, is incompetent, I think explanation is more complex than that. Solution has to be economic and marketable. I think elephant flow detection and unequal mapping of hash result to physical interface is economic and marketable solution, but it needs that extra level of abstraction, i.e. you cannot just backport it via software if hardware is missing that sort of capability. -- ++ytti
On 20 December 2017 at 19:04, Denys Fedoryshchenko <denys@visp.net.lb> wrote:
As person who is in love with embedded systems development, i just watched today beautiful 10s of meters long 199x machine, where multi kW VFDs manage huge motors(not steppers), dragging synchronously and printing on thin paper with crazy speed and all they have is long ~9600 link between a bunch of encoders and PLC dinosaur managing all this beauty. If any of them will apply a bit wrong torque, stretched paper will rip apart. In fact nothing complex there, and technology is ancient these days. Engineers who cannot synchronize and update few virtual "subinstances" policing ratio based on feedback, in one tiny, expensive box, with reasonable update ratio, having in hands modern technologies, maybe incompetent?
As appealing it is to say everyone, present company excluded, is incompetent, I think explanation is more complex than that. Solution has to be economic and marketable. I think elephant flow detection and unequal mapping of hash result to physical interface is economic and marketable solution, but it needs that extra level of abstraction, i.e. you cannot just backport it via software if hardware is missing that sort of capability. Even highly incompetent in such matters person as me, know, that some of modern architecture challenges is when NPU consists of a large number of "processing cores", each having his own counters, and additionally it might be also multiple NPU handling same customer traffic. On such conditions updating _precise_ counters(for bitrate measurements, for example) is not
On 2017-12-20 19:12, Saku Ytti wrote: trivial anymore as sum = a(1) + .. + a(n), due synchronization, shared resources access and etc. But still it's solvable in most of cases, even dead wrong way of running script and changing policer value on each "unit" once per second mostly solve problem. And if architecturally some NPU cannot do such job, it means they are flawed, and should be avoided for specific tasks, same as some BCM chipset switches with claimed 32k macs, but choking from 20k macs, because of 8192 entries tcam and probably imperfect hash + linear probing on collision. Sure such switch is not suitable for aggregation and termination. Still, i am running some dedicated servers on colo in EU/US, some over 10G(bonding), and _single_ ip on server, i never faced such balancing issues, thats why i am asking, if someone had such carrier, who require to balance bandwidth between many ips, with quite high precision, to not lose expensive bandwidth.
Denys Fedoryshchenko wrote on 12/20/2017 12:07 PM:
Still, i am running some dedicated servers on colo in EU/US, some over 10G(bonding), and _single_ ip on server, i never faced such balancing issues, thats why i am asking, if someone had such carrier, who require to balance bandwidth between many ips, with quite high precision, to not lose expensive bandwidth.
I have not heard of this. We typically purchase a connection and bring our own IP space. The capacity of the connection and the number of IP addresses (if any) are unrelated to each other. That said, I would find a bonded ethernet connection acceptable in some applications. Bonded ethernet generally works well to increase aggregate bandwidth when used by multiple hosts whose individual bandwidth is well below the speed of each link in the bond. With few hosts, bonded ethernet connections are not likely to work well to increase bandwidth and would generally be unsuitable for providing connectivity to a single IP/host.
This is based on feedback from a colleague that spent several years in Lebanon and did a fair amount of research into the AS-adjacency paths in and out of the country, and the OSI layer 1 (submarine fiber to Cyprus, etc) paths... It sounds to me like your upstream carrier does not actually have any such limitation in place and is making a nonsensical excuse, intended for consumption by less technically savvy persons, as so why they're running their international transit connections too too close to full. International connectivity in and our of Beirut is quite costly on a $/Mbps basis. If you had the ability to see a traffic chart on one of their upstream connections I would not be surprised to see that they're running a 10GbE to, for example, telekom italia/sparkle at 87% utilization. On Wed, Dec 20, 2017 at 11:23 AM, Blake Hudson <blake@ispn.net> wrote:
Denys Fedoryshchenko wrote on 12/20/2017 12:07 PM:
Still, i am running some dedicated servers on colo in EU/US, some over 10G(bonding), and _single_ ip on server, i never faced such balancing issues, thats why i am asking, if someone had such carrier, who require to balance bandwidth between many ips, with quite high precision, to not lose expensive bandwidth.
I have not heard of this. We typically purchase a connection and bring our own IP space. The capacity of the connection and the number of IP addresses (if any) are unrelated to each other.
That said, I would find a bonded ethernet connection acceptable in some applications. Bonded ethernet generally works well to increase aggregate bandwidth when used by multiple hosts whose individual bandwidth is well below the speed of each link in the bond. With few hosts, bonded ethernet connections are not likely to work well to increase bandwidth and would generally be unsuitable for providing connectivity to a single IP/host.
Hi, sounds like you are hosting the origin for the CDN which causes issues. Does the CDN care where it is pulling the data from? Could you place a cheaper origin somewhere else? Like AWS, Italy, Katar or Amsterdam? For 150k/month you can get a lot of bandwidth/storage/rack space somewhere else. An other option could be to use something like origin storage where the content is stored on a CDN provider server already. Other than that you could check the hashing with your upstream provider and make them use layer 4 info as well. If they refuse you might be able to free up some IPs by reducing ptp links to /31 or some ugly NAT tricks where ports are pointing to different services. (Mail ports go to mailserver and http to CDN unit) For ~$37.5k you can also buy some more prefixes to announce. Karsten 2017-12-20 18:04 GMT+01:00 Denys Fedoryshchenko <denys@visp.net.lb>:
On 2017-12-20 17:52, Saku Ytti wrote:
On 20 December 2017 at 16:55, Denys Fedoryshchenko <denys@visp.net.lb> wrote:
And for me, it sounds like faulty aggregation + shaping setup, for example, i heard once if i do policing on some models of Cisco switch, on an aggregated interface, if it has 4 interfaces it will install 25% policer on each interface and if hashing is done by dst ip only, i will face such issue, but that is old and cheap model, as i recall.
One such old and cheap model is ASR9k trident, typhoon and tomahawk.
It's actually pretty demanding problem, as technically two linecards or even just ports sitting on two different NPU might as well be different routers, they don't have good way to communicate to each other on BW use. So N policer being installed as N/member_count per link is very typical.
ECMP is fact of life, and even thought none if any provider document that they have per-flow limitations which are lower than nominal rate of connection you purchases, these do exist almost universally everywhere. People who are most likely to see these limits are people who tunnel everything, so that everything from their say 10Gbps is single flow, from POV of the network. In IPv6 world at least tunnel encap end could write hash to IPv6 flow label, allowing core potentially to balance tunneled traffic, unless tunnel itself guarantees order.
I don't think it's fair for operator to demand equal bandwidth per IP, but you will expose yourself to more problems if you do not have sufficient entropy. We are slowly getting solutions to this, Juniper Trio and BRCM Tomahawk3 can detect elephant flows and dynamically unequally map hash results to physical ports to alleviate the problem.
As person who is in love with embedded systems development, i just watched today beautiful 10s of meters long 199x machine, where multi kW VFDs manage huge motors(not steppers), dragging synchronously and printing on thin paper with crazy speed and all they have is long ~9600 link between a bunch of encoders and PLC dinosaur managing all this beauty. If any of them will apply a bit wrong torque, stretched paper will rip apart. In fact nothing complex there, and technology is ancient these days. Engineers who cannot synchronize and update few virtual "subinstances" policing ratio based on feedback, in one tiny, expensive box, with reasonable update ratio, having in hands modern technologies, maybe incompetent?
National operator doesn't provide IPv6, that's one of the problems. In most of cases there is no tunnels, but imperfection still exist. When ISP pays ~$150k/month (bandwidth very expensive here), and because CDN has 3 units & 3 ingress ips, and carrier have bonding somewhere over 4 links, it just means ~$37.5k is lost in rough estimations, no sane person will accept that. Sometimes one CDN unit are in maintenance, and 2 existing can perfectly serve demand, but because of this "balancing" issues - it just go down, as half of capacity missing.
But, tunnels in rare cases true too, but what we can do, if they don't have reasonable DDoS protection tools all world have (not even blackholing). Many DDoS-protection operators charge extra for more tunnel endpoints with balancing, and this balancing is not so equal as well (same src+dst ip at best). And when i did round-robin on my own solution, i noticed except this "bandwidth distribution" issue, latency on each ip is unequal, so RR create for me "out of order" issues. Another problem, most popular services in region (in matters of bandwidth) is facebook, whatsapp, youtube. Most of them is big fat flows running over few ips. I doubt i can convince them to balance over more.
On 20 December 2017 at 15:52, Saku Ytti <saku@ytti.fi> wrote:
On 20 December 2017 at 16:55, Denys Fedoryshchenko <denys@visp.net.lb> wrote:
And for me, it sounds like faulty aggregation + shaping setup, for example, i heard once if i do policing on some models of Cisco switch, on an aggregated interface, if it has 4 interfaces it will install 25% policer on each interface and if hashing is done by dst ip only, i will face such issue, but that is old and cheap model, as i recall.
One such old and cheap model is ASR9k trident, typhoon and tomahawk.
It's actually pretty demanding problem, as technically two linecards or even just ports sitting on two different NPU might as well be different routers, they don't have good way to communicate to each other on BW use. So N policer being installed as N/member_count per link is very typical.
Hi, In the case of ASR9K IOS-XR 6.0.1 added the following command: "hw-module all qos-mode bundle-qos-aggregate-mode" This splits the bandwidth over the links and takes into account the link bandwidth; with bundle bandwidth 50G (with 10G+40G members) the ratios become 5/1 and 5/4 respectively (it is supporting unbalanced member link speeds). Also the NPUs don't need to talk to each other on the ASR9K; the central fabric arbiter has a view of bandwidth per VoQ and can control the bandwidth across the LAG member interfaces when they are spread over multiple lines cards and NPUs. Cheers, James,
Denys Fedoryshchenko wrote on 12/20/2017 8:55 AM:
National operator here ask customers to distribute bandwidth between all ip's equally, e.g. if i have /22, and i have in it CDN from one of the big content providers, this CDN use only 3 ips for ingress bandwidth, so bandwidth distribution is not equal between ips and i am not able to use all my bandwidth.
And for me, it sounds like faulty aggregation + shaping setup, for example, i heard once if i do policing on some models of Cisco switch, on an aggregated interface, if it has 4 interfaces it will install 25% policer on each interface and if hashing is done by dst ip only, i will face such issue, but that is old and cheap model, as i recall.
Did anybody in the world face such requirements? Is such requirements can be considered as legit?
Not being able to use all of your bandwidth is a common issue if you are provided a bonded connection (aka Link Aggregation Group). For example, you are provided a 4Gbps service over 4x1Gbps ethernet links. Ethernet traffic is not typically balanced across links per frame, because this could lead to out of order delivery or jitter, especially in cases where the links have different physical characteristics. Instead, a hashing algorithm is typically used to distribute traffic based on flows. This results in each flow having consistent packet order and latency characteristics, but does force a flow over a single link, resulting in the flow being limited to the performance of that link. In this context, flows can be based on src/dst MAC address, IP address, or TCP/UDP port information, depending on the traffic type (some IP traffic is not TCP/UDP and won't have a port) and equipment type (layer 3 devices typically hash by layer 3 or 4 info). Your operator may be able to choose an alternative hashing algorithm that could work better for you (hashing based on layer 4 information instead of layer 3 or 2, for example). This is highly dependent on your provider's equipment and configuration - it may be a global option on the equipment or may not be an option at all. Bottom line, if you expected 4Gbps performance for each host on your network, you're unlikely to get it on service delivered through 4x 1Gbps links. 10Gbps+ links between you and your ISP's peers would better serve those needs (any 1Gbps bonds in the path between you and your provider's edge are likely to exhibit the same characteristics). --Blake
On 2017-12-20 19:16, Blake Hudson wrote:
Denys Fedoryshchenko wrote on 12/20/2017 8:55 AM:
National operator here ask customers to distribute bandwidth between all ip's equally, e.g. if i have /22, and i have in it CDN from one of the big content providers, this CDN use only 3 ips for ingress bandwidth, so bandwidth distribution is not equal between ips and i am not able to use all my bandwidth.
And for me, it sounds like faulty aggregation + shaping setup, for example, i heard once if i do policing on some models of Cisco switch, on an aggregated interface, if it has 4 interfaces it will install 25% policer on each interface and if hashing is done by dst ip only, i will face such issue, but that is old and cheap model, as i recall.
Did anybody in the world face such requirements? Is such requirements can be considered as legit?
Not being able to use all of your bandwidth is a common issue if you are provided a bonded connection (aka Link Aggregation Group). For example, you are provided a 4Gbps service over 4x1Gbps ethernet links. Ethernet traffic is not typically balanced across links per frame, because this could lead to out of order delivery or jitter, especially in cases where the links have different physical characteristics. Instead, a hashing algorithm is typically used to distribute traffic based on flows. This results in each flow having consistent packet order and latency characteristics, but does force a flow over a single link, resulting in the flow being limited to the performance of that link. In this context, flows can be based on src/dst MAC address, IP address, or TCP/UDP port information, depending on the traffic type (some IP traffic is not TCP/UDP and won't have a port) and equipment type (layer 3 devices typically hash by layer 3 or 4 info).
Your operator may be able to choose an alternative hashing algorithm that could work better for you (hashing based on layer 4 information instead of layer 3 or 2, for example). This is highly dependent on your provider's equipment and configuration - it may be a global option on the equipment or may not be an option at all. Bottom line, if you expected 4Gbps performance for each host on your network, you're unlikely to get it on service delivered through 4x 1Gbps links. 10Gbps+ links between you and your ISP's peers would better serve those needs (any 1Gbps bonds in the path between you and your provider's edge are likely to exhibit the same characteristics).
--Blake
No bonding to me, usually it is dedicated 1G/10G/etc link. Also i simulated this bandwidth for "hashability", and any layer4 aware hashing on cisco/juniper provided perfectly balanced bandwidth distribution. On my tests i can see that they have some balancing clearly by dst ip only.
participants (7)
-
Blake Hudson
-
Denys Fedoryshchenko
-
Eric Kuhnke
-
James Bensley
-
Karsten Elfenbein
-
Naslund, Steve
-
Saku Ytti