Re: Lossy cogent p2p experiences?
I wouldn't call 50 megabit/s an elephant flow ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "Mark Tinka" <mark@tinka.africa> To: "Mike Hammett" <nanog@ics-il.net>, "Saku Ytti" <saku@ytti.fi> Cc: nanog@nanog.org Sent: Friday, September 1, 2023 8:56:03 AM Subject: Re: Lossy cogent p2p experiences? On 9/1/23 15:44, Mike Hammett wrote: and I would say the OP wasn't even about elephant flows, just about a network that can't deliver anything acceptable. Unless Cogent are not trying to accept (and by extension, may not be able to guarantee) large Ethernet flows because they can't balance them across their various core links, end-to-end... Pure conjecture... Mark.
Mark Tinka wrote:
On 9/1/23 15:59, Mike Hammett wrote:
I wouldn't call 50 megabit/s an elephant flow
Fair point.
Both of you are totally wrong, because the proper thing to do here is to police, if *ANY*, based on total traffic without detecting any flow. 100 50Mbps flows are as harmful as 1 5Gbps flow. Moreover, as David Hubbard wrote:
I’ve got a non-rate-limited 10gig circuit
there is no point of policing. Detection of elephant flows were wrongly considered useful with flow driven architecture to automatically bypass L3 processing for the flows, when L3 processing capability were wrongly considered limited. Then, topology driven architecture of MPLS appeared, even though topology driven is flow driven (you can't put inner labels of MPLS without knowing detailed routing information at the destinations, which is hidden at the source through route aggregation, on demand after detecting flows.) Masataka Ohta
On 9/2/23 17:04, Masataka Ohta wrote:
Both of you are totally wrong, because the proper thing to do here is to police, if *ANY*, based on total traffic without detecting any flow.
I don't think it's as much an issue of flow detection as it is the core's ability to balance the Layer 2 payload across multiple links effectively. At our shop, we understand the limitations of trying to carry large EoMPLS flows across an IP/MPLS network that is, primarily, built to carry IP traffic. While some vendors have implemented adaptive load balancing algorithms on decent (if not custom) silicon that can balance EoMPLS flows as well as they can IP flows, it is hit & miss depending on the code, hardware, vendor, e.t.c. In our case, our ability to load balance EoMPLS flows as well as we do IP flows has improved since we moved to the PTX1000/10001 for our core routers. But even then, we will not sell anything above 40Gbps as an EoMPLS service. Once it gets there, time for EoDWDM. At least, until 800Gbps or 1Tbps Ethernet ports become both technically viable and commercially feasible. For as long as core links are based on 100Gbps and 400Gbps ports, optical carriage for 40Gbps and above is more sensible than EoMPLS. Mark.
Mark Tinka wrote:
it is the core's ability to balance the Layer 2 payload across multiple links effectively.
Wrong. It can be performed only at the edges by policing total incoming traffic without detecting flows.
While some vendors have implemented adaptive load balancing algorithms
There is no such algorithms because, as I wrote: : 100 50Mbps flows are as harmful as 1 5Gbps flow. Masataka Ohta
On 9/2/23 17:38, Masataka Ohta wrote:
Wrong. It can be performed only at the edges by policing total incoming traffic without detecting flows.
I am not talking about policing in the core, I am talking about detection in the core. Policing at the edge is pretty standard. You can police a 50Gbps EoMPLS flow coming in from a customer port in the edge. If you've got N x 10Gbps links in the core and the core is unable to detect that flow in depth to hash it across all those 10Gbps links, you can end up putting all or a good chunk of that 50Gbps of EoMPLS traffic into a single 10Gbps link in the core, despite all other 10Gbps links having ample capacity available.
There is no such algorithms because, as I wrote:
: 100 50Mbps flows are as harmful as 1 5Gbps flow.
Do you operate a large scale IP/MPLS network? Because I do, and I know what I see with the equipment we deploy. You are welcome to deny it all you want, however. Not much I can do about that. Mark.
Mark Tinka wrote:
Wrong. It can be performed only at the edges by policing total incoming traffic without detecting flows.
I am not talking about policing in the core, I am talking about detection in the core.
I'm not talking about detection at all.
Policing at the edge is pretty standard. You can police a 50Gbps EoMPLS flow coming in from a customer port in the edge. If you've got N x 10Gbps links in the core and the core is unable to detect that flow in depth to hash it across all those 10Gbps links, you can end up putting all or a good chunk of that 50Gbps of EoMPLS traffic into a single 10Gbps link in the core, despite all other 10Gbps links having ample capacity available.
Relying on hash is a poor way to offer wide bandwidth. If you have multiple parallel links over which many slow TCP connections are running, which should be your assumption, the proper thing to do is to use the links with round robin fashion without hashing. Without buffer bloat, packet reordering probability within each TCP connection is negligible. Faster TCP may suffer from packet reordering during slight congestion, but the effect is like that of RED. Anyway, in this case, the situation is: :Moreover, as David Hubbard wrote: :> I've got a non-rate-limited 10gig circuit So, if you internally have 10 parallel 1G circuits expecting perfect hashing over them, it is not "non-rate-limited 10gig". Masataka Ohta
On 9/3/23 09:59, Masataka Ohta wrote:
If you have multiple parallel links over which many slow TCP connections are running, which should be your assumption, the proper thing to do is to use the links with round robin fashion without hashing. Without buffer bloat, packet reordering probability within each TCP connection is negligible.
So you mean, what... per-packet load balancing, in lieu of per-flow load balancing?
So, if you internally have 10 parallel 1G circuits expecting perfect hashing over them, it is not "non-rate-limited 10gig".
It is understood in the operator space that "rate limiting" generally refers to policing at the edge/access. The core is always abstracted, and that is just capacity planning and management by the operator. Mark.
Mark Tinka wrote:
So you mean, what... per-packet load balancing, in lieu of per-flow load balancing?
Why, do you think, you can rely on existence of flows?
So, if you internally have 10 parallel 1G circuits expecting perfect hashing over them, it is not "non-rate-limited 10gig".
It is understood in the operator space that "rate limiting" generally refers to policing at the edge/access.
And nothing beyond, of course.
The core is always abstracted, and that is just capacity planning and management by the operator.
ECMP, surely, is a too abstract concept to properly manage/operate simple situations with equal speed multi parallel point to point links. Masataka Ohta
On 9/3/23 15:01, Masataka Ohta wrote:
Why, do you think, you can rely on existence of flows?
You have not quite answered my question - but I will assume you are in favour of per-packet load balancing. I have deployed per-packet load balancing before, ironically, trying to deal with large EoMPLS flows in a LAG more than a decade ago. I won't be doing that again... OoO packets is nasty at scale.
And nothing beyond, of course.
No serious operators polices in the core.
ECMP, surely, is a too abstract concept to properly manage/operate simple situations with equal speed multi parallel point to point links.
I must have been doing something wrong for the last 25 years. Mark.
On 9/3/23 15:01, Masataka Ohta wrote:
Why, do you think, you can rely on existence of flows?
You have not quite answered my question - but I will assume you are in favour of per-packet load balancing. I have deployed per-packet load balancing before, ironically, trying to deal with large EoMPLS flows in a LAG more than a decade ago. I won't be doing that again... OoO packets is nasty at scale.
And nothing beyond, of course.
No serious operator polices in the core.
ECMP, surely, is a too abstract concept to properly manage/operate simple situations with equal speed multi parallel point to point links.
I must have been doing something wrong for the last 25 years. Mark.
Mark Tinka wrote:
ECMP, surely, is a too abstract concept to properly manage/operate simple situations with equal speed multi parallel point to point links.
I must have been doing something wrong for the last 25 years.
Are you saying you thought a 100G Ethernet link actually consisting of 4 parallel 25G links, which is an example of "equal speed multi parallel point to point links", were relying on hashing? Masataka Ohta
Masataka Ohta wrote on 04/09/2023 12:04:
Are you saying you thought a 100G Ethernet link actually consisting of 4 parallel 25G links, which is an example of "equal speed multi parallel point to point links", were relying on hashing?
this is an excellent example of what we're not talking about in this thread. A 100G serdes is an unbuffered mechanism which includes a PLL, and this allows the style of clock/signal synchronisation required for the deserialised 4x25G lanes to be reserialised at the far end. This is one of the mechanisms used for packet / cell / bit spray, and it works really well. This thread is talking about buffered transmission links on routers / switches on systems which provide no clocking synchronisation and not even a guarantee that the bearer circuits have comparable latencies. ECMP / hash based load balancing is a crock, no doubt about it; it's just less crocked than other approaches where there are no guarantees about device and bearer circuit behaviour. Nick
Nick Hilliard wrote:
Are you saying you thought a 100G Ethernet link actually consisting of 4 parallel 25G links, which is an example of "equal speed multi parallel point to point links", were relying on hashing?
this is an excellent example of what we're not talking about in this thread.
Not "we", but "you".
A 100G serdes is an unbuffered mechanism which includes a PLL, and this allows the style of clock/signal synchronisation required for the deserialised 4x25G lanes to be reserialised at the far end. This is one of the mechanisms used for packet / cell / bit spray, and it works really well.
That's why I, instead of fully shared buffer, mentioned round robin as the proper solution for the case.
This thread is talking about buffered transmission links on routers / switches on systems which provide no clocking synchronisation and not even a guarantee that the bearer circuits have comparable latencies. ECMP / hash based load balancing is a crock, no doubt about it;
See the first three lines of this mail to find that I explicitly mentioned "equal speed multi parallel point to point links" as the context for round robin. As I already told you: : In theory, you can always fabricate unrealistic counter examples : against theories by ignoring essential assumptions of the theories. you are keep ignoring essential assumptions for no good purposes. Masataka Ohta
On 9/4/23 13:27, Nick Hilliard wrote:
this is an excellent example of what we're not talking about in this thread.
It is amusing how he tried to pivot the discussion. Nobody was talking about how lane transport in optical modules works. Mark.
On 9/4/23 13:04, Masataka Ohta wrote:
Are you saying you thought a 100G Ethernet link actually consisting of 4 parallel 25G links, which is an example of "equal speed multi parallel point to point links", were relying on hashing?
No... you are saying that. Mark.
Mark Tinka wrote:
Are you saying you thought a 100G Ethernet link actually consisting of 4 parallel 25G links, which is an example of "equal speed multi parallel point to point links", were relying on hashing?
No...
So, though you wrote:
If you have multiple parallel links over which many slow TCP connections are running, which should be your assumption, the proper thing to do is to use the links with round robin fashion without hashing. Without buffer bloat, packet reordering probability within each TCP connection is negligible.
So you mean, what... per-packet load balancing, in lieu of per-flow load balancing?
you now recognize that per-flow load balancing is not a very good idea. Good.
you are saying that.
See above to find my statement of "without hashing". Masataka Ohta
On 9/6/23 09:12, Masataka Ohta wrote:
you now recognize that per-flow load balancing is not a very good idea.
You keep moving the goal posts. Stay on-topic. I was asking you to clarify your post as to whether you were speaking of per-flow or per-packet load balancing. You did not do that, but I did not return to that question because your subsequent posts inferred that you were talking to per-packet load balancing. And just because I said per-flow load balancing has been the gold standard for the last 25 years, does not mean it is the best solution. It just means it is the gold standard. I recognize what happens in the real world, not in the lab or text books. Mark.
Mark Tinka <mark@tinka.africa> writes:
And just because I said per-flow load balancing has been the gold standard for the last 25 years, does not mean it is the best solution. It just means it is the gold standard.
TCP looks quite different in 2023 than it did in 1998. It should handle packet reordering quite gracefully; in the best case the NIC will reassemble the out-of-order TCP packets into a 64k packet and the OS will never even know they were reordered. Unfortunately current equipment does not seem to offer per-packet load balancing, so we cannot test how well it works. It is possible that per-packet load balancing will work a lot better today than it did in 1998, especially if the equipment does buffering before load balancing and the links happen to be fairly short and not very diverse. Switching back to per-packet would solve quite a lot of problems, including elephant flows and bad hashing. I would love to hear about recent studies. /Benny
On Wed, 6 Sept 2023 at 17:10, Benny Lyne Amorsen <benny+usenet@amorsen.dk> wrote:
TCP looks quite different in 2023 than it did in 1998. It should handle packet reordering quite gracefully; in the best case the NIC will
I think the opposite is true, TCP was designed to be order agnostic. But everyone uses cubic, and for cubic reorder is the same as packet loss. This is a good trade-off. You need to decide if you want to recover fast from occasional packet loss, or if you want to be tolerant of reordering. The moment cubic receives frame+1 it expects, it acks frame-1 again, signalling loss of packet, causing unnecessary resend and window size reduction.
will never even know they were reordered. Unfortunately current equipment does not seem to offer per-packet load balancing, so we cannot test how well it works.
For example Juniper offers true per-packet, I think mostly used in high performance computing. -- ++ytti
For example Juniper offers true per-packet, I think mostly used in high performance computing.
At least on MX, what Juniper calls 'per-packet' is really 'per-flow'. On Wed, Sep 6, 2023 at 10:17 AM Saku Ytti <saku@ytti.fi> wrote:
On Wed, 6 Sept 2023 at 17:10, Benny Lyne Amorsen <benny+usenet@amorsen.dk> wrote:
TCP looks quite different in 2023 than it did in 1998. It should handle packet reordering quite gracefully; in the best case the NIC will
I think the opposite is true, TCP was designed to be order agnostic. But everyone uses cubic, and for cubic reorder is the same as packet loss. This is a good trade-off. You need to decide if you want to recover fast from occasional packet loss, or if you want to be tolerant of reordering. The moment cubic receives frame+1 it expects, it acks frame-1 again, signalling loss of packet, causing unnecessary resend and window size reduction.
will never even know they were reordered. Unfortunately current equipment does not seem to offer per-packet load balancing, so we cannot test how well it works.
For example Juniper offers true per-packet, I think mostly used in high performance computing.
-- ++ytti
On 9/6/23 17:27, Tom Beecher wrote:
At least on MX, what Juniper calls 'per-packet' is really 'per-flow'.
Unless you specifically configure true "per-packet" on your LAG: set interfaces ae2 aggregated-ether-options load-balance per-packet I ran per-packet on a Juniper LAG 10 years ago. It produced 100% perfect traffic distribution. But the reordering was insane, and the applications could not tolerate it. If you applications can tolerate reordering, per-packet is fine. In the public Internet space, it seems we aren't there yet. Mark.
If you applications can tolerate reordering, per-packet is fine. In the public Internet space, it seems we aren't there yet.
Yeah this During lockdown here in Italy one day we started getting calls about performance issues performance degradation, vpns dropping or becoming unusable, and general randomness of this isn't working like it used to. All the lines checked out, no bandwidth contention etc only strange thing we found was all affected sessions had a lot of OOR packets with a particular network in Italy. With them we traced it down to traffic flowing through one IXP and found they had added capacity between two switches and it had been configured with per packet balancing. It was changed to flow based balancing and everything went back to normal. Brian
Unless you specifically configure true "per-packet" on your LAG:
Well, not exactly the same thing. (But it's my mistake, I was referring to L3 balancing, not L2 interface stuff.) load-balance per-packet will cause massive reordering, because it's random spray , caring about nothing except equal loading of the members. It's a last resort option that will cause tons of reordering. (And they call that out quite clearly in docs.) If you don't care about reordering it's great. load-balance adaptive generally did a decent enough job last time I used it much. stateful was hit or miss ; sometimes it tested amazing, other times not so much. But it wasn't a primary requirement so I never dove into why On Wed, Sep 6, 2023 at 12:04 PM Mark Tinka <mark@tinka.africa> wrote:
On 9/6/23 17:27, Tom Beecher wrote:
At least on MX, what Juniper calls 'per-packet' is really 'per-flow'.
Unless you specifically configure true "per-packet" on your LAG:
set interfaces ae2 aggregated-ether-options load-balance per-packet
I ran per-packet on a Juniper LAG 10 years ago. It produced 100% perfect traffic distribution. But the reordering was insane, and the applications could not tolerate it.
If you applications can tolerate reordering, per-packet is fine. In the public Internet space, it seems we aren't there yet.
Mark.
On 9/6/23 18:52, Tom Beecher wrote:
Well, not exactly the same thing. (But it's my mistake, I was referring to L3 balancing, not L2 interface stuff.)
Fair enough.
load-balance per-packet will cause massive reordering, because it's random spray , caring about nothing except equal loading of the members. It's a last resort option that will cause tons of reordering. (And they call that out quite clearly in docs.) If you don't care about reordering it's great.
load-balance adaptive generally did a decent enough job last time I used it much.
Yep, pretty much my experience too.
stateful was hit or miss ; sometimes it tested amazing, other times not so much. But it wasn't a primary requirement so I never dove into why
Never tried stateful. Moving 802.1Q trunk from N x 10Gbps LAG's to native 100Gbps links resolved this load balancing conundrum for us. Of course, it works well because we spread these router<=>switch links across several 100Gbps ports, so no single trunk is ever that busy, even for customers buying N x 10Gbps services. Mark.
Per packet LB is one of those ideas that at a conceptual level are great, but in practice are obvious that they’re out of touch with reality. Kind of like the EIGRP protocol from Cisco and using the load, reliability, and MTU metrics. On Wed, Sep 6, 2023 at 1:13 PM Mark Tinka <mark@tinka.africa> wrote:
On 9/6/23 18:52, Tom Beecher wrote:
Well, not exactly the same thing. (But it's my mistake, I was referring to L3 balancing, not L2 interface stuff.)
Fair enough.
load-balance per-packet will cause massive reordering, because it's random spray , caring about nothing except equal loading of the members. It's a last resort option that will cause tons of reordering. (And they call that out quite clearly in docs.) If you don't care about reordering it's great.
load-balance adaptive generally did a decent enough job last time I used it much.
Yep, pretty much my experience too.
stateful was hit or miss ; sometimes it tested amazing, other times not so much. But it wasn't a primary requirement so I never dove into why
Never tried stateful.
Moving 802.1Q trunk from N x 10Gbps LAG's to native 100Gbps links resolved this load balancing conundrum for us. Of course, it works well because we spread these router<=>switch links across several 100Gbps ports, so no single trunk is ever that busy, even for customers buying N x 10Gbps services.
Mark.
On Thu, 7 Sept 2023 at 00:00, David Bass <davidbass570@gmail.com> wrote:
Per packet LB is one of those ideas that at a conceptual level are great, but in practice are obvious that they’re out of touch with reality. Kind of like the EIGRP protocol from Cisco and using the load, reliability, and MTU metrics.
Those multi metrics are in ISIS as well (if you don't use wide). And I agree those are not for common cases, but I wouldn't be shocked if someone has legitimate MTR use-case where different metric-type topologies are very useful. But as long as we keep context as the Internet, true. 100% reordering does not work for the Internet, not without changing all end hosts. And by changing those, it's not immediately obvious how we end-up in better place, like if we wait bit longer to signal packet-loss, likely we end up in worse place, as reordering just is so dang rare today, because congestion control choices have made sure no one reorders, or customers will yell at you, yet packet-loss remains common. Perhaps if congestion control used latency or FEC instead of loss, we could tolerate reordering while not underperforming under loss, but I'm sure in decades following that decision we'd learn new ways how we don't understand any of this. But for non-internet applications, where you control hosts, per-packet is used and needed, I think HPC applications, and GPU farms etc. are the users who asked JNPR to implement this. -- ++ytti
On 9/7/23 09:51, Saku Ytti wrote:
Perhaps if congestion control used latency or FEC instead of loss, we could tolerate reordering while not underperforming under loss, but I'm sure in decades following that decision we'd learn new ways how we don't understand any of this.
Isn't this partly what ECN was meant for? It's so old I barely remember what it was meant to solve :-). Mark.
It was intended to detect congestion. The obvious response was in some way to pace the sender(s) so that it was alleviated. Sent using a machine that autocorrects in interesting ways...
On Sep 7, 2023, at 11:19 PM, Mark Tinka <mark@tinka.africa> wrote:
On 9/7/23 09:51, Saku Ytti wrote:
Perhaps if congestion control used latency or FEC instead of loss, we could tolerate reordering while not underperforming under loss, but I'm sure in decades following that decision we'd learn new ways how we don't understand any of this.
Isn't this partly what ECN was meant for? It's so old I barely remember what it was meant to solve :-).
Mark.
Tom Beecher wrote:
Well, not exactly the same thing. (But it's my mistake, I was referring to L3 balancing, not L2 interface stuff.)
That should be a correct referring.
load-balance per-packet will cause massive reordering,
If buffering delay of ECM paths can not be controlled , yes.
because it's random spray , caring about nothing except equal loading of the members.
Equal loading on point to point links between two routers by (weighted) round robin means mostly same buffering delay, which won't cause massive reordering. Masataka Ohta
Mark Tinka <mark@tinka.africa> writes:
set interfaces ae2 aggregated-ether-options load-balance per-packet
I ran per-packet on a Juniper LAG 10 years ago. It produced 100% perfect traffic distribution. But the reordering was insane, and the applications could not tolerate it.
Unfortunately that is not strict round-robin load balancing. I do not know about any equipment that offers actual round-robin load-balancing. Juniper's solution will cause way too much packet reordering for TCP to handle. I am arguing that strict round-robin load balancing will function better than hash-based in a lot of real-world scenarios.
On Thu, 7 Sept 2023 at 15:45, Benny Lyne Amorsen <benny+usenet@amorsen.dk> wrote:
Juniper's solution will cause way too much packet reordering for TCP to handle. I am arguing that strict round-robin load balancing will function better than hash-based in a lot of real-world scenarios.
And you will be wrong. Packet arriving out of order, will be considered previous packet lost by host, and host will signal need for resend. -- ++ytti
Saku Ytti wrote:
And you will be wrong. Packet arriving out of order, will be considered previous packet lost by host, and host will signal need for resend.
As I already quote the very old and fundamental paper on the E2E argument: End-To-End Arguments in System Design https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments... : 3.4 Guaranteeing FIFO Message Delivery and as is described in rfc2001, Since TCP does not know whether a duplicate ACK is caused by a lost ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ segment or just a reordering of segments, it waits for a small number ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ of duplicate ACKs to be received. It is assumed that if there is just a reordering of the segments, there will be only one or two duplicate ACKs before the reordered segment is processed, which will then generate a new ACK. If three or more duplicate ACKs are ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ received in a row, it is a strong indication that a segment has been ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ lost. ----- in networking, it is well known that "Guaranteeing FIFO Message Delivery" by the network is impossible because packets arriving out of order without packet losses is inevitable and is not uncommon. As such, slight reordering is *NOT* interpreted as previous packet loss. The allowed amount of reordering depends on TCP implementations and can be controlled by upgrading TCP. Masataka Ohta
On 9/7/23 09:31, Benny Lyne Amorsen wrote:
Unfortunately that is not strict round-robin load balancing.
Oh? What is it then, if it's not spraying successive packets across member links?
I do not know about any equipment that offers actual round-robin load-balancing.
Cisco had both per-destination and per-packet. Is that not it in the networking world?
Juniper's solution will cause way too much packet reordering for TCP to handle. I am arguing that strict round-robin load balancing will function better than hash-based in a lot of real-world scenarios.
Ummh, no, it won't. If it did, it would have been widespread. But it's not. Mark.
On Fri, 8 Sept 2023 at 09:17, Mark Tinka <mark@tinka.africa> wrote:
Unfortunately that is not strict round-robin load balancing.
Oh? What is it then, if it's not spraying successive packets across member links?
I believe the suggestion is that round-robin out-performs random spray. Random spray is what the HPC world is asking, not round-robin. Now I've not operated such network where per-packet is useful, so I'm not sure why you'd want round-robin over random spray, but I can see easily why you'd want either a) random traffic or b) random spray, if neither are true, if you have strict round-robin and you have non-random traffic, say every other packet is big data delivery, every other packet is small ACK, you can easily synchronise one link to 100% util, and and another near 0%, if you do true round-robin, but not of you do random spray. I don't see downside random spray would have over round-robin, but I wouldn't be shocked if there is one. I see this thread is mostly starting to loop around two debates 1) Reordering is not a problem - if you control the application, you can make it 0 problem - if you use TCP shipping in Androids, iOS, macOS, Windows, Linux, BSD reordering is in practice as bad as packet loss. - people who know this in the list, don't know it because they read it, they know it, because they got caught pants down and learned it, because they had reordering and tcp performance was destroyed, even at very low reorder rates - we could design TCP congestion control that is very tolerant to reordering, but I cannot say if it would be overall win or loss 2) Reordering won't happen in per-packet, if there is no congestion and latencies are equal - the receiving distributed router (~all of them) do not have global synchronisation, they do not make any guarantees that ingress order is honored for egress, when ingress is >1 interface, the amount of reordering this alone causes will destroy customer expectation of TCP performance - we could quite easily guarantee order as long as interfaces are in same hardware complex, but it would be very difficult to guarantee between hardware complexes -- ++ytti
Mark Tinka <mark@tinka.africa> writes:
Oh? What is it then, if it's not spraying successive packets across member links?
It sprays the packets more or less randomly across links, and each link then does individual buffering. It introduces an unnecessary random delay to each packet, when it could just place them successively on the next link.
Ummh, no, it won't.
If it did, it would have been widespread. But it's not.
It seems optimistic to argue that we have reached perfection in networking. The Linux TCP stack does not immediately start backing off when it encounters packet reordering. In the server world, packet-based round-robin is a fairly common interface bonding strategy, with the accompanying reordering, and generally it performs great.
i am going to be foolish and comment, as i have not seen this raised if i am running a lag, i can not resist adding a bit of resilience by having it spread across line cards. surprise! line cards from vendor <any> do not have uniform hashing or rotating algorithms. randy
On 9/9/23 20:44, Randy Bush wrote:
i am going to be foolish and comment, as i have not seen this raised
if i am running a lag, i can not resist adding a bit of resilience by having it spread across line cards.
surprise! line cards from vendor <any> do not have uniform hashing or rotating algorithms.
We spread all our LAG's across multiple line cards wherever possible (wherever possible = chassis-based hardware). I am not intimately aware of any hashing concerns for LAG's that traverse multiple line cards in the same chassis. Mark.
At a previous $dayjob at a Tier 1, we would only support LAG for a customer L2/3 service if the ports were on the same card. The response we gave if customers pushed back was "we don't consider LAG a form of circuit protection, so we're not going to consider physical resiliency in the design", which was true, because we didn't, but it was beside the point. The real reason was that getting our switching/routing platform to actually run traffic symmetrically across a LAG, which most end users considered expected behavior in a LAG, required a reconfiguration of the default hash, which effectively meant that [switching/routing vendor]'s TAC wouldn't help when something invariably went wrong. So it wasn't that it wouldn't work (my recollection at least is that everything ran fine in lab environments) but we didn't trust the hardware vendor support. On Sat, Sep 9, 2023 at 3:36 PM Mark Tinka <mark@tinka.africa> wrote:
On 9/9/23 20:44, Randy Bush wrote:
i am going to be foolish and comment, as i have not seen this raised
if i am running a lag, i can not resist adding a bit of resilience by having it spread across line cards.
surprise! line cards from vendor <any> do not have uniform hashing or rotating algorithms.
We spread all our LAG's across multiple line cards wherever possible (wherever possible = chassis-based hardware).
I am not intimately aware of any hashing concerns for LAG's that traverse multiple line cards in the same chassis.
Mark.
-- - Dave Cohen craetdave@gmail.com @dCoSays www.venicesunlight.com
On 9/9/23 22:29, Dave Cohen wrote:
At a previous $dayjob at a Tier 1, we would only support LAG for a customer L2/3 service if the ports were on the same card. The response we gave if customers pushed back was "we don't consider LAG a form of circuit protection, so we're not going to consider physical resiliency in the design", which was true, because we didn't, but it was beside the point. The real reason was that getting our switching/routing platform to actually run traffic symmetrically across a LAG, which most end users considered expected behavior in a LAG, required a reconfiguration of the default hash, which effectively meant that [switching/routing vendor]'s TAC wouldn't help when something invariably went wrong. So it wasn't that it wouldn't work (my recollection at least is that everything ran fine in lab environments) but we didn't trust the hardware vendor support.
We've had the odd bug here and there with LAG's for things like VRRP, BFD, e.t.c. But we have not run into that specific issue before on ASR1000's, ASR9000's, CRS-X's and MX. 98% of our network is Juniper nowadays, but even when we ran Cisco and had LAG's across multiple line cards, we didn't see this problem. The only hashing issue we had with LAG's is when we tried to carry Layer 2 traffic across them in the core. But this was just a limitation of the CRS-X, and happened also on member links of a LAG that shared the same line card. Mark.
On Sat, 9 Sept 2023 at 21:36, Benny Lyne Amorsen <benny+usenet@amorsen.dk> wrote:
The Linux TCP stack does not immediately start backing off when it encounters packet reordering. In the server world, packet-based round-robin is a fairly common interface bonding strategy, with the accompanying reordering, and generally it performs great.
If you have Linux - 1RU cat-or-such - Router - Internet Mostly round-robin between Linux-1RU is gonna work, because it satisfies the a) non congested b) equal rtt c) non-distributed (single pipeline ASIC switch, honoring ingress order on egress), requirements. But it is quite a special case, and of course there is only a round-robin on one link in one direction. Between 3.6-4.4 all multipath in Linux was broken, and I still to this day help people with problems on multipath complaining it doesn't perform (in LAN!). 3.6 introduced FIB to replace flow-cache, and made multipath essentially random 4.4 replaced random with hash When I ask them 'do you see reordering', people mostly reply 'no', because they look at PCAP and it doesn't look important to the human observer, it is such an insignificant amount.. Invariable problem goes away with hashing. (netstat -s is better than intuition on PCAP). -- ++ytti
On 9/6/23 16:14, Saku Ytti wrote:
For example Juniper offers true per-packet, I think mostly used in high performance computing.
Cisco did it too with CEF supporting "ip load-sharing per-packet" at the interface level. I am not sure this is still supported on modern code/boxes. Mark.
On 9/6/23 11:20, Benny Lyne Amorsen wrote:
TCP looks quite different in 2023 than it did in 1998. It should handle packet reordering quite gracefully; in the best case the NIC will reassemble the out-of-order TCP packets into a 64k packet and the OS will never even know they were reordered. Unfortunately current equipment does not seem to offer per-packet load balancing, so we cannot test how well it works.
I ran per-packet load balancing on a Juniper LAG between 2015 - 2016. Let's just say I won't be doing that again. It balanced beautifully, but OoO packets made customers' lives impossible. So we went back to adaptive load balancing.
It is possible that per-packet load balancing will work a lot better today than it did in 1998, especially if the equipment does buffering before load balancing and the links happen to be fairly short and not very diverse.
Switching back to per-packet would solve quite a lot of problems, including elephant flows and bad hashing.
I would love to hear about recent studies.
2016 is not 1998, and certainly not 2023... but I've not heard about any improvements in Internet-based applications being better at handling OoO packets. Open to new info. 100Gbps ports has given us some breathing room, as have larger buffers on Arista switches to move bandwidth management down to the user-facing port and not the upsteam router. Clever Trio + Express chips have also enabled reasonably even traffic distribution with per-flow load balancing. We shall revisit the per-flow vs. per-packet problem when 100Gbps starts to become as rampant as 10Gbps did. Mark.
Benny Lyne Amorsen wrote:
TCP looks quite different in 2023 than it did in 1998. It should handle packet reordering quite gracefully;
Maybe and, even if it isn't, TCP may be modified. But that is not my primary point. ECMP, in general, means pathes consist of multiple routers and links. The links have various bandwidth and other traffic may be merged at multi access links or on routers. Then, it is hopeless for the load balancing points to control buffers of the routers in the pathes and delays caused by buffers, which makes per-packet load balancing hopeless. However, as I wrote to Mark Tinka; : If you have multiple parallel links over which many slow : TCP connections are running, which should be your assumption, with "multiple parallel links", which are single hop pathes, it is possible for the load balancing point to control amount of buffer occupancy of the links and delays caused by the buffers almost same, which should eliminate packet reordering within a flow, especially when " many slow TCP connections are running". And, simple round robin should be good enough for most of the cases (no lab testing at all, yet). A little more aggressive approach is to fully share a single buffer by all the parallel links. But as it is not compatible with router architecture today, I did not proposed the approach. Masataka Ohta
On Wed, 6 Sept 2023 at 10:27, Mark Tinka <mark@tinka.africa> wrote:
I recognize what happens in the real world, not in the lab or text books.
Fun fact about the real world, devices do not internally guarantee order. That is, even if you have identical latency links, 0 congestion, order is not guaranteed between packet1 coming from interfaceI1 and packet2 coming from interfaceI2, which packet first goes to interfaceE1 is unspecified. This is because packets inside lookup engine can be sprayed to multiple lookup engines, and order is lost even for packets coming from interface1 exclusively, however after the lookup the order is restored for _flow_, it is not restored between flows, so packets coming from interface1 with random ports won't be same order going out from interface2. So order is only restored inside a single lookup complex (interfaces are not guaranteed to be in the same complex) and only for actual flows. It is designed this way, because no one runs networks which rely on order outside these parameters, and no one even knows their kit works like this, because they don't have to. -- ++ytti
Saku Ytti wrote:
Fun fact about the real world, devices do not internally guarantee order. That is, even if you have identical latency links, 0 congestion, order is not guaranteed between packet1 coming from interfaceI1 and packet2 coming from interfaceI2, which packet first goes to interfaceE1 is unspecified.
So, you lack fundamental knowledge on the E2E argument fully applicable to situations in the real world Internet. In the very basic paper on the E2E argument published in 1984: End-To-End Arguments in System Design https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments... reordering is recognized both as the real and the theoretical world as: 3.4 Guaranteeing FIFO Message Delivery Ensuring that messages arrive at the receiver in the same order in which they are sent is another function usually assigned to the communication subsystem. which means, according to the paper, the "function" of reordering by network can not be complete or correct, and, unlike you, I'm fully aware of it.
This is because packets inside lookup engine can be sprayed to multiple lookup engines, and order is lost even for packets coming from interface1 exclusively, however after the lookup the order is restored for _flow_, it is not restored between flows, so packets coming from interface1 with random ports won't be same order going out from interface2.
That is a broken argument for how identification of flows by intelligent intermediate entities could work against the E2E argument and the reality initiated this thread. In the real world, according to the E2E argument, attempts to identify flows by intelligent intermediate entities is just harmful from the beginning, which is why flow driven architecture including that of MPLS is broken and hopeless. I really hope you understand the meaning of "intelligent intermediate entities" in the context of the E2E argument. Masataka Ohta
On 9/6/23 12:01, Saku Ytti wrote:
Fun fact about the real world, devices do not internally guarantee order. That is, even if you have identical latency links, 0 congestion, order is not guaranteed between packet1 coming from interfaceI1 and packet2 coming from interfaceI2, which packet first goes to interfaceE1 is unspecified. This is because packets inside lookup engine can be sprayed to multiple lookup engines, and order is lost even for packets coming from interface1 exclusively, however after the lookup the order is restored for _flow_, it is not restored between flows, so packets coming from interface1 with random ports won't be same order going out from interface2.
So order is only restored inside a single lookup complex (interfaces are not guaranteed to be in the same complex) and only for actual flows.
Yes, this has been my understanding of, specifically, Juniper's forwarding complex. Packets are chopped into near-same-size cells, sprayed across all available fabric links by the PFE logic, given a sequence number, and protocol engines ensure oversubscription is managed by a request-grant mechanism between PFE's. I'm not sure what mechanisms other vendors implement, but certainly OoO cells in the Juniper forwarding complex is not a concern within the same internal system itself. Mark.
On Wed, 6 Sept 2023 at 19:28, Mark Tinka <mark@tinka.africa> wrote:
Yes, this has been my understanding of, specifically, Juniper's forwarding complex.
Correct, packet is sprayed to some PPE, and PPEs do not run in deterministic time, after PPEs there is reorder block that restores flow, if it has to. EZchip is same with its TOPs.
Packets are chopped into near-same-size cells, sprayed across all available fabric links by the PFE logic, given a sequence number, and protocol engines ensure oversubscription is managed by a request-grant mechanism between PFE's.
This isn't the mechanism that causes reordering, it's the ingress and egress lookup where Packet or PacketHead is sprayed to some PPE where it can occur. Can find some patents on it: https://www.freepatentsonline.com/8799909.html When a PPE 315 has finished processing a header, it notifies a Reorder Block 321. The Reorder Block 321 is responsible for maintaining order for headers belonging to the same flow, and pulls a header from a PPE 315 when that header is at the front of the queue for its reorder flow. Note this reorder happens even when you have exactly 1 ingress interface and exactly 1 egress interface, as long as you have enough PPS, you will reorder outside flows, even without fabric being involved. -- ++ytti
On Wed, Sep 6, 2023 at 12:23 AM Mark Tinka <mark@tinka.africa> wrote:
I recognize what happens in the real world, not in the lab or text books.
What's the difference between theory and practice? In theory, there is no difference. -- William Herrin bill@herrin.us https://bill.herrin.us/
William Herrin wrote:
I recognize what happens in the real world, not in the lab or text books.
What's the difference between theory and practice?
W.r.t. the fact that there are so many wrong theories and wrong practices, there is no difference.
In theory, there is no difference.
Especially because the real world includes labs and text books and, as such, all the theories including all the wrong ones exist in the real world. Masataka Ohta
Masataka Ohta wrote on 03/09/2023 08:59:
the proper thing to do is to use the links with round robin fashion without hashing. Without buffer bloat, packet reordering probability within each TCP connection is negligible.
Can you provide some real world data to back this position up? What you said reminds me of the old saying: in theory, there's no difference between theory and practice, but in practice there is. Nick
Nick Hilliard wrote:
the proper thing to do is to use the links with round robin fashion without hashing. Without buffer bloat, packet reordering probability within each TCP connection is negligible.
Can you provide some real world data to back this position up?
See, for example, the famous paper of "Sizing Router Buffers". With thousands of TCP connections at the backbone recognized by the paper, buffers with thousands of packets won't cause packet reordering.
What you said reminds me of the old saying: in theory, there's no difference between theory and practice, but in practice there is.
In theory, you can always fabricate unrealistic counter examples against theories by ignoring essential assumptions of the theories. In this case, "Without buffer bloat" is an essential assumption. Masataka Ohta
Masataka Ohta wrote on 03/09/2023 14:32:
See, for example, the famous paper of "Sizing Router Buffers".
With thousands of TCP connections at the backbone recognized by the paper, buffers with thousands of packets won't cause packet reordering.
What you said reminds me of the old saying: in theory, there's no difference between theory and practice, but in practice there is.
In theory, you can always fabricate unrealistic counter examples against theories by ignoring essential assumptions of the theories.
In this case, "Without buffer bloat" is an essential assumption.
I can see how this conclusion could potentially be reached in specific styles of lab configs, but the real world is more complicated and the assumptions you've made don't hold there, especially the implicit ones. Buffer bloat will make this problem worse, but small buffers won't eliminate the problem. That isn't to say that packet / cell spray arrangements can't work. There are some situations where they can work reasonably well, given specific constraints, e.g. limited distance transmission path and path congruence with far-side reassembly (!), but these are the exception. Usually this only happens inside network devices rather than between devices, but occasionally you see products on the market which support this between devices with varying degrees of success. Generally in real world situations on the internet, packet reordering will happen if you use round robin, and this will impact performance for higher speed flows. There are several reasons for this, but mostly they boil down to a lack of control over the exact profile of the packets that the devices are expected to transmit, and no guarantee that the individual bearer channels have identical transmission characteristics. Then multiply that across the N load-balanced hops that each flow will take between source and destination. It's true that per-hash load balancing is a nuisance, but it works better in practice on larger heterogeneous networks than RR. Nick
Nick Hilliard wrote:
In this case, "Without buffer bloat" is an essential assumption.
I can see how this conclusion could potentially be reached in specific styles of lab configs,
I'm not interested in how poorly you configure your lab.
but the real world is more complicated and
And, this thread was initiated because of unreasonable behavior apparently caused by stupid attempts for automatic flow detection followed by policing. That is the real world. Moreover, it has been well known both in theory and practice that flow driven architecture relying on automatic detection of flows does not scale and is no good, though MPLS relies on the broken flow driven architecture.
Generally in real world situations on the internet, packet reordering will happen if you use round robin, and this will impact performance for higher speed flows.
That is my point already stated by me. You don't have to repeat it again.
It's true that per-hash load balancing is a nuisance, but it works better in practice on larger heterogeneous networks than RR.
Here, you implicitly assume large number of slower speed flows against your statement of "higher speed flows". Masataka Ohta
participants (13)
-
Benny Lyne Amorsen
-
Brian Turnbow
-
Dave Cohen
-
David Bass
-
Fred Baker
-
Mark Tinka
-
Masataka Ohta
-
Mike Hammett
-
Nick Hilliard
-
Randy Bush
-
Saku Ytti
-
Tom Beecher
-
William Herrin