Lossy cogent p2p experiences?

David Hubbard

31 Aug 2023 31 Aug '23

3:55 p.m.

Hi all, curious if anyone who has used Cogent as a point to point provider has gone through packet loss issues with them and were able to successfully resolve? I’ve got a non-rate-limited 10gig circuit between two geographic locations that have about 52ms of latency. Mine is set up to support both jumbo frames and vlan tagging. I do know Cogent packetizes these circuits, so they’re not like waves, and that the expected single session TCP performance may be limited to a few gbit/sec, but I should otherwise be able to fully utilize the circuit given enough flows. Circuit went live earlier this year, had zero issues with it. Testing with common tools like iperf would allow several gbit/sec of TCP traffic using single flows, even without an optimized TCP stack. Using parallel flows or UDP we could easily get close to wire speed. Starting about ten weeks ago we had a significant slowdown, to even complete failure, of bursty data replication tasks between equipment that was using this circuit. Rounds of testing demonstrate that new flows often experience significant initial packet loss of several thousand packets, and will then have ongoing lesser packet loss every five to ten seconds after that. There are times we can’t do better than 50 Mbit/sec, but it’s rare to achieve gigabit most of the time unless we do a bunch of streams with a lot of tuning. UDP we also see the loss, but can still push many gigabits through with one sender, or wire speed with several nodes. For equipment which doesn’t use a tunable TCP stack, such as storage arrays or vmware, the retransmits completely ruin performance or may result in ongoing failure we can’t overcome. Cogent support has been about as bad as you can get. Everything is great, clean your fiber, iperf isn’t a good test, install a physical loop oh wait we don’t want that so go pull it back off, new updates come at three to seven day intervals, etc. If the performance had never been good to begin with I’d have just attributed this to their circuits, but since it worked until late June, I know something has changed. I’m hoping someone else has run into this and maybe knows of some hints I could give them to investigate. To me it sounds like there’s a rate limiter / policer defined somewhere in the circuit, or an overloaded interface/device we’re forced to traverse, but they assure me this is not the case and claim to have destroyed and rebuilt the logical circuit. Thanks!

Attachments:

attachment.html (text/html — 4.3 KB)

Show replies by date

Eric Kuhnke

31 Aug 31 Aug

8:51 p.m.

Cogent has asked many people NOT to purchase their ethernet private circuit point to point service unless they can guarantee that you won't move any single flow of greater than 2 Gbps. This works fine as long as the service is used mostly for mixed IP traffic like a bunch of randomly mixed customers together. What you are trying to do is probably against the guidelines their engineering group has given them for what they can sell now. This is a known weird limitation with Cogent's private circuit service. The best working theory that several people I know in the neteng community have come up with is because Cogent does not want to adversely impact all other customers on their router in some sites, where the site's upstreams and links to neighboring POPs are implemented as something like 4 x 10 Gbps. In places where they have not upgraded that specific router to a full 100 Gbps upstream. Moving large flows >2Gbps could result in flat topping a traffic chart on just 1 of those 10Gbps circuits. On Thu, Aug 31, 2023 at 10:04 AM David Hubbard < dhubbard@dino.hostasaurus.com> wrote:

...

Hi all, curious if anyone who has used Cogent as a point to point provider has gone through packet loss issues with them and were able to successfully resolve? I’ve got a non-rate-limited 10gig circuit between two geographic locations that have about 52ms of latency. Mine is set up to support both jumbo frames and vlan tagging. I do know Cogent packetizes these circuits, so they’re not like waves, and that the expected single session TCP performance may be limited to a few gbit/sec, but I should otherwise be able to fully utilize the circuit given enough flows.

Circuit went live earlier this year, had zero issues with it. Testing with common tools like iperf would allow several gbit/sec of TCP traffic using single flows, even without an optimized TCP stack. Using parallel flows or UDP we could easily get close to wire speed. Starting about ten weeks ago we had a significant slowdown, to even complete failure, of bursty data replication tasks between equipment that was using this circuit. Rounds of testing demonstrate that new flows often experience significant initial packet loss of several thousand packets, and will then have ongoing lesser packet loss every five to ten seconds after that. There are times we can’t do better than 50 Mbit/sec, but it’s rare to achieve gigabit most of the time unless we do a bunch of streams with a lot of tuning. UDP we also see the loss, but can still push many gigabits through with one sender, or wire speed with several nodes.

For equipment which doesn’t use a tunable TCP stack, such as storage arrays or vmware, the retransmits completely ruin performance or may result in ongoing failure we can’t overcome.

Cogent support has been about as bad as you can get. Everything is great, clean your fiber, iperf isn’t a good test, install a physical loop oh wait we don’t want that so go pull it back off, new updates come at three to seven day intervals, etc. If the performance had never been good to begin with I’d have just attributed this to their circuits, but since it worked until late June, I know something has changed. I’m hoping someone else has run into this and maybe knows of some hints I could give them to investigate. To me it sounds like there’s a rate limiter / policer defined somewhere in the circuit, or an overloaded interface/device we’re forced to traverse, but they assure me this is not the case and claim to have destroyed and rebuilt the logical circuit.

Thanks!

David Hubbard

9:42 p.m.

That’s not what I’m trying to do, that’s just what I’m using during testing to demonstrate the loss to them. It’s intended to bridge a number of networks with hundreds of flows, including inbound internet sources, but any new TCP flow is subject to numerous dropped packets at establishment and then ongoing loss every five to ten seconds. The initial loss and ongoing bursts of loss cause the TCP window to shrink so much that any single flow, between systems that can’t be optimized, ends up varying from 50 Mbit/sec to something far short of a gigabit. It was also fine for six months before this miserable behavior began in late June. From: Eric Kuhnke <eric.kuhnke@gmail.com> Date: Thursday, August 31, 2023 at 4:51 PM To: David Hubbard <dhubbard@dino.hostasaurus.com> Cc: Nanog@nanog.org <nanog@nanog.org> Subject: Re: Lossy cogent p2p experiences? Cogent has asked many people NOT to purchase their ethernet private circuit point to point service unless they can guarantee that you won't move any single flow of greater than 2 Gbps. This works fine as long as the service is used mostly for mixed IP traffic like a bunch of randomly mixed customers together. What you are trying to do is probably against the guidelines their engineering group has given them for what they can sell now. This is a known weird limitation with Cogent's private circuit service. The best working theory that several people I know in the neteng community have come up with is because Cogent does not want to adversely impact all other customers on their router in some sites, where the site's upstreams and links to neighboring POPs are implemented as something like 4 x 10 Gbps. In places where they have not upgraded that specific router to a full 100 Gbps upstream. Moving large flows >2Gbps could result in flat topping a traffic chart on just 1 of those 10Gbps circuits. On Thu, Aug 31, 2023 at 10:04 AM David Hubbard <dhubbard@dino.hostasaurus.com<mailto:dhubbard@dino.hostasaurus.com>> wrote: Hi all, curious if anyone who has used Cogent as a point to point provider has gone through packet loss issues with them and were able to successfully resolve? I’ve got a non-rate-limited 10gig circuit between two geographic locations that have about 52ms of latency. Mine is set up to support both jumbo frames and vlan tagging. I do know Cogent packetizes these circuits, so they’re not like waves, and that the expected single session TCP performance may be limited to a few gbit/sec, but I should otherwise be able to fully utilize the circuit given enough flows. Circuit went live earlier this year, had zero issues with it. Testing with common tools like iperf would allow several gbit/sec of TCP traffic using single flows, even without an optimized TCP stack. Using parallel flows or UDP we could easily get close to wire speed. Starting about ten weeks ago we had a significant slowdown, to even complete failure, of bursty data replication tasks between equipment that was using this circuit. Rounds of testing demonstrate that new flows often experience significant initial packet loss of several thousand packets, and will then have ongoing lesser packet loss every five to ten seconds after that. There are times we can’t do better than 50 Mbit/sec, but it’s rare to achieve gigabit most of the time unless we do a bunch of streams with a lot of tuning. UDP we also see the loss, but can still push many gigabits through with one sender, or wire speed with several nodes. For equipment which doesn’t use a tunable TCP stack, such as storage arrays or vmware, the retransmits completely ruin performance or may result in ongoing failure we can’t overcome. Cogent support has been about as bad as you can get. Everything is great, clean your fiber, iperf isn’t a good test, install a physical loop oh wait we don’t want that so go pull it back off, new updates come at three to seven day intervals, etc. If the performance had never been good to begin with I’d have just attributed this to their circuits, but since it worked until late June, I know something has changed. I’m hoping someone else has run into this and maybe knows of some hints I could give them to investigate. To me it sounds like there’s a rate limiter / policer defined somewhere in the circuit, or an overloaded interface/device we’re forced to traverse, but they assure me this is not the case and claim to have destroyed and rebuilt the logical circuit. Thanks!

William Herrin

3 Sep 3 Sep

2:49 p.m.

On Thu, Aug 31, 2023 at 2:42 PM David Hubbard <dhubbard@dino.hostasaurus.com> wrote:

...

any new TCP flow is subject to numerous dropped packets at establishment and then ongoing loss every five to ten seconds.

Hi David, That sounds like normal TCP behavior over a long fat pipe. After establishment, TCP sends a burst of 10 packets at wire speed. There's a long delay and then they basically get acked all at once so it sends another burst of 20 packets this time. This doubling burst repeats itself until one of the bursts overwhelms the buffers of a mid-path device, causing one or a bunch of them to be lost. That kicks it out of "slow start" so that it stops trying to double the window size every time. Depending on how aggressive your congestion control algorithm is, it then slightly increases the window size until it loses packets, and then falls back to a smaller size. It actually takes quite a while for the packets to spread out over the whole round trip time. They like to stay bunched up in bursts. If those bursts align with other users' traffic and overwhelm a midpoint buffer again, well, there you go. I have a hypothesis that TCP performance could be improved by intentionally spreading out the early packets. Essentially, upon receiving an ack to the first packet that contained data, start a rate limiter that allows only one packet per 1/20th of the round trip time to be sent for the next 20 packets. I left the job where I was looking at that and haven't been back to it. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Masataka Ohta

4 Sep 4 Sep

7:13 a.m.

William Herrin wrote:

...

Hi David,

That sounds like normal TCP behavior over a long fat pipe.

No, not at all. First, though you explain slow start, it has nothing to do with long fat pipe. Long fat pipe problem is addressed by window scaling (and SACK). As David Hubbard wrote: : I've got a non-rate-limited 10gig circuit and : The initial and recurring packet loss occurs on any flow of : more than ~140 Mbit. the problem is caused not by wire speed limitation of a "fat" pipe but by artificial policing at 140M. Masataka Ohta

William Herrin

12:27 p.m.

On Mon, Sep 4, 2023 at 12:13 AM Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:

...

William Herrin wrote:

...
That sounds like normal TCP behavior over a long fat pipe.

No, not at all. First, though you explain slow start, it has nothing to do with long fat pipe. Long fat pipe problem is addressed by window scaling (and SACK).

So, I've actually studied this in real-world conditions and TCP behaves exactly as I described in my previous email for exactly the reasons I explained. If you think it doesn't, you don't know what you're talking about. Window scaling and SACK makes it possible for TCP to grow to consume the entire whole end-to-end pipe when the pipe is at least as large as the originating interface and -empty- of other traffic. Those conditions are rarely found in the real world. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Masataka Ohta

2:07 p.m.

William Herrin wrote:

...

...
No, not at all. First, though you explain slow start, it has nothing to do with long fat pipe. Long fat pipe problem is addressed by window scaling (and SACK).

So, I've actually studied this in real-world conditions and TCP behaves exactly as I described in my previous email for exactly the reasons I explained.

Yes of course, which is my point. Your problem is that your point of slow start has nothing to do with long fat pipe.

...

Window scaling and SACK makes it possible for TCP to grow to consume the entire whole end-to-end pipe when the pipe is at least as large as the originating interface and -empty- of other traffic.

Totally wrong. Unless the pipe is long and fat, a plain TCP without window scaling or SACK is to grow to consume the entire whole end-to-end pipe when the pipe is at least as large as the originating interface and -empty- of other traffic.

...

Those conditions are rarely found in the real world.

It is usual that TCP consumes all the available bandwidth. Exceptions, not so rare in the real world, are plain TCPs over long fat pipes. Masataka Ohta

William Herrin

2:34 p.m.

On Mon, Sep 4, 2023 at 7:07 AM Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:

...

William Herrin wrote:

...
So, I've actually studied this in real-world conditions and TCP behaves exactly as I described in my previous email for exactly the reasons I explained.

Yes of course, which is my point. Your problem is that your point of slow start has nothing to do with long fat pipe.

Well it doesn't show up in long slow pipes because the low transmission speed spaces out the packets, and it doesn't show up in short fat pipes because there's not enough delay to cause the burstiness. So I don't know how you figure it has nothing to do with long fat pipes, but you're plain wrong. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/

Masataka Ohta

3:02 p.m.

William Herrin wrote:

...

Well it doesn't show up in long slow pipes because the low transmission speed spaces out the packets,

Wrong. That is a phenomenon with slow access and fast backbone, which has nothing to do with this thread. If backbone is as slow as access, there can be no "space out" possible.

...

and it doesn't show up in short fat pipes because there's not enough delay to cause the burstiness.

Short pipe means speed of burst shows up continuously without interruption.

...

So I don't know how you figure it has nothing to do with long fat pipes,

That's your problem. Masataka Ohta

Saku Ytti

1 Sep 1 Sep

8:50 a.m.

On Thu, 31 Aug 2023 at 23:56, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

...

The best working theory that several people I know in the neteng community have come up with is because Cogent does not want to adversely impact all other customers on their router in some sites, where the site's upstreams and links to neighboring POPs are implemented as something like 4 x 10 Gbps. In places where they have not upgraded that specific router to a full 100 Gbps upstream. Moving large flows >2Gbps could result in flat topping a traffic chart on just 1 of those 10Gbps circuits.

It is a very plausible theory, and everyone has this problem to a lesser or greater degree. There was a time when edge interfaces were much lower capacity than backbone interfaces, but I don't think that time will ever come back. So this problem is systemic. Luckily there is quite a reasonable solution to the problem, called 'adaptive load balancing', where software monitors balancing, and biases the hash_result => egress_interface tables to improve balancing when dealing with elephant flows. -- ++ytti

Mark Tinka

11:49 a.m.

On 9/1/23 10:50, Saku Ytti wrote:

...

It is a very plausible theory, and everyone has this problem to a lesser or greater degree. There was a time when edge interfaces were much lower capacity than backbone interfaces, but I don't think that time will ever come back. So this problem is systemic. Luckily there is quite a reasonable solution to the problem, called 'adaptive load balancing', where software monitors balancing, and biases the hash_result => egress_interface tables to improve balancing when dealing with elephant flows.

We didn't have much success with FAT when the PE was an MX480 and the P a CRS-X (FP40 + FP140 line cards). This was regardless of whether the core links were native IP/MPLS or 802.1AX. When we switched our P devices to PTX1000 and PTX10001, we've had surprisingly good performance of all manner of traffic across native IP/MPLS and 802.1AX links, even without explicitly configuring FAT for EoMPLS traffic. Of course, our policy is to never transport EoMPLS servics in excess of 40Gbps. Once a customer requires 41Gbps of EoMPLS service or more, we move them to EoDWDM. Cheaper and more scalable that way. It does help that we operate both a Transport and IP/MPLS network, but I understand this may not be the case for most networks. Mark.

Saku Ytti

1:29 p.m.

On Fri, 1 Sept 2023 at 14:54, Mark Tinka <mark@tinka.africa> wrote:

...

When we switched our P devices to PTX1000 and PTX10001, we've had surprisingly good performance of all manner of traffic across native IP/MPLS and 802.1AX links, even without explicitly configuring FAT for EoMPLS traffic.

PTX and MX as LSR look inside pseudowire to see if it's IP (dangerous guess to make for LSR), CSR/ASR9k does not. So PTX and MX LSR will balance your pseudowire even without FAT. I've had no problem having ASR9k LSR balancing FAT PWs. However this is a bit of a sidebar, because the original problem is about elephant flows, which FAT does not help with. But adaptive balancing does. -- ++ytti

Mike Hammett

1:44 p.m.

and I would say the OP wasn't even about elephant flows, just about a network that can't deliver anything acceptable. ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "Saku Ytti" <saku@ytti.fi> To: "Mark Tinka" <mark@tinka.africa> Cc: nanog@nanog.org Sent: Friday, September 1, 2023 8:29:12 AM Subject: Re: Lossy cogent p2p experiences? On Fri, 1 Sept 2023 at 14:54, Mark Tinka <mark@tinka.africa> wrote:

...

When we switched our P devices to PTX1000 and PTX10001, we've had surprisingly good performance of all manner of traffic across native IP/MPLS and 802.1AX links, even without explicitly configuring FAT for EoMPLS traffic.

Mark Tinka

2:08 p.m.

On 9/1/23 15:44, Mike Hammett wrote:

...

and I would say the OP wasn't even about elephant flows, just about a network that can't deliver anything acceptable.

Unless Cogent are not trying to accept (and by extension, may not be able to guarantee) large Ethernet flows because they can't balance them across their various core links, end-to-end... Pure conjecture... Mark.

David Hubbard

2:19 p.m.

The initial and recurring packet loss occurs on any flow of more than ~140 Mbit. The fact that it’s loss-free under that rate is what furthers my opinion it’s config-based somewhere, even though they say it isn’t. From: NANOG <nanog-bounces+dhubbard=dino.hostasaurus.com@nanog.org> on behalf of Mark Tinka <mark@tinka.africa> Date: Friday, September 1, 2023 at 10:13 AM To: Mike Hammett <nanog@ics-il.net>, Saku Ytti <saku@ytti.fi> Cc: nanog@nanog.org <nanog@nanog.org> Subject: Re: Lossy cogent p2p experiences? On 9/1/23 15:44, Mike Hammett wrote: and I would say the OP wasn't even about elephant flows, just about a network that can't deliver anything acceptable. Unless Cogent are not trying to accept (and by extension, may not be able to guarantee) large Ethernet flows because they can't balance them across their various core links, end-to-end... Pure conjecture... Mark.

David Hubbard

11 Sep 11 Sep

4:14 p.m.

Some interesting new developments on this, independent of the divergent network equipment discussion. 😊 Cogent had a field engineer at the east coast location where my local loop (10gig wave) meets their equipment, i.e. (me – patch cable to loop provider’s wave equipment – wave – patch cable to Cogent equipment). On the other end, the geographically distant west coast direction, it’s Cogent equipment to my equipment in the same facility with just patch cable. They connected some model of EXFO’s NetBlazer FTBx 8880-series testing device to a port on their east coast network device, not disconnecting my circuit. Originally, they were planning to have someone physically loop at their equipment at the other end, but I volunteered that my Arista gear supports a provider-facing loop at the transceiver level if they wanted to try that, so my loop, cabling, and transceiver could be part of the testing. One direction at a time, they interrupted the point to point config to create a point to point between one direction of my gear, set to loopback mode, and the NetBlazer device. The device was set to use five parallel streams. In the close direction, where the third-party wave is involved, they ran at full 5 x 2gbps for thirty minutes, had zero packets lost, no issues. My monitoring confirmed this rate of port input was occurring, although oddly not output, but perhaps Arista doesn’t “see”/count the retransmitted packets in phy loopback mode. In the distant direction across their backbone, their equipment at the remote end, and the fiber patch cable to me, they tested at 9.5 Gbit for thirty minutes through my device in loopback mode. The result was, of 2.6B packets sent, only 334 packets lost. They configured for 9.5 gbps rate of testing, so five 1.9gbps streams. Across the five streams, the report has a “frame loss” and out of sequence section. Zero out of sequence, but among the five streams, loss seconds / count were 3 / 26, 3 / 48, 1 / 5, 13 / 221, 1 / 34. I’m not familiar with this testing device, but to me that suggests it’s stating how many of the total seconds experienced loss, and the counted packet loss. So really the only one that stands out is the one with thirteen seconds where loss occurred, but the packet counts we’re talking about are miniscule. Again, my monitoring at the interface level showed this 9.5gbps of testing occurring for the thirty minutes the report says. So, now I’m just completely confused. How is this device, traversing the same equipment, ports, cables, able to achieve far greater average throughput, and almost no loss, across a very long duration? There are times I’ll be able to achieve nearly the same, but never for a test longer than ten seconds as it just falls off from there. For example, I did a five parallel stream TCP test with iperf just now and did achieve a net throughput of 8.16 Gbps with about 1200 retransmits. Same five stream test run for half hour like theirs, I got no better than 2.64 Gbps and 183,000 retransmits. iperf and UDP allow me to see loss at any rate of transmit exceeding ~140mbps, in just seconds, not a half hour. To rule out my gear, I’m also able to perform the same tests from the same systems (both VM and physical) using public addresses and traversing the internet, as these are publicly connected systems. I get far lower loss and much greater throughput on the internet path. For example, simple ten second test of a single stream at 400 Mbit UDP; 5 packets lost across internet, 491 across P2P. Single stream TCP across the internet for ten seconds; 3.47 Gbps, 162 retransmits. Across the P2P, this time at least, 637 Mbps, 3633 retransmits. David From: David Hubbard <dhubbard@dino.hostasaurus.com> Date: Friday, September 1, 2023 at 10:19 AM To: Nanog@nanog.org <nanog@nanog.org> Subject: Re: Lossy cogent p2p experiences? The initial and recurring packet loss occurs on any flow of more than ~140 Mbit. The fact that it’s loss-free under that rate is what furthers my opinion it’s config-based somewhere, even though they say it isn’t. From: NANOG <nanog-bounces+dhubbard=dino.hostasaurus.com@nanog.org> on behalf of Mark Tinka <mark@tinka.africa> Date: Friday, September 1, 2023 at 10:13 AM To: Mike Hammett <nanog@ics-il.net>, Saku Ytti <saku@ytti.fi> Cc: nanog@nanog.org <nanog@nanog.org> Subject: Re: Lossy cogent p2p experiences? On 9/1/23 15:44, Mike Hammett wrote: and I would say the OP wasn't even about elephant flows, just about a network that can't deliver anything acceptable. Unless Cogent are not trying to accept (and by extension, may not be able to guarantee) large Ethernet flows because they can't balance them across their various core links, end-to-end... Pure conjecture... Mark.

Mark Tinka

1 Sep 1 Sep

1:46 p.m.

On 9/1/23 15:29, Saku Ytti wrote:

...

PTX and MX as LSR look inside pseudowire to see if it's IP (dangerous guess to make for LSR), CSR/ASR9k does not. So PTX and MX LSR will balance your pseudowire even without FAT.

Yes, this was our conclusion as well after moving our core to PTX1000/10001. Mark.

Saku Ytti

1:55 p.m.

On Fri, 1 Sept 2023 at 16:46, Mark Tinka <mark@tinka.africa> wrote:

...

Yes, this was our conclusion as well after moving our core to PTX1000/10001.

Personally I would recommend turning off LSR payload heuristics, because there is no accurate way for LSR to tell what the label is carrying, and wrong guess while rare will be extremely hard to root cause, because you will never hear it, because the person suffering from it is too many hops away from problem being in your horizon. I strongly believe edge imposing entropy or fat is the right way to give LSR hashing hints. -- ++ytti

Lukas Tribus

3:37 p.m.

On Fri, 1 Sept 2023 at 15:55, Saku Ytti <saku@ytti.fi> wrote:

...

On Fri, 1 Sept 2023 at 16:46, Mark Tinka <mark@tinka.africa> wrote:

...
Yes, this was our conclusion as well after moving our core to PTX1000/10001.

Personally I would recommend turning off LSR payload heuristics, because there is no accurate way for LSR to tell what the label is carrying, and wrong guess while rare will be extremely hard to root cause, because you will never hear it, because the person suffering from it is too many hops away from problem being in your horizon. I strongly believe edge imposing entropy or fat is the right way to give LSR hashing hints.

If you need to load-balance labelled IP traffic though, all your edge devices would have to impose entropy/fat. On the hand a workaround at the edge at least for EoMPLS would be to enable control-word. Lukas

Mark Tinka

7:56 p.m.

On 9/1/23 15:55, Saku Ytti wrote:

...

Personally I would recommend turning off LSR payload heuristics, because there is no accurate way for LSR to tell what the label is carrying, and wrong guess while rare will be extremely hard to root cause, because you will never hear it, because the person suffering from it is too many hops away from problem being in your horizon. I strongly believe edge imposing entropy or fat is the right way to give LSR hashing hints.

PTX1000/10001 (Express) offers no real configurable options for load balancing the same way MX (Trio) does. This is what took us by surprise. This is all we have on our PTX: tinka@router# show forwarding-options family inet6 { route-accounting; } load-balance-label-capability; [edit] tinka@router# Mark.

Saku Ytti

2 Sep 2 Sep

6:43 a.m.

On Fri, 1 Sept 2023 at 22:56, Mark Tinka <mark@tinka.africa> wrote:

...

PTX1000/10001 (Express) offers no real configurable options for load balancing the same way MX (Trio) does. This is what took us by surprise.

What in particular are you missing? As I explained, PTX/MX both allow for example speculating on transit pseudowires having CW on them. Which is non-default and requires 'zero-control-word'. You should be looking at 'hash-key' on PTX and 'enhanced-hash-key' on MX. You don't appear to have a single stanza configured, but I do wonder what you wanted to configure when you noticed the missing ability to do so. -- ++ytti

Mark Tinka

7:23 a.m.

On 9/2/23 08:43, Saku Ytti wrote:

...

What in particular are you missing? As I explained, PTX/MX both allow for example speculating on transit pseudowires having CW on them. Which is non-default and requires 'zero-control-word'. You should be looking at 'hash-key' on PTX and 'enhanced-hash-key' on MX. You don't appear to have a single stanza configured, but I do wonder what you wanted to configure when you noticed the missing ability to do so.

Sorry for the confusion - let me provide some background context since we deployed the PTX ages ago (and core nodes are typically boring). The issue we ran into was to do with our deployment tooling, which was based on 'enhanced-hash-key' that is required for MPC's on the MX. The tooling used to deploy the PTX was largely built on what we use to deploy the MX, with tweaks of critically different items. At the time, we did not know that the PTX required 'hash-key' as opposed to 'enhanced-hash-key'. So nothing got deployed on the PTX specifically for load balancing (we might have assumed it to have been non-existent or incomplete feature at the time). So the "surprise" I speak of is how well it all worked with load balancing across LAG's and EoMPLS traffic compared to the CRS-X, despite not having any load balancing features explicitly configured, which is still the case today. It works, so we aren't keen to break it. Mark.

Tony Wicks

1 Sep 1 Sep

7:32 p.m.

Yes adaptive load balancing very much helps but the weakness is it is normally only fully supported on vendor silicon not merchant silicon. Much of the transport edge is merchant silicon due to the per packet cost being far lower and the general requirement to just pass not manipulate packets. Using the Nokia kit for example the 7750 does a great job of "adaptive-load-balancing" but the 7250 is lacklustre at best. -----Original Message----- From: NANOG <nanog-bounces+tony=wicks.co.nz@nanog.org> On Behalf Of Saku Ytti Sent: Friday, September 1, 2023 8:51 PM To: Eric Kuhnke <eric.kuhnke@gmail.com> Cc: nanog@nanog.org Subject: Re: Lossy cogent p2p experiences? Luckily there is quite a reasonable solution to the problem, called 'adaptive load balancing', where software monitors balancing, and biases the hash_result => egress_interface tables to improve balancing when dealing with elephant flows.

Mike Hammett

7:52 p.m.

It doesn't help the OP at all, but this is why (thus far, anyway), I overwhelmingly prefer wavelength transport to anything switched. Can't have over-subscription or congestion issues on a wavelength. ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "David Hubbard" <dhubbard@dino.hostasaurus.com> To: "Nanog@nanog.org" <nanog@nanog.org> Sent: Thursday, August 31, 2023 10:55:19 AM Subject: Lossy cogent p2p experiences? Hi all, curious if anyone who has used Cogent as a point to point provider has gone through packet loss issues with them and were able to successfully resolve? I’ve got a non-rate-limited 10gig circuit between two geographic locations that have about 52ms of latency. Mine is set up to support both jumbo frames and vlan tagging. I do know Cogent packetizes these circuits, so they’re not like waves, and that the expected single session TCP performance may be limited to a few gbit/sec, but I should otherwise be able to fully utilize the circuit given enough flows. Circuit went live earlier this year, had zero issues with it. Testing with common tools like iperf would allow several gbit/sec of TCP traffic using single flows, even without an optimized TCP stack. Using parallel flows or UDP we could easily get close to wire speed. Starting about ten weeks ago we had a significant slowdown, to even complete failure, of bursty data replication tasks between equipment that was using this circuit. Rounds of testing demonstrate that new flows often experience significant initial packet loss of several thousand packets, and will then have ongoing lesser packet loss every five to ten seconds after that. There are times we can’t do better than 50 Mbit/sec, but it’s rare to achieve gigabit most of the time unless we do a bunch of streams with a lot of tuning. UDP we also see the loss, but can still push many gigabits through with one sender, or wire speed with several nodes. For equipment which doesn’t use a tunable TCP stack, such as storage arrays or vmware, the retransmits completely ruin performance or may result in ongoing failure we can’t overcome. Cogent support has been about as bad as you can get. Everything is great, clean your fiber, iperf isn’t a good test, install a physical loop oh wait we don’t want that so go pull it back off, new updates come at three to seven day intervals, etc. If the performance had never been good to begin with I’d have just attributed this to their circuits, but since it worked until late June, I know something has changed. I’m hoping someone else has run into this and maybe knows of some hints I could give them to investigate. To me it sounds like there’s a rate limiter / policer defined somewhere in the circuit, or an overloaded interface/device we’re forced to traverse, but they assure me this is not the case and claim to have destroyed and rebuilt the logical circuit. Thanks!

Mark Tinka

8:06 p.m.

On 9/1/23 21:52, Mike Hammett wrote:

...

It doesn't help the OP at all, but this is why (thus far, anyway), I overwhelmingly prefer wavelength transport to anything switched. Can't have over-subscription or congestion issues on a wavelength.

Large IP/MPLS operators insist on optical transport for their own backbone, but are more than willing to sell packet for transport. I find this amusing :-). I submit that customers who can't afford large links (1Gbps or below) are forced into EoMPLS transport due to cost. Other customers are also forced into EoMPLS transport because there is no other option for long haul transport in their city other than a provider who can only offer EoMPLS. There is a struggling trend from some medium sized operators looking to turn an optical network into a packet network, i.e., they will ask for a 100Gbps EoDWDM port, but only seek to pay for a 25Gbps service. The large port is to allow them to scale in the future without too much hassle, but they want to pay for the bandwidth they use, which is hard to limit anyway if it's a proper EoDWDM channel. I am swatting such requests away because you tie up a full 100Gbps channel on the line side for the majority of hardware that does pure EoDWDM, which is a contradiction to the reason a packet network makes sense for sub-rate services. Mark.

Tom Beecher

5 Sep 5 Sep

9:13 p.m.

...

Cogent support has been about as bad as you can get. Everything is great, clean your fiber, iperf isn’t a good test, install a physical loop oh wait we don’t want that so go pull it back off, new updates come at three to seven day intervals, etc. If the performance had never been good to begin with I’d have just attributed this to their circuits, but since it worked until late June, I know something has changed. I’m hoping someone else has run into this and maybe knows of some hints I could give them to investigate. To me it sounds like there’s a rate limiter / policer defined somewhere in the circuit, or an overloaded interface/device we’re forced to traverse, but they assure me this is not the case and claim to have destroyed and rebuilt the logical circuit.

Sure smells like port buffer issues somewhere in the middle. ( mismatched deep / shallow, or something configured to support jumbo frames, but buffers not optimized for them) On Thu, Aug 31, 2023 at 11:57 AM David Hubbard < dhubbard@dino.hostasaurus.com> wrote:

...

Hi all, curious if anyone who has used Cogent as a point to point provider has gone through packet loss issues with them and were able to successfully resolve? I’ve got a non-rate-limited 10gig circuit between two geographic locations that have about 52ms of latency. Mine is set up to support both jumbo frames and vlan tagging. I do know Cogent packetizes these circuits, so they’re not like waves, and that the expected single session TCP performance may be limited to a few gbit/sec, but I should otherwise be able to fully utilize the circuit given enough flows.

Circuit went live earlier this year, had zero issues with it. Testing with common tools like iperf would allow several gbit/sec of TCP traffic using single flows, even without an optimized TCP stack. Using parallel flows or UDP we could easily get close to wire speed. Starting about ten weeks ago we had a significant slowdown, to even complete failure, of bursty data replication tasks between equipment that was using this circuit. Rounds of testing demonstrate that new flows often experience significant initial packet loss of several thousand packets, and will then have ongoing lesser packet loss every five to ten seconds after that. There are times we can’t do better than 50 Mbit/sec, but it’s rare to achieve gigabit most of the time unless we do a bunch of streams with a lot of tuning. UDP we also see the loss, but can still push many gigabits through with one sender, or wire speed with several nodes.

For equipment which doesn’t use a tunable TCP stack, such as storage arrays or vmware, the retransmits completely ruin performance or may result in ongoing failure we can’t overcome.

Cogent support has been about as bad as you can get. Everything is great, clean your fiber, iperf isn’t a good test, install a physical loop oh wait we don’t want that so go pull it back off, new updates come at three to seven day intervals, etc. If the performance had never been good to begin with I’d have just attributed this to their circuits, but since it worked until late June, I know something has changed. I’m hoping someone else has run into this and maybe knows of some hints I could give them to investigate. To me it sounds like there’s a rate limiter / policer defined somewhere in the circuit, or an overloaded interface/device we’re forced to traverse, but they assure me this is not the case and claim to have destroyed and rebuilt the logical circuit.

Thanks!

664

Age (days ago)

675

Last active (days ago)

List overview

Download

25 comments

10 participants

participants (10)

David Hubbard
Eric Kuhnke
Lukas Tribus
Mark Tinka
Masataka Ohta
Mike Hammett
Saku Ytti
Tom Beecher
Tony Wicks
William Herrin

Lossy cogent p2p experiences?

tags

participants (10)