400G link L2 flakiness over DWDM

We have a number of new 400G provider waves with odd flakiness that we can't seem to sort out. These links appear good - light levels are fine, links will be up with working LLDP. But if we admin down / up the link on the IP side, many of them won't come back up on their own. Our provider sees "loss of alignment" alarms on the optical equipment, and we can eventually get the links back up if they bounce them a few times on the DWDM side. We're using Juniper PTX's, with Juniper 400G FR4 and 400G LR4 optics. Same issue on different circuits/devices/paths with this provider, but no issues with other transport providers using the same optics on our end. Believe our provider is using Ciena optical equipment. Has anyone seen issues like this before? Our provider has been looking into this for months without much progress. One tantalizing clue was another customer using Cisco Nexus had better results after they enabled "link transmit reset-skip" (https://quickview.cloudapps.cisco.com/quickview/bug/CSCvi45168). Not clear exactly what that option does, and there doesn't seem to be an obviously equivalent config knob in Junos. Thanks, Oliver

On 1/10/25 21:57, Oliver Garraux wrote:
We have a number of new 400G provider waves with odd flakiness that we can't seem to sort out.
These links appear good - light levels are fine, links will be up with working LLDP. But if we admin down / up the link on the IP side, many of them won't come back up on their own. Our provider sees "loss of alignment" alarms on the optical equipment, and we can eventually get the links back up if they bounce them a few times on the DWDM side.
We're using Juniper PTX's, with Juniper 400G FR4 and 400G LR4 optics. Same issue on different circuits/devices/paths with this provider, but no issues with other transport providers using the same optics on our end. Believe our provider is using Ciena optical equipment.
Has anyone seen issues like this before? Our provider has been looking into this for months without much progress. One tantalizing clue was another customer using Cisco Nexus had better results after they enabled "link transmit reset-skip" (https://quickview.cloudapps.cisco.com/quickview/bug/CSCvi45168). Not clear exactly what that option does, and there doesn't seem to be an obviously equivalent config knob in Junos.
This sounds like a Ciena-specific issue. I have found the below on their knowledge portal: https://my.ciena.com/CienaPortal/s/article/6500-Submarine-How-to-clear-Loss-... https://my.ciena.com/CienaPortal/s/article/Waveserver-5-Intermittent-errors-... We don't run Ciena, but I have reached out to a good mate at Ciena to see if he is aware about this and if their kit is implicated. I'll let you know when I hear back. Mark.

On 1/12/25 09:25, Mark Tinka wrote:
This sounds like a Ciena-specific issue.
I have found the below on their knowledge portal:
https://my.ciena.com/CienaPortal/s/article/6500-Submarine-How-to-clear-Loss-...
https://my.ciena.com/CienaPortal/s/article/Waveserver-5-Intermittent-errors-...
We don't run Ciena, but I have reached out to a good mate at Ciena to see if he is aware about this and if their kit is implicated. I'll let you know when I hear back.
No feedback yet from my Ciena contact, but I finally got full access to those knowledge articles. The 6500 issue appears to be an AOC cable length difference between M2M (mate-to-mate) ports. Ensuring the AOC cables are the same clears the LOA alarm. On Waveserver, the issue appears to be a software defect that appears when certain ports are configured with both Ethernet and OTU4 profiles for 400G services. Resolution is based on recreating the client service or hard rebooting the line card. There is no mention of a software upgrade, but since the article is from 2023, I'd expect that should already have been done. Both the 6500 and Waveserver have this issue only when 400G services are delivered. The articles do point to "submarine" applications, but I'd ignore that since the equipment used for terrestrial and submarine is largely the same (very minor differences in the line cards). Will ping if my Ciena contact reaches back. Would be good to validate the above with your provider, and let us know please. Thanks. Mark.

Mark is the undisputed DWDM wizard! On Mon, Jan 13, 2025, 3:42 AM Mark Tinka <mark@tinka.africa> wrote:
On 1/12/25 09:25, Mark Tinka wrote:
This sounds like a Ciena-specific issue.
I have found the below on their knowledge portal:
https://my.ciena.com/CienaPortal/s/article/6500-Submarine-How-to-clear-Loss-...
https://my.ciena.com/CienaPortal/s/article/Waveserver-5-Intermittent-errors-...
We don't run Ciena, but I have reached out to a good mate at Ciena to see if he is aware about this and if their kit is implicated. I'll let you know when I hear back.
No feedback yet from my Ciena contact, but I finally got full access to those knowledge articles.
The 6500 issue appears to be an AOC cable length difference between M2M (mate-to-mate) ports. Ensuring the AOC cables are the same clears the LOA alarm.
On Waveserver, the issue appears to be a software defect that appears when certain ports are configured with both Ethernet and OTU4 profiles for 400G services. Resolution is based on recreating the client service or hard rebooting the line card. There is no mention of a software upgrade, but since the article is from 2023, I'd expect that should already have been done.
Both the 6500 and Waveserver have this issue only when 400G services are delivered.
The articles do point to "submarine" applications, but I'd ignore that since the equipment used for terrestrial and submarine is largely the same (very minor differences in the line cards).
Will ping if my Ciena contact reaches back.
Would be good to validate the above with your provider, and let us know please. Thanks.
Mark.

On Waveserver, the issue appears to be a software defect that appears when certain ports are configured with both Ethernet and OTU4 profiles for 400G services. Resolution is based on recreating the client service or hard rebooting the line card. There is no mention of a software upgrade, but since the article is from 2023, I'd expect that should already have been done.
This is a not uncommon bug on Ciena stuff. Thankfully it doesn't happen often and is a straightforward fix, although bouncing a card is obv not ideal in a lot of cases. On Mon, Jan 13, 2025 at 6:42 AM Mark Tinka <mark@tinka.africa> wrote:
On 1/12/25 09:25, Mark Tinka wrote:
This sounds like a Ciena-specific issue.
I have found the below on their knowledge portal:
https://my.ciena.com/CienaPortal/s/article/6500-Submarine-How-to-clear-Loss-...
https://my.ciena.com/CienaPortal/s/article/Waveserver-5-Intermittent-errors-...
We don't run Ciena, but I have reached out to a good mate at Ciena to see if he is aware about this and if their kit is implicated. I'll let you know when I hear back.
No feedback yet from my Ciena contact, but I finally got full access to those knowledge articles.
The 6500 issue appears to be an AOC cable length difference between M2M (mate-to-mate) ports. Ensuring the AOC cables are the same clears the LOA alarm.
On Waveserver, the issue appears to be a software defect that appears when certain ports are configured with both Ethernet and OTU4 profiles for 400G services. Resolution is based on recreating the client service or hard rebooting the line card. There is no mention of a software upgrade, but since the article is from 2023, I'd expect that should already have been done.
Both the 6500 and Waveserver have this issue only when 400G services are delivered.
The articles do point to "submarine" applications, but I'd ignore that since the equipment used for terrestrial and submarine is largely the same (very minor differences in the line cards).
Will ping if my Ciena contact reaches back.
Would be good to validate the above with your provider, and let us know please. Thanks.
Mark.

On 1/13/25 17:10, Tom Beecher wrote:
This is a not uncommon bug on Ciena stuff. Thankfully it doesn't happen often and is a straightforward fix, although bouncing a card is obv not ideal in a lot of cases.
Indeed. My Ciena contact responded. He will check with their support team on what they know about this issue. He deals mostly with subsea (SLTE) design. Mark.

On 1/13/25 10:10, Tom Beecher wrote:
This is a not uncommon bug on Ciena stuff. Thankfully it doesn't happen often and is a straightforward fix, although bouncing a card is obv not ideal in a lot of cases.
I've been looking at the Waveserver line, so this is somewhat relevant to my interests. Given that this is a "not uncommon", do you or anyone else happen to know if there are other options such as e.g. configuring the port for only one of Ethernet or OTU service?

On 1/13/25 18:29, Brandon Martin wrote:
On 1/13/25 10:10, Tom Beecher wrote:
This is a not uncommon bug on Ciena stuff. Thankfully it doesn't happen often and is a straightforward fix, although bouncing a card is obv not ideal in a lot of cases.
I've been looking at the Waveserver line, so this is somewhat relevant to my interests.
Given that this is a "not uncommon", do you or anyone else happen to know if there are other options such as e.g. configuring the port for only one of Ethernet or OTU service?
Based on how the issue was described in the report, the problem seems to occur if there are multiple ports on the line card configured with different mixes of Ethernet and OTN on ports 3 and 11, as well as on ports 4, 8, 12, and 16. The 2nd scenario is if on Waveserver, 400G Ethernet is enabled on ports 6 and 12, but a mix of both Ethernet and OTN are configured on ports 6, 7, 12, and 13 without 400G. The 3rd scenario is 400G Ethernet and OTN mixed between ports 4 and 10. It's all very confusing. I'd certainly raise this with your Ciena account team before shelling out any cash, if you are keen to use them. Mark.

One tantalizing clue was another customer using Cisco Nexus had better results after they enabled "link transmit reset-skip" (https://quickview.cloudapps.cisco.com/quickview/bug/CSCvi45168). Not clear exactly what that option does, and there doesn't seem to be an obviously equivalent config knob in Junos.
I have a vague recollection of that option being there to basically ignore certain events that would normally cause link drop in order to keep an unstable link up. It was a long time ago, I sent a message to the guy I used to work with who might remember more. On Fri, Jan 10, 2025 at 3:00 PM Oliver Garraux <oliver@g.garraux.net> wrote:
We have a number of new 400G provider waves with odd flakiness that we can't seem to sort out.
These links appear good - light levels are fine, links will be up with working LLDP. But if we admin down / up the link on the IP side, many of them won't come back up on their own. Our provider sees "loss of alignment" alarms on the optical equipment, and we can eventually get the links back up if they bounce them a few times on the DWDM side.
We're using Juniper PTX's, with Juniper 400G FR4 and 400G LR4 optics. Same issue on different circuits/devices/paths with this provider, but no issues with other transport providers using the same optics on our end. Believe our provider is using Ciena optical equipment.
Has anyone seen issues like this before? Our provider has been looking into this for months without much progress. One tantalizing clue was another customer using Cisco Nexus had better results after they enabled "link transmit reset-skip" (https://quickview.cloudapps.cisco.com/quickview/bug/CSCvi45168). Not clear exactly what that option does, and there doesn't seem to be an obviously equivalent config knob in Junos.
Thanks, Oliver
participants (5)
-
Brandon Martin
-
Mark Tinka
-
Oliver Garraux
-
TJ Trout
-
Tom Beecher