100G, input errors and/or transceiver issues
Good day, Over the last two years, organizations that I've worked with have upgraded equipment and now make regular use of 100G port speeds. To provide a frame of reference on use cases, the organizations that I've worked for make use of 100G speeds within their own data centers, in carrier neutral data centers, and lease 100G transport from larger carriers; they don't currently operate their own 100G coherent/long-haul networks. - How commonly do other operators experience input errors with 100G interfaces? - How often do you find that you have to change a transceiver out? Either for errors or another reason. - Do we collectively expect this to improve as 100G becomes more common and production volumes increase in the future? Thanks, Graham
On Mon, 19 Jul 2021 at 19:47, Graham Johnston <johnston.grahamj@gmail.com> wrote: Hey Graham,
How commonly do other operators experience input errors with 100G interfaces? How often do you find that you have to change a transceiver out? Either for errors or another reason. Do we collectively expect this to improve as 100G becomes more common and production volumes increase in the future?
New rule. Share your own data before asking others to share theirs. IN DC, SP markets 100GE has dominated the market for several years now, so it rings odd to many at 'more common'. 112G SERDES is shipping on the electric side, and there is nowhere more mature to go from 100GE POV. The optical side, QSFP112, is really the only thing left to cost optimise 100GE. We've had our share of MSA ambiguity issues with 100GE, but today 100GE looks mature to our eyes in failure rates and compatibility. 1GE is really hard to support and 10GE is becoming problematic, in terms of hardware procurement. -- ++ytti
Saku, I don't at this point have long term data collection compiled for the issues that we've faced. That said, we have two 100G transport links that have a regular background level of input errors at ranges that hover between 0.00055 to 0.00383 PPS on one link, and none to 0.00135 PPS (that jumped to 0.03943 PPS over the weekend). The range is often directionally associated rather than variable behavior of a single direction. The data comes from the last 24 hours, the two referenced links are operated by different providers on very different paths (opposite directions). Over shorter distances, we've definitely seen input errors that have affected PNI connections within a datacenter as well. In the case of the last PNI issue, the other party swapped their transceiver, we didn't even physically touch our side; I note this only to express that I don't think this is just a case of the transceivers that we are sourcing. Comparatively, other than clear transport system issues, I don't recall this sort of thing at all with 10G "wavelength" transport that we had purchased for years prior. I put wavelengths in quotes there knowing that it may have been a while since our transport was a literal wavelength as compared to being muxed into a 100G+ wavelength. On Mon, 19 Jul 2021 at 12:01, Saku Ytti <saku@ytti.fi> wrote:
On Mon, 19 Jul 2021 at 19:47, Graham Johnston <johnston.grahamj@gmail.com> wrote:
Hey Graham,
How commonly do other operators experience input errors with 100G interfaces? How often do you find that you have to change a transceiver out? Either for errors or another reason. Do we collectively expect this to improve as 100G becomes more common and production volumes increase in the future?
New rule. Share your own data before asking others to share theirs.
IN DC, SP markets 100GE has dominated the market for several years now, so it rings odd to many at 'more common'. 112G SERDES is shipping on the electric side, and there is nowhere more mature to go from 100GE POV. The optical side, QSFP112, is really the only thing left to cost optimise 100GE. We've had our share of MSA ambiguity issues with 100GE, but today 100GE looks mature to our eyes in failure rates and compatibility. 1GE is really hard to support and 10GE is becoming problematic, in terms of hardware procurement.
-- ++ytti
We have a moderately dense deployment of 100-Gig LR4 (Both DWDM Lambdas and Juniper MX) around our WAN and we don't clock any background input errors on our interfaces unless there is an ongoing problem. That said, we have experienced issues with sub-millisecond link state changes between two endpoints that are physically cross connected to one another with no intermediary Layer 1 (DWDM, Etc.). There doesn't seem to be rhyme or reason to this and we've looked at each lane extensively and so far, everything has been inconclusive. We also experienced some code issues on Juniper MPC3D-NG's running 100-Gig's and our DWDM Client Ports where timing would start to slip and eventually cause the link to fail. Both Juniper and the DWDM Vendor found code variances they patched. We haven't had any such issues on Juniper MPC5's 7's or the 10003 Line Cards. TL;DR: In my experience, 100-Gig might require some more TLC then 10-Gig to run clean and is more sensitive to variations in transport. Other's mileage may vary. Best, JJ Stonebraker | Associate Director The University of Texas System | Office of Telecommunication Services (512) 232-0888 | jjs@ots.utsystem.edu ________________________________ From: NANOG <nanog-bounces+jjs=ots.utsystem.edu@nanog.org> on behalf of Graham Johnston <johnston.grahamj@gmail.com> Sent: Monday, July 19, 2021 12:19 PM To: Saku Ytti <saku@ytti.fi> Cc: nanog list <nanog@nanog.org> Subject: Re: 100G, input errors and/or transceiver issues Saku, I don't at this point have long term data collection compiled for the issues that we've faced. That said, we have two 100G transport links that have a regular background level of input errors at ranges that hover between 0.00055 to 0.00383 PPS on one link, and none to 0.00135 PPS (that jumped to 0.03943 PPS over the weekend). The range is often directionally associated rather than variable behavior of a single direction. The data comes from the last 24 hours, the two referenced links are operated by different providers on very different paths (opposite directions). Over shorter distances, we've definitely seen input errors that have affected PNI connections within a datacenter as well. In the case of the last PNI issue, the other party swapped their transceiver, we didn't even physically touch our side; I note this only to express that I don't think this is just a case of the transceivers that we are sourcing. Comparatively, other than clear transport system issues, I don't recall this sort of thing at all with 10G "wavelength" transport that we had purchased for years prior. I put wavelengths in quotes there knowing that it may have been a while since our transport was a literal wavelength as compared to being muxed into a 100G+ wavelength. On Mon, 19 Jul 2021 at 12:01, Saku Ytti <saku@ytti.fi<mailto:saku@ytti.fi>> wrote: On Mon, 19 Jul 2021 at 19:47, Graham Johnston <johnston.grahamj@gmail.com<mailto:johnston.grahamj@gmail.com>> wrote: Hey Graham,
How commonly do other operators experience input errors with 100G interfaces? How often do you find that you have to change a transceiver out? Either for errors or another reason. Do we collectively expect this to improve as 100G becomes more common and production volumes increase in the future?
New rule. Share your own data before asking others to share theirs. IN DC, SP markets 100GE has dominated the market for several years now, so it rings odd to many at 'more common'. 112G SERDES is shipping on the electric side, and there is nowhere more mature to go from 100GE POV. The optical side, QSFP112, is really the only thing left to cost optimise 100GE. We've had our share of MSA ambiguity issues with 100GE, but today 100GE looks mature to our eyes in failure rates and compatibility. 1GE is really hard to support and 10GE is becoming problematic, in terms of hardware procurement. -- ++ytti
On Mon, 19 Jul 2021 at 20:19, Graham Johnston <johnston.grahamj@gmail.com> wrote:
I don't at this point have long term data collection compiled for the issues that we've faced. That said, we have two 100G transport links that have a regular background level of input errors at ranges that hover between 0.00055 to 0.00383 PPS on one link, and none to 0.00135 PPS (that jumped to 0.03943 PPS over the weekend). The range is often directionally associated rather than variable
On typical 100G link we do not get single FCS error in a typical day. However Ethernet spec still allows very high error rate of 10**-12. So 1 error per 1Tb (b not B). I.e. 1 error per 10s, or 0.1PPS would be in-spec. We see much better performance to this and vendors generally accept lower error rates as legitimate errors. -- ++ytti
On Jul 19, 2021, at 1:50 PM, Saku Ytti <saku@ytti.fi> wrote:
On Mon, 19 Jul 2021 at 20:19, Graham Johnston <johnston.grahamj@gmail.com> wrote:
I don't at this point have long term data collection compiled for the issues that we've faced. That said, we have two 100G transport links that have a regular background level of input errors at ranges that hover between 0.00055 to 0.00383 PPS on one link, and none to 0.00135 PPS (that jumped to 0.03943 PPS over the weekend). The range is often directionally associated rather than variable
On typical 100G link we do not get single FCS error in a typical day. However Ethernet spec still allows very high error rate of 10**-12. So 1 error per 1Tb (b not B). I.e. 1 error per 10s, or 0.1PPS would be in-spec. We see much better performance to this and vendors generally accept lower error rates as legitimate errors.
I will confirm my experience with this at $dayjob as well. We see interfaces with no errors over much longer periods of time inclusive of several of the technology options. If you are seeing errors, there’s likely something wrong like unclean fiber or bad optic/linecard etc. - Jared
Thank you all for the consensus. What I hear from you is that 100G takes more care to operate error free, as compared to 10G, which wasn't surprising to me. Also, that generally, we should be able to operate without errors, or certainly less than I'm currently observing, and that connector and transceiver interface cleanliness is our first likely point of investigation. Thanks to all who responded. On Mon, 19 Jul 2021 at 12:58, Jared Mauch <jared@puck.nether.net> wrote:
On Jul 19, 2021, at 1:50 PM, Saku Ytti <saku@ytti.fi> wrote:
On Mon, 19 Jul 2021 at 20:19, Graham Johnston <johnston.grahamj@gmail.com> wrote:
I don't at this point have long term data collection compiled for the issues that we've faced. That said, we have two 100G transport links that have a regular background level of input errors at ranges that hover between 0.00055 to 0.00383 PPS on one link, and none to 0.00135 PPS (that jumped to 0.03943 PPS over the weekend). The range is often directionally associated rather than variable
On typical 100G link we do not get single FCS error in a typical day. However Ethernet spec still allows very high error rate of 10**-12. So 1 error per 1Tb (b not B). I.e. 1 error per 10s, or 0.1PPS would be in-spec. We see much better performance to this and vendors generally accept lower error rates as legitimate errors.
I will confirm my experience with this at $dayjob as well. We see interfaces with no errors over much longer periods of time inclusive of several of the technology options. If you are seeing errors, there’s likely something wrong like unclean fiber or bad optic/linecard etc.
- Jared
You could also enable FEC on the link. This will remove any errors until the link quality is really far gone. Regards Baldur man. 19. jul. 2021 20.06 skrev Graham Johnston <johnston.grahamj@gmail.com>:
Thank you all for the consensus. What I hear from you is that 100G takes more care to operate error free, as compared to 10G, which wasn't surprising to me. Also, that generally, we should be able to operate without errors, or certainly less than I'm currently observing, and that connector and transceiver interface cleanliness is our first likely point of investigation.
Thanks to all who responded.
On Mon, 19 Jul 2021 at 12:58, Jared Mauch <jared@puck.nether.net> wrote:
On Jul 19, 2021, at 1:50 PM, Saku Ytti <saku@ytti.fi> wrote:
On Mon, 19 Jul 2021 at 20:19, Graham Johnston <johnston.grahamj@gmail.com> wrote:
I don't at this point have long term data collection compiled for the issues that we've faced. That said, we have two 100G transport links that have a regular background level of input errors at ranges that hover between 0.00055 to 0.00383 PPS on one link, and none to 0.00135 PPS (that jumped to 0.03943 PPS over the weekend). The range is often directionally associated rather than variable
On typical 100G link we do not get single FCS error in a typical day. However Ethernet spec still allows very high error rate of 10**-12. So 1 error per 1Tb (b not B). I.e. 1 error per 10s, or 0.1PPS would be in-spec. We see much better performance to this and vendors generally accept lower error rates as legitimate errors.
I will confirm my experience with this at $dayjob as well. We see interfaces with no errors over much longer periods of time inclusive of several of the technology options. If you are seeing errors, there’s likely something wrong like unclean fiber or bad optic/linecard etc.
- Jared
We... don’t see anything like this... on the transport side, FEC is more than sufficient to effectively eliminate errors. On the LAN side, check your connections. Reiterating that this is not normal or expected behavior. —L.B. Ms. Lady Benjamin PD Cannon of Glencoe, ASCE 6x7 Networks & 6x7 Telecom, LLC CEO lb@6by7.net <mailto:lb@6by7.net> "The only fully end-to-end encrypted global telecommunications company in the world.” FCC License KJ6FJJ
On Jul 19, 2021, at 10:19 AM, Graham Johnston <johnston.grahamj@gmail.com> wrote:
Saku,
I don't at this point have long term data collection compiled for the issues that we've faced. That said, we have two 100G transport links that have a regular background level of input errors at ranges that hover between 0.00055 to 0.00383 PPS on one link, and none to 0.00135 PPS (that jumped to 0.03943 PPS over the weekend). The range is often directionally associated rather than variable behavior of a single direction. The data comes from the last 24 hours, the two referenced links are operated by different providers on very different paths (opposite directions). Over shorter distances, we've definitely seen input errors that have affected PNI connections within a datacenter as well. In the case of the last PNI issue, the other party swapped their transceiver, we didn't even physically touch our side; I note this only to express that I don't think this is just a case of the transceivers that we are sourcing.
Comparatively, other than clear transport system issues, I don't recall this sort of thing at all with 10G "wavelength" transport that we had purchased for years prior. I put wavelengths in quotes there knowing that it may have been a while since our transport was a literal wavelength as compared to being muxed into a 100G+ wavelength.
On Mon, 19 Jul 2021 at 12:01, Saku Ytti <saku@ytti.fi <mailto:saku@ytti.fi>> wrote: On Mon, 19 Jul 2021 at 19:47, Graham Johnston <johnston.grahamj@gmail.com <mailto:johnston.grahamj@gmail.com>> wrote:
Hey Graham,
How commonly do other operators experience input errors with 100G interfaces? How often do you find that you have to change a transceiver out? Either for errors or another reason. Do we collectively expect this to improve as 100G becomes more common and production volumes increase in the future?
New rule. Share your own data before asking others to share theirs.
IN DC, SP markets 100GE has dominated the market for several years now, so it rings odd to many at 'more common'. 112G SERDES is shipping on the electric side, and there is nowhere more mature to go from 100GE POV. The optical side, QSFP112, is really the only thing left to cost optimise 100GE. We've had our share of MSA ambiguity issues with 100GE, but today 100GE looks mature to our eyes in failure rates and compatibility. 1GE is really hard to support and 10GE is becoming problematic, in terms of hardware procurement.
-- ++ytti
Hey Graham, We're running 6xDWDM PAM4 100G links currently (for approx. 24 months) - I just pulled up our 13 month statistics, and we've had less than 100 input errors across every link total across that time period. Every single input error seems to correspond to a fibre cut/flap - I've seen an unprecedented 11 input errors total this month on one link. I'd note that “post covid economic reopening” has meant “run lots of construction equipment through fibre optic cables” and that's caused us more issues than any optic or interface has. For reference, we're a Cisco/SmartOptics shop, and our links aren't super long distance - ~40-60km, nothing that would require coherent optics. * Kevin Menzel On July 19, 2021 13:19, Graham Johnston <johnston.grahamj@gmail.com<mailto:johnston.grahamj@gmail.com>> wrote: Saku, I don't at this point have long term data collection compiled for the issues that we've faced. That said, we have two 100G transport links that have a regular background level of input errors at ranges that hover between 0.00055 to 0.00383 PPS on one link, and none to 0.00135 PPS (that jumped to 0.03943 PPS over the weekend). The range is often directionally associated rather than variable behavior of a single direction. The data comes from the last 24 hours, the two referenced links are operated by different providers on very different paths (opposite directions). Over shorter distances, we've definitely seen input errors that have affected PNI connections within a datacenter as well. In the case of the last PNI issue, the other party swapped their transceiver, we didn't even physically touch our side; I note this only to express that I don't think this is just a case of the transceivers that we are sourcing. Comparatively, other than clear transport system issues, I don't recall this sort of thing at all with 10G "wavelength" transport that we had purchased for years prior. I put wavelengths in quotes there knowing that it may have been a while since our transport was a literal wavelength as compared to being muxed into a 100G+ wavelength. On Mon, 19 Jul 2021 at 12:01, Saku Ytti <saku@ytti.fi<mailto:saku@ytti.fi>> wrote: On Mon, 19 Jul 2021 at 19:47, Graham Johnston <johnston.grahamj@gmail.com<mailto:johnston.grahamj@gmail.com>> wrote: Hey Graham,
How commonly do other operators experience input errors with 100G interfaces? How often do you find that you have to change a transceiver out? Either for errors or another reason. Do we collectively expect this to improve as 100G becomes more common and production volumes increase in the future?
New rule. Share your own data before asking others to share theirs. IN DC, SP markets 100GE has dominated the market for several years now, so it rings odd to many at 'more common'. 112G SERDES is shipping on the electric side, and there is nowhere more mature to go from 100GE POV. The optical side, QSFP112, is really the only thing left to cost optimise 100GE. We've had our share of MSA ambiguity issues with 100GE, but today 100GE looks mature to our eyes in failure rates and compatibility. 1GE is really hard to support and 10GE is becoming problematic, in terms of hardware procurement. -- ++ytti
participants (7)
-
Baldur Norddahl
-
Graham Johnston
-
Jared Mauch
-
Kevin Menzel
-
Lady Benjamin Cannon of Glencoe, ASCE
-
Saku Ytti
-
Stonebraker, Jack J