Hello We had a 100G link that started to misbehave and caused the customers to notice bad packet loss. The optical values are just fine but we had packet loss and latency. Interface shows FEC errors on one end and carrier transitions on the other end. But otherwise the link would stay up and our monitor system completely failed to warn about the failure. Had to find the bad link by traceroute (mtr) and observe where packet loss started. The link was between a Juniper MX204 and Juniper ACX5448. Link length 2 meters using 2 km single mode SFP modules. What is the best practice to monitor links to avoid this scenarium? What options do we have to do link monitoring? I am investigating BFD but I am unsure if that would have helped the situation. Thanks, Baldur
I'll sell you my Solar Winds license - cheap! Pete Rohrman Stage2 Support 212 497 8000, Opt. 2 On 4/29/21 4:39 PM, Baldur Norddahl wrote:
Hello
We had a 100G link that started to misbehave and caused the customers to notice bad packet loss. The optical values are just fine but we had packet loss and latency. Interface shows FEC errors on one end and carrier transitions on the other end. But otherwise the link would stay up and our monitor system completely failed to warn about the failure. Had to find the bad link by traceroute (mtr) and observe where packet loss started.
The link was between a Juniper MX204 and Juniper ACX5448. Link length 2 meters using 2 km single mode SFP modules.
What is the best practice to monitor links to avoid this scenarium? What options do we have to do link monitoring? I am investigating BFD but I am unsure if that would have helped the situation.
Thanks,
Baldur
The Junipers on both sides should have discrete SNMP OIDs that respond with a FEC stress value, or FEC error value. See blue highlighted part here about FEC. Depending on what version of JunOS you're running the MIB for it may or may not exist. https://kb.juniper.net/InfoCenter/index?page=content&id=KB36074&cat=MX2008&actp=LIST In other equipment sometimes it's found in a sub-tree of SNMP adjacent to optical DOM values. Once you can acquire and poll that value, set it up as a custom thing to graph and alert upon certain threshold values in your choice of NMS. Additionally signs of a failing optic may show up in some of the optical DOM MIB items you can poll: https://mibs.observium.org/mib/JUNIPER-DOM-MIB/ It helps if you have some non-misbehaving similar linecards and optics which can be polled during custom graph/OID configuration, to establish a baseline 'no problem' value, which if exceeded will trigger whatever threshold value you set in your monitoring system. On Thu, Apr 29, 2021 at 1:40 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
Hello
We had a 100G link that started to misbehave and caused the customers to notice bad packet loss. The optical values are just fine but we had packet loss and latency. Interface shows FEC errors on one end and carrier transitions on the other end. But otherwise the link would stay up and our monitor system completely failed to warn about the failure. Had to find the bad link by traceroute (mtr) and observe where packet loss started.
The link was between a Juniper MX204 and Juniper ACX5448. Link length 2 meters using 2 km single mode SFP modules.
What is the best practice to monitor links to avoid this scenarium? What options do we have to do link monitoring? I am investigating BFD but I am unsure if that would have helped the situation.
Thanks,
Baldur
We monitor light levels and FEC values on all links and have thresholds for early-warning and PRe-failure analysis. Short answer is yes we see links lose packets before completely failing and for dozens of reasons that’s still a good thing, but you need to monitor every part of a resilient network. Ms. Lady Benjamin PD Cannon of Glencoe, ASCE 6x7 Networks & 6x7 Telecom, LLC CEO lb@6by7.net "The only fully end-to-end encrypted global telecommunications company in the world.” FCC License KJ6FJJ Sent from my iPhone via RFC1149.
On Apr 29, 2021, at 2:32 PM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:
The Junipers on both sides should have discrete SNMP OIDs that respond with a FEC stress value, or FEC error value. See blue highlighted part here about FEC. Depending on what version of JunOS you're running the MIB for it may or may not exist.
https://kb.juniper.net/InfoCenter/index?page=content&id=KB36074&cat=MX2008&actp=LIST
In other equipment sometimes it's found in a sub-tree of SNMP adjacent to optical DOM values. Once you can acquire and poll that value, set it up as a custom thing to graph and alert upon certain threshold values in your choice of NMS.
Additionally signs of a failing optic may show up in some of the optical DOM MIB items you can poll: https://mibs.observium.org/mib/JUNIPER-DOM-MIB/
It helps if you have some non-misbehaving similar linecards and optics which can be polled during custom graph/OID configuration, to establish a baseline 'no problem' value, which if exceeded will trigger whatever threshold value you set in your monitoring system.
On Thu, Apr 29, 2021 at 1:40 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote: Hello
We had a 100G link that started to misbehave and caused the customers to notice bad packet loss. The optical values are just fine but we had packet loss and latency. Interface shows FEC errors on one end and carrier transitions on the other end. But otherwise the link would stay up and our monitor system completely failed to warn about the failure. Had to find the bad link by traceroute (mtr) and observe where packet loss started.
The link was between a Juniper MX204 and Juniper ACX5448. Link length 2 meters using 2 km single mode SFP modules.
What is the best practice to monitor links to avoid this scenarium? What options do we have to do link monitoring? I am investigating BFD but I am unsure if that would have helped the situation.
Thanks,
Baldur
If I may add one thing I forgot, this post reminded me. In the question I think it was probably a 100G CWDM4 short distance link. When monitoring a 100G coherent (QPSK, 16QAM, whatever) longer distance link, be absolutely sure to poll all of the SNMP OIDs for it the same as if it was a point to point microwave link. Depending on exactly what line card and optic it is, it may behave somewhat similarly to a faded or misaligned radio link under conditions related to degradation of the fiber or the lasers. In particular I'm thinking of coherent 100G linecards that can switch on the fly between 'low FEC' and 'high FEC' payload vs FEC percentage (much as an ACM-capable 18 or 23 GHz band radio would), which should absolutely trigger an alarm. And also the data for FEC decode stress percentage level, etc. On Thu, Apr 29, 2021 at 2:37 PM Lady Benjamin Cannon of Glencoe, ASCE < lb@6by7.net> wrote:
We monitor light levels and FEC values on all links and have thresholds for early-warning and PRe-failure analysis.
Short answer is yes we see links lose packets before completely failing and for dozens of reasons that’s still a good thing, but you need to monitor every part of a resilient network.
Ms. Lady Benjamin PD Cannon of Glencoe, ASCE 6x7 Networks & 6x7 Telecom, LLC CEO lb@6by7.net "The only fully end-to-end encrypted global telecommunications company in the world.”
FCC License KJ6FJJ
Sent from my iPhone via RFC1149.
On Apr 29, 2021, at 2:32 PM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:
The Junipers on both sides should have discrete SNMP OIDs that respond with a FEC stress value, or FEC error value. See blue highlighted part here about FEC. Depending on what version of JunOS you're running the MIB for it may or may not exist.
https://kb.juniper.net/InfoCenter/index?page=content&id=KB36074&cat=MX2008&actp=LIST
In other equipment sometimes it's found in a sub-tree of SNMP adjacent to optical DOM values. Once you can acquire and poll that value, set it up as a custom thing to graph and alert upon certain threshold values in your choice of NMS.
Additionally signs of a failing optic may show up in some of the optical DOM MIB items you can poll: https://mibs.observium.org/mib/JUNIPER-DOM-MIB/
It helps if you have some non-misbehaving similar linecards and optics which can be polled during custom graph/OID configuration, to establish a baseline 'no problem' value, which if exceeded will trigger whatever threshold value you set in your monitoring system.
On Thu, Apr 29, 2021 at 1:40 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
Hello
We had a 100G link that started to misbehave and caused the customers to notice bad packet loss. The optical values are just fine but we had packet loss and latency. Interface shows FEC errors on one end and carrier transitions on the other end. But otherwise the link would stay up and our monitor system completely failed to warn about the failure. Had to find the bad link by traceroute (mtr) and observe where packet loss started.
The link was between a Juniper MX204 and Juniper ACX5448. Link length 2 meters using 2 km single mode SFP modules.
What is the best practice to monitor links to avoid this scenarium? What options do we have to do link monitoring? I am investigating BFD but I am unsure if that would have helped the situation.
Thanks,
Baldur
Yes the JNP DOM MIB is what you are looking for. It also the traps for warnings and alarms thresholds you can use which is driven by the optic own parameters. ( Human Interface: show interfaces diagnostics optics <interface> ] ) TLDR: Realtime: Traps; Monitoring: DOM MIB; PS: I suggest you join [ juniper-nsp@puck.nether.net ] mailing list. ----- Alain Hebert ahebert@pubnix.net PubNIX Inc. 50 boul. St-Charles P.O. Box 26770 Beaconsfield, Quebec H9W 6G7 Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443 On 4/29/21 5:32 PM, Eric Kuhnke wrote:
The Junipers on both sides should have discrete SNMP OIDs that respond with a FEC stress value, or FEC error value. See blue highlighted part here about FEC. Depending on what version of JunOS you're running the MIB for it may or may not exist.
https://kb.juniper.net/InfoCenter/index?page=content&id=KB36074&cat=MX2008&actp=LIST <https://kb.juniper.net/InfoCenter/index?page=content&id=KB36074&cat=MX2008&actp=LIST>
In other equipment sometimes it's found in a sub-tree of SNMP adjacent to optical DOM values. Once you can acquire and poll that value, set it up as a custom thing to graph and alert upon certain threshold values in your choice of NMS.
Additionally signs of a failing optic may show up in some of the optical DOM MIB items you can poll: https://mibs.observium.org/mib/JUNIPER-DOM-MIB/ <https://mibs.observium.org/mib/JUNIPER-DOM-MIB/>
It helps if you have some non-misbehaving similar linecards and optics which can be polled during custom graph/OID configuration, to establish a baseline 'no problem' value, which if exceeded will trigger whatever threshold value you set in your monitoring system.
On Thu, Apr 29, 2021 at 1:40 PM Baldur Norddahl <baldur.norddahl@gmail.com <mailto:baldur.norddahl@gmail.com>> wrote:
Hello
We had a 100G link that started to misbehave and caused the customers to notice bad packet loss. The optical values are just fine but we had packet loss and latency. Interface shows FEC errors on one end and carrier transitions on the other end. But otherwise the link would stay up and our monitor system completely failed to warn about the failure. Had to find the bad link by traceroute (mtr) and observe where packet loss started.
The link was between a Juniper MX204 and Juniper ACX5448. Link length 2 meters using 2 km single mode SFP modules.
What is the best practice to monitor links to avoid this scenarium? What options do we have to do link monitoring? I am investigating BFD but I am unsure if that would have helped the situation.
Thanks,
Baldur
What NMS is everyone using to graph and alert on this data? On Fri, Apr 30, 2021 at 7:49 AM Alain Hebert <ahebert@pubnix.net> wrote:
Yes the JNP DOM MIB is what you are looking for.
It also the traps for warnings and alarms thresholds you can use which is driven by the optic own parameters. ( Human Interface: show interfaces diagnostics optics <interface> ] )
TLDR:
Realtime: Traps; Monitoring: DOM MIB;
PS: I suggest you join [ juniper-nsp@puck.nether.net ] mailing list.
----- Alain Hebert ahebert@pubnix.net PubNIX Inc. 50 boul. St-Charles P.O. Box 26770 Beaconsfield, Quebec H9W 6G7 Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443
On 4/29/21 5:32 PM, Eric Kuhnke wrote:
The Junipers on both sides should have discrete SNMP OIDs that respond with a FEC stress value, or FEC error value. See blue highlighted part here about FEC. Depending on what version of JunOS you're running the MIB for it may or may not exist.
https://kb.juniper.net/InfoCenter/index?page=content&id=KB36074&cat=MX2008&actp=LIST
In other equipment sometimes it's found in a sub-tree of SNMP adjacent to optical DOM values. Once you can acquire and poll that value, set it up as a custom thing to graph and alert upon certain threshold values in your choice of NMS.
Additionally signs of a failing optic may show up in some of the optical DOM MIB items you can poll: https://mibs.observium.org/mib/JUNIPER-DOM-MIB/
It helps if you have some non-misbehaving similar linecards and optics which can be polled during custom graph/OID configuration, to establish a baseline 'no problem' value, which if exceeded will trigger whatever threshold value you set in your monitoring system.
On Thu, Apr 29, 2021 at 1:40 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
Hello
We had a 100G link that started to misbehave and caused the customers to notice bad packet loss. The optical values are just fine but we had packet loss and latency. Interface shows FEC errors on one end and carrier transitions on the other end. But otherwise the link would stay up and our monitor system completely failed to warn about the failure. Had to find the bad link by traceroute (mtr) and observe where packet loss started.
The link was between a Juniper MX204 and Juniper ACX5448. Link length 2 meters using 2 km single mode SFP modules.
What is the best practice to monitor links to avoid this scenarium? What options do we have to do link monitoring? I am investigating BFD but I am unsure if that would have helped the situation.
Thanks,
Baldur
Y.1731 or TWAMP if available on those devices. Le ven. 30 avr. 2021 17:57, Colton Conor <colton.conor@gmail.com> a écrit :
What NMS is everyone using to graph and alert on this data?
On Fri, Apr 30, 2021 at 7:49 AM Alain Hebert <ahebert@pubnix.net> wrote:
Yes the JNP DOM MIB is what you are looking for.
It also the traps for warnings and alarms thresholds you can use which is driven by the optic own parameters. ( Human Interface: show interfaces diagnostics optics <interface> ] )
TLDR:
Realtime: Traps; Monitoring: DOM MIB;
PS: I suggest you join [ juniper-nsp@puck.nether.net ] mailing list.
----- Alain Hebert ahebert@pubnix.net PubNIX Inc. 50 boul. St-Charles P.O. Box 26770 Beaconsfield, Quebec H9W 6G7 Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443
On 4/29/21 5:32 PM, Eric Kuhnke wrote:
The Junipers on both sides should have discrete SNMP OIDs that respond with a FEC stress value, or FEC error value. See blue highlighted part here about FEC. Depending on what version of JunOS you're running the MIB for it may or may not exist.
https://kb.juniper.net/InfoCenter/index?page=content&id=KB36074&cat=MX2008&actp=LIST
In other equipment sometimes it's found in a sub-tree of SNMP adjacent to optical DOM values. Once you can acquire and poll that value, set it up as a custom thing to graph and alert upon certain threshold values in your choice of NMS.
Additionally signs of a failing optic may show up in some of the optical DOM MIB items you can poll: https://mibs.observium.org/mib/JUNIPER-DOM-MIB/
It helps if you have some non-misbehaving similar linecards and optics which can be polled during custom graph/OID configuration, to establish a baseline 'no problem' value, which if exceeded will trigger whatever threshold value you set in your monitoring system.
On Thu, Apr 29, 2021 at 1:40 PM Baldur Norddahl < baldur.norddahl@gmail.com> wrote:
Hello
We had a 100G link that started to misbehave and caused the customers to notice bad packet loss. The optical values are just fine but we had packet loss and latency. Interface shows FEC errors on one end and carrier transitions on the other end. But otherwise the link would stay up and our monitor system completely failed to warn about the failure. Had to find the bad link by traceroute (mtr) and observe where packet loss started.
The link was between a Juniper MX204 and Juniper ACX5448. Link length 2 meters using 2 km single mode SFP modules.
What is the best practice to monitor links to avoid this scenarium? What options do we have to do link monitoring? I am investigating BFD but I am unsure if that would have helped the situation.
Thanks,
Baldur
On Fri, 30 Apr 2021 at 00:35, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:
The Junipers on both sides should have discrete SNMP OIDs that respond with a FEC stress value, or FEC error value. See blue highlighted part here about FEC. Depending on what version of JunOS you're running the MIB for it may or may not exist.
This feature will be introduced by ER-079886 in some future date. You may be confused about OTN FEC, which is available via MIB, but unrelated to the topic. I did plan to open a feature request for other vendors too, but I've been lazy. It is broadly missing, We are doing very little as a community to address problems before they become symptomatic and undercapitalising the information we already have from DDM and RS-FEC. Only slightly on-topic; people who interact with optical vendors might want to ask about propagating RS-FEC correctable errors. RS-FEC of course is point-to-point, so in your active optical system it terminates on the first hop. But technically nothing stopping far end optical link inducing RS-FEC correctable error, if there was an error. Perhaps even standard to discriminate between organic near-hop RS-FEC correctable error from induced. We have a sort of precedent for this, as some cut-through switches can discriminate between near-hop FCS error from other type of FCS, because of course sender will know about FCS after it already sent the frame, but it can add some symbol in this case, to let the receiver know it's not near-end. This allows the receiver to keep two FCS counters. -- ++ytti
We use LibreNMS and smokeping to monitor latency and dropped packets on all our links and setup alerts if they go over a certain threshold. We are working on a script to automatically reroute traffic based on the alerts to route around the bad link to give us time to fix it. Thanks Travis From: NANOG <nanog-bounces+tgarrison=netviscom.com@nanog.org> On Behalf Of Baldur Norddahl Sent: Thursday, April 29, 2021 3:39 PM To: nanog@nanog.org Subject: link monitoring Hello We had a 100G link that started to misbehave and caused the customers to notice bad packet loss. The optical values are just fine but we had packet loss and latency. Interface shows FEC errors on one end and carrier transitions on the other end. But otherwise the link would stay up and our monitor system completely failed to warn about the failure. Had to find the bad link by traceroute (mtr) and observe where packet loss started. The link was between a Juniper MX204 and Juniper ACX5448. Link length 2 meters using 2 km single mode SFP modules. What is the best practice to monitor links to avoid this scenarium? What options do we have to do link monitoring? I am investigating BFD but I am unsure if that would have helped the situation. Thanks, Baldur
participants (9)
-
Alain Hebert
-
Baldur Norddahl
-
Colton Conor
-
Eric Kuhnke
-
Lady Benjamin Cannon of Glencoe, ASCE
-
Michel Blais
-
Pete Rohrman
-
Saku Ytti
-
Travis Garrison