Is there a method or tool(s) to prove network outages?
Hi Everyone Please I have a very problematic radio link which goes out and back on again every few hours. The only way I know this is happening is from my gateway device: a Sophos UTM that sends email anytime there's been an outage. The ISP refuses to accept this as outage/instability proof, and I'm wondering if there's something I can run behind the gateway UTM that can provide output information over time. They seem to be a primarily Windows+Cisco shop (as is common here in the 4th world). We are primarily Linux. Is there some set of command incantations I can run who's output I can collect and send to them (besides some sort of sustained ping)? Thanks in advance! Sina
On 12/1/13, 8:56 AM, Notify Me wrote:
Hi Everyone
Please I have a very problematic radio link which goes out and back on again every few hours. The only way I know this is happening is from my gateway device: a Sophos UTM that sends email anytime there's been an outage.
The ISP refuses to accept this as outage/instability proof, and I'm wondering if there's something I can run behind the gateway UTM that can provide output information over time. They seem to be a primarily Windows+Cisco shop (as is common here in the 4th world). We are primarily Linux. Is there some set of command incantations I can run who's output I can collect and send to them (besides some sort of sustained ping)?
Given a measurement target on the customer side and smokeping instance on your side you can actively measure the availability/latency/loss rates between them.
Thanks in advance!
Sina
On Dec 2, 2013, at 12:19 AM, joel jaeggli <joelja@bogus.com> wrote:
Given a measurement target on the customer side and smokeping instance on your side you can actively measure the availability/latency/loss rates between them.
I think he's actually the end-customer, and he's saying that his upstream transit ISP won't accept non-RF-specific diags . . . ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> Luck is the residue of opportunity and design. -- John Milton
On 12/1/13, 9:23 AM, Dobbins, Roland wrote:
On Dec 2, 2013, at 12:19 AM, joel jaeggli <joelja@bogus.com> wrote:
Given a measurement target on the customer side and smokeping instance on your side you can actively measure the availability/latency/loss rates between them.
I think he's actually the end-customer, and he's saying that his upstream transit ISP won't accept non-RF-specific diags . . .
and if you don't control any of the air interfaces you don't get that.
----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com>
Luck is the residue of opportunity and design.
-- John Milton
I'm actually halfway through trying to setup a smokeping appliance. Sent from my BlackBerry wireless device from MTN -----Original Message----- From: joel jaeggli <joelja@bogus.com> Date: Sun, 01 Dec 2013 09:38:39 To: Dobbins, Roland<rdobbins@arbor.net>; nanog@nanog.org list<nanog@nanog.org> Subject: Re: Is there a method or tool(s) to prove network outages? On 12/1/13, 9:23 AM, Dobbins, Roland wrote:
On Dec 2, 2013, at 12:19 AM, joel jaeggli <joelja@bogus.com> wrote:
Given a measurement target on the customer side and smokeping instance on your side you can actively measure the availability/latency/loss rates between them.
I think he's actually the end-customer, and he's saying that his upstream transit ISP won't accept non-RF-specific diags . . .
and if you don't control any of the air interfaces you don't get that.
----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com>
Luck is the residue of opportunity and design.
-- John Milton
Hi Everyone
Please I have a very problematic radio link which goes out and back on again every few hours. The only way I know this is happening is from my gateway device: a Sophos UTM that sends email anytime there's been an outage.
The ISP refuses to accept this as outage/instability proof, and I'm wondering if there's something I can run behind the gateway UTM that can provide output information over time. They seem to be a primarily Windows+Cisco shop (as is common here in the 4th world). We are primarily Linux. Is there some set of command incantations I can run who's output I can collect and send to them (besides some sort of sustained ping)?
I'm a big fan of smokeping personally. (http://oss.oetiker.ch/smokeping/) it will install on practically any linux device, and i've even installed it on ~$50 consumer NAS or router type devices (e.g. a pgoplug nas) if it has enough ram (128MB is dicey but works). shows you loss + latency. you set it up to ping e.g. the near and far end of the radio link, and then maybe a few sentinel sites. it can do ICMP and also TCP.
On Dec 1, 2013, at 11:56 PM, Notify Me <notify.sina@gmail.com> wrote:
Is there some set of command incantations I can run who's output I can collect and send to them (besides some sort of sustained ping)?
Do you have wireless CPE within your span of administrative control? ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> Luck is the residue of opportunity and design. -- John Milton
No, I don't. Sent from my BlackBerry wireless device from MTN -----Original Message----- From: "Dobbins, Roland" <rdobbins@arbor.net> Date: Sun, 1 Dec 2013 17:20:51 To: nanog@nanog.org list<nanog@nanog.org> Subject: Re: Is there a method or tool(s) to prove network outages? On Dec 1, 2013, at 11:56 PM, Notify Me <notify.sina@gmail.com> wrote:
Is there some set of command incantations I can run who's output I can collect and send to them (besides some sort of sustained ping)?
Do you have wireless CPE within your span of administrative control? ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> Luck is the residue of opportunity and design. -- John Milton
Sina, I'd recommend using Zenoss to monitor the remote end of the link at least with /Status/Ping. You'll get alerts when Zenoss can't ping across the link, and may be able to set up SNMP traps on your router for the link itself going down. DISCLOSURE: I work for Zenoss, however I used Zenoss core long before they decided to pay me money. Good luck with dealing with your ISP, it's _ALWAYS_ a pain in situations like this. Andrew On 12/1/2013 11:56 AM, Notify Me wrote:
Hi Everyone
Please I have a very problematic radio link which goes out and back on again every few hours. The only way I know this is happening is from my gateway device: a Sophos UTM that sends email anytime there's been an outage.
The ISP refuses to accept this as outage/instability proof, and I'm wondering if there's something I can run behind the gateway UTM that can provide output information over time. They seem to be a primarily Windows+Cisco shop (as is common here in the 4th world). We are primarily Linux. Is there some set of command incantations I can run who's output I can collect and send to them (besides some sort of sustained ping)?
Thanks in advance!
Sina
Thanks a lot, ill definitely consider it. Sent from my BlackBerry wireless device from MTN -----Original Message----- From: Andrew D Kirch <trelane@trelane.net> Date: Sun, 01 Dec 2013 13:40:44 To: <nanog@nanog.org> Subject: Re: Is there a method or tool(s) to prove network outages? Sina, I'd recommend using Zenoss to monitor the remote end of the link at least with /Status/Ping. You'll get alerts when Zenoss can't ping across the link, and may be able to set up SNMP traps on your router for the link itself going down. DISCLOSURE: I work for Zenoss, however I used Zenoss core long before they decided to pay me money. Good luck with dealing with your ISP, it's _ALWAYS_ a pain in situations like this. Andrew On 12/1/2013 11:56 AM, Notify Me wrote:
Hi Everyone
Please I have a very problematic radio link which goes out and back on again every few hours. The only way I know this is happening is from my gateway device: a Sophos UTM that sends email anytime there's been an outage.
The ISP refuses to accept this as outage/instability proof, and I'm wondering if there's something I can run behind the gateway UTM that can provide output information over time. They seem to be a primarily Windows+Cisco shop (as is common here in the 4th world). We are primarily Linux. Is there some set of command incantations I can run who's output I can collect and send to them (besides some sort of sustained ping)?
Thanks in advance!
Sina
Ask them for a plot of your snr (signal to noise ratio) and your rsl (receive signal level). In the RF realm, it's pretty difficult to fabricate receive power. :) Sent from my Mobile Device. -------- Original message -------- From: Notify Me <notify.sina@gmail.com> Date: 12/01/2013 7:58 AM (GMT-09:00) To: nanog@nanog.org Subject: Is there a method or tool(s) to prove network outages? Hi Everyone Please I have a very problematic radio link which goes out and back on again every few hours. The only way I know this is happening is from my gateway device: a Sophos UTM that sends email anytime there's been an outage. The ISP refuses to accept this as outage/instability proof, and I'm wondering if there's something I can run behind the gateway UTM that can provide output information over time. They seem to be a primarily Windows+Cisco shop (as is common here in the 4th world). We are primarily Linux. Is there some set of command incantations I can run who's output I can collect and send to them (besides some sort of sustained ping)? Thanks in advance! Sina
On Sun, Dec 01, 2013 at 05:56:51PM +0100, Notify Me wrote:
Please I have a very problematic radio link which goes out and back on again every few hours. The only way I know this is happening is from my gateway device: a Sophos UTM that sends email anytime there's been an outage.
The ISP refuses to accept this as outage/instability proof, and I'm wondering if there's something I can run behind the gateway UTM that can provide output information over time.
I'm surprised nobody's mentioned the root question to answer before you go off spending time setting up anything in particular: what *will* the ISP accept (or be forced to accept) as outage/instability proof? Contracts are your first line of defence, but it's nigh-on universal that they don't cover these sorts of situations well enough. So you probably need to have a discussion, as a follow-on from being told that your UTM's e-mails *aren't* sufficient, to determine what *is* sufficient. Once you've got that, only then can you evaluate appropriate methods of gathering the necessary data to support a claim of an outage. I like the *idea* of smokeping, but when gathering data on complete service loss (which was my use case for it as well) I found its methods of collecting and displaying that data to be very suboptimal and counter-intuitive. For something small and once-off like this, I'd probably just break out my text editor and script up something that would collect the relevant data and process it into the acceptable form. - Matt
Hmm. Great points. Didn't think of that. Sent from my BlackBerry wireless device from MTN -----Original Message----- From: Matt Palmer <mpalmer@hezmatt.org> Date: Mon, 2 Dec 2013 06:50:31 To: <nanog@nanog.org> Subject: Re: Is there a method or tool(s) to prove network outages? On Sun, Dec 01, 2013 at 05:56:51PM +0100, Notify Me wrote:
Please I have a very problematic radio link which goes out and back on again every few hours. The only way I know this is happening is from my gateway device: a Sophos UTM that sends email anytime there's been an outage.
The ISP refuses to accept this as outage/instability proof, and I'm wondering if there's something I can run behind the gateway UTM that can provide output information over time.
I'm surprised nobody's mentioned the root question to answer before you go off spending time setting up anything in particular: what *will* the ISP accept (or be forced to accept) as outage/instability proof? Contracts are your first line of defence, but it's nigh-on universal that they don't cover these sorts of situations well enough. So you probably need to have a discussion, as a follow-on from being told that your UTM's e-mails *aren't* sufficient, to determine what *is* sufficient. Once you've got that, only then can you evaluate appropriate methods of gathering the necessary data to support a claim of an outage. I like the *idea* of smokeping, but when gathering data on complete service loss (which was my use case for it as well) I found its methods of collecting and displaying that data to be very suboptimal and counter-intuitive. For something small and once-off like this, I'd probably just break out my text editor and script up something that would collect the relevant data and process it into the acceptable form. - Matt
if you do not control the rf end [0], then i assume the upstream supplies it and is really selling you connectivity to behind the rf cpe. so you should show you do not have that connectivity, the rf is a red herring. use smokeping or any other tool from immediately behind the rf cpe with a target of the first layer three hop beyond your network. but, if the upstream is in solid denial and is not actually accepting that they have a contractual obligation, measurement and technology are not going to help you. randy -- [0] - literally NO control, e.g. not antenna placement, power supplied, ...
On Dec 1, 2013, at 11:50 AM, Matt Palmer <mpalmer@hezmatt.org> wrote:
I'm surprised nobody's mentioned the root question to answer before you go off spending time setting up anything in particular: what *will* the ISP accept (or be forced to accept) as outage/instability proof? Contracts are your first line of defence, but it's nigh-on universal that they don't cover these sorts of situations well enough. So you probably need to have a discussion, as a follow-on from being told that your UTM's e-mails *aren't* sufficient, to determine what *is* sufficient.
This. They may not cooperate, in which case, you have to force proof down their throats. I would go with the Zenoss (or Zabbix, or...) option - a free to use, professionally supported, professional grade commonly used monitoring package that would meet anyone's basic "credible tool" definition plus neat GUI to send a snapshot of the results. Use it to perform various tests of the net - pings, http gets of some small target, starting pings with the next hop outside your premise and working outwards to the outside world. Don't overwhelm your net with tests, but test as often as needed to demonstrate an issue. -george william herbert george.herbert@gmail.com Sent from Kangphone
On Dec 2, 2013, at 2:50 AM, Matt Palmer <mpalmer@hezmatt.org> wrote:
I'm surprised nobody's mentioned the root question to answer before you go off spending time setting up anything in particular: what *will* the ISP accept (or be forced to accept) as outage/instability proof?
That was my point - if the upstream won't accept ICMP pings, what's the likelihood he'll accept anything else, either? ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> Luck is the residue of opportunity and design. -- John Milton
On Sun, 1 Dec 2013 17:56:51 +0100, Notify Me <notify.sina@gmail.com> said: > I have a very problematic radio link which goes out and back on > again every few hours. Is "every few hours" regular/cyclical? Does the radio link cross a tidal body of water? -w
Its cyclical, but I have not tried to graph/measure its repetition before now (when I noticed the emails filling up my inbox). Body of tidal water..could be, but I wasn't involved in the installation so I can't actually tell where the antennas are pointing. Sent from my BlackBerry wireless device from MTN -----Original Message----- From: William Waites <wwaites@tardis.ed.ac.uk> Date: Sun, 01 Dec 2013 20:14:46 To: <notify.sina@gmail.com> Cc: <nanog@nanog.org> Subject: Re: Is there a method or tool(s) to prove network outages? On Sun, 1 Dec 2013 17:56:51 +0100, Notify Me <notify.sina@gmail.com> said: > I have a very problematic radio link which goes out and back on > again every few hours. Is "every few hours" regular/cyclical? Does the radio link cross a tidal body of water? -w
On Sun, 1 Dec 2013 20:25:36 +0000, Sina Owolabi <notify.sina@gmail.com> said: > Its cyclical, but I have not tried to graph/measure its > repetition before now... Body of tidal water..could be This is speculation until you have measurements, but if this is the case I'd wager you are having reflected signal interference off of the water. The water acts like a mirror and as it moves up and down the reflected signal will move in and out of phase with the main signal. At certain points you'll get near complete cancellation and the link will fail. See section 4 here for some explanations, fig 5 and 6 for what you could expect the graphs of signal strength, time, link capactity to look like: http://homepages.inf.ed.ac.uk/mmarina/papers/mobicom_winsdr08.pdf But not having access to the RF part you can't measure this directly. If you can get tide tables for a nearby location, what you could do is say that signal strength is 1 if the link is working and 0 if it is not. Measure for a while then scatterplot that against the level of the tide. If the measurements of 0 group tightly together in a few spots then you know definitely what is happening. Perhaps that plot together with a pointer to a nice academic paper would be enough to convince the provider of what is happening. What could you do about this? If you are lucky and the interference does not complete a full cycle from destructive to constructive and back with the largest amplitude of the tides that you experience in that place, you could try moving the antenna up or down. How much depends on the frequency and distances involved but I'd try 25cm increments up to a couple of meters if you can. You'll still get degradation but can hopefully avoid the deep nulls that take the link out completely. If you are able and willing to replace the end-site radios or antennas with your own, and the link uses some sort of 2xN MIMO, you could arrange vertical spacing between the antennas so that you have a good signal at one antenna when the other one is experiencing a null. This should get you on average half the best-case throughput the equipment is capable of but it should get you that consistently. The actual spacing depends on the distances and heights involved. -w
I would hold off on considering Multipath as a problem until you see the RSL. There is no reason to go to the worst case scenario. In addition to that, there are some mitigation techniques we use (OFDM, XPIC, etc.) that will help null out some multi path should that be the case. With that being said, you should probably hire an RF Engineer rather than try to attempt this yourself. If you guys are having path problems, talk to the guy who designed the path. If there ³wasn¹t² a ³guy² who ³designed² the path - this is what you get. //warren On 12/1/13, 1:02 PM, "William Waites" <wwaites@tardis.ed.ac.uk> wrote:
On Sun, 1 Dec 2013 20:25:36 +0000, Sina Owolabi <notify.sina@gmail.com> said:
Its cyclical, but I have not tried to graph/measure its repetition before now... Body of tidal water..could be
This is speculation until you have measurements, but if this is the case I'd wager you are having reflected signal interference off of the water. The water acts like a mirror and as it moves up and down the reflected signal will move in and out of phase with the main signal. At certain points you'll get near complete cancellation and the link will fail.
See section 4 here for some explanations, fig 5 and 6 for what you could expect the graphs of signal strength, time, link capactity to look like:
http://homepages.inf.ed.ac.uk/mmarina/papers/mobicom_winsdr08.pdf
But not having access to the RF part you can't measure this directly. If you can get tide tables for a nearby location, what you could do is say that signal strength is 1 if the link is working and 0 if it is not. Measure for a while then scatterplot that against the level of the tide. If the measurements of 0 group tightly together in a few spots then you know definitely what is happening. Perhaps that plot together with a pointer to a nice academic paper would be enough to convince the provider of what is happening.
What could you do about this?
If you are lucky and the interference does not complete a full cycle from destructive to constructive and back with the largest amplitude of the tides that you experience in that place, you could try moving the antenna up or down. How much depends on the frequency and distances involved but I'd try 25cm increments up to a couple of meters if you can. You'll still get degradation but can hopefully avoid the deep nulls that take the link out completely.
If you are able and willing to replace the end-site radios or antennas with your own, and the link uses some sort of 2xN MIMO, you could arrange vertical spacing between the antennas so that you have a good signal at one antenna when the other one is experiencing a null. This should get you on average half the best-case throughput the equipment is capable of but it should get you that consistently. The actual spacing depends on the distances and heights involved.
-w
On Dec 2, 2013, at 6:26 AM, Warren Bailey <wbailey@satelliteintelligencegroup.com> wrote:
I would hold off on considering Multipath as a problem until you see the RSL.
Concur. It could also be related to precipitation or other adverse conditions. Or, in fact, it could be related to the 'UTM' box and/or something else on the endpoint network. It could be a periodic DDoS attack, or traffic causing an availability hit as an unintended consequence. It's difficult to say without data. Since the OP has the ability to gather IP-level data on his own network, he should utilize whatever instrumentation and telemetry he can set up in order to diagnose the issue as accurately as possible. And the OP should dig out his SLA and see what it says about the obligations of his upstream. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> Luck is the residue of opportunity and design. -- John Milton
Keep in mind that inter web traffic has nothing to do with the overall health of the radio link. In RF land, we really don¹t care what is going over that link - just that we have enough RSL hitting the receiver to be above threshold thus allowing the box to demodulate that signal. If your radio is sitting at a threshold RSL of -108 and you¹re coming in at -105, big trouble in little China (3dB fade murdered your link). Stop thinking like a network engineer.. If your DS-1 was taking hits, an ICMP request (or lack thereof) would mean little (read: zero) to me as an RF Engineer. I want to see the BER/PER of the circuit over time so I can correlate possible trouble with real world issues. With that being said.. the tidal issue comes up a lot, and more times than not I see someone who said ³Point that dish over there² and when it magically works they have earned the title of ³Best RF Engineer in History² until the tide rolls in and their link suddenly has ³issues². The invention of cheap wireless has caused many people to believe they have in depth wireless experience, and that is usually not the case. Not trying to preach, but I¹ve spent a *TON* of time and other people¹s money in multi path land.. If someone was responsible for the proper design of the link multi path would not be a factor as it would be addressed early on in the link. You are not going to gain much traction with a wireless company when you call and tell them your pings aren¹t working.. They are kind of like parents.. They just don¹t understand. ;) //warren Ps - I welcome any replies on or off list.. I know how frustrating it can be to have a link that seems to work well until you look at it, so I probably have a bit more compassion than others when talking about broken Microwave/Satellite hops. On 12/1/13, 5:40 PM, "Dobbins, Roland" <rdobbins@arbor.net> wrote:
On Dec 2, 2013, at 6:26 AM, Warren Bailey <wbailey@satelliteintelligencegroup.com> wrote:
I would hold off on considering Multipath as a problem until you see the RSL.
Concur. It could also be related to precipitation or other adverse conditions.
Or, in fact, it could be related to the 'UTM' box and/or something else on the endpoint network. It could be a periodic DDoS attack, or traffic causing an availability hit as an unintended consequence.
It's difficult to say without data. Since the OP has the ability to gather IP-level data on his own network, he should utilize whatever instrumentation and telemetry he can set up in order to diagnose the issue as accurately as possible.
And the OP should dig out his SLA and see what it says about the obligations of his upstream.
----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com>
Luck is the residue of opportunity and design.
-- John Milton
Thanks a lot for the in-depth insights, all. Ill be doing a lot of "sleuthing" in the next few days based on all this information. Sent from my BlackBerry wireless device from MTN -----Original Message----- From: Warren Bailey <wbailey@satelliteintelligencegroup.com> Date: Mon, 2 Dec 2013 03:09:13 To: Dobbins, Roland<rdobbins@arbor.net>; nanog@nanog.org<nanog@nanog.org> Subject: Re: Is there a method or tool(s) to prove network outages? Keep in mind that inter web traffic has nothing to do with the overall health of the radio link. In RF land, we really don¹t care what is going over that link - just that we have enough RSL hitting the receiver to be above threshold thus allowing the box to demodulate that signal. If your radio is sitting at a threshold RSL of -108 and you¹re coming in at -105, big trouble in little China (3dB fade murdered your link). Stop thinking like a network engineer.. If your DS-1 was taking hits, an ICMP request (or lack thereof) would mean little (read: zero) to me as an RF Engineer. I want to see the BER/PER of the circuit over time so I can correlate possible trouble with real world issues. With that being said.. the tidal issue comes up a lot, and more times than not I see someone who said ³Point that dish over there² and when it magically works they have earned the title of ³Best RF Engineer in History² until the tide rolls in and their link suddenly has ³issues². The invention of cheap wireless has caused many people to believe they have in depth wireless experience, and that is usually not the case. Not trying to preach, but I¹ve spent a *TON* of time and other people¹s money in multi path land.. If someone was responsible for the proper design of the link multi path would not be a factor as it would be addressed early on in the link. You are not going to gain much traction with a wireless company when you call and tell them your pings aren¹t working.. They are kind of like parents.. They just don¹t understand. ;) //warren Ps - I welcome any replies on or off list.. I know how frustrating it can be to have a link that seems to work well until you look at it, so I probably have a bit more compassion than others when talking about broken Microwave/Satellite hops. On 12/1/13, 5:40 PM, "Dobbins, Roland" <rdobbins@arbor.net> wrote:
On Dec 2, 2013, at 6:26 AM, Warren Bailey <wbailey@satelliteintelligencegroup.com> wrote:
I would hold off on considering Multipath as a problem until you see the RSL.
Concur. It could also be related to precipitation or other adverse conditions.
Or, in fact, it could be related to the 'UTM' box and/or something else on the endpoint network. It could be a periodic DDoS attack, or traffic causing an availability hit as an unintended consequence.
It's difficult to say without data. Since the OP has the ability to gather IP-level data on his own network, he should utilize whatever instrumentation and telemetry he can set up in order to diagnose the issue as accurately as possible.
And the OP should dig out his SLA and see what it says about the obligations of his upstream.
----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com>
Luck is the residue of opportunity and design.
-- John Milton
participants (11)
-
Andrew D Kirch
-
Dobbins, Roland
-
Don Bowman
-
George William Herbert
-
joel jaeggli
-
Matt Palmer
-
Notify Me
-
Randy Bush
-
Sina Owolabi
-
Warren Bailey
-
William Waites