tools and techniques to pinpoint and respond to loss on a path
Hi, Does anyone have any recommendations on how to pinpoint and react to packet loss across the internet? preferably in an automated fashion. For detection I'm currently looking at trying smoketrace to run from inside my network, but I'd love to be able to run traceroutes from my edge routers triggered during periods of loss. I have Juniper MX80s on one end- which I'm hopeful I'll be able to cobble together some combo of RPM and event scripting to kick off a traceroute. We have Cisco4900Ms on the other end and maybe the same thing is possible but I'm not so sure. I'd love to hear other suggestions and experience for detection and also for options on what I might be able to do when loss is detected on a path. In my specific situation I control equipment on both ends of the path that I care about with details below. we are a hosted service company and we currently have two data centers, DC A and DC B. DC A uses juniper MX routers, advertises our own IP space and takes full BGP feeds from two providers, ISPs A1 and A2. At DC B we have a smaller installation and instead take redundant drops (and IP space) from a single provider, ISP B1, who then peers upstream with two providers, B2 and B3 We have a fairly consistent bi-directional stream of traffic between DC A and DC B. Both of ISP A1 and A2 have good peering with ISP B2 so under normal network conditions traffic flows across ISP B1 to B2 and then to either ISP A1 or A2 oversimplified ascii pic showing only the normal best paths: -- ISP A1----------------------ISP B2-- DC A--| |--- ISP B1 ----- DC B -- ISP A2----------------------ISP B2-- with increasing frequency we've been experiencing packet loss along the path from DC A to DC B. Usually the periods of loss are brief, 30 seconds to a minute, but they are total blackouts. I'd like to be able to collect enough relevant data to pinpoint the trouble spot as much as possible so I can take it to the ISPs and request a solution. The blackouts are so quick that it's impossible to log in and get a trace- hence the desire to automate it. I can provide more details off list if helpful- I'm trying not to vilify anyone- especially without copious amounts of data points. As a side question, what should my expectation be regarding packet loss when sending packets from point A to point B across multiple providers across the internet? Is 30 seconds to a minute of blackout between two destinations every couple of weeks par for the course? My directly connected ISPs offer me an SLA, but what should I reasonably expect from them when one of their upstream peers (or a peer of their peers) has issues? If this turns out to be BGP reconvergence or similar do I have any options? many thanks, -andy
On Jul 15, 2013, at 5:18 PM, Andy Litzinger <Andy.Litzinger@theplatform.com> wrote:
I'd like to be able to collect enough relevant data to pinpoint the trouble spot as much as possible so I can take it to the ISPs and request a solution. The blackouts are so quick that it's impossible to log in and get a trace- hence the desire to automate it.
I can provide more details off list if helpful- I'm trying not to vilify anyone- especially without copious amounts of data points.
As a side question, what should my expectation be regarding packet loss when sending packets from point A to point B across multiple providers across the internet? Is 30 seconds to a minute of blackout between two destinations every couple of weeks par for the course? My directly connected ISPs offer me an SLA, but what should I reasonably expect from them when one of their upstream peers (or a peer of their peers) has issues? If this turns out to be BGP reconvergence or similar do I have any options?
I think there are a number of tools available to detect if something is happening: 1) iperf (test network/bw usage) 2) owamp (one way ping) - you can use this to detect when reordering or other events happen.. this will collect nearly continuious data. requires good ntp references, or accepting you may see skewed data. 3) some other udp/low latency responder. i've built something of my own that does this, i can provide a pointer if you are interested. i have graphs of my connection at home to someplace remote that crosses 3 carriers. you can see the queuing delay increment throughout the day until peak times and taper off at night. no loss, but the increase is quite visible. 4) some vendor SLA/SAA product. Cisco and others have SAA responders that work on their devices you can configure to collect data. That being said, losing network for 30 seconds once every 2 weeks I would expect is fairly common. Someone will be doing network upgrades/work or there will be hardware/transmission error, etc. 30 seconds sounds a lot like bgp convergence, and in older platforms, eg: 6500/sup720 expect about 8k prefixes/second max to be downloaded into the tcam/fib. with 400k+ prefixes, it takes awhile to pump the tables into the forwarding side. - Jared
Personally I would never expect simple routed connectivity across the public internet to be such a high level of reliability, without at least diverse path tunnels running route protocols internally. While any provider will attempt to fix peer / upstream issues as they can, any SLA you would have is between two points on their private network, not from point A to point Z that they have no control over across multiple peers and the public internet itself. The much more common design is using a single provider for each thread between sites. Then at least you have an end-to-end SLA in effect, as well as a single entity that is responsible for the entire link in question. This sounds like you're trying to achieve private link IGP / FRR level site to site failover/convergence across the public internet. Perhaps you should rethink your goals here or your design? -Blake On Mon, Jul 15, 2013 at 4:18 PM, Andy Litzinger < Andy.Litzinger@theplatform.com> wrote:
Hi,
Does anyone have any recommendations on how to pinpoint and react to packet loss across the internet? preferably in an automated fashion. For detection I'm currently looking at trying smoketrace to run from inside my network, but I'd love to be able to run traceroutes from my edge routers triggered during periods of loss. I have Juniper MX80s on one end- which I'm hopeful I'll be able to cobble together some combo of RPM and event scripting to kick off a traceroute. We have Cisco4900Ms on the other end and maybe the same thing is possible but I'm not so sure.
I'd love to hear other suggestions and experience for detection and also for options on what I might be able to do when loss is detected on a path.
In my specific situation I control equipment on both ends of the path that I care about with details below.
we are a hosted service company and we currently have two data centers, DC A and DC B. DC A uses juniper MX routers, advertises our own IP space and takes full BGP feeds from two providers, ISPs A1 and A2. At DC B we have a smaller installation and instead take redundant drops (and IP space) from a single provider, ISP B1, who then peers upstream with two providers, B2 and B3
We have a fairly consistent bi-directional stream of traffic between DC A and DC B. Both of ISP A1 and A2 have good peering with ISP B2 so under normal network conditions traffic flows across ISP B1 to B2 and then to either ISP A1 or A2
oversimplified ascii pic showing only the normal best paths:
-- ISP A1----------------------ISP B2-- DC A--| |--- ISP B1 ----- DC B -- ISP A2----------------------ISP B2--
with increasing frequency we've been experiencing packet loss along the path from DC A to DC B. Usually the periods of loss are brief, 30 seconds to a minute, but they are total blackouts.
I'd like to be able to collect enough relevant data to pinpoint the trouble spot as much as possible so I can take it to the ISPs and request a solution. The blackouts are so quick that it's impossible to log in and get a trace- hence the desire to automate it.
I can provide more details off list if helpful- I'm trying not to vilify anyone- especially without copious amounts of data points.
As a side question, what should my expectation be regarding packet loss when sending packets from point A to point B across multiple providers across the internet? Is 30 seconds to a minute of blackout between two destinations every couple of weeks par for the course? My directly connected ISPs offer me an SLA, but what should I reasonably expect from them when one of their upstream peers (or a peer of their peers) has issues? If this turns out to be BGP reconvergence or similar do I have any options?
many thanks, -andy
Have you looked into Cisco's OER? -James -----Original Message----- From: Andy Litzinger [mailto:Andy.Litzinger@theplatform.com] Sent: Monday, July 15, 2013 2:19 PM To: nanog@nanog.org Subject: tools and techniques to pinpoint and respond to loss on a path Hi, Does anyone have any recommendations on how to pinpoint and react to packet loss across the internet? preferably in an automated fashion. For detection I'm currently looking at trying smoketrace to run from inside my network, but I'd love to be able to run traceroutes from my edge routers triggered during periods of loss. I have Juniper MX80s on one end- which I'm hopeful I'll be able to cobble together some combo of RPM and event scripting to kick off a traceroute. We have Cisco4900Ms on the other end and maybe the same thing is possible but I'm not so sure. I'd love to hear other suggestions and experience for detection and also for options on what I might be able to do when loss is detected on a path. In my specific situation I control equipment on both ends of the path that I care about with details below. we are a hosted service company and we currently have two data centers, DC A and DC B. DC A uses juniper MX routers, advertises our own IP space and takes full BGP feeds from two providers, ISPs A1 and A2. At DC B we have a smaller installation and instead take redundant drops (and IP space) from a single provider, ISP B1, who then peers upstream with two providers, B2 and B3 We have a fairly consistent bi-directional stream of traffic between DC A and DC B. Both of ISP A1 and A2 have good peering with ISP B2 so under normal network conditions traffic flows across ISP B1 to B2 and then to either ISP A1 or A2 oversimplified ascii pic showing only the normal best paths: -- ISP A1----------------------ISP B2-- DC A--| |--- ISP B1 ----- DC B -- ISP A2----------------------ISP B2-- with increasing frequency we've been experiencing packet loss along the path from DC A to DC B. Usually the periods of loss are brief, 30 seconds to a minute, but they are total blackouts. I'd like to be able to collect enough relevant data to pinpoint the trouble spot as much as possible so I can take it to the ISPs and request a solution. The blackouts are so quick that it's impossible to log in and get a trace- hence the desire to automate it. I can provide more details off list if helpful- I'm trying not to vilify anyone- especially without copious amounts of data points. As a side question, what should my expectation be regarding packet loss when sending packets from point A to point B across multiple providers across the internet? Is 30 seconds to a minute of blackout between two destinations every couple of weeks par for the course? My directly connected ISPs offer me an SLA, but what should I reasonably expect from them when one of their upstream peers (or a peer of their peers) has issues? If this turns out to be BGP reconvergence or similar do I have any options? many thanks, -andy
IP SLA + EEM on the 4900. You can have the 4900 run pings/latency tests and then run commands and pipe them to flash when the issue happens. -Pete On Mon, Jul 15, 2013 at 5:18 PM, Andy Litzinger < Andy.Litzinger@theplatform.com> wrote:
Hi,
Does anyone have any recommendations on how to pinpoint and react to packet loss across the internet? preferably in an automated fashion. For detection I'm currently looking at trying smoketrace to run from inside my network, but I'd love to be able to run traceroutes from my edge routers triggered during periods of loss. I have Juniper MX80s on one end- which I'm hopeful I'll be able to cobble together some combo of RPM and event scripting to kick off a traceroute. We have Cisco4900Ms on the other end and maybe the same thing is possible but I'm not so sure.
I'd love to hear other suggestions and experience for detection and also for options on what I might be able to do when loss is detected on a path.
In my specific situation I control equipment on both ends of the path that I care about with details below.
we are a hosted service company and we currently have two data centers, DC A and DC B. DC A uses juniper MX routers, advertises our own IP space and takes full BGP feeds from two providers, ISPs A1 and A2. At DC B we have a smaller installation and instead take redundant drops (and IP space) from a single provider, ISP B1, who then peers upstream with two providers, B2 and B3
We have a fairly consistent bi-directional stream of traffic between DC A and DC B. Both of ISP A1 and A2 have good peering with ISP B2 so under normal network conditions traffic flows across ISP B1 to B2 and then to either ISP A1 or A2
oversimplified ascii pic showing only the normal best paths:
-- ISP A1----------------------ISP B2-- DC A--| |--- ISP B1 ----- DC B -- ISP A2----------------------ISP B2--
with increasing frequency we've been experiencing packet loss along the path from DC A to DC B. Usually the periods of loss are brief, 30 seconds to a minute, but they are total blackouts.
I'd like to be able to collect enough relevant data to pinpoint the trouble spot as much as possible so I can take it to the ISPs and request a solution. The blackouts are so quick that it's impossible to log in and get a trace- hence the desire to automate it.
I can provide more details off list if helpful- I'm trying not to vilify anyone- especially without copious amounts of data points.
As a side question, what should my expectation be regarding packet loss when sending packets from point A to point B across multiple providers across the internet? Is 30 seconds to a minute of blackout between two destinations every couple of weeks par for the course? My directly connected ISPs offer me an SLA, but what should I reasonably expect from them when one of their upstream peers (or a peer of their peers) has issues? If this turns out to be BGP reconvergence or similar do I have any options?
many thanks, -andy
What I have done in the past, and this presumes you have a /29 or bigger on the peering session to your upstreams is to check with the direct upstream provider at each and get approval to put a linux box diagnostics server on the peering side of each BGP upstream connection you have - default-routed out to their BGP router(s). Typically not a problem with the upstream as long as they know this is for diagnostics purposes and will be taken down later. Also helps the upstreams know you are seriously looking at the reliability they are giving and their competitors are giving you. On that diagnostics box, run some quick & dirty tools to try and start isolating if the problem is related to one upstream link or another, or a combination of them. Have each one monitoring all the distant peer connections, and possibly even each-other local peers for connectivity if you are uber-detailed. The problem could be anywhere in between, but if you notice it is one link that has the issues and the other one does not, and/or a combo of src/dst, then you are in better shape to help your upstreams diagnose as well. A couple tools like smokeping and running traceroute and ping on a scripted basis are not perfect, but easy to setup. Log it all out so when it impacts production systems you can go back and look at those logs and see if there are any clues. nettop is also another handy tool to dump stuff out with and also in the nearly impossible case you happen to be on the console when the problem occurs is very handy. From there, let that run for a while - hours, days, weeks depending on the frequency of the problem and typically you will find that the 'hiccup' happens either via one peering partner or all of them - and/or from one end or the other. More than likely something will fall out from the data as to where the problem is, and often it is not with your direct peers, but their peers or somebody else further down the chain. This kind of stuff is notoriously difficult to troubleshoot and I generally agree with the opinions that for better or worse - global IP connectivity is still just a 'best effort basis' with out spending immense amounts of money. I remember a few years ago having blips and near one-hour outages from NW Washington State over to Europe and the problem was that global crossing was doing a bunch of maintenance and it was not going well for them. They were 'man in the middle' for the routing from two different peers and just knowing the problem was a big help and with some creative BGP announcements we were able to minimize the impact. - mike On Jul 15, 2013, at 2:18 PM, Andy Litzinger <Andy.Litzinger@theplatform.com> wrote:
Hi,
Does anyone have any recommendations on how to pinpoint and react to packet loss across the internet? preferably in an automated fashion. For detection I'm currently looking at trying smoketrace to run from inside my network, but I'd love to be able to run traceroutes from my edge routers triggered during periods of loss. I have Juniper MX80s on one end- which I'm hopeful I'll be able to cobble together some combo of RPM and event scripting to kick off a traceroute. We have Cisco4900Ms on the other end and maybe the same thing is possible but I'm not so sure.
I'd love to hear other suggestions and experience for detection and also for options on what I might be able to do when loss is detected on a path.
In my specific situation I control equipment on both ends of the path that I care about with details below.
we are a hosted service company and we currently have two data centers, DC A and DC B. DC A uses juniper MX routers, advertises our own IP space and takes full BGP feeds from two providers, ISPs A1 and A2. At DC B we have a smaller installation and instead take redundant drops (and IP space) from a single provider, ISP B1, who then peers upstream with two providers, B2 and B3
We have a fairly consistent bi-directional stream of traffic between DC A and DC B. Both of ISP A1 and A2 have good peering with ISP B2 so under normal network conditions traffic flows across ISP B1 to B2 and then to either ISP A1 or A2
oversimplified ascii pic showing only the normal best paths:
-- ISP A1----------------------ISP B2-- DC A--| |--- ISP B1 ----- DC B -- ISP A2----------------------ISP B2--
with increasing frequency we've been experiencing packet loss along the path from DC A to DC B. Usually the periods of loss are brief, 30 seconds to a minute, but they are total blackouts.
I'd like to be able to collect enough relevant data to pinpoint the trouble spot as much as possible so I can take it to the ISPs and request a solution. The blackouts are so quick that it's impossible to log in and get a trace- hence the desire to automate it.
I can provide more details off list if helpful- I'm trying not to vilify anyone- especially without copious amounts of data points.
As a side question, what should my expectation be regarding packet loss when sending packets from point A to point B across multiple providers across the internet? Is 30 seconds to a minute of blackout between two destinations every couple of weeks par for the course? My directly connected ISPs offer me an SLA, but what should I reasonably expect from them when one of their upstream peers (or a peer of their peers) has issues? If this turns out to be BGP reconvergence or similar do I have any options?
many thanks, -andy
participants (6)
-
Andy Litzinger
-
Blake Dunlap
-
James Sink
-
Jared Mauch
-
Michael DeMan
-
Pete Lumbis