BGP Failover Question

Chris Wallace

21 Feb 2011 21 Feb '11

9:10 p.m.

I am looking for some help with an issue we recently had with one of our BGP peers recently. I currently have two DIA providers each terminated into their own edge router and I am doing iBGP to exchange routes between the two edge routers. Last week Provider A made a policy change "somewhere" in their network in the middle of the day causing traffic to stop routing. Of course this connection happens to be the preferred route for the majority of our inbound and outbound traffic. I never saw our physical link go down and never saw our peer drop therefore BGP did not stop advertising routes, this caused most of our customers traffic to go nowhere. In order to fix the issue I had to manually shutdown the peer till Provider A confirmed the change they made had been reverted. This isn't the first time we have seen this issue with our various providers, how can I prevent issues like this from happening in the future? ---Chris

Show replies by date

Brian Johnson

21 Feb 21 Feb

9:21 p.m.

Chris, The best way to resolve this issue is to not use a service provider who takes down your connectivity outside of maintenance windows, but I digress. This is the nature of BGP. You send your providers routes about your network prefixes and they send you routes to say the DFZ. When you forward packets to them ,because they sent you routes saying they can get the destinations your packets have on them, it is now outside of anything you can do about it. It is now up to the peer to forward the packets as they said they would by sending you prefixes. This is a trust relationship as you trust they will forward your packets because that is why you are paying them. - Brian J. -----Original Message----- From: Chris Wallace [mailto:lists@iamchriswallace.com] Sent: Monday, February 21, 2011 3:10 PM To: NANOG Subject: BGP Failover Question I am looking for some help with an issue we recently had with one of our BGP peers recently. I currently have two DIA providers each terminated into their own edge router and I am doing iBGP to exchange routes between the two edge routers. Last week Provider A made a policy change "somewhere" in their network in the middle of the day causing traffic to stop routing. Of course this connection happens to be the preferred route for the majority of our inbound and outbound traffic. I never saw our physical link go down and never saw our peer drop therefore BGP did not stop advertising routes, this caused most of our customers traffic to go nowhere. In order to fix the issue I had to manually shutdown the peer till Provider A confirmed the change they made had been reverted. This isn't the first time we have seen this issue with our various providers, how can I prevent issues like this from happening in the future? ---Chris

Max Pierson

9:25 p.m.

I would simply monitor PPS on those links and set a threshold which will kick off an alert at least. If your scripting savvy, other tools such as IP SLA and EEM on Cisco could be used to automate the failover. Juniper also has a similar scripting tool that can probably do the same. I've had this happen before and is a real pain. Regards, M On Mon, Feb 21, 2011 at 3:10 PM, Chris Wallace <lists@iamchriswallace.com>wrote:

...

I am looking for some help with an issue we recently had with one of our BGP peers recently. I currently have two DIA providers each terminated into their own edge router and I am doing iBGP to exchange routes between the two edge routers. Last week Provider A made a policy change "somewhere" in their network in the middle of the day causing traffic to stop routing. Of course this connection happens to be the preferred route for the majority of our inbound and outbound traffic. I never saw our physical link go down and never saw our peer drop therefore BGP did not stop advertising routes, this caused most of our customers traffic to go nowhere. In order to fix the issue I had to manually shutdown the peer till Provider A confirmed the change they made had been reverted. This isn't the first time we have seen this issue with our various providers, how can I prevent issues like this from happening in the future?

---Chris

Seth Mattinen

9:35 p.m.

On 2/21/2011 13:10, Chris Wallace wrote:

...

I am looking for some help with an issue we recently had with one of our BGP peers recently. I currently have two DIA providers each terminated into their own edge router and I am doing iBGP to exchange routes between the two edge routers. Last week Provider A made a policy change "somewhere" in their network in the middle of the day causing traffic to stop routing. Of course this connection happens to be the preferred route for the majority of our inbound and outbound traffic. I never saw our physical link go down and never saw our peer drop therefore BGP did not stop advertising routes, this caused most of our customers traffic to go nowhere. In order to fix the issue I had to manually shutdown the peer till Provider A confirmed the change they made had been reverted. This isn't the first time we have seen this issue with our various providers, how can I prevent issues like this from happening in the future?

I had a provider like that a long time ago; it was an ATG T1 (which was fine) but when they were bought by Eschelon the exact problem you're describing would happen every other month like clockwork. The first time was forgivable. The second time I was annoyed. After the third I was angry, unplugged it, and told them to stuff it because apparently they didn't know how to deal with BGP. You can't prevent it from happening. You can only come up with band-aids to notify you. Save yourself the headache and find a new provider that knows how to handle BGP. What happens if the other circuit is not available (outage, planned maintenance, etc.) at the same time the problem one decides to black hole you? If you're facing the same repeating problem they are obviously not the best fit for you. ~Seth

Max Pierson

9:44 p.m.

...

Save yourself the headache and find a new provider that knows how to handle BGP

I've had this happen with providers that do know how to handle BGP. Just because you peer with 3356, 701, etc, doesn't mean operators can't make a mistake. I've even seen this happen due to some wierd BGP behavior caused by some cool new "features". IMHO, better to plan for it and deploy it as a policy (by whatever means). M On Mon, Feb 21, 2011 at 3:35 PM, Seth Mattinen <sethm@rollernet.us> wrote:

...

...
I am looking for some help with an issue we recently had with one of our BGP peers recently. I currently have two DIA providers each terminated into

On 2/21/2011 13:10, Chris Wallace wrote: their own edge router and I am doing iBGP to exchange routes between the two edge routers. Last week Provider A made a policy change "somewhere" in their network in the middle of the day causing traffic to stop routing. Of course this connection happens to be the preferred route for the majority of our inbound and outbound traffic. I never saw our physical link go down and never saw our peer drop therefore BGP did not stop advertising routes, this caused most of our customers traffic to go nowhere. In order to fix the issue I had to manually shutdown the peer till Provider A confirmed the change they made had been reverted. This isn't the first time we have seen this issue with our various providers, how can I prevent issues like this from happening in the future?

...
I had a provider like that a long time ago; it was an ATG T1 (which was fine) but when they were bought by Eschelon the exact problem you're describing would happen every other month like clockwork. The first time was forgivable. The second time I was annoyed. After the third I was angry, unplugged it, and told them to stuff it because apparently they didn't know how to deal with BGP.

You can't prevent it from happening. You can only come up with band-aids to notify you. Save yourself the headache and find a new provider that knows how to handle BGP. What happens if the other circuit is not available (outage, planned maintenance, etc.) at the same time the problem one decides to black hole you? If you're facing the same repeating problem they are obviously not the best fit for you.

~Seth

Seth Mattinen

10:19 p.m.

On 2/21/2011 13:44, Max Pierson wrote:

...

...
Save yourself the headache and find a new provider that knows how to handle BGP

I've had this happen with providers that do know how to handle BGP. Just because you peer with 3356, 701, etc, doesn't mean operators can't make a mistake. I've even seen this happen due to some wierd BGP behavior caused by some cool new "features".

IMHO, better to plan for it and deploy it as a policy (by whatever means).

On a predictable schedule? That's where I drew the line: they were "fixing" something that was not "normal" to them every two months that resulted in the problem the OP described. Yes, mistakes happen, but identical repeating mistakes don't count in my book. I would expect my providers to document changes and whoever is making changes to consult it when they see a deviation from common config. ~Seth

Charles Gucker

11:36 p.m.

On Mon, Feb 21, 2011 at 4:10 PM, Chris Wallace <lists@iamchriswallace.com> wrote:

...

This isn't the first time we have seen this issue with our various providers, how can I prevent issues like this from happening in the future?

Quick question, are you running with a default route from your provider? If so, you're better off either finding another provider, or upgrading the router (if necessary) to carry a full table. If they do something to partition their network, you will see the decrease in routes learned from them, provided you see those routes and not the default route as asked above. charles

Chris Wallace

22 Feb 22 Feb

5:17 p.m.

We are recieving full routes from both providers. ---Chris On Feb 21, 2011, at 6:36 PM, Charles Gucker wrote:

...

On Mon, Feb 21, 2011 at 4:10 PM, Chris Wallace <lists@iamchriswallace.com> wrote:

...
This isn't the first time we have seen this issue with our various providers, how can I prevent issues like this from happening in the future?

Quick question, are you running with a default route from your provider? If so, you're better off either finding another provider, or upgrading the router (if necessary) to carry a full table. If they do something to partition their network, you will see the decrease in routes learned from them, provided you see those routes and not the default route as asked above.

charles

Hammer

5:23 p.m.

As Max stated, you can set triggers based on thresholds that are monitered via multiple methods in Cisco IOS. That way you could force the route down dynamically. There's always a risk when letting the machines do the thinking but this would help in situations like this. Can't speak for other vendors but I'm sure the features are similar. -Hammer- "I was a normal American nerd." -Jack Herer On Tue, Feb 22, 2011 at 11:17 AM, Chris Wallace <lists@iamchriswallace.com>wrote:

...

We are recieving full routes from both providers.

---Chris

On Feb 21, 2011, at 6:36 PM, Charles Gucker wrote:

...
...
This isn't the first time we have seen this issue with our various

On Mon, Feb 21, 2011 at 4:10 PM, Chris Wallace <lists@iamchriswallace.com> wrote: providers, how can I prevent issues like this from happening in the future?

Quick question, are you running with a default route from your provider? If so, you're better off either finding another provider, or upgrading the router (if necessary) to carry a full table. If they do something to partition their network, you will see the decrease in routes learned from them, provided you see those routes and not the default route as asked above.

charles

Bret Clark

6:11 p.m.

On 02/22/2011 12:23 PM, Hammer wrote:

...

As Max stated, you can set triggers based on thresholds that are monitered via multiple methods in Cisco IOS. That way you could force the route down dynamically. There's always a risk when letting the machines do the thinking but this would help in situations like this. Can't speak for other vendors but I'm sure the features are similar.

Well as someone else stated, if an upstream provider can't provide BGP reliably then it's time to give them the boot. Once in a year, okay, but beyond that, then it's time to read riot act with that provider. Bret

Hammer

6:15 p.m.

I'm not argueing that at all. But it wasn't relevent to the question at hand. And depending on the scale of your business dumping providers is not something done on a whim. It's not like your fed up with DSL and want to convert to Cable. -Hammer- "I was a normal American nerd." -Jack Herer On Tue, Feb 22, 2011 at 12:11 PM, Bret Clark <bclark@spectraaccess.com>wrote:

...

On 02/22/2011 12:23 PM, Hammer wrote:

...
As Max stated, you can set triggers based on thresholds that are monitered via multiple methods in Cisco IOS. That way you could force the route down dynamically. There's always a risk when letting the machines do the thinking but this would help in situations like this. Can't speak for other vendors but I'm sure the features are similar.

Well as someone else stated, if an upstream provider can't provide BGP reliably then it's time to give them the boot. Once in a year, okay, but beyond that, then it's time to read riot act with that provider. Bret

Owen DeLong

6:38 p.m.

Assuming that he has provider independent space (why run full BGP feeds if you are not multihomed?), then, actually it's about on par and less disruptive in general. Add new provider, wait a day or two, then disconnect old provider. If he's using provider assigned space, then, the big hurdle is switching to provider independent (requires a renumber), but, that's a good idea for a variety of reasons. I would hardly call the type and frequency of outages described a "whim" when using that as a reason to change providers. Sounds like he is suffering severe impact to his business. Owen On Feb 22, 2011, at 10:15 AM, Hammer wrote:

...

I'm not argueing that at all. But it wasn't relevent to the question at hand. And depending on the scale of your business dumping providers is not something done on a whim. It's not like your fed up with DSL and want to convert to Cable.

-Hammer-

"I was a normal American nerd." -Jack Herer

On Tue, Feb 22, 2011 at 12:11 PM, Bret Clark <bclark@spectraaccess.com>wrote:

...
On 02/22/2011 12:23 PM, Hammer wrote:

...
As Max stated, you can set triggers based on thresholds that are monitered via multiple methods in Cisco IOS. That way you could force the route down dynamically. There's always a risk when letting the machines do the thinking but this would help in situations like this. Can't speak for other vendors but I'm sure the features are similar.

Well as someone else stated, if an upstream provider can't provide BGP reliably then it's time to give them the boot. Once in a year, okay, but beyond that, then it's time to read riot act with that provider. Bret

Hammer

6:52 p.m.

I agree. But swapping providers is not the default answer in some environments. I work in an enterprise with multiple GE circuits from multiple providers to the Internet. The lead time on calling up a different carrier and saying "I need a gigabit connection to the Internet" would probably be 90-120 days. And then you get to go thru the contracts/negotiations and MSAs. You don't just flip. In smaller operations I understand. But I was simply saying that it's not always that easy. If I went to my boss and said one of our carriers sucks and we should dump them he would just laugh and throw me out. 1. What are the SLAs with the carrier in question? Do you have them clearly defined? Are they out of SLA? If so, what compensation is entitled based on violation of said SLA? 2. What trending are you doing to document the failures in SLA of the carrier in question? Do we have a documented pattern of poor performence by using that trending? 3. What are our contractual or legal options based on items 1 and 2? 4. Don't forget about the Layer8 (political) factor. If your telco manager is buddies with the carrier then you have to double your documentation against them. Some companies spend tens of millions a month on circuits. You better be ready to justify yourself. -Hammer- "I was a normal American nerd." -Jack Herer On Tue, Feb 22, 2011 at 12:38 PM, Owen DeLong <owen@delong.com> wrote:

...

Assuming that he has provider independent space (why run full BGP feeds if you are not multihomed?), then, actually it's about on par and less disruptive in general. Add new provider, wait a day or two, then disconnect old provider.

If he's using provider assigned space, then, the big hurdle is switching to provider independent (requires a renumber), but, that's a good idea for a variety of reasons.

I would hardly call the type and frequency of outages described a "whim" when using that as a reason to change providers. Sounds like he is suffering severe impact to his business.

Owen

On Feb 22, 2011, at 10:15 AM, Hammer wrote:

...
I'm not argueing that at all. But it wasn't relevent to the question at hand. And depending on the scale of your business dumping providers is not something done on a whim. It's not like your fed up with DSL and want to convert to Cable.

-Hammer-

"I was a normal American nerd." -Jack Herer

On Tue, Feb 22, 2011 at 12:11 PM, Bret Clark <bclark@spectraaccess.com wrote:

...
On 02/22/2011 12:23 PM, Hammer wrote:

...
As Max stated, you can set triggers based on thresholds that are monitered via multiple methods in Cisco IOS. That way you could force the route down dynamically. There's always a risk when letting the machines do the thinking but this would help in situations like this. Can't speak for other vendors but I'm sure the features are similar.

Well as someone else stated, if an upstream provider can't provide BGP reliably then it's time to give them the boot. Once in a year, okay, but beyond that, then it's time to read riot act with that provider. Bret

Hammer

6:54 p.m.

Funny, I was just at your IPv6 sight this morning while researching multihoming scenarios. "That name sounds familiar....." -Hammer- "I was a normal American nerd." -Jack Herer On Tue, Feb 22, 2011 at 12:52 PM, Hammer <bhmccie@gmail.com> wrote:

...

I agree. But swapping providers is not the default answer in some environments. I work in an enterprise with multiple GE circuits from multiple providers to the Internet. The lead time on calling up a different carrier and saying "I need a gigabit connection to the Internet" would probably be 90-120 days. And then you get to go thru the contracts/negotiations and MSAs. You don't just flip. In smaller operations I understand. But I was simply saying that it's not always that easy. If I went to my boss and said one of our carriers sucks and we should dump them he would just laugh and throw me out.

1. What are the SLAs with the carrier in question? Do you have them clearly defined? Are they out of SLA? If so, what compensation is entitled based on violation of said SLA?

2. What trending are you doing to document the failures in SLA of the carrier in question? Do we have a documented pattern of poor performence by using that trending?

3. What are our contractual or legal options based on items 1 and 2?

4. Don't forget about the Layer8 (political) factor. If your telco manager is buddies with the carrier then you have to double your documentation against them. Some companies spend tens of millions a month on circuits. You better be ready to justify yourself.

-Hammer-

"I was a normal American nerd." -Jack Herer

On Tue, Feb 22, 2011 at 12:38 PM, Owen DeLong <owen@delong.com> wrote:

...
Assuming that he has provider independent space (why run full BGP feeds if you are not multihomed?), then, actually it's about on par and less disruptive in general. Add new provider, wait a day or two, then disconnect old provider.

If he's using provider assigned space, then, the big hurdle is switching to provider independent (requires a renumber), but, that's a good idea for a variety of reasons.

I would hardly call the type and frequency of outages described a "whim" when using that as a reason to change providers. Sounds like he is suffering severe impact to his business.

Owen

On Feb 22, 2011, at 10:15 AM, Hammer wrote:

...
I'm not argueing that at all. But it wasn't relevent to the question at hand. And depending on the scale of your business dumping providers is not something done on a whim. It's not like your fed up with DSL and want to convert to Cable.

-Hammer-

"I was a normal American nerd." -Jack Herer

On Tue, Feb 22, 2011 at 12:11 PM, Bret Clark <bclark@spectraaccess.com wrote:

...
On 02/22/2011 12:23 PM, Hammer wrote:

...
As Max stated, you can set triggers based on thresholds that are monitered via multiple methods in Cisco IOS. That way you could force the route down dynamically. There's always a risk when letting the machines do the thinking but this would help in situations like this. Can't speak for other vendors but I'm sure the features are similar.

Well as someone else stated, if an upstream provider can't provide BGP reliably then it's time to give them the boot. Once in a year, okay, but beyond that, then it's time to read riot act with that provider. Bret

Owen DeLong

7:20 p.m.

On Feb 22, 2011, at 10:52 AM, Hammer wrote:

...

I agree. But swapping providers is not the default answer in some environments. I work in an enterprise with multiple GE circuits from multiple providers to the Internet. The lead time on calling up a different carrier and saying "I need a gigabit connection to the Internet" would probably be 90-120 days. And then you get to go thru the contracts/negotiations and MSAs. You don't just flip. In smaller operations I understand. But I was simply saying that it's not always that easy. If I went to my boss and said one of our carriers sucks and we should dump them he would just laugh and throw me out.

That depends on where you are. If you have a router in one or more of the many "carrier hotels" around the world, you can usually order a new Gig-E cross-connect with service in less than a week. If you need to have a circuit engineered, then, 30-90 days is probably about right. If you need to have facilities installed to provide said circuit, it can be as much as 180 days. However, I don't think the point was "disconnect them tomorrow". I think the point was "If the impact is that severe, the sooner you start the new provider process, the sooner you get relief."

...

1. What are the SLAs with the carrier in question? Do you have them clearly defined? Are they out of SLA? If so, what compensation is entitled based on violation of said SLA?

99.99% of all SLAs are a pittance of money refunded IF you jump through extreme hoops to collect. They are rarely sufficient to resolve or even compensate for outages.

...

2. What trending are you doing to document the failures in SLA of the carrier in question? Do we have a documented pattern of poor performence by using that trending?

3. What are our contractual or legal options based on items 1 and 2?

4. Don't forget about the Layer8 (political) factor. If your telco manager is buddies with the carrier then you have to double your documentation against them. Some companies spend tens of millions a month on circuits. You better be ready to justify yourself.

Yeah, this is usually the biggest problem. Owen

...

-Hammer-

"I was a normal American nerd." -Jack Herer

On Tue, Feb 22, 2011 at 12:38 PM, Owen DeLong <owen@delong.com> wrote: Assuming that he has provider independent space (why run full BGP feeds if you are not multihomed?), then, actually it's about on par and less disruptive in general. Add new provider, wait a day or two, then disconnect old provider.

If he's using provider assigned space, then, the big hurdle is switching to provider independent (requires a renumber), but, that's a good idea for a variety of reasons.

I would hardly call the type and frequency of outages described a "whim" when using that as a reason to change providers. Sounds like he is suffering severe impact to his business.

Owen

On Feb 22, 2011, at 10:15 AM, Hammer wrote:

...
I'm not argueing that at all. But it wasn't relevent to the question at hand. And depending on the scale of your business dumping providers is not something done on a whim. It's not like your fed up with DSL and want to convert to Cable.

-Hammer-

"I was a normal American nerd." -Jack Herer

On Tue, Feb 22, 2011 at 12:11 PM, Bret Clark <bclark@spectraaccess.com>wrote:

...
On 02/22/2011 12:23 PM, Hammer wrote:

...
As Max stated, you can set triggers based on thresholds that are monitered via multiple methods in Cisco IOS. That way you could force the route down dynamically. There's always a risk when letting the machines do the thinking but this would help in situations like this. Can't speak for other vendors but I'm sure the features are similar.

Well as someone else stated, if an upstream provider can't provide BGP reliably then it's time to give them the boot. Once in a year, okay, but beyond that, then it's time to read riot act with that provider. Bret

Hammer

8:58 p.m.

Uncle! -Hammer- "I was a normal American nerd." -Jack Herer On Tue, Feb 22, 2011 at 1:20 PM, Owen DeLong <owen@delong.com> wrote:

...

On Feb 22, 2011, at 10:52 AM, Hammer wrote:

I agree. But swapping providers is not the default answer in some environments. I work in an enterprise with multiple GE circuits from multiple providers to the Internet. The lead time on calling up a different carrier and saying "I need a gigabit connection to the Internet" would probably be 90-120 days. And then you get to go thru the contracts/negotiations and MSAs. You don't just flip. In smaller operations I understand. But I was simply saying that it's not always that easy. If I went to my boss and said one of our carriers sucks and we should dump them he would just laugh and throw me out.

That depends on where you are. If you have a router in one or more of the many "carrier hotels" around the world, you can usually order a new Gig-E cross-connect with service in less than a week. If you need to have a circuit engineered, then, 30-90 days is probably about right. If you need to have facilities installed to provide said circuit, it can be as much as 180 days.

However, I don't think the point was "disconnect them tomorrow". I think the point was "If the impact is that severe, the sooner you start the new provider process, the sooner you get relief."

1. What are the SLAs with the carrier in question? Do you have them clearly defined? Are they out of SLA? If so, what compensation is entitled based on violation of said SLA?

99.99% of all SLAs are a pittance of money refunded IF you jump through extreme hoops to collect. They are rarely sufficient to resolve or even compensate for outages.

2. What trending are you doing to document the failures in SLA of the carrier in question? Do we have a documented pattern of poor performence by using that trending?

3. What are our contractual or legal options based on items 1 and 2?

4. Don't forget about the Layer8 (political) factor. If your telco manager is buddies with the carrier then you have to double your documentation against them. Some companies spend tens of millions a month on circuits. You better be ready to justify yourself.

Yeah, this is usually the biggest problem.

Owen

-Hammer-

"I was a normal American nerd." -Jack Herer

On Tue, Feb 22, 2011 at 12:38 PM, Owen DeLong <owen@delong.com> wrote:

...
Assuming that he has provider independent space (why run full BGP feeds if you are not multihomed?), then, actually it's about on par and less disruptive in general. Add new provider, wait a day or two, then disconnect old provider.

If he's using provider assigned space, then, the big hurdle is switching to provider independent (requires a renumber), but, that's a good idea for a variety of reasons.

I would hardly call the type and frequency of outages described a "whim" when using that as a reason to change providers. Sounds like he is suffering severe impact to his business.

Owen

On Feb 22, 2011, at 10:15 AM, Hammer wrote:

...
I'm not argueing that at all. But it wasn't relevent to the question at hand. And depending on the scale of your business dumping providers is not something done on a whim. It's not like your fed up with DSL and want to convert to Cable.

-Hammer-

"I was a normal American nerd." -Jack Herer

On Tue, Feb 22, 2011 at 12:11 PM, Bret Clark <bclark@spectraaccess.com wrote:

...
On 02/22/2011 12:23 PM, Hammer wrote:

...
As Max stated, you can set triggers based on thresholds that are monitered via multiple methods in Cisco IOS. That way you could force the route down dynamically. There's always a risk when letting the machines do the thinking but this would help in situations like this. Can't speak for other vendors but I'm sure the features are similar.

Well as someone else stated, if an upstream provider can't provide BGP reliably then it's time to give them the boot. Once in a year, okay, but beyond that, then it's time to read riot act with that provider. Bret

5512

Age (days ago)

5513

Last active (days ago)

List overview

Download

15 comments

8 participants

participants (8)

Bret Clark
Brian Johnson
Charles Gucker
Chris Wallace
Hammer
Max Pierson
Owen DeLong
Seth Mattinen