Unless useful to others, feel free to just reply off-list. Background: Tuesday (yesterday) morning around 1am, I got a phone call from one of my transit customers(which seems more like a dream). I, sadly, didn't have the router they are on logging to a server, so it's impossible for me to see exactly what happened. Here's what I have. They received a minor spike in traffic going to them. My router shows the last BGP peer reset about that time, so this could be me sending the global table. His bandwidth then drops to 0 for almost exactly 30 minutes (MRTG isn't an exactly graph). My guess (authoratative answer) was the customer flapped their routes once too many times and was suppressed by both of my providers, as I seem to recall the penalty heal rate is in 30 minute increments. First issue is, am I right? If I am, then I need to develop ways to limit the damage done to my customer. Is there a way to setup route supression just under what most people use so that I can have client fix the problem and then clear the suppress on my network to allow them to come back up immediately just under the suppress threshold? Another possibility, although I've not seen reference to it, since the customer only transits through my network and depends on my redundancy, is it possible to hold his routes in the tables and keep advertising them out unless they are down for a set time period (ie, ignore flaps, but drop them if he's down 15-30 minutes)? I've never seen this issue. I was aware supression was possible when I first started learning BGP, and so I have never risked bouncing my peers more than three times in a day, and at that point usually quit playing until the next week. When my peers flap due to DDOS attacks, BGP never stabalizes fully or my providers have protected my networks (though I haven't seen how 69.8/18 will react in this scenario which doesn't have a shorter prefix at the peer). My customer is thinking of multi-homing again after this. Of course, it wouldn't have saved the customer. The reason they left multi-homing is that their network is in the same building and they only have one BGP router. I don't think multiple paths would have saved them. Opinions? Suggestions? Options? -Jack ~We now return you to the 69/8 threads
On Wed, 12 Mar 2003, Jack Bates wrote:
traffic going to them. My router shows the last BGP peer reset about that time, so this could be me sending the global table. His bandwidth then drops to 0 for almost exactly 30 minutes (MRTG isn't an exactly graph). My guess (authoratative answer) was the customer flapped their routes once too many times and was suppressed by both of my providers, as I seem to recall the penalty heal rate is in 30 minute increments.
Were there more flaps than just that last one before everything became very quiet? A flap (up->down transition) has a penalty of 1000. By default (if dampening is enabled), the dampen threshold is 2000. You need at least three flaps to trigger dampening.
First issue is, am I right? If I am, then I need to develop ways to limit the damage done to my customer.
Yell at your upstreams.
Is there a way to setup route supression just under what most people use so that I can have client fix the problem and then clear the suppress on my network to allow them to come back up immediately just under the suppress threshold?
Dampening doesn't work on direct eBGP sessions: when the session is lost the dampening info is removed from memory. So dampening your own customers doesn't really do anything. For this reason, it seems curious to me that both your upstreams use rather aggressive dampening. (See RIPE-229 for some considerations on good dampening practices.)
Opinions? Suggestions? Options?
If this happens again you can simply reset your sessions to your upstreams (one at a time of course) to get rid of the dampening IN THE NEXT HOP AS. However, if the trouble is further upstream this only makes matters worse.
On Wed, 12 Mar 2003, Randy Bush wrote:
You need at least three flaps to trigger dampening.
i guess you really need to look at that pdf.
You are right, it is depressing. However, I don't see how the penalty multiplication could happen here, you need a few hops in between for that.
Iljitsch van Beijnum wrote:
On Wed, 12 Mar 2003, Randy Bush wrote:
You need at least three flaps to trigger dampening.
i guess you really need to look at that pdf.
You are right, it is depressing. However, I don't see how the penalty multiplication could happen here, you need a few hops in between for that.
Ah, but this is the Internet. Jack's two upstreams likely have direct or indirect links between them where they will also receive the route updates in question. Should we change the subject (back) to "BGP to doom us all"? Peter E. Fry (I believe I said that without swearing. $#&%*@!)
On Wed, 12 Mar 2003, Peter E. Fry wrote:
You are right, it is depressing. However, I don't see how the penalty multiplication could happen here, you need a few hops in between for that.
Ah, but this is the Internet. Jack's two upstreams likely have direct or indirect links between them where they will also receive the route updates in question.
Dampening is done on the eBGP router where the route enters the AS, and, unless I'm mistaken, per route/path and not per prefix. So the flapping that ISP A sees from ISP B is a completely seperate thing from the flapping that ISP A sees from its customer's customer as far as the dampening algorithm is concerned.
Should we change the subject (back) to "BGP to doom us all"?
For all the criticism that BGP is subjected to, I find it curious that nobody has proposed a replacement protocol (that I'm aware of).
On Wed, 12 Mar 2003, Randy Bush wrote:
You need at least three flaps to trigger dampening.
i guess you really need to look at that pdf.
randy
"Better Algorithms" -- http://www.kotovnik.com/~avg/flap-rfc.txt http://www.kotovnik.com/~avg/flap-rfc.ps I didn't publish that one because I wanted to compare that with penalty-based dampening on historical (pre-dampening) flap records, but then got distracted by other projects. Preliminary data (from frequency analysis) indicates that "unwarranted downtime" (defined as suppression after the last flap prior to entering stable state) is reduced by a factor of 3 to 4 compared with penalty-based algorithm tuned to produce the same post-dampening flap rate. --vadim
On Wed, Mar 12, 2003 at 06:53:03AM -0600, Jack Bates wrote:
traffic going to them. My router shows the last BGP peer reset about that [...] I've not seen reference to it, since the customer only transits through my network and depends on my redundancy, is it possible to hold his routes in the tables and keep advertising them out unless they are down for a set time period (ie, ignore flaps, but drop them if he's down 15-30 minutes)?
While perhaps not always an ideal solution, is it possible for the customer to set default to you rather than having to use BGP? You could in turn use static routing back to them for their netblock(s). John
you might want to look at <http://psg.com/~randy/021028.zmao-nanog.pdf>. then again, you may not. it's depressing. randy
participants (6)
-
Iljitsch van Beijnum
-
Jack Bates
-
John Kristoff
-
Peter E. Fry
-
Randy Bush
-
Vadim Antonov