Re: Persistent BGP peer flapping - do you care?
Here's my two cents... A good rule of thumb (possibly from RFC 822) is, be liberal in what you accept and strict in what you send. When applied to BGP, I would suggest that any implementation should choose a canonical form for constructing updates, but a parser that allows for rule-bending without rule-breaking. On the issue of existing vendor implementations, and how to build the specs to prevent meltdowns: I would suspect that during implementation, brand C routers were the victims during testing, and perhaps the change was made to avoid that happening. The current state of affairs is very much like the classical game-theory "prisoner's dilemna". The new spec should have two goals - discourage any implementation which can lead to meltdowns, and encourage strict adherence to the spec. The latter can be achieved via the former, in fact, if the mechanisms are well chosen. My suggestion would be, rather than a back-off of resetting BGP sessions, that first attempt strict interpretation (to insulate against completely insane routers), and then loose interpretation. The model is "Fool me once, shame on you, fool me twice, shame on me." On first receiving a bad update, reset. If upon re-establishing the session, the same bad update is heard, drop the bad update but keep the session up (along with the messages back, etc.) One additional optional behaviour I would suggest - look at the AS path and/or path length and/or announcing router IP address. If heard from the originator, drop the session (and either keep it down, or try one more time before requiring operator intervention); it may be the case that only these conditions strictly require a reset, and that all other situations may only require the "ignore bad routes" behaviour. Resetting BGP more than a small, finite number of times is, IMHO, a bad idea. After all, BGP is a stateful protocol, and state changes should be triggered deterministically, even if that requires operator input. Brian Dickson Velocita
Dickson, Brian wrote:
A good rule of thumb (possibly from RFC 822) is, be liberal in what you accept and strict in what you send.
While it has appeared in several docs, the earliest I have found is RFC760 (not to be confused with the most prophetic RFC706 - On the junk mail problem) which was the predecessor to RFC791 - IPv4. Tony
A good rule of thumb (possibly from RFC 822) is, be liberal in what you accept and strict in what you send.
That's a good rule of thumb in general, but I'm not sure it makes sens eto apply it to the routing fabric of the entire Internet. Any router that sends you a malformed update is without question broken. I think the point of saying "drop the session on receipt of a bad update" is that accepting updates from broken routers is bad practice when those updates are being used as the basis for routing in the global Internet. If you leave the session up, and just drop the malformed update, you are then accepting and passing on (assuming you have peers or downstreams) routing updates from a router known to be broken. In terms of the original rule, if you're liberal in what you accept from BGP (i.e. you reject the malformed update, but accept the other updates from the same router), you are also (if you have any peeers or downstreams) effectively being liberal (isntead of strickt) in what you send them. (Sure, you'll be strict about the *formatting* of what you send. But you're being liberal in the sense that you're passing on routes from a known-to-be-broken router.)
I would suspect that during implementation, brand C routers were the victims during testing, and perhaps the change was made to avoid that happening.
ISTM that if that were the case, Brand C would have chosen to reject the update but maintain the session, as opposed to accepting and passing on the update. My guess is that it was just an ordinary bug in the AS_PATH validation code, that resulted in the BGP implementation failing to realize that the update was malformed. The ensuing meltdowns caused by the bug is essentially a problem of the homogeneousness of the Internet. The malformed update could only spread from Brand C to Brand C, had there been a lot more diversity in the core of the Internet, the update would probably not have spread as faror had as great an impact.
My suggestion would be, rather than a back-off of resetting BGP sessions, that first attempt strict interpretation (to insulate against completely insane routers), and then loose interpretation. The model is "Fool me once, shame on you, fool me twice, shame on me."
On first receiving a bad update, reset. If upon re-establishing the session, the same bad update is heard, drop the bad update but keep the session up (along with the messages back, etc.)
The potential risk I see is that you are still passing on updates from a router known to be broken. From a purely reactive perspective, we look at past failures and say "when it happened last time, that would have been a good idea, because all the other updates were good". But from a proactive, more general perspective, the receiving router really has no way of knowing just how broken the router on the other end of the link is. I do agree, though, with the observation that this can vary on a case by case basis. For example, a multi-homed end user isn't generally propogating any of its BGP-received routes. So it might make sense in such a case to just reject only the malformed packet, because the alternative is to signifigantly degrade their connectivity over a single routing update. (And there is no offsetting benefit to the core routing fabric of the Internet, because such an end-user isn't really participating in that.) So, yes, having a knob to control the behavior might be a good idea. But I would stop short of saving that everyone in the core should configure that knob to leave the session up with a known to be broken neighbor.
Resetting BGP more than a small, finite number of times is, IMHO, a bad idea. After all, BGP is a stateful protocol, and state changes should be triggered deterministically, even if that requires operator input.
Yes, I agree with that also. Dropping a session to a misbehaving peer is a good idea; restarting it immediately after every drop (so you can just drop it again when it misbehaves again) is bad. -- Brett
Brian: Thank-you for your 2 cents. I'm gathering all the input until Sunday night. I really appreciate your comments. I'll summarize all the input to the list at that time, and suggest some ideas. I'll try to boil all the input on this problem into a document that I can post to IDR and NANOG. Sue PS - I'm away from email from now until Monday am. Thanks nanog folks!! At 07:30 PM 1/17/2002 -0500, Dickson, Brian wrote:
Here's my two cents...
A good rule of thumb (possibly from RFC 822) is, be liberal in what you accept and strict in what you send.
When applied to BGP, I would suggest that any implementation should choose a canonical form for constructing updates, but a parser that allows for rule-bending without rule-breaking.
On the issue of existing vendor implementations, and how to build the specs to prevent meltdowns:
I would suspect that during implementation, brand C routers were the victims during testing, and perhaps the change was made to avoid that happening.
The current state of affairs is very much like the classical game-theory "prisoner's dilemna".
The new spec should have two goals - discourage any implementation which can lead to meltdowns, and encourage strict adherence to the spec. The latter can be achieved via the former, in fact, if the mechanisms are well chosen.
My suggestion would be, rather than a back-off of resetting BGP sessions, that first attempt strict interpretation (to insulate against completely insane routers), and then loose interpretation. The model is "Fool me once, shame on you, fool me twice, shame on me."
On first receiving a bad update, reset. If upon re-establishing the session, the same bad update is heard, drop the bad update but keep the session up (along with the messages back, etc.)
One additional optional behaviour I would suggest - look at the AS path and/or path length and/or announcing router IP address. If heard from the originator, drop the session (and either keep it down, or try one more time before requiring operator intervention); it may be the case that only these conditions strictly require a reset, and that all other situations may only require the "ignore bad routes" behaviour.
Resetting BGP more than a small, finite number of times is, IMHO, a bad idea. After all, BGP is a stateful protocol, and state changes should be triggered deterministically, even if that requires operator input.
Brian Dickson Velocita
participants (4)
-
Brett Frankenberger
-
Dickson, Brian
-
Susan Hares
-
Tony Hain