Re: Time to revise RFC 1771
This ignores three basic facts: 1) Networks tend to be homogenous in platform. 2) Platforms tend to accept their own implementation quirks 3) Networks peer at borders Therefore, under the "drop the session rule," my bad announcement gets to all my borders fine, and all my external peers who are not running forgiving/compatable implementations drop their connections to me and all my traffic to/from them hits the floor. One CRC error does not make PPP drop. Why make one route cause a catastrophic loss of connectivity? Report the bad route, drop it, and move on; let layer 8 resolve it. -Dave On 6/26/2001 at 13:08:46 -0700, Clayton Fiske said:
The basic issue is one of scale vs integrity. However, I think this particular case is one in which the RFC-dictated behavior is the correct choice. The problem is that one [set of] router[s] did not follow such behavior and thus escalated the scale of the problem significantly.
Given that the malformed route in question was most likely originated from a single router, the only damage that should have been done was a loss of routability for networks behind that one router. While of course that could be arguably a significant number of networks, I think it's a safe assumption that X losing its peers is pretty much always a smaller impact than all of X's peers losing -their- peers. If network XYZ's routers have N peers each, the RFC-dictated behavior gives us N peering sessions lost (assuming the offending route was advertised to all peers), instead of N^2 (or greater) sessions as was the case.
I think the logic of dropping the session is sound. If a router originates one malformed route, who's to say the rest of its routes are correct? Perhaps other routes are corrupted, but not in ways detectable by the router's sanity checks. Since the offending route is indeed malformed, it's not unreasonable to stop trusting the router from which it originated. Since it's likely[1] only a single router is originating the route, dropping sessions to that one router controls the blast radius[2].
This is not to say that the issue of scale is unimportant. It most certainly is. However, again, if the first router(s) to receive the route had behaved properly, the scale of the problem would have been small. The only place you'd see a flap of 100,000 routes is if the offending router was your upstream's. Everyone else would only see (at most) a flap of the routes originated by and/or behind that router (in BGP topology terms).
Perhaps a knob to control the behavior would be an acceptable compromise for some. I think it's a bad idea for two reasons. First, it allows bugs such as this to go unfixed, because when it happens people just adjust the knob to keep their BGP sessions stable. Second, it circumvents the integrity control. If a router has many corrupted routes, but only a few trigger the sanity checks for malformation, the session stays alive and the remaining corrupted routes are then propagated network-wide. While this may seem like a paranoid philosophy, a little paranoia can be good when considering the integrity of the larger whole.
-c
[1] = Yes, "likely" is a relative term. I know there are plenty of cases where the same route is originated by multiple routers, however the odds of more than one of them corrupting a route at the same time are probably slim compared to the odds of a single one doing so.
[2] = In this specific case, as I understand it, the direct peers did in fact drop the offending BGP session, however they propagated the offending announcement to their peers before doing so. In this case, of course, the blast radius is not controlled.
-- Dave Israel Senior Manager, IP Backbone Intermedia Business Internet
On Tue, Jun 26, 2001 at 04:27:49PM -0400, Dave Israel wrote:
This ignores three basic facts:
1) Networks tend to be homogenous in platform. 2) Platforms tend to accept their own implementation quirks 3) Networks peer at borders
Therefore, under the "drop the session rule," my bad announcement gets to all my borders fine, and all my external peers who are not running forgiving/compatable implementations drop their connections to me and all my traffic to/from them hits the floor.
In this case, vendor C's implementation was neither forgiving nor compatible. It still dropped the peer(s) in question. It just had the much more harmful quirk that it forwarded the bad route on to its peers before doing so. In this case, a homogenous network would not only lose its border sessions, it would lose all internal ones through which the route was advertised.
One CRC error does not make PPP drop. Why make one route cause a catastrophic loss of connectivity? Report the bad route, drop it, and move on; let layer 8 resolve it.
Because, arguably, we don't know that it's just one route. We just know that one route set off the alarm. Do you feel safe assuming that whatever bug caused one corrupted route left all the other routes alone? Plus, a CRC error can occur between two valid, compliant, bug-free implementations. A bad route, by definition, can't. We're not talking about external faults here, but broken implementations. When one side of a protocol session simply breaks the rules, I don't think it's reasonable to say that the other side needs to be "fixed" to accept that breakage. Fix the broken side. The reason this has got everyone's attention is because of the unique way in which the breakage occurred. If all implementations were changed to drop the single bad route and keep the sessions intact, the damage would not have been what it was. If all implementations followed the current specs and dropped the session with the router which first originated the bad route, the damage would not have been what it was. To say that one way causes massive damage and the other doesn't is inaccurate. The damage was caused by the implementation in question doing something resembling one but with harmful behavior thrown in. -c
participants (2)
-
Clayton Fiske
-
Dave Israel