New subject: Time to revise RFC 1771

26 Jun 2001

      This ignores three basic facts:

1) Networks tend to be homogenous in platform.
2) Platforms tend to accept their own implementation quirks
3) Networks peer at borders

Therefore, under the "drop the session rule," my bad announcement
gets to all my borders fine, and all my external peers who are not
running forgiving/compatable implementations drop their connections
to me and all my traffic to/from them hits the floor.

One CRC error does not make PPP drop.  Why make one route cause
a catastrophic loss of connectivity?  Report the bad route,
drop it, and move on; let layer 8 resolve it.

-Dave

On 6/26/2001 at 13:08:46 -0700, Clayton Fiske said:
...
The basic issue is one of scale vs integrity. However, I think this
particular case is one in which the RFC-dictated behavior is the
correct choice. The problem is that one [set of] router[s] did not
follow such behavior and thus escalated the scale of the problem
significantly.
Given that the malformed route in question was most likely originated
from a single router, the only damage that should have been done was
a loss of routability for networks behind that one router. While of
course that could be arguably a significant number of networks, I
think it's a safe assumption that X losing its peers is pretty much
always a smaller impact than all of X's peers losing -their- peers.
If network XYZ's routers have N peers each, the RFC-dictated
behavior gives us N peering sessions lost (assuming the offending
route was advertised to all peers), instead of N^2 (or greater)
sessions as was the case.
I think the logic of dropping the session is sound. If a router
originates one malformed route, who's to say the rest of its routes
are correct? Perhaps other routes are corrupted, but not in ways
detectable by the router's sanity checks. Since the offending route
is indeed malformed, it's not unreasonable to stop trusting the
router from which it originated. Since it's likely[1] only a single
router is originating the route, dropping sessions to that one
router controls the blast radius[2].
This is not to say that the issue of scale is unimportant. It most
certainly is. However, again, if the first router(s) to receive
the route had behaved properly, the scale of the problem would
have been small. The only place you'd see a flap of 100,000
routes is if the offending router was your upstream's. Everyone
else would only see (at most) a flap of the routes originated by
and/or behind that router (in BGP topology terms).
Perhaps a knob to control the behavior would be an acceptable
compromise for some. I think it's a bad idea for two reasons.
First, it allows bugs such as this to go unfixed, because when
it happens people just adjust the knob to keep their BGP sessions
stable. Second, it circumvents the integrity control. If a router
has many corrupted routes, but only a few trigger the sanity
checks for malformation, the session stays alive and the remaining
corrupted routes are then propagated network-wide. While this may
seem like a paranoid philosophy, a little paranoia can be good
when considering the integrity of the larger whole.
-c
[1] = Yes, "likely" is a relative term. I know there are plenty of
      cases where the same route is originated by multiple routers,
      however the odds of more than one of them corrupting a route
      at the same time are probably slim compared to the odds of
      a single one doing so.
[2] = In this specific case, as I understand it, the direct peers did
      in fact drop the offending BGP session, however they propagated
      the offending announcement to their peers before doing so. In
      this case, of course, the blast radius is not controlled.
-- 
Dave Israel
Senior Manager, IP Backbone
Intermedia Business Internet

Re: Time to revise RFC 1771

Dave Israel

Clayton Fiske

tags

participants (2)