Tony Li <tli@jnx.com> wrote:
I.e. no amount of routing updates should cause false link flapping due to delayed keepalive messages. The same is true for keepalive messages vs routing updates on a link.
If you have strict priority of keepalives over routing (and we should distinguish between hellos and updates for those protocols where you can tell the difference), you can (theoretically) starve the routing updates. Thus, equal priority makes sense.
Well, keepalives cannot starve routing protocols, simply because they're pretty much rate-limited. A scenario when routing protocol starves keepalives is clearly disastrous -- you get spurious line flaps, which produce even more routing updates, and the network collapses. On the other hand, insufficient capacity to handle routing updates is ok, as it does not have that kind of positive feedback. The routing-level hellos are useful only to find out if the routing process is still running and/or resynchronize protocols. They are certainly too slow to handle regular link failures. (So there's a need to have a gateway-to-gateway keepalive protocol over shared media like Ethernet or another trendy madness).
d) sub-second keepalive intervals -- this is probably the only method to discover _fast_ that remote end is dead. The way it is now it takes 30 sec or so for a local router to find out that the remote one is wedged, and take appropriate action.
Do you really want that? Given your comments in b, I would think that you would want to ride a 1 second outage. I also wouldn't be thrilled about the overhead involved given much more precision. Note that what you said originally was about ping, which is slightly different.
Well, a packet per line per, say, 200 msec is not much. OTOH, you want to avoid connectivity losses longer than 0.5-1 sec. To provide a reasonable redundancy you may want to wait for 2 missing keepalives in a row or so to decide that the link's down. I.e. hold-down interval determines maximal rate of link-status updates as seen from routing protocols; the keepalive intervals determine maximal delay between link failure and the reaction of the routing system. If sub-1sec recovery is the goal, and IGP convergence is on the order of 500 msec; that leaves only 500 msec for the keepalives, or 250 msec keepalive interval. It would also be a good idea to wait for, say, 20 consequtive good keepalives to decide that the link is up. Cisco is getting the line up at the very first good keepalive (at least it appears so). --vadim