Re: Keepalives, WAS: NAP/ISP Saturation
Tony Li <tli@jnx.com> wrote:
No flap dampening, no hold-down "blackholing" after a failure (so as not to generate route withdrawals for transient link outages), silly priority and no sub-second ping intervals, and forget about LQM).
None of these have anything to do with the link keepalive protocol and everything to do with internal link implementation. Let's not confuse the issue.
Well, let us see: a) dampening -- it makes a lot of sense to flap-dampen at circuit level, where it is cheap and efficient; and not pass the flap to routing protocols where it is a lot more expensive to process. I.e. if a DS3 CSU lost its marbles, there's no reason to recompute some 20K routes dependent on that particular circuit every 60 or so seconds. b) blackholing -- if a circuit went down it makes sense to wait for some time (0.1 sec or so) and just drop traffic on the floor, in hope that the outage is transient. There's a lot of momentary carrier losses or other glitches in the telco transmission networks. Only when outage is prolonged does it make sense to notify the routing level. c) priority -- the link keepalive processes must have priorities _higher_ than that of routing protocols. I.e. no amount of routing updates should cause false link flapping due to delayed keepalive messages. The same is true for keepalive messages vs routing updates on a link. d) sub-second keepalive intervals -- this is probably the only method to discover _fast_ that remote end is dead. The way it is now it takes 30 sec or so for a local router to find out that the remote one is wedged, and take appropriate action. e) Link Quality Monitoring -- the usefulness is obvious. For most link-level failures there are sufficient advance warnings (corrupted checksums, etc). Also, there's a lot of things (like "stealth" rerouting by transmission fabric) which in some cases make circuit worse than a disconnected one. One particluar case i have in mind suddenly increased link latency by some 400 ms (Satellites-R-Us, that's it), so manual intervention was required to move traffic off the link. In any case, some automatic LQM shut-offs (on conditions like "latency is more than N ms" or "error frequency is higher than 1:10e6") are clearly in order. So i think the things i noted are rather relevant, and are necessary if we're going to build a real production network. I'm sorry if the link keepalive protocol digression confused the original discussion of security of the routing system. --vadim
No flap dampening, no hold-down "blackholing" after a failure (so as not to generate route withdrawals for transient link outages), silly priority and no sub-second ping intervals, and forget about LQM).
None of these have anything to do with the link keepalive protocol and everything to do with internal link implementation. Let's not confuse the issue.
Well, let us see: a) dampening -- it makes a lot of sense to flap-dampen at circuit level, where it is cheap and efficient; and not pass the flap to routing protocols where it is a lot more expensive to process. I.e. if a DS3 CSU lost its marbles, there's no reason to recompute some 20K routes dependent on that particular circuit every 60 or so seconds. No argument. But this should not be part of the protocol per-se. This should simply be the "device driver" if you will. b) blackholing -- if a circuit went down it makes sense to wait for some time (0.1 sec or so) and just drop traffic on the floor, in hope that the outage is transient. There's a lot of momentary carrier losses or other glitches in the telco transmission networks. Only when outage is prolonged does it make sense to notify the routing level. This is certainly already taken care of by any of the keepalive protocols that I know of. The real question is what does the device driver do with loss of carrier. c) priority -- the link keepalive processes must have priorities _higher_ than that of routing protocols. I dunno about that one. I would argue that equal to the keepalives of the protocols is sufficient. Neither should use enough bandwidth to out-compete the other. I.e. no amount of routing updates should cause false link flapping due to delayed keepalive messages. The same is true for keepalive messages vs routing updates on a link. If you have strict priority of keepalives over routing (and we should distinguish between hellos and updates for those protocols where you can tell the difference), you can (theoretically) starve the routing updates. Thus, equal priority makes sense. d) sub-second keepalive intervals -- this is probably the only method to discover _fast_ that remote end is dead. The way it is now it takes 30 sec or so for a local router to find out that the remote one is wedged, and take appropriate action. Do you really want that? Given your comments in b, I would think that you would want to ride a 1 second outage. I also wouldn't be thrilled about the overhead involved given much more precision. Note that what you said originally was about ping, which is slightly different. e) Link Quality Monitoring -- the usefulness is obvious. For most link-level failures there are sufficient advance warnings (corrupted checksums, etc). Also, there's a lot of things (like "stealth" rerouting by transmission fabric) which in some cases make circuit worse than a disconnected one. One particluar case i have in mind suddenly increased link latency by some 400 ms (Satellites-R-Us, that's it), so manual intervention was required to move traffic off the link. In any case, some automatic LQM shut-offs (on conditions like "latency is more than N ms" or "error frequency is higher than 1:10e6") are clearly in order. This is certainly relevant. So i think the things i noted are rather relevant, and are necessary if we're going to build a real production network. Seems reasonable to me. Tony
participants (2)
-
Tony Li
-
Vadim Antonov