Routing System Scaling - Disaster Looming, but Medium-Term Fixes Known
It's very simple: ANYONE can build a multi-Gpps router which can handle a million or two static routes. Hell, if the Internet were completely undynamic, alot of things would be much easier. Likewise, anyone can build a 0 pps router that can handle an order of magnitude more dynamism in the global routing system than we observe today. Finally, anyone can build a multi-Gpps router that can handle heavy dynamism in a global routing table consisting of a dozen prefixes. The Internet: bigger, faster, more dynamic. Choose two. One usually sees "cheaper" as one of the terms in the "choose two" joke. In this case, it's implicit, since the total price of a large router is greater than the sum of the (increasingly expensive) cost of providing the oompf necesary to do each of these three things. So, there are several Deaths of the Internet which are possible: -- it's too expensive to keep up with growth, so utilization falls off, since nobody wants to pay the real bills -- we blow up on one of the scaling axes -- too many routes: poof, we run out of memory routers are unhappy when out of memory... (ok, more memory is cheap for static storage, but ever-faster memory seems to be a general problem) -- too much dynamism: poof, our memory isn't fast enough, or the distribution of stuff via BGP/TCP isn't fast enough, or we have no CPU cycles to generate tables fast enough our our communication among routing-process CPUs and forwarding engine isn't fast enough, etc... everyone is unhappy when it takes minutes, hours, or DAYS to see the end of blackholes/loops/other effects of incomplete convergence after a change (ok, everyone knows about route damping, yes? http://www.ripe.net/docs/ripe-210.html ) -- we blow up because while we can handle some of the technical aspects of routing system complexity, we lose on human factors -- inefficient/broken routing: we don't have the power needed to apply decent policy to the routes we learn from elsewhere, which is already seen in the form of route leakage, inadvertant transit, etc., and we don't have the skills to quickly find & fix problems that have global effect The reaction to scale driven by increasing numbers of ever-longer prefixes should not be a panicky turn to something radical, or sky-is-falling moaning. It has been demonstrated (by me) that there is little reachability lost when applying prefix-length filtering, or when applying progressive flap-damping (e.g., if you flap something longer than a /24, the route's gone from my tables until tomorrow). Dennis Ferguson's arguments for maximum-prefix limits from peers are also very interesting, and have gotten some testing in the real world. Ultimately the problem here is that this just makes introducing ever-longer prefixes harder, but does not expose a real cost to someone who is thinking of doing this. If various people, or someone sufficiently large, ever gets around to implementing the following idea, I think people may be more inclined to renumber and/or be aggregated, rather than try to introduce "exception routes" into the global routing system. What I wish I had finished @ Sprint: a web-based form that lets one arrange an exception to access-list 112 for some small fee, like $50/month/change. For your 50 bucks, you see no filtering on one of your long prefixes for a whole month (or year or whatever). These days it may be that your $50/month will make your long prefix less likely to be damped out of the routing table for a long time, however there is scope for further innovation, such as charging only $5 to clear your prefix out of the flap-damping penalty box of one router. This revenue stream would surely pay for alot of the costs of upgrading routers with more memory, faster CPUs, more CPUs, faster memory, better code, etc. to deal with the increase in the number and dynamicism of the current Internet's global routing system. It might even fund the switchover to something better than the IDR architecture we have now. Sean.
participants (1)
-
smd@clock.org