Re: What is the limit? (was RE: multi-homing fixes)

29 Aug 2001

      The Internet *FAILED* in 1994.  The number of prefixes carried globally
exceeded the ability of the global routing system to carry it, and
as a result, some parts of the world deliberately summarized away
whole large-scale ISPs merely to survive.

The Internet *FAILED* again in 1996, when the dynamicism in
the global routing system exceeded the ability of many border
routers to handle, and as a result, some networks (not just core ones)
deliberately made the Internet less quick to adjust to changes in
topology. 

Thus, the routing system FAILED twice: once because of memory, once
because of processor power.

If the size or the dynamicism of the global routing system grows
for a sustained period faster than the price/performance curve of
EITHER memory OR processing power, the Internet will FAIL again.

I do mean the Internet, and not just some pieces of it.
The old saw, "the Internet detects damage and routes around it" simply
isn't true, when your routing system isn't working.

It was unpleasant both times.  Three continents were effectively
isolated from one another for a couple of days while organizing
a response to the memory crisis.  Three of the largest ISPs at
the time were crippled on and off for days during the processing
power crisis, and even when mechanisms were brought into place,
relatively unimportant bugs destabilized the entire Internet
from time to time.

Note that the processor power issue is the one that has been
scariest, since it has had small-scale failures fairly regularly.
Things like selective packet drop *exist* because of the
price/performance curve and engineering gap for deploying and USING
more processing power.

So, Moore's Law, or more specifically the underlying curve
which tracks the growth of useful computational power, is
exactly what we should compare with the global routing system's
growth curve.

Note that when Moore is doing better than the Internet,
it allows for either cheaper supply of dynamic connectivity,
or it allows for the deployment of more complex handling
of the global NLRI.

The major problem, as you have pointed out, is that processing
requirement is often bursty, such as when everyone is trying
to do a dynamic recovery from a router crash or major line fault.
We could still use 68030s in our core routers, it's just that it'd
take alot longer than it used to perform a global partition repair,
which means your TCP sessions or your patience will probably time out
alot more frequently.

| I think that what we need to do is have a fourth group, call them Internet
| Engineers for lack of a better word, come in and determine what the sign
| should read.

Structures built according to best known engineering practices
still fall down from time to time.  That's the problem in anticipating
unforeseen failures.  Consider yourself lucky that you haven't had
to experience a multi-day degredation (or complete failure!) of
service due to resource constraints.   And that you haven't 
run into a sign that says: "please note: if you try to have an
automatic partition repair in the event this path towards {AS SET}
fails, your local routing system will destabilize".

| Finally, we have a sixth group, call them the IETF, come in
| and invent a flying car that doesn't need the bridge at all.  

As Randy (with his IETF hat) says: "send code".  

	Sean.

Re: What is the limit? (was RE: multi-homing fixes)

smd＠clock.org