---------- Forwarded message ---------- Date: Fri, 28 Jun 1996 17:42:30 -0400 (EDT) From: Brian Tao <taob@io.org> Reply-To: inet-access@earth.com To: inet-access@earth.com Subject: NETCOM downtime a programming error. (fwd) Resent-Date: Fri, 28 Jun 1996 16:45:28 -0500 (CDT) Resent-From: inet-access@earth.com Not sure where this first showed up, but it sounds like some sort of trade publication... ---------- Forwarded message ---------- IA: Take us through what happened... GARRISON: Think about the network in three layers. The first layer is Network Access Points, where we have peer agreements with other careers. It's the entry point to the Internet. (They're about a half dozen across the U.S.) The next level down are hubs, which are our internal virtual private network routing hubs that look at traffic and direct it along the speediest line available. The third level down is where the customer actually logs on, at an access POP (Point Of Presence). At each of those levels there are routers made by Cisco and others that have instruction tables on them -- "IF THEN" statements that tell the traffic where to go, what route to go to get to its destination. At the network access level, you have some pretty complex code that says, 'If the traffic comes from this party, then do the following thing with it.' And because of the number of new access providers or changes in the access providers, there are daily changes made at the network access layer. And these are changes that are made in software to the routers. It's done in a language called BGP, or Border Gateway Protocol. So, there was one line of code that said, literally, "No redist bgp access list 25 in," just a line of code that revised an instruction. Because the two sentences were put together as opposed to being done on separate lines, the network read it as an "AND" statement instead of an "OR" or an "IF statement. So, what happens is the network automatically replicates the instruction set from the network access point from where this was entered, which was Washington, DC, and it replicated itself to the other network access points. Because of the way the code was written, it then said, 'ah hah, it's a network instruction, not a peering instruction -- I'd better send it out to the hubs.' The hubs saw it, and said, 'ah hah, I'd better send it out to the POPs.' Well, the POPs memory -- the routers at the lower levels of the network -- do not have the memory or capacity for the peering instructions because they don't interface with anybody else, so they don't need that capacity. So, when they got it, it basically froze the routers down at the third level of the network. Meantime, we're sitting reprogramming the routers, but as fast as we can reprogram the replication feature of the intelligent network, it overwhelms our ability to reprogram. Basically our decision was to shut down the network to reboot the routers, to put in a fresh instruction set. That's a long winded explanation, but because your readers are more technical, it's worthwhile! ============================== ISP Mailing List ============================== Email ``unsubscribe'' to inet-access-request@earth.com to be removed. inet-access archives are at ftp://ftp.earth.com/pub/archive/inet-access/