Recent MBONE Events

4 Dec 1992

      ANSNET/NSFNET Operational Experiences with MBONE

                         Jordan Becker, ANS
                         Mark Knopper, Merit

During the week of 11/9 and 11/16, there were a number of operational
problems during the preparation and actual operation of the IETF MBONE
packet video/audiocast.  This note summarizes the problems observed as
we currently understand them, the corrective actions that were taken, and
addresses some recommendations for avoiding similar problems during
future MBONE video/audiocasts.

The use of loose source route packets, and the large volume of MBONE
traffic appears to have caused fairly widespread problems for several
Internet service providers.  However, the volume of MBONE traffic and
source route optioned packets did not seem to adversely affect the
ANSNET/NSFNET, as was earlier believed.  There were severe routing
instabilities with peer networks at several ANSNET/NSFNET border
gateways including E128 (Palo Alto), E144 (FIX-E), E145 (FIX-W) and
most notably at E133 (Ithaca) due to the MBONE traffic and processing
of source route packets.  The instability in these peer networks coupled
with inefficient handling of very large and frequent routing changes
introduced through EGP resulted in some ANSNET/NSFNET instabilities. 
Networks carrying MBONE traffic frequently stopped being advertised by
external peers, and were timed out by the ENSS.  The external peer then
stabilized and these networks were then advertised to the ENSS by the
external peer soon thereafter.  This process repeated itself in a cyclical
fashion.  This seems to have resulted in the following problems as
recorded in our BGP/EGP logs on the ENSS and neighboring CNSS
routers: 

(1)  The general flapping of routing in the Internet was keeping the
     ANSNET routers fairly busy processing external updates. 

(2)  The routing implementation employed by the MBONE was very slow
     about stopping flows to destinations which had become unreachable. 
     We observed unidirectional flows, most likely due to use of default
     routing long after the destination routes had been removed from our
     tables.  When the ANSNET routers had lost the route to the
     destination host (from the peer network), the route from the
     destination host to the source still seemed to be working.  This
     means the source host may still have been hearing MBONE routing
     updates from the destination host, even though the destination was
     receiving nothing, and this seemed to keep tunnels up far longer
     than they should have.  We are seeking more information from the
     mrouted developers on the dynamics of this.

(3)  As a result of (2) above, some ANSNET routers were supporting a
     moderate amount of card-to-system traffic as the flows to
     unreachable destinations continued.  While this was not a problem
     by itself since, the volumes were only on the order of a few hundred
     packets per second, the processing of these packets on an ENSS
     did further reduce the amount of system CPU available for the
     processing of routing packets, and this slowed down the updates of
     routing information on the ENSS interfaces.

(4)  The general routing performance degradation caused by (3) above
     occasionally left the ANSNET routers with insufficient resources to
     deal with major routing events, such as a large EGP neighbor's
     routing session being broken, in a speedy enough fashion to avoid
     timeouts on internal sessions (either IS-IS or IBGP) from kicking in
     and causing reachability of the ENSS to be lost by the external
     peers.

This caused a few connectivity problems at various places on the
ANSNET, but was by far the worst at ENSS133 (Ithaca).  One reason that
the degradation of ENSS133 was worse than at other places was due to
the fact that Cornell was on the sending end of a fair number of MBONE
tunnels, which meant the card-to-system traffic for unreachable
destinations tended to be higher on the ENSS133 router than elsewhere. 
The events which precipitated each and every routing failure on ENSS133
were either the withdrawal of most or all of PSI's routes by its router or
the timeout of the ENSS133 EGP session with the PSI router.  Fully
withdrawing these routes required the ENSS to withdraw the 600-700
networks PSI advertises to ANSNET from all 78 ANSNET backbone
routers, and shortly thereafter, add them back to all 78 backbone routers. 
The same nets were also flapping on the T1 NSFNET, adding slightly to
the load on ENSS133.  Due to inefficiencies in redistributing EGP via
IBGP (lack of aggregation of routes resulting in one system call per route
per IBGP peer), ENSS133 had trouble processing back to back changes,
causing an outage until both routers stabilized.  There were also a couple
of failures at ENSS144 during the week of 11/16 which seem to be have
been prompted by the EGP sessions with two ENSS144 FIX-West peers
(the MILNET peer was one) flapping at about the same time. 

There were several actions taken during the week of 11/16 (IETF
video/audiocast) which reduced the severity of this problem including:

(a)  ICMP unreachable messages were turned off on the external
     interfaces of ENSS routers that experienced problems.  These
     messages were being not being processed directly on the external
     ENSS interfaces which resulted in some inefficiency. 

(b)  SprintLink rerouted traffic (and the MBONE tunnel) from the IETF to
     Cornell from the CIX (via internal PSInet path), to the T3 ANSNET
     path.  This improved stability within PSInet and within ANSNET.

(c)  Cornell rerouted traffic (MBONE tunnel) to SDSC from the PSInet
     path to the T3 ANSNET path.

(d)  One of the two parallel IETF audio/video channels was disabled.

(e)  A default route was established on ENSS133 pointing to its adjacent
     internal router (CNSS49).  This ensured that card<->system traffic
     being processed due to unreachable destinations was moved to the
     CNSS router which was not involved in processing EGP updates.

(f)  A new version of the routing software was installed on the four
     ENSS nodes that experienced route flapping to aggregate EGP
     updates from external peers before sending IBGP messages to other
     internal T3 routers.

The combination of all of these actions stabilized ENSS133 and the other
ENSS routers that experienced instabilities. 

There are several actions which we already have, or will soon implement
to avoid ANSNET border router instabilities during future MBONE multicast
events:

(1)  The ENSS EGP software has been enhanced to support improved
     aggregation of updates from external peers into IBGP update
     messages.  The ENSS will now aggregate EGP derived routes
     together into a single update before flooding this to other routers
     across the backbone via IBGP.  This improves the efficiency of the
     ENSS dramatically.

(2)  A change to the ANSNET router interface microcode has been
     implemented (and will be deployed during the next week) so that
     problems resulting from large amounts of ENSS card-system traffic
     will be eliminated when destinations become unreachable.  Even if
     mrouted keeps sending traffic, this will be dropped on the incoming
     ENSS interface.

(3)  The T1 NSFNET backbone was disconnected on 12/2.  The T1
     network (particularly the interconnect points with the T3 system) was
     a major source of route flapping, and eliminating it should provide
     an additional margin for handling instability from other peer networks.

While the changes we are making to the T3 network will significantly
improve T3 network performance in dealing with external EGP peer
flapping, and related MBONE routing problems, our changes will *NOT*
improve the problems that other Internet networks may experience when
processing source route packets, and handling routing transitions with
MBONE tunnels.

We recommend that each service provider develop their own internal
routing plan to address this, we continue to recommend the migration to
use of BGP at all border gateways, and we recommend that MBONE
software be upgraded to support IP encapsulation to avoid the problems
with routers that do not process loose source route optioned packets
efficiently.  We also are recommending that the MBONE developers
explore optimizing the mrouted software to avoid the sustained
unidirectional flows to unreachable destinations that we observed.
Finally, it is recommended that an mrouted machine be maintained on the
ENSS DMZ of each participating regional, and this node be used as a
hierarchical distribution point to locations in the local campus and regional.
Backhauling of traffic across campuses and regionals should be
discouraged.

There is another MBONE packet video/audiocast scheduled to coincide
with the Concert Packet Video conference on 12/10.  We would like to
test the setup of the proposed tunnel topology with participating service
providers prior to this event to ensure stable operation.  We would
suggest an off-hours maintenance window with interested service providers
to test the stability of the MBONE prior to 12/10.  We are open to
suggestions on the timeframe for this.  Tuesday evening 12/8 might be a
good time for this.

Jordan Becker

tags

participants (1)