Re: problem at mae-west tonight?
Here's an edited copy of mail I just sent elsewhere, which I believe deserves some thought by other network operators : Just to update you on this situation: The problem was caused by a bad NetEdge device on the link from Netcom to MAE-West. It caused a partial failure of layer 2 connectivity from July 13 08:30 PDT until July 15 16:00 PDT. That's over 2 days. The current design and implementation of the route server suffers from two fatal flaws which, until they are fixed, will mean that I avoid using route servers at all costs: 1. The route server does not have access to the true state of layer 2 interconnectivity, and thus cannot adjust its announcements to avoid "black holes" in the layer 2 fabric, be they caused by missing ATM PVCs or bad bridging devices. Clearly we need a protocol whereby the route server can gain information from the routers about who they can and cannot see. (Others here have suggested an approach similar to the new OSPF multipoint model) 2. The route server does not have a mechanism whereby a client router can force the route server to NOT advertise information to another client, except by editing the policy information and waiting for it to reload. That time-frame is excessive from a network operations standpoint, because every minute that the route server announces a route into a black hole, customers are without service... service which can be restored simply by tearing down the session with the route server, which of course forces it to stop announcing routes to _every_ client... solving the one problem, and causing others at the same time. -matthew kaufman matthew@scruz.net ps. Given that I'd like to refuse to use the route servers, should I _not_ peer with them, but make my RADB policy look right, so that other people can tell what my policy is? or should I peer with them, to help their statistics gathering, but set my policy to "don't advertise anything to anyone" ?
Matthew,
Matthew Kaufman writes:
Here's an edited copy of mail I just sent elsewhere, which I believe deserves some thought by other network operators :
Just to update you on this situation: The problem was caused by a bad NetEdge device on the link from Netcom to MAE-West. It caused a partial failure of layer 2 connectivity from July 13 08:30 PDT until July 15 16:00 PDT. That's over 2 days. The current design and implementation of the route server suffers from two fatal flaws which, until they are fixed, will mean that I avoid using route servers at all costs:
1. The route server does not have access to the true state of layer 2 interconnectivity, and thus cannot adjust its announcements to avoid "black holes" in the layer 2 fabric, be they caused by missing ATM PVCs or bad bridging devices. Clearly we need a protocol whereby the route server can gain information from the routers about who they can and cannot see. (Others here have suggested an approach similar to the new OSPF multipoint model)
2. The route server does not have a mechanism whereby a client router can force the route server to NOT advertise information to another client, except by editing the policy information and waiting for it to reload. That time-frame is excessive from a network operations standpoint, because every minute that the route server announces a route into a black hole, customers are without service... service which can be restored simply by tearing down the session with the route server, which of course forces it to stop announcing routes to _every_ client... solving the one problem, and causing others at the same time.
-matthew kaufman matthew@scruz.net
ps. Given that I'd like to refuse to use the route servers, should I _not_ peer with them, but make my RADB policy look right, so that other people can tell what my policy is? or should I peer with them, to help their statistics gathering, but set my policy to "don't advertise anything to anyone" ?
We would encourage you to continue to export routes to the route servers and to register your policy in the RADB. We will explore potential solutions to the issues you raise. --Elise
participants (2)
-
epg@merit.edu
-
matthew@scruz.net