Re: RS/960 upgrade ... status report
Mark said: observed during RS/960 testing. The problem involved some aberrant behavior of rcp_routed when a misconfigured regional peer router would advertise a route via BGP to an ENSS whose next hop was the ENSS itself. The old rcp_routed could go into a loop sending multiple redirect packets out on to the subnet. The new rcp_routed will close the BGP session if it receives an announcement of such a route. The new rcp_routed software also has support for externally administered inter-AS metrics, an auto-restart capability, and bug fixes for BGP overruns with peer routers. This deployment caused a few problems. One is that this new feature of rcp_routed pointed out a misconfigured peer router at Rice University in Houston. This caused the BGP connection to open and close rapidly which caused further problems on the peer router. Eventually the peer was reconfigured to remove the bad route, which fixed the problem. Another problem was on the Argonne ENSS. This node crashed in such a way that it was ---------------------------------- What he did not say: The new rcp_routed will (virtually) immeadiately after the close, reopen a BGP session and pump all known routes at the BGP peer. If the peer is already working on processing the old ones, this adds an unneeded burden on the regional peer. I have not reviewed the spec in detail, but there should be something in place to prevent a constant cycle of close, open, slam 5k routes, close, open, slam 5k routes, close... well, you get the picture. This points out a configuration problem with BGP in the ANS T3 routers, in addition to the less than optimal configuration that we had in our ciscos. Just my two cents from the other side of the fence. -- Regards, Bill Manning bmanning@rice.edu PO Box 1892 713-285-5415 713-527-6099 Houston, Texas R.U. (o-kome) 77251-1892
From: William Manning <bmanning@is.rice.edu> Subject: Re: RS/960 upgrade ... status report Date: Thu, 30 Apr 92 23:07:02 CDT Mark said: observed during RS/960 testing. The problem involved some aberrant behavior of rcp_routed when a misconfigured regional peer router would advertise a route via BGP to an ENSS whose next hop was the ENSS itself. The old rcp_routed could go into a loop sending multiple redirect packets out on to the subnet. The new rcp_routed will close the BGP session if it receives an announcement of such a route. The new rcp_routed software also has support for externally administered inter-AS metrics, an auto-restart capability, and bug fixes for BGP overruns with peer routers. This deployment caused a few problems. One is that this new feature of rcp_routed pointed out a misconfigured peer router at Rice University in Houston. This caused the BGP connection to open and close rapidly which caused further problems on the peer router. Eventually the peer was reconfigured to remove the bad route, which fixed the problem. Another problem was on the Argonne ENSS. This node crashed in such a way that it was ---------------------------------- What he did not say: The new rcp_routed will (virtually) immeadiately after the close, reopen a BGP session and pump all known routes at the BGP peer. If the peer is already working on processing the old ones, this adds an unneeded burden on the regional peer. I have not reviewed the spec in detail, but there should be something in place to prevent a constant cycle of close, open, slam 5k routes, close, open, slam 5k routes, close... well, you get the picture. This points out a configuration problem with BGP in the ANS T3 routers, in addition to the less than optimal configuration that we had in our ciscos. Just my two cents from the other side of the fence. Bill, Given that we haven't heard of anyone else's peer router dying a horrible death since the new rcp_routed went in, I assume that the "less than optimal configuration" isn't a *common* mistake. But, for the benefit of those of us who are soon to run BGP (and who seem to have a knack for encountering uncommon mistakes :-), can you fill us in on what the actual config problem was? Were you redistributing your EGP-learned-routes back to BGP or something? Mark, Wouldn't it make more sense for the ENSS to just ignore the offending route rather than close the BGP session, especially given the lack of a delay before the session gets reestablished? Dan
Bill, I agree, and we have debated both whether our recovery from this is correct according to the spec, and (more importantly in my view) whether the spec is reasonable in this case. I believe the BGP RFC says that it is ok to close the connection, but I would prefer that it just log this problem and pretend that it did not hear this bogus route. That way the NOC and site people can take a look at the problem after the fact when it is noticed that packets are not getting to or from a particular net, but they will not be forced to look at the problem right away with the whole AS being disconnected. We are working on this and will eventually have some modifications to rcp_routed to deploy. Mark
From: bmanning@is.rice.edu (William Manning) To: regional-techs@merit.edu CC: mak@merit.edu
Mark said: observed during RS/960 testing. The problem involved some aberrant behavior of rcp_routed when a misconfigured regional peer router would advertise a route via BGP to an ENSS whose next hop was the ENSS itself. The old rcp_routed could go into a loop sending multiple redirect packets out on to the subnet. The new rcp_routed will close the BGP session if it receives an announcement of such a route. The new rcp_routed software also has support for externally administered inter-AS metrics, an auto-restart capability, and bug fixes for BGP overruns with peer routers.
This deployment caused a few problems. One is that this new feature of rcp_routed pointed out a misconfigured peer router at Rice University in Houston. This caused the BGP connection to open and close rapidly which caused further problems on the peer router. Eventually the peer was reconfigured to remove the bad route, which fixed the problem. Another problem was on the Argonne ENSS. This node crashed in such a way that it was
----------------------------------
What he did not say:
The new rcp_routed will (virtually) immeadiately after the close, reopen a BGP session and pump all known routes at the BGP peer. If the peer is already working on processing the old ones, this adds an unneeded burden on the regional peer. I have not reviewed the spec in detail, but there should be something in place to prevent a constant cycle of close, open, slam 5k routes, close, open, slam 5k routes, close... well, you get the picture. This points out a configuration problem with BGP in the ANS T3 routers, in addition to the less than optimal configuration that we had in our ciscos.
Just my two cents from the other side of the fence. -- Regards, Bill Manning bmanning@rice.edu PO Box 1892 713-285-5415 713-527-6099 Houston, Texas R.U. (o-kome) 77251-1892
participants (3)
-
bmanning@is.rice.edu
-
Dan Long
-
mak