2008.02.20 NANOG 42 Graceful Restart, NSF and NSR
Last set of three notes...was going to send them from the ballroom, but they started tearing down as soon as the closing finished, even though it was 30 minutes early. ^_^;; Matt 2008.02.20 Graceful restart, non-stop routing, spiteful switchover, and non-stop forwarding. Ken Weissner, kweissne@cisco.com systems and technology architect introduction to HA technologies Lots of moving parts, we start with some definitions: HA -- high availability general term SSO -- Stateful Switchover two processors/processes, transfers information from one processor so second can pick up from first one. NSF -- NonStop Forwarding forwarding is key portion; router continues moving packets as control plane recovers/restarts GR -- Graceful Restart IETF specificed mechanism, allows peers to give time for peer to come back before routes are flushed; both sides have to agree on it first. NSR -- NonStop Routing maintain session completely without losing state as it shifts from one processor to the other, without having to alert peer via graceful restart Thos four all work together to allow unplanned switchover to occur with minimal interruption of service (not quite "none"). ISSU -- In Service Software Upgrade use above items to allow upgrading of software while packets continue to move. Why do people care about SSO/NSF? Concerns about single point of failure; customer aggregation or customer connect point which would otherwise impact many, many customers when issues arise. They can also do it on non-distributed platforms; very little packet loss on them. Diagram showing what impacts happen with adjacency failures prior to SSO/NSF. Another slide showing nonstop forwarding due to graceful restart mechanisms. supported for LDP, OSPF, BGP, ISIS Graceful restart for EIGRP, and two different draft mechanisms for OSPF. Two modes; "aware" and "capable". If you're a device that is "capable", your peers need to be "aware" that you are capable of doing it. the capable device tells its 'aware' peers that it will come back within a timeout interval. "aware" and "helper" are generally synonymous. configuration for capability needs to be turned on; not on by default. awareness is turned on by default in Cisco code. TCP based protocols have to have it configured on both sides for graceful restart to be able to function. graceful restart concerns voiced at nanog 40 If I'm an aware peer, how do I tell if my neighbor really went away, and I should reroute quickly, or if it's going to be back shortly, and I should continue to move packets towards it. Need to know if NSF is active or not; but NSF isn't something you configure. so, graceful restart concerns addressed: For BGP, there's a restart timer which limits amount of time before peer comes back, which limits the amount of blackhole time; default is 120 seconds, but can be set shorter to limit the duration of blackhole events. Other conditions also apply; if link is POS point to point, linkdown will abort GR, and will tear down the session and flush routes. Once open message with restart bit set comes across, routes are put into stale bucket; still used, but stale counter begins to count down until they get flushed. Also, added "end of RIB" update message; at that point, it can build new table based on updates, and can act upon changes in the stale table; clear truly stale entries, stop stale timer, and process is done. For OSPF, two ways to handle it: RFC3623, vs draft-nguyen-ospf-restart-06 slight differences; does a new route update cause an abort out of graceful restart or not? Cisco supports Nonstop routing for ISIS and BGP in IOS inbox solution, no other communication needed. BGP configured on neighbor basis; ISIS is box-wide. Hybrid BGP NSR run GR on route reflectors Per peer GSR/NSR config for BGP currently BGP GR is globally enabled for all peers in the routing process Where BGP NSR is available, configured on a per peer basis. GR gets used over NSR if peer supports GR Need mechansism to protect interfaces/L2 state and forwarding; protect both data plane and control plane. Features work together to provide protection and redundancy; should really use them all together as designed. eg, enabling SSO also enables NSF (FIB checkpointing). Each routing protocol on the box needs to be aware and configured to get full benefit of NSR. routing protocol timers often cranked down for faster detection of failures. When switchover happens, takes a little while for new processor to catch up, if dead timer is set too low, graceful restart may not kick in before the sessions are torn down. First packet can be pushed in 10 seconds for BGP; but setting timers down to 1/5 or anything below 10 seconds can result in oscillation or other problematic interactions.
participants (1)
-
Matthew Petach