Re: Normal config window?
What are the hours of the config window on Tues and Fri mornings? Do you have a rough idea of the typical outage duration (i.e. do you upgrade the boxes serially so a 5 minute restart occurs on each node one after the other or do you hit the big red button and they all go at once or what?). Do you usually start right at the beginning of the window or if nothing tricky is going on do you wait till near the end?
Thanks, Dan
Dan, I'm glad you asked. The current routing config install windows are from 05:00 - 08:00 EST on Tue and Fri mornings. Starting this week, we will be adding an additional window on Thur morning from 05:00 - 08:00 EST. We need the additional window for the next couple of months in order to complete the remaining T3 AS cutovers. We need to send one of two signals to the routing daemon in order to install the new configuration file: kill -USR1 or kill -USR2. A kill -USR1 just re-reads the new configuration file into the rcp_routed. A kill -USR2 is a little more severe and actually briefly stops and restarts the rcp_routed. This results in a brief period of down EGP/BGP sessions and loss of connectivity, and it usually takes about 5 minutes for all backbone routing to re-converge. These kills get done on the RCPs on the T1 backbone, and on the CNSS, DNSS, or ENSS boxes on the T3 backbone. The severity of the kill and whether a kill needs to be done at all depends on the type of change/addition made in the new configuration file for that node. Most sites running EGP will only need a kill -USR1, while most sites using BGP need a kill -USR2, or an equivalent (stopping and restarting the BGP session), to install the new changes. Adding or deleting a peer always takes a kill -USR2 to the node. Adding a new ENSS to the T3 backbone requires that a kill -USR2 be done on each backbone node to enable the new IBGP sessions. A few exceptions: On the T3 backbone, ENSS 134 (Boston) and ENSS 135 (San Diego) have explicitly configured outbound announcements in order to give them the T1 backbone route (129.140) without also sending them all the other routes coming in from the T1 backbone via the interconnects. Installing any new T3 routes always requires a kill -USR2 on E134 and E135 (usually each config day). The active T1/T3 interconnects in Ann Arbor and Houston also almost always need a kill -USR2 or the equivalent with each configuration update. The kills usually take place in a serial fashion but can begin at varying times during the window, depending on the workload, special problems, and when we or the NOC can schedule the time to do the kills. Each window is usually quite unique, except that the interconnects are usually done last. Sometimes we piggy-back the routing configuration installs on top of some other backbone-wide software update that gets scheduled on the same night, to minimize down time. Sometimes all the kills can be accomplished in 15-20 minutes, while other times we literally need the entire window, because we wait for routing to readjust on one node before disrupting routing on the next node. If we are doing kill -USR2s on the entire T3 backbone, we'll usually do all boxes at each POP at the same time. --Steve Widmayer Merit/NSFNET
participants (1)
-
Steven K. Widmayer