>What are the hours of the config window on Tues and Fri mornings?
>Do you have a rough idea of the typical outage duration (i.e. do you upgrade
>the boxes serially so a 5 minute restart occurs on each node one after the
>other or do you hit the big red button and they all go at once or what?). Do
>you usually start right at the beginning of the window or if nothing tricky is
>going on do you wait till near the end?
>
>Thanks,
>Dan
Dan,
I'm glad you asked. The current routing config install windows are
from 05:00 - 08:00 EST on Tue and Fri mornings. Starting this week, we
will be adding an additional window on Thur morning from 05:00 - 08:00
EST. We need the additional window for the next couple of months in
order to complete the remaining T3 AS cutovers.
We need to send one of two signals to the routing daemon in order to
install the new configuration file: kill -USR1 or kill -USR2. A kill
-USR1 just re-reads the new configuration file into the rcp_routed.
A kill -USR2 is a little more severe and actually briefly stops and
restarts the rcp_routed. This results in a brief period of down EGP/BGP
sessions and loss of connectivity, and it usually takes about 5 minutes
for all backbone routing to re-converge.
These kills get done on the RCPs on the T1 backbone, and on the CNSS, DNSS,
or ENSS boxes on the T3 backbone. The severity of the kill and whether a
kill needs to be done at all depends on the type of change/addition made
in the new configuration file for that node. Most sites running EGP will
only need a kill -USR1, while most sites using BGP need a kill -USR2, or
an equivalent (stopping and restarting the BGP session), to install the
new changes. Adding or deleting a peer always takes a kill -USR2 to the
node. Adding a new ENSS to the T3 backbone requires that a kill -USR2
be done on each backbone node to enable the new IBGP sessions.
A few exceptions:
On the T3 backbone, ENSS 134 (Boston) and ENSS 135 (San Diego) have
explicitly configured outbound announcements in order to give them the
T1 backbone route (129.140) without also sending them all the other routes
coming in from the T1 backbone via the interconnects. Installing any new
T3 routes always requires a kill -USR2 on E134 and E135 (usually each
config day).
The active T1/T3 interconnects in Ann Arbor and Houston also almost always
need a kill -USR2 or the equivalent with each configuration update.
The kills usually take place in a serial fashion but can begin at varying
times during the window, depending on the workload, special problems,
and when we or the NOC can schedule the time to do the kills. Each window
is usually quite unique, except that the interconnects are usually done
last. Sometimes we piggy-back the routing configuration installs on top
of some other backbone-wide software update that gets scheduled on the
same night, to minimize down time. Sometimes all the kills can be
accomplished in 15-20 minutes, while other times we literally need the
entire window, because we wait for routing to readjust on one node before
disrupting routing on the next node. If we are doing kill -USR2s on the
entire T3 backbone, we'll usually do all boxes at each POP at the same
time.
--Steve Widmayer
Merit/NSFNET