Yeah, could have been one of those...gone from bad to worse things like Dave mentioned... initial problem and course of action perhaps led to a worse problem.

I’ve had DWDM issues that have taken down multiple locations far apart from each other due to how the transport guys hauled stuff 

A few years back I had about 15 routers all reboot suddenly... they were all far apart from each other, turned out to be one of the dual bgp sessions to rr cluster flapped and all 15 routers crash rebooted.

But ~50 hours of downtime !? 

Aaron

On Dec 31, 2018, at 11:41 AM, Dave Temkin <dave@temk.in> wrote:

On Mon, Dec 31, 2018 at 11:33 AM Naslund, Steve <SNaslund@medline.com> wrote:

They shouldn’t need OOB to operate existing lambdas just to configure new ones.  One possibility is that the management interface also handles master timing which would be a really bad idea but possible (should be redundant and it should be able to free run for a reasonable amount of time).  The main issue exposed is that obviously the management interface is critical and is not redundant enough.  That is if we believe the OOB explanation in the first place (which by the way is obviously not OOB since it wiped out the in band network when it failed).

 

Steven Naslund

Chicago IL

 

 
A theory, and only a theory, is that they decided to, in order to troubleshoot a much smaller problem (OOB/etc.), deploy an optical configuration change that, when faced with inaccessibility to multiple nodes, ended up causing a significant inconsistency in their optical network, wreaking havoc on all sorts of other systems. With the OOB network already in chaos, card reseats were required to stabilize things on that network and then they could rebuild the optical network from a fully reachable state.

Again, only a theory.

-Dave

 

 

>This seems entirely plausible given that DWDM amplifiers and lasers being a complex analog system, they need OOB to align. 

>--

>Eric