Hi James,
From: James Bensley <jwbensley+nanog@gmail.com> Sent: Thursday, June 27, 2019 1:48 PM
On Thu, 27 Jun 2019 at 12:46, <adamv0025@netconsultings.com> wrote:
From: James Bensley <jwbensley@gmail.com> Sent: Thursday, June 27, 2019 9:56 AM
One experience I have made is that when there is an outage on a large PE, even when it still has spare capacity, is that the business impact can be too much to handle (the support desk is overwhelmed, customers become irate if you can't quickly tell them what all the impacted services are, when service will be restored, the NMS has so many alarms it’s not clear what the problem is or where
I see what you mean, my hope is to address these challenges by having a "single source of truth" provisioning system that will have, among other
But yes I agree with smaller PEs any failure fallout is minimized
it's coming from etc.). things, also HW-customer/service mapping -so Ops team will be able to say that if particular LC X fails then customers/services X,Y,Z will be affected. proportionally.
Hi Adam,
My experience is that it is much more complex than that (although it also depends on what sort of service you're offering), one can't easily model the inter-dependency between multiple physical assets like links, interfaces, line cards, racks, DCs etc and logical services such as a VRFs/L3VPNs, cloud hosted proxies and the P&T edge.
Consider this, in my opinion, relatively simple example: Three PEs in a triangle. Customer is dual-homed to PE1 and PE2 and their link to PE1 is their primary/active link. Transit is dual-homed to PE2 and PE3 and your hosted filtering service cluster is also dual-homed to PE2 and PE3 to be near the Internet connectivity.
I agree the scenario you proposed is perfectly valid seems simple but might contain high degree of complexity in terms of traffic patterns. Thinking about this I'd propose to separate the problem into two parts, The simpler one to solve is the physical resource allocation part of the problem This is where the hierarchical record of physical assets could give us the right answers to what happens if this card fails (example of hierarchy POP->PE->LineCard->PhysicalPort(s)-> PhysicalPort(s)->Aggregation-SW->PhysicalPort(s)->Customer/Service) The other part of the problem is much harder and has two sub parts: -first subpart is to model interactions between number of protocols to accurately predict traffic patterns under various failure conditions. (I'd argue that this to some extent should be part of the design documentation and well understood and tested during POC testing for a new design -although entropy...) -And now the tricky subpart is to be able to map individual customer->service/service->customer traffic flows onto the first subpart (This subpart I didn't give much thought so can't possibly comment ) adam