On Thu, 27 Jun 2019 at 12:46, <adamv0025@netconsultings.com> wrote:
From: James Bensley <jwbensley@gmail.com> Sent: Thursday, June 27, 2019 9:56 AM
One experience I have made is that when there is an outage on a large PE, even when it still has spare capacity, is that the business impact can be too much to handle (the support desk is overwhelmed, customers become irate if you can't quickly tell them what all the impacted services are, when service will be restored, the NMS has so many alarms it’s not clear what the problem is or where it's coming from etc.).
I see what you mean, my hope is to address these challenges by having a "single source of truth" provisioning system that will have, among other things, also HW-customer/service mapping -so Ops team will be able to say that if particular LC X fails then customers/services X,Y,Z will be affected. But yes I agree with smaller PEs any failure fallout is minimized proportionally.
Hi Adam, My experience is that it is much more complex than that (although it also depends on what sort of service you're offering), one can't easily model the inter-dependency between multiple physical assets like links, interfaces, line cards, racks, DCs etc and logical services such as a VRFs/L3VPNs, cloud hosted proxies and the P&T edge. Consider this, in my opinion, relatively simple example: Three PEs in a triangle. Customer is dual-homed to PE1 and PE2 and their link to PE1 is their primary/active link. Transit is dual-homed to PE2 and PE3 and your hosted filtering service cluster is also dual-homed to PE2 and PE3 to be near the Internet connectivity. How will you record the inter-dependencies that an outage on PE3 impacts Customer? Because when that Customer sends traffic to PE1 (lets say all their operations are hosted in a public cloud provider), and PE1 has learned the shortest-path to 0/0 or ::0/0 from PE2, the Internet traffic is sent from PE1 to PE2, and from PE2 into your filtering cluster, and when the traffic comes back into PE2 after passing through the filters it is then sent to PE3 because the transit provider attached to PE3 has a better route to Customer's destination (AWS/Azure/GCP/whatever) than the one directly attached to PE2. That to me is a simple scenario, and it can be mapped with a dependency tree. But in my experience, and maybe it's just me, things are usually a lot more complicated than this. The root cause is probably a bad design introducing too much complexity, which is another vote for smaller PEs from me. With more service dedicated PEs one can reduce or remove the possibility of piling multiple services and more complexity onto the same PE(s). Most places I've seen (managed service providers) simply can't map the complex inter-dependencies they have been physical and logical infrastructure without having some super bespoke and also complex asset management / CMDB / CI system. Cheers, James.