From: James Bensley <jwbensley@gmail.com> Sent: Thursday, June 27, 2019 9:56 AM
One experience I have made is that when there is an outage on a large PE, even when it still has spare capacity, is that the business impact can be too much to handle (the support desk is overwhelmed, customers become irate if you can't quickly tell them what all the impacted services are, when service will be restored, the NMS has so many alarms it’s not clear what the problem is or where it's coming from etc.).
I see what you mean, my hope is to address these challenges by having a "single source of truth" provisioning system that will have, among other things, also HW-customer/service mapping -so Ops team will be able to say that if particular LC X fails then customers/services X,Y,Z will be affected. But yes I agree with smaller PEs any failure fallout is minimized proportionally.
This doesn’t mean there isn’t a place for large routers. For example, in a typical network, by the time we get to the P nodes layer in the core we tend to have high levels of redundancy, i.e. any PE is dual-homed to two or more P nodes and will have 100% redundant capacity.
Exactly, while the service edge topology might be dynamic as a result of horizontal scaling the core on the other hand I'd say should be fairly static and scaled vertically, that is I wouldn't want to scale core routers horizontally and as a result have core topology changing with every P scale out iteration at any POP, that would be bad news for capacity planning and traffic engineering...
I’ve tried to write some of my experiences here (https://null.53bits.co.uk/index.php?page=few-larger-routers-vs.-many- smaller-routers). The tl;dr version though is that there’s rarely a technical restriction to having fewer large routers and it’s an operational/business impact problem.
I'll give it a read, cheers. adam