Re: few big monolithic PEs vs many small PEs
On Wed, 19 Jun 2019 at 21:23, <adamv0025@netconsultings.com> wrote:
Hi folks,
Recently I ran into a peculiar situation where we had to cap couple of PE even though merely a half of the rather big chassis was populated with cards, reason being that the central RE/RP was not able to cope with the combined number of routes/vrfs/bgp sessions/etc..
So this made me think about the best strategy in building out SP-Edge nowadays (yes I'm aware of the centralize/decentralize pendulum swinging every couple of years). The conclusion I came to was that *currently the best approach would be to use several medium to small(fixed) PEs to replace a big monolithic chasses based system. So what I was thinking is, Yes it will cost a bit more (router is more expensive than a LC) Will end up with more prefixes in IGP, more BGP sessions etc.. -don't care. But the benefits are less eggs in one basket, simplified and hence faster testing in case of specialized PEs and obviously better RP CPU/MEM to port ratio. Am I missing anything please?
*currently, Yes some old chassis systems or even multi-chassis systems used to support additional RPs and offloading some of the processes (e.g. BGP onto those) -problem is these are custom hacks and still a single OS which needs rebooting LC/ASICs when being upgraded -so the problem of too many eggs in one basket still exists (yes cisco NCS6k and recent ASR9k lightspeed LCs are an exception) And yes there is the "node-slicing" approach from Juniper where one can offload CP onto multiple x86 servers and assign LCs to each server (virtual node) - which would solve my chassis full problem -but honestly how many of you are running such setup? Exactly. And that's why I'd be hesitant to deploy this solution in production just yet. I don't know of any other vendor solution like this one, but who knows maybe in 5 years this is going to be the new standard. Anyways I need a solution/strategy for the next 3-5 years.
Would like to hear what are your thoughts on this conundrum.
adam
netconsultings.com ::carrier-class solutions for the telecommunications industry::
Hi Adam, Over the years I have been bitten multiple times by having fewer big routers with either far too many services/customers connected to them or too much traffic going through them. These days I always go for more smaller/more routers than fewer/larger routers. One experience I have made is that when there is an outage on a large PE, even when it still has spare capacity, is that the business impact can be too much to handle (the support desk is overwhelmed, customers become irate if you can't quickly tell them what all the impacted services are, when service will be restored, the NMS has so many alarms it’s not clear what the problem is or where it's coming from etc.). I’ve seen networks place change freeze on devices, with the exception of changes that migrate customers or services off of the PE, because any outage would create too great an impact to the business, or risk the customers terminating their contract. I’ve also seen changes freeze be placed upon large PEs because the complexity was too great, trying to work out the impact of a change on one of the original PEs from when the network was first built, which is somehow linked to virtually every service on the network in some obscure and unforeseeable way. This doesn’t mean there isn’t a place for large routers. For example, in a typical network, by the time we get to the P nodes layer in the core we tend to have high levels of redundancy, i.e. any PE is dual-homed to two or more P nodes and will have 100% redundant capacity. Down at the access layer customers may be connected to a single access layer device or the access layer device might have a single backhaul link. So technically we have lots of customers, services and traffic passing through larger P node devices, but these devices have a low rate of changes / low touch, perform a low number of functions, they are operationally simple, and are highly redundant. Adversely at the service edge, which I guess is your main concern here, I’m all about more smaller devices with single service dedicated devices. I’ve tried to write some of my experiences here (https://null.53bits.co.uk/index.php?page=few-larger-routers-vs.-many-smaller...). The tl;dr version though is that there’s rarely a technical restriction to having fewer large routers and it’s an operational/business impact problem. I'd like to hear from anyone who has had great success with fewer larger PEs. Cheers, James.
On 27/Jun/19 10:58, James Bensley wrote:
Hi Adam,
Over the years I have been bitten multiple times by having fewer big routers with either far too many services/customers connected to them or too much traffic going through them. These days I always go for more smaller/more routers than fewer/larger routers.
One experience I have made is that when there is an outage on a large PE, even when it still has spare capacity, is that the business impact can be too much to handle (the support desk is overwhelmed, customers become irate if you can't quickly tell them what all the impacted services are, when service will be restored, the NMS has so many alarms it’s not clear what the problem is or where it's coming from etc.).
I’ve seen networks place change freeze on devices, with the exception of changes that migrate customers or services off of the PE, because any outage would create too great an impact to the business, or risk the customers terminating their contract. I’ve also seen changes freeze be placed upon large PEs because the complexity was too great, trying to work out the impact of a change on one of the original PEs from when the network was first built, which is somehow linked to virtually every service on the network in some obscure and unforeseeable way.
I would tend to agree when the edge routers are massive, e.g., boxes like the Cisco ASR9922 or the Juniper MX2020 are simply too large, and present a real risk re: that level of customer aggregation (even for low-revenue services such as Broadband). I don't think I'd ever justify buying these towers to aggregate customers, mainly due to the risk. For us, even the MX960 is too big, which is why we focus on the MX480 (ASR9906 being the equivalent). It's a happy medium between the small and large end of the spectrum. And as I mentioned before, we just look at a totally different box for 100Gbps customers. Mark.
On 27 June 2019 16:26:03 BST, Mark Tinka <mark.tinka@seacom.mu> wrote:
On 27/Jun/19 10:58, James Bensley wrote:
Hi Adam,
Over the years I have been bitten multiple times by having fewer big routers with either far too many services/customers connected to them or too much traffic going through them. These days I always go for more smaller/more routers than fewer/larger routers.
One experience I have made is that when there is an outage on a large PE, even when it still has spare capacity, is that the business impact can be too much to handle (the support desk is overwhelmed, customers become irate if you can't quickly tell them what all the impacted services are, when service will be restored, the NMS has so many alarms it’s not clear what the problem is or where it's coming from etc.).
I’ve seen networks place change freeze on devices, with the exception of changes that migrate customers or services off of the PE, because any outage would create too great an impact to the business, or risk the customers terminating their contract. I’ve also seen changes freeze be placed upon large PEs because the complexity was too great, trying to work out the impact of a change on one of the original PEs from when the network was first built, which is somehow linked to virtually every service on the network in some obscure and unforeseeable way.
I would tend to agree when the edge routers are massive, e.g., boxes like the Cisco ASR9922 or the Juniper MX2020 are simply too large, and present a real risk re: that level of customer aggregation (even for low-revenue services such as Broadband). I don't think I'd ever justify buying these towers to aggregate customers, mainly due to the risk.
For us, even the MX960 is too big, which is why we focus on the MX480 (ASR9906 being the equivalent). It's a happy medium between the small and large end of the spectrum.
And as I mentioned before, we just look at a totally different box for 100Gbps customers.
Mark.
Yeah, if you want to name specific boxes then yes I've made similar experiences with the same boxen. Even the MX960 is slightly too big for a PE depending on how you load it (port combinations). Large boxes like the MX2020, ASR9922, NCS6K, etc. these can only reasonably be used as P nodes in my opinion. Cheers, James.
On 27/Jun/19 21:41, James Bensley wrote:
Large boxes like the MX2020, ASR9922, NCS6K, etc. these can only reasonably be used as P nodes in my opinion.
The NCS6000 was always designed as a core router to replace the CRS. We just haven't seen the need for one since the CRS-X we run we operate (8-slot chassis) is still more than enough for our requirements. But yes, all of these edge routers, nowadays, are very decent core boxes also, particularly if you run a BGP-free core and have no need to support non-Ethernet links to any reasonable degree in there. Mark.
participants (2)
-
James Bensley
-
Mark Tinka