You're getting to the core of the question (sorry, I could not resist...) -- and again the complexity is as much in the terminology as anything else. In EZChip, at least as we used it on the ASR9k, the chip had a bunch of processing cores, and each core performed some of the work on each packet. I honestly don't know if the cores themselves were different or if they were all the same physical design, but they were definitely attached to different memories, and they definitely ran different microcode. These cores were allocated to separate stages, and had names along the lines of {parse, search, encap, transmit} etc. I'm sure these aren't 100% correct but you get the point. Importantly, there were NOT the same number of cores for each stage, so when a packet went from stage A to stage B there was some kind of mux in between. If you knew precisely that each stage had the same number of cores, you could choose to arrange it such that the packet always followed a "straight-line" through the processing pipe, which would make some parts of the implementation cheaper/easier. You're correct that the instruction set for stuff like this is definitely not ARM (nor x86 nor anything else standard) because the problem space you're optimizing for is a lot smaller that what you'd have on a more general purpose CPU. The (enormous) challenge for running the same ucode on multiple targets is that networking has exceptionally high performance requirements -- billions of packets per second is where this whole thread started! Fortunately, we also have a much smaller problem space to solve than general purpose compute, although in a lot of places that's because we vendors have told operators "Look, if you want something that can forward a couple hundred terabits in a single system, you're going to have to constrain what features you need, because otherwise the current hardware just can't do it". To get that kind of performance without breaking the bank requires -- or at least has required up until this point in time -- some very tight integration between the hardware forwarding design and the microcode. I was at Barefoot when P4 was released, and Tofino was the closest thing I've seen to a "general purpose network ucode machine" -- and even that was still very much optimized in terms of how the hardware was designed and built, and it VERY much required the P4 programmer to have a deep understanding of what hardware resources were available. When you write a P4 program and compile it for an x86 machine, you can basically create as many tables and lookup stages as you want -- you just have to eat more CPU and memory accesses for more complex programs and they run slower. But on a chip like Tofino (or any other NPU-like target) you're going to have finite limits on how many processing stages and memory tables exist... so it's more the case that when your program gets bigger it no longer "just runs slower" but rather it "doesn't run at all". The industry would greatly benefit from some magical abstraction layer that would let people write forwarding code that's both target-independent AND high-performance, but at least so far the performance penalty for making such code target independent has been waaaaay more than the market is willing to bear. --lj -----Original Message----- From: Saku Ytti <saku@ytti.fi> Sent: Sunday, August 7, 2022 4:44 AM To: ljwobker@gmail.com Cc: Jeff Tantsura <jefftant.ietf@gmail.com>; NANOG <nanog@nanog.org>; Jeff Doyle <jdoyle@juniper.net> Subject: Re: 400G forwarding - how does it work? On Sat, 6 Aug 2022 at 17:08, <ljwobker@gmail.com> wrote:
For a while, GSR and CRS type systems had linecards where each card had a bunch of chips that together built the forwarding pipeline. You had chips for the L1/L2 interfaces, chips for the packet lookups, chips for the QoS/queueing math, and chips for the fabric interfaces. Over time, we integrated more and more of these things together until you (more or less) had a linecard where everything was done on one or two chips, instead of a half dozen or more. Once we got here, the next step was to build linecards where you actually had multiple independent things doing forwarding -- on the ASR9k we called these "slices". This again multiplies the performance you can get, but now both the software and the operators have to deal with the complexity of having multiple things running code where you used to only have one. Now let's jump into the 2010's where the silicon integration allows you to put down multiple cores or pipelines on a single chip, each of these is now (more or less) it's own forwarding entity. So now you've got yet ANOTHER layer of abstraction. If I can attempt to draw out the tree, it looks like this now:
1) you have a chassis or a system, which has a bunch of linecards. 2) each of those linecards has a bunch of NPUs/ASICs 3) each of those NPUs has a bunch of cores/pipelines
Thank you for this. I think we may have some ambiguity here. I'll ignore multichassis designs, as those went out of fashion, for now. And describe only 'NPU' not express/brcm style pipeline. 1) you have a chassis with multiple linecards 2) each linecard has 1 or more forwarding packages 3) each package has 1 or more NPUs (Juniper calls these slices, unsure if EZchip vocabulary is same here) 4) each NPU has 1 or more identical cores (well, I can't really name any with 1 core, I reckon, NPU like GPU pretty inherently has many many cores, and unlike some in this thread, I don't think they ever are ARM instruction set, that makes no sense, you create instruction set targeting the application at hand which ARM instruction set is not, but maybe some day we have some forwarding-IA, allowing customers to provide ucode that runs on multiple targets, but this would reduce pace of innovation) Some of those NPU core architectures are flat, like Trio, where a single core handles the entire packet. Where other core architectures, like FP are matrices, where you have multiple lines and packet picks 1 of the lines and traverses each core in line. (FP has much more cores in line, compared to leaba/pacific stuff) -- ++ytti