So the most important bits are pipelining and parallelism. And this is substantially simplified, but hopefully it helps. Pipelining basically means that you have a whole bunch of different operations that you need to perform to forward a packet. Lots of these are lookups into things like the FIB tables, the encap tables, the MAC tables, and literally dozens of other places where you store configuration and network state. Some of these are very small simple tables (“give me the value for a packet with TOS = 0b101”) and some are very complicated, like multi-level longest-prefix trees/tries that are built from lots of custom hardware logic and memory. It varies a lot from chip to chip, but there are on the order of 50-100 different tables for the current generation of “fast” chips doing lots of 400GE interfaces. Figuring out how to distribute all this forwarding state across all the different memory banks/devices in a big, fast chip is one of the Very Hard Problems that the chip makers and system vendors have to figure out. So once you build out this pipeline, you’ve got a bunch of different steps that all happen sequentially. The “length” of the pipeline puts a floor on the latency for switching a single packet… if I have to do 25 lookups and they’re all dependent on the one before, it’s not possible for me to switch the packet in any less than 25 clocks…. BUT, if I have a bunch of hardware all running these operations at the same time, I can push the aggregate forwarding capacity way higher. This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput. Now there’s plenty of complexity in terms of HOW I do all that parallelism — figuring out whether I have to replicate entire memory structures or if I can come up with sneaky ways of doing multiple lookups more efficiently, but that’s getting into the magic secret sauce type stuff. I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism. Hopefully that helps at least some. Disclaimer: I’m a Cisco employee, these words are mine and not representative of anything awesome that I may or may not work on in my day job… —lj ________________________________ From: NANOG <nanog-bounces+ljwobker=gmail.com@nanog.org> on behalf of James Bensley <jwbensley+nanog@gmail.com> Sent: Monday, July 25, 2022 8:55 AM To: NANOG <nanog@nanog.org> Subject: 400G forwarding - how does it work? Hi All, I've been trying to understand how forwarding at 400G is possible, specifically in this example, in relation to the Broadcom J2 chips, but I don't the mystery is anything specific to them... According to the Broadcom Jericho2 BCM88690 data sheet it provides 4.8Tbps of traffic processing and supports packet forwarding at 2Bpps. According to my maths that means it requires packet sizes of 300Bs to reach line rate across all ports. The data sheet says packet sizes above 284B, so I guess this is excluding some headers like the inter-frame gap and CRC (nothing after the PHY/MAC needs to know about them if the CRC is valid)? As I interpret the data sheet, J2 should supports chassis with 12x 400Gbps ports at line rate with 284B packets then. Jericho2 can be linked to a BCM16K for expanded packet forwarding tables and lookup processing (i.e. to hold the full global routing table, in such a case, forwarding lookups are offloaded to the BCM16K). The BCM16K documentation suggests that it uses TCAM for exact matching (e.g.,for ACLs) in something called the "Database Array" (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in something called the "User Data Array" (with 16M 32b entries?). A BCM16K supports 16 parallel searches, which means that each of the 12x 400G ports on a Jericho2 could perform an forwarding lookup at same time. This means that the BCM16K "only" needs to perform forwarding look-ups at a linear rate of 1x 400Gbps, not 4.8Tbps, and "only" for packets larger than 284 bytes, because that is the Jericho2 line-rate Pps rate. This means that each of the 16 parallel searches in the BCM16K, they need to support a rate of 164Mpps (164,473,684) to reach 400Gbps. This is much more in the realm of feasible, but still pretty extreme... 1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which is within the access time of TCAM and SRAM but this needs to include some computing time too e.g. generating a key for a lookup and passing the results along the pipeline etc. The BCM16K has a clock speed of 1Ghz (1,000,000,000, cycles per second, or cycle every 1 nano second) and supports an SRAM memory access in a single clock cycle (according to the data sheet). If one cycle is required for an SRAM lookup, the BCM16K only has 5 cycles to perform other computation tasks, and the J2 chip needs to do the various header re-writes and various counter updates etc., so how is magic this happening?!? The obvious answer is that it's not magic and my understanding is fundamentally flawed, so please enlighten me. Cheers, James.