Hi All, I've been trying to understand how forwarding at 400G is possible, specifically in this example, in relation to the Broadcom J2 chips, but I don't the mystery is anything specific to them... According to the Broadcom Jericho2 BCM88690 data sheet it provides 4.8Tbps of traffic processing and supports packet forwarding at 2Bpps. According to my maths that means it requires packet sizes of 300Bs to reach line rate across all ports. The data sheet says packet sizes above 284B, so I guess this is excluding some headers like the inter-frame gap and CRC (nothing after the PHY/MAC needs to know about them if the CRC is valid)? As I interpret the data sheet, J2 should supports chassis with 12x 400Gbps ports at line rate with 284B packets then. Jericho2 can be linked to a BCM16K for expanded packet forwarding tables and lookup processing (i.e. to hold the full global routing table, in such a case, forwarding lookups are offloaded to the BCM16K). The BCM16K documentation suggests that it uses TCAM for exact matching (e.g.,for ACLs) in something called the "Database Array" (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in something called the "User Data Array" (with 16M 32b entries?). A BCM16K supports 16 parallel searches, which means that each of the 12x 400G ports on a Jericho2 could perform an forwarding lookup at same time. This means that the BCM16K "only" needs to perform forwarding look-ups at a linear rate of 1x 400Gbps, not 4.8Tbps, and "only" for packets larger than 284 bytes, because that is the Jericho2 line-rate Pps rate. This means that each of the 16 parallel searches in the BCM16K, they need to support a rate of 164Mpps (164,473,684) to reach 400Gbps. This is much more in the realm of feasible, but still pretty extreme... 1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which is within the access time of TCAM and SRAM but this needs to include some computing time too e.g. generating a key for a lookup and passing the results along the pipeline etc. The BCM16K has a clock speed of 1Ghz (1,000,000,000, cycles per second, or cycle every 1 nano second) and supports an SRAM memory access in a single clock cycle (according to the data sheet). If one cycle is required for an SRAM lookup, the BCM16K only has 5 cycles to perform other computation tasks, and the J2 chip needs to do the various header re-writes and various counter updates etc., so how is magic this happening?!? The obvious answer is that it's not magic and my understanding is fundamentally flawed, so please enlighten me. Cheers, James.
Once upon a time, James Bensley <jwbensley+nanog@gmail.com> said:
The obvious answer is that it's not magic and my understanding is fundamentally flawed, so please enlighten me.
So I can't answer to your specific question, but I just wanted to say that your CPU analysis is simplistic and doesn't really match how CPUs work now. Something can be "line rate" but not push the first packet through in the shortest time. CPUs break operations down into a series of very small operations and then run those operations in a pipeline, with different parts of the CPU working on the micro operations for different overall operations at the same time. The first object out of the pipeline (packet destination calculated in this case) may take more time, but then after that you keep getting a result every cycle/few cycles. For example, it might take 4 times as long to process the first packet, but as long as the hardware can handle 4 packets in a queue, you'll get a packet result every cycle after that, without dropping anything. So maybe the first result takes 12 cycles, but then you can keep getting a result every 3 cycles as long as the pipeline is kept full. This type of pipelined+superscalar processing was a big deal with Cray supercomputers, but made it down to PC-level hardware with the Pentium Pro. It has issues (see all the Spectre and Retbleed CPU flaws with branch prediction for example), but in general it allows a CPU to handle a chain of operations faster than it can handle each operation individually. -- Chris Adams <cma@cmadams.net>
Thanks for the responses Chris, Saku… On Mon, 25 Jul 2022 at 15:17, Chris Adams <cma@cmadams.net> wrote:
Once upon a time, James Bensley <jwbensley+nanog@gmail.com> said:
The obvious answer is that it's not magic and my understanding is fundamentally flawed, so please enlighten me.
So I can't answer to your specific question, but I just wanted to say that your CPU analysis is simplistic and doesn't really match how CPUs work now.
It wasn't a CPU analysis because switching ASICs != CPUs. I am aware of the x86 architecture, but know little of network ASICs, so I was deliberately trying to not apply my x86 knowledge here, in case it sent me down the wrong path. You made references towards typical CPU features;
For example, it might take 4 times as long to process the first packet, but as long as the hardware can handle 4 packets in a queue, you'll get a packet result every cycle after that, without dropping anything. So maybe the first result takes 12 cycles, but then you can keep getting a result every 3 cycles as long as the pipeline is kept full.
Yes, in the x86/x64 CPU world keeping the instruction cache and data cache hot indeed results in optimal performance, and as you say modern CPUs use parallel pipelines amongst other techniques like branch prediction, SIMD, (N)UMA, and so on, but I would assume (because I don’t know) that not all of the x86 feature set map nicely to packet processing in ASICs (VPP uses these techniques on COTS CPUs, to emulate a fixed pipeline, rather than run to completion model). You and Saku both suggest that heavy parallelism is the magic source;
Something can be "line rate" but not push the first packet through in the shortest time.
On Mon, 25 Jul 2022 at 15:16, Saku Ytti <saku@ytti.fi> wrote:
I.e. say JNPR Trio PPE has many threads, and only one thread is running, rest of the threads are waiting for answers from memory. That is, once we start pushing packets through the device, it takes a long ass time (like single digit microseconds) before we see any packets out. 1000x longer than your calculated single digit nanoseconds.
In principal I accept this idea. But lets try and do the maths, I'd like to properly understand; The non-drop rate of the J2 is 2Bpps @ 284 bytes == 4.8Tbps, my example scenario was a single J2 chip in a 12x400G device. If each port is receiving 400G @ 284 bytes (164,473,684 pps), that’s one every 6.08 nanoseconds coming in. What kind of parallelism is required to stop from ingress dropping? It takes say 5 microseconds to process and forward a packet (seems reasonable looking at some Arista data sheets which use J2 variants), which means we need to be operating on 5,000ns / 6.08ns == 822 packets per port simultaneously, so 9868 packets are being processed across all 12 ports simultaneously, to stop ingress dropping on all interfaces. I think the latest generation Trio has 160 PPEs per PFE, but I’m not sure how many threads per PPE. Older generations had 20 threads/contexts per PPE, so if it hasn’t increased that would make for 3200 threads in total. That is a 1.6Tbps FD chip, although not apples to apples of course, Trio is run to completion too. The Nokia FP5 has 1,200 cores (I have no idea how many threads per core) and is rated for 4.8Tbps FD. Again doing something quite different to a J2 chip, again its RTC. J2 is a partially-fixed pipeline but slightly programmable if I have understood correctly, but definitely at the other end of the spectrum compared to RTC. So are we to surmise that a J2 chip has circa 10k parallel pipelines, in order to process 9868 packets in parallel? I have no frame of reference here, but in comparison to Gen 6 Trio of NP5, that seems very high to me (to the point where I assume I am wrong). Cheers, James.
On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com> wrote:
I have no frame of reference here, but in comparison to Gen 6 Trio of NP5, that seems very high to me (to the point where I assume I am wrong).
No you are right, FP has much much more PPEs than Trio. For fair calculation, you compare how many lines FP has to PPEs in Trio. Because in Trio single PPE handles entire packet, and all PPEs run identical ucode, performing same work. In FP each PPE in line has its own function, like first PPE in line could be parsing the packet and extracting keys from it, second could be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth. Why choose this NP design instead of Trio design, I don't know. I don't understand the upsides. Downside is easy to understand, picture yourself as ucode developer, and you get task to 'add this magic feature in the ucode'. Implementing it in Trio seems trivial, add the code in ucode, rock on. On FP, you might have to go 'aww shit, I need to do this before PPE5 but after PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that I have in the PPE4, crap, now I need to shuffle around and figure out which PPE in line runs what function to keep the PPS we promise to customer. Let's look it from another vantage point, let's cook-up IPv6 header with crapton of EH, in Trio, PPE keeps churning it out, taking long time, but eventually it gets there or raises exception and gives up. Every other PPE in the box is fully available to perform work. Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not doing anything and can't do anything, until the PPE before in line is done. Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before and after lookup, before is normally needed for ingressACL but after lookup ingressACL is needed for CoPP (we only know after lookup if it is control-plane packet). Nokia doesn't do this at all, and I bet they can't do it, because if they'd add it in the core where it needs to be in line, total PPS would go down. as there is no budget for additional ACL. Instead all control-plane packets from ingressFP are sent to control plane FP, and inshallah we don't congest the connection there or it.
Cheers, James.
-- ++ytti
All high-performance networking devices on the market have pipeline architecture. The pipeline consists of "stages". ASICs have stages fixed to particular functions: [cid:image002.png@01D8A0DD.988EC6A0] Well, some stages are driven by code our days (a little flexibility). Juniper is pipeline-based too (like any ASIC). They just invented one special stage in 1996 for lookup (sequence search by nibble in the big external memory tree) – it was public information up to 2000year. It is a different principle from TCAM search – performance is traded for flexibility/simplicity/cost. Network Processors emulate stages on general-purpose ARM cores. It is a pipeline too (different cores for different functions, many cores for every function), just it is a virtual pipeline. Ed/ -----Original Message----- From: NANOG [mailto:nanog-bounces+vasilenko.eduard=huawei.com@nanog.org] On Behalf Of Saku Ytti Sent: Monday, July 25, 2022 10:03 PM To: James Bensley <jwbensley+nanog@gmail.com> Cc: NANOG <nanog@nanog.org> Subject: Re: 400G forwarding - how does it work? On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com<mailto:jwbensley+nanog@gmail.com>> wrote:
I have no frame of reference here, but in comparison to Gen 6 Trio of
NP5, that seems very high to me (to the point where I assume I am
wrong).
No you are right, FP has much much more PPEs than Trio. For fair calculation, you compare how many lines FP has to PPEs in Trio. Because in Trio single PPE handles entire packet, and all PPEs run identical ucode, performing same work. In FP each PPE in line has its own function, like first PPE in line could be parsing the packet and extracting keys from it, second could be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth. Why choose this NP design instead of Trio design, I don't know. I don't understand the upsides. Downside is easy to understand, picture yourself as ucode developer, and you get task to 'add this magic feature in the ucode'. Implementing it in Trio seems trivial, add the code in ucode, rock on. On FP, you might have to go 'aww shit, I need to do this before PPE5 but after PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that I have in the PPE4, crap, now I need to shuffle around and figure out which PPE in line runs what function to keep the PPS we promise to customer. Let's look it from another vantage point, let's cook-up IPv6 header with crapton of EH, in Trio, PPE keeps churning it out, taking long time, but eventually it gets there or raises exception and gives up. Every other PPE in the box is fully available to perform work. Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not doing anything and can't do anything, until the PPE before in line is done. Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before and after lookup, before is normally needed for ingressACL but after lookup ingressACL is needed for CoPP (we only know after lookup if it is control-plane packet). Nokia doesn't do this at all, and I bet they can't do it, because if they'd add it in the core where it needs to be in line, total PPS would go down. as there is no budget for additional ACL. Instead all control-plane packets from ingressFP are sent to control plane FP, and inshallah we don't congest the connection there or it.
Cheers,
James.
-- ++ytti
On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard <vasilenko.eduard@huawei.com> wrote:
Juniper is pipeline-based too (like any ASIC). They just invented one special stage in 1996 for lookup (sequence search by nibble in the big external memory tree) – it was public information up to 2000year. It is a different principle from TCAM search – performance is traded for flexibility/simplicity/cost.
How do you define a pipeline? My understanding is that fabric and wan connections are in chip called MQ, 'head' of packet being some 320B or so (bit less on more modern Trio, didn't measure specifically) is then sent to LU complex for lookup. LU then sprays packets to one of many PPE, but once packet hits PPE, it is processed until done, it doesn't jump to another PPE. Reordering will occur, which is later restored for flows, but outside flows reorder may remain. I don't know what the cores are, but I'm comfortable to bet money they are not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use tensilica cores, which are decidedly not ARM. Nokia, as mentioned, kind of has a pipeline, because a single packet hits every core in line, and each core does separate thing.
Network Processors emulate stages on general-purpose ARM cores. It is a pipeline too (different cores for different functions, many cores for every function), just it is a virtual pipeline.
Ed/
-----Original Message----- From: NANOG [mailto:nanog-bounces+vasilenko.eduard=huawei.com@nanog.org] On Behalf Of Saku Ytti Sent: Monday, July 25, 2022 10:03 PM To: James Bensley <jwbensley+nanog@gmail.com> Cc: NANOG <nanog@nanog.org> Subject: Re: 400G forwarding - how does it work?
On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com> wrote:
I have no frame of reference here, but in comparison to Gen 6 Trio of
NP5, that seems very high to me (to the point where I assume I am
wrong).
No you are right, FP has much much more PPEs than Trio.
For fair calculation, you compare how many lines FP has to PPEs in Trio. Because in Trio single PPE handles entire packet, and all PPEs run identical ucode, performing same work.
In FP each PPE in line has its own function, like first PPE in line could be parsing the packet and extracting keys from it, second could be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.
Why choose this NP design instead of Trio design, I don't know. I don't understand the upsides.
Downside is easy to understand, picture yourself as ucode developer, and you get task to 'add this magic feature in the ucode'.
Implementing it in Trio seems trivial, add the code in ucode, rock on.
On FP, you might have to go 'aww shit, I need to do this before PPE5 but after PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that I have in the PPE4, crap, now I need to shuffle around and figure out which PPE in line runs what function to keep the PPS we promise to customer.
Let's look it from another vantage point, let's cook-up IPv6 header with crapton of EH, in Trio, PPE keeps churning it out, taking long time, but eventually it gets there or raises exception and gives up.
Every other PPE in the box is fully available to perform work.
Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not doing anything and can't do anything, until the PPE before in line is done.
Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before and after lookup, before is normally needed for ingressACL but after lookup ingressACL is needed for CoPP (we only know after lookup if it is control-plane packet). Nokia doesn't do this at all, and I bet they can't do it, because if they'd add it in the core where it needs to be in line, total PPS would go down. as there is no budget for additional ACL. Instead all control-plane packets from ingressFP are sent to control plane FP, and inshallah we don't congest the connection there or it.
Cheers,
James.
--
++ytti
-- ++ytti
Nope, ASIC vendors are not ARM-based for PFE. Every “stage” is a very specialized ASIC with small programmability (not so small for P4 and some latest generation ASICs). ARM cores are for Network Processors (NP). ARM cores (with proper microcode) could emulate any “stage” of ASIC. It is the typical explanation for why NPs are more flexible than ASIC. Stages are connected to the common internal memory where enriched packet headers are stored. The pipeline is just the order of stages to process these internal enriched headers. The size of this internal header is the critical restriction of the ASIC, never disclosed or discussed (but people know it anyway for the most popular ASICs – it is possible to google “key buffer”). Hint: the smallest one in the industry is 128bytes, the biggest 384bytes. It is not possible to process longer headers for one PFE pass. Non-compressed SRv6 header could be 208bytes (+TCP/UDP +VLAN +L2 +ASIC_internal_staff). Hence, the need for compressed. It was a big marketing announcement from one famous ASIC vendor just a few years ago that some ASIC stages are capable of dynamically sharing common big external memory (used for MAC/IP/Filters). It may be internal memory too for small scalability, but typically it is external memory. This memory is always discussed in detail – it is needed for the operation team. It is only about headers. The packet itself (payload) is stored in the separate memory (buffer) that is not visible for pipeline stages. There were times when it was difficult to squeeze everything into one ASIC. Then one chip prepares an internal (enriched) header and may do some processing (some simple stages), then send this header to the next chip for other “stages” (especially the complicated lookup with external memory connected). It is the artifact now. Ed/ From: Saku Ytti [mailto:saku@ytti.fi] Sent: Tuesday, July 26, 2022 11:53 AM To: Vasilenko Eduard <vasilenko.eduard@huawei.com> Cc: James Bensley <jwbensley+nanog@gmail.com>; NANOG <nanog@nanog.org> Subject: Re: 400G forwarding - how does it work? On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard <vasilenko.eduard@huawei.com<mailto:vasilenko.eduard@huawei.com>> wrote: Juniper is pipeline-based too (like any ASIC). They just invented one special stage in 1996 for lookup (sequence search by nibble in the big external memory tree) – it was public information up to 2000year. It is a different principle from TCAM search – performance is traded for flexibility/simplicity/cost. How do you define a pipeline? My understanding is that fabric and wan connections are in chip called MQ, 'head' of packet being some 320B or so (bit less on more modern Trio, didn't measure specifically) is then sent to LU complex for lookup. LU then sprays packets to one of many PPE, but once packet hits PPE, it is processed until done, it doesn't jump to another PPE. Reordering will occur, which is later restored for flows, but outside flows reorder may remain. I don't know what the cores are, but I'm comfortable to bet money they are not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use tensilica cores, which are decidedly not ARM. Nokia, as mentioned, kind of has a pipeline, because a single packet hits every core in line, and each core does separate thing. Network Processors emulate stages on general-purpose ARM cores. It is a pipeline too (different cores for different functions, many cores for every function), just it is a virtual pipeline. Ed/ -----Original Message----- From: NANOG [mailto:nanog-bounces+vasilenko.eduard<mailto:nanog-bounces%2Bvasilenko.eduard>=huawei.com@nanog.org<mailto:huawei.com@nanog.org>] On Behalf Of Saku Ytti Sent: Monday, July 25, 2022 10:03 PM To: James Bensley <jwbensley+nanog@gmail.com<mailto:jwbensley%2Bnanog@gmail.com>> Cc: NANOG <nanog@nanog.org<mailto:nanog@nanog.org>> Subject: Re: 400G forwarding - how does it work? On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com<mailto:jwbensley+nanog@gmail.com>> wrote:
I have no frame of reference here, but in comparison to Gen 6 Trio of
NP5, that seems very high to me (to the point where I assume I am
wrong).
No you are right, FP has much much more PPEs than Trio. For fair calculation, you compare how many lines FP has to PPEs in Trio. Because in Trio single PPE handles entire packet, and all PPEs run identical ucode, performing same work. In FP each PPE in line has its own function, like first PPE in line could be parsing the packet and extracting keys from it, second could be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth. Why choose this NP design instead of Trio design, I don't know. I don't understand the upsides. Downside is easy to understand, picture yourself as ucode developer, and you get task to 'add this magic feature in the ucode'. Implementing it in Trio seems trivial, add the code in ucode, rock on. On FP, you might have to go 'aww shit, I need to do this before PPE5 but after PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that I have in the PPE4, crap, now I need to shuffle around and figure out which PPE in line runs what function to keep the PPS we promise to customer. Let's look it from another vantage point, let's cook-up IPv6 header with crapton of EH, in Trio, PPE keeps churning it out, taking long time, but eventually it gets there or raises exception and gives up. Every other PPE in the box is fully available to perform work. Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not doing anything and can't do anything, until the PPE before in line is done. Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before and after lookup, before is normally needed for ingressACL but after lookup ingressACL is needed for CoPP (we only know after lookup if it is control-plane packet). Nokia doesn't do this at all, and I bet they can't do it, because if they'd add it in the core where it needs to be in line, total PPS would go down. as there is no budget for additional ACL. Instead all control-plane packets from ingressFP are sent to control plane FP, and inshallah we don't congest the connection there or it.
Cheers,
James.
-- ++ytti -- ++ytti
How do you define a pipeline?
For what it's worth, and with just a cursory look through this email, and without wishing to offend anyone's knowledge: a pipeline in processing is the division of the instruction cycle into a number of stages. General purpose RISC processors are often organized into five such stages. Under optimal conditions, which can be fairly, albeit loosely, interpreted as "one instruction does not affect its peers which are already in one of the stages", then a pipeline can increase the number of instructions retired per second, often quoted as MIPS (millions of instructions per second) by a factor equal to the number of stages in the pipeline. Cheers, Etienne On Tue, Jul 26, 2022 at 10:56 AM Saku Ytti <saku@ytti.fi> wrote:
On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard < vasilenko.eduard@huawei.com> wrote:
Juniper is pipeline-based too (like any ASIC). They just invented one special stage in 1996 for lookup (sequence search by nibble in the big external memory tree) – it was public information up to 2000year. It is a different principle from TCAM search – performance is traded for flexibility/simplicity/cost.
How do you define a pipeline? My understanding is that fabric and wan connections are in chip called MQ, 'head' of packet being some 320B or so (bit less on more modern Trio, didn't measure specifically) is then sent to LU complex for lookup. LU then sprays packets to one of many PPE, but once packet hits PPE, it is processed until done, it doesn't jump to another PPE. Reordering will occur, which is later restored for flows, but outside flows reorder may remain.
I don't know what the cores are, but I'm comfortable to bet money they are not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use tensilica cores, which are decidedly not ARM.
Nokia, as mentioned, kind of has a pipeline, because a single packet hits every core in line, and each core does separate thing.
Network Processors emulate stages on general-purpose ARM cores. It is a pipeline too (different cores for different functions, many cores for every function), just it is a virtual pipeline.
Ed/
-----Original Message----- From: NANOG [mailto:nanog-bounces+vasilenko.eduard=huawei.com@nanog.org] On Behalf Of Saku Ytti Sent: Monday, July 25, 2022 10:03 PM To: James Bensley <jwbensley+nanog@gmail.com> Cc: NANOG <nanog@nanog.org> Subject: Re: 400G forwarding - how does it work?
On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com> wrote:
I have no frame of reference here, but in comparison to Gen 6 Trio of
NP5, that seems very high to me (to the point where I assume I am
wrong).
No you are right, FP has much much more PPEs than Trio.
For fair calculation, you compare how many lines FP has to PPEs in Trio. Because in Trio single PPE handles entire packet, and all PPEs run identical ucode, performing same work.
In FP each PPE in line has its own function, like first PPE in line could be parsing the packet and extracting keys from it, second could be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.
Why choose this NP design instead of Trio design, I don't know. I don't understand the upsides.
Downside is easy to understand, picture yourself as ucode developer, and you get task to 'add this magic feature in the ucode'.
Implementing it in Trio seems trivial, add the code in ucode, rock on.
On FP, you might have to go 'aww shit, I need to do this before PPE5 but after PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that I have in the PPE4, crap, now I need to shuffle around and figure out which PPE in line runs what function to keep the PPS we promise to customer.
Let's look it from another vantage point, let's cook-up IPv6 header with crapton of EH, in Trio, PPE keeps churning it out, taking long time, but eventually it gets there or raises exception and gives up.
Every other PPE in the box is fully available to perform work.
Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not doing anything and can't do anything, until the PPE before in line is done.
Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before and after lookup, before is normally needed for ingressACL but after lookup ingressACL is needed for CoPP (we only know after lookup if it is control-plane packet). Nokia doesn't do this at all, and I bet they can't do it, because if they'd add it in the core where it needs to be in line, total PPS would go down. as there is no budget for additional ACL. Instead all control-plane packets from ingressFP are sent to control plane FP, and inshallah we don't congest the connection there or it.
Cheers,
James.
--
++ytti
-- ++ytti
-- Ing. Etienne-Victor Depasquale Assistant Lecturer Department of Communications & Computer Engineering Faculty of Information & Communication Technology University of Malta Web. https://www.um.edu.mt/profile/etiennedepasquale
Pipeline Stages are like separate computers (with their own ALU) sharing the same memory. In the ASIC case, the computers have different types (different capabilities). From: Etienne-Victor Depasquale [mailto:edepa@ieee.org] Sent: Tuesday, July 26, 2022 2:05 PM To: Saku Ytti <saku@ytti.fi> Cc: Vasilenko Eduard <vasilenko.eduard@huawei.com>; NANOG <nanog@nanog.org> Subject: Re: 400G forwarding - how does it work? How do you define a pipeline? For what it's worth, and with just a cursory look through this email, and without wishing to offend anyone's knowledge: a pipeline in processing is the division of the instruction cycle into a number of stages. General purpose RISC processors are often organized into five such stages. Under optimal conditions, which can be fairly, albeit loosely, interpreted as "one instruction does not affect its peers which are already in one of the stages", then a pipeline can increase the number of instructions retired per second, often quoted as MIPS (millions of instructions per second) by a factor equal to the number of stages in the pipeline. Cheers, Etienne On Tue, Jul 26, 2022 at 10:56 AM Saku Ytti <saku@ytti.fi<mailto:saku@ytti.fi>> wrote: On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard <vasilenko.eduard@huawei.com<mailto:vasilenko.eduard@huawei.com>> wrote: Juniper is pipeline-based too (like any ASIC). They just invented one special stage in 1996 for lookup (sequence search by nibble in the big external memory tree) – it was public information up to 2000year. It is a different principle from TCAM search – performance is traded for flexibility/simplicity/cost. How do you define a pipeline? My understanding is that fabric and wan connections are in chip called MQ, 'head' of packet being some 320B or so (bit less on more modern Trio, didn't measure specifically) is then sent to LU complex for lookup. LU then sprays packets to one of many PPE, but once packet hits PPE, it is processed until done, it doesn't jump to another PPE. Reordering will occur, which is later restored for flows, but outside flows reorder may remain. I don't know what the cores are, but I'm comfortable to bet money they are not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use tensilica cores, which are decidedly not ARM. Nokia, as mentioned, kind of has a pipeline, because a single packet hits every core in line, and each core does separate thing. Network Processors emulate stages on general-purpose ARM cores. It is a pipeline too (different cores for different functions, many cores for every function), just it is a virtual pipeline. Ed/ -----Original Message----- From: NANOG [mailto:nanog-bounces+vasilenko.eduard<mailto:nanog-bounces%2Bvasilenko.eduard>=huawei.com@nanog.org<mailto:huawei.com@nanog.org>] On Behalf Of Saku Ytti Sent: Monday, July 25, 2022 10:03 PM To: James Bensley <jwbensley+nanog@gmail.com<mailto:jwbensley%2Bnanog@gmail.com>> Cc: NANOG <nanog@nanog.org<mailto:nanog@nanog.org>> Subject: Re: 400G forwarding - how does it work? On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com<mailto:jwbensley+nanog@gmail.com>> wrote:
I have no frame of reference here, but in comparison to Gen 6 Trio of
NP5, that seems very high to me (to the point where I assume I am
wrong).
No you are right, FP has much much more PPEs than Trio. For fair calculation, you compare how many lines FP has to PPEs in Trio. Because in Trio single PPE handles entire packet, and all PPEs run identical ucode, performing same work. In FP each PPE in line has its own function, like first PPE in line could be parsing the packet and extracting keys from it, second could be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth. Why choose this NP design instead of Trio design, I don't know. I don't understand the upsides. Downside is easy to understand, picture yourself as ucode developer, and you get task to 'add this magic feature in the ucode'. Implementing it in Trio seems trivial, add the code in ucode, rock on. On FP, you might have to go 'aww shit, I need to do this before PPE5 but after PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that I have in the PPE4, crap, now I need to shuffle around and figure out which PPE in line runs what function to keep the PPS we promise to customer. Let's look it from another vantage point, let's cook-up IPv6 header with crapton of EH, in Trio, PPE keeps churning it out, taking long time, but eventually it gets there or raises exception and gives up. Every other PPE in the box is fully available to perform work. Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not doing anything and can't do anything, until the PPE before in line is done. Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before and after lookup, before is normally needed for ingressACL but after lookup ingressACL is needed for CoPP (we only know after lookup if it is control-plane packet). Nokia doesn't do this at all, and I bet they can't do it, because if they'd add it in the core where it needs to be in line, total PPS would go down. as there is no budget for additional ACL. Instead all control-plane packets from ingressFP are sent to control plane FP, and inshallah we don't congest the connection there or it.
Cheers,
James.
-- ++ytti -- ++ytti -- Ing. Etienne-Victor Depasquale Assistant Lecturer Department of Communications & Computer Engineering Faculty of Information & Communication Technology University of Malta Web. https://www.um.edu.mt/profile/etiennedepasquale
On 25 July 2022 19:02:50 UTC, Saku Ytti <saku@ytti.fi> wrote:
On Mon, 25 Jul 2022 at 21:51, James Bensley <jwbensley+nanog@gmail.com> wrote:
I have no frame of reference here, but in comparison to Gen 6 Trio of NP5, that seems very high to me (to the point where I assume I am wrong).
No you are right, FP has much much more PPEs than Trio.
Can you give any examples?
Why choose this NP design instead of Trio design, I don't know. I don't understand the upsides.
I think one use case is fixed latency. If you have minimal variation in your traffic you can provide a guaranteed upper bound on latency. This should be possible with the RTC model too of course, just harder because any variation in traffic at all, will result in a different run time duration, and I imagine it is easier to measure, find, and fix/tune chunks of code (running on separate cores, like in a pipeline) than in more code all running one core (like in RTC). So that's possibly a second benefit, maybe FP is easier to debug and measure changes?
Downside is easy to understand, picture yourself as ucode developer, and you get task to 'add this magic feature in the ucode'. Implementing it in Trio seems trivial, add the code in ucode, rock on. On FP, you might have to go 'aww shit, I need to do this before PPE5 but after PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that I have in the PPE4, crap, now I need to shuffle around and figure out which PPE in line runs what function to keep the PPS we promise to customer.
That's why we have packet recirc <troll face>
Let's look it from another vantage point, let's cook-up IPv6 header with crapton of EH, in Trio, PPE keeps churning it out, taking long time, but eventually it gets there or raises exception and gives up. Every other PPE in the box is fully available to perform work. Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not doing anything and can't do anything, until the PPE before in line is done.
This is exactly the benefit of FP vs NPI, less flexible, more throughput. NPU has served us (the industry) well at the edge, and FP is serving us well in the core.
Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before and after lookup, before is normally needed for ingressACL but after lookup ingressACL is needed for CoPP (we only know after lookup if it is control-plane packet). Nokia doesn't do this at all, and I bet they can't do it, because if they'd add it in the core where it needs to be in line, total PPS would go down. as there is no budget for additional ACL. Instead all control-plane packets from ingressFP are sent to control plane FP, and inshallah we don't congest the connection there or it.
Interesting. Cheers, James.
"Pipeline" in the context of networking chips is not a terribly well-defined term. In some chips, you'll have a pipeline that is built from very rigid hardware logic blocks -- the first block does exactly one part of the packet forwarding, then hands the packet (or just the header and metadata) to the second block, which does another portion of the forwarding. You build the pipeline out of as many blocks as you need to solve your particular networking problem, and voila!
From a performance standpoint, on current "fast" chips, many (but certainly not all) of the "pipelines" are designed to forward one packet per clock cycle for "normal" use cases. (Of course we sneaky vendors get to decide what is normal and what's not, but that's a separate issue...) So if I have a chip that has one pipeline and it's clocked at 1.25Ghz, that means
"Pipeline", in the context of networking chips, is not a terribly well-defined term! In some chips, you'll have an almost-literal pipeline that is built from very rigid hardware logic blocks. The first block does exactly one part of the packet forwarding, then hands the packet (or just the header and metadata) to the second block, which does another portion of the forwarding. You build the pipeline out of as many blocks as you need to solve your particular networking problem, and voila! The advantages here is that you can make things very fast and power efficient, but they aren't all that flexible, and deity help you if you ever need to do something in a different order than your pipeline! You can also build a "pipeline" out of software functions - write up some Python code (because everyone loves Python, right?) where function A calls function B and so on. At some level, you've just build a pipeline out of different software functions. This is going to be a lot slower (C code will be faster but nowhere near as fast as dedicated hardware) but it's WAY more flexible. You can more or less dynamically build your "pipeline" on a packet-by-packet basis, depending on what features and packet data you're dealing with. "Microcode" is really just a term we use for something like "really optimized and limited instruction sets for packet forwarding". Just like an x86 or an ARM has some finite set of instructions that it can execute, so do current networking chips. The larger that instruction space is and the more combinations of those instructions you can store, the more flexible your code is. Of course, you can't make that part of the chip bigger without making something else smaller, so there's another tradeoff. MOST current chips are really a hybrid/combination of these two extremes. You have some set of fixed logic blocks that do exactly One Set Of Things, and you have some other logic blocks that can be reconfigured to do A Few Different Things. The degree to which the programmable stuff is programmable is a major input to how many different features you can do on the chip, and at what speeds. Sometimes you can use the same hardware block to do multiple things on a packet if you're willing to sacrifice some packet rate and/or bandwidth. The constant "law of physics" is that you can always do a given function in less power/space/cost if you're willing to optimize for that specific thing -- but you're sacrificing flexibility to do it. The more flexibility ("programmability") you want to add to a chip, the more logic and memory you need to add. that it can forward 1.25 billion packets per second. Note that this does NOT mean that I can forward a packet in "a one-point-two-five-billionth of a second" -- but it does mean that every clock cycle I can start on a new packet and finish another one. The length of the pipeline impacts the latency of the chip, although this part of the latency is often a rounding error compared to the number of times I have to read and write the packet into different memories as it goes through the system. So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth. The exact details of how the pipelines are constructed and how much parallelism I built INSIDE a pipeline as opposed to replicating pipelines is sort of Gooky Implementation Details, but it's a very very important part of doing the chip level architecture as those sorts of decisions drive lots of Other Important Decisions in the silicon design... --lj
mandatory slide of laundry analogy for pipelining https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelinin... On Tue, 26 Jul 2022 at 12:41, Lawrence Wobker <ljwobker@gmail.com> wrote:
"Pipeline" in the context of networking chips is not a terribly well-defined term. In some chips, you'll have a pipeline that is built from very rigid hardware logic blocks -- the first block does exactly one part of the packet forwarding, then hands the packet (or just the header and metadata) to the second block, which does another portion of the forwarding. You build the pipeline out of as many blocks as you need to solve your particular networking problem, and voila!
"Pipeline", in the context of networking chips, is not a terribly well-defined term! In some chips, you'll have an almost-literal pipeline that is built from very rigid hardware logic blocks. The first block does exactly one part of the packet forwarding, then hands the packet (or just the header and metadata) to the second block, which does another portion of the forwarding. You build the pipeline out of as many blocks as you need to solve your particular networking problem, and voila! The advantages here is that you can make things very fast and power efficient, but they aren't all that flexible, and deity help you if you ever need to do something in a different order than your pipeline!
You can also build a "pipeline" out of software functions - write up some Python code (because everyone loves Python, right?) where function A calls function B and so on. At some level, you've just build a pipeline out of different software functions. This is going to be a lot slower (C code will be faster but nowhere near as fast as dedicated hardware) but it's WAY more flexible. You can more or less dynamically build your "pipeline" on a packet-by-packet basis, depending on what features and packet data you're dealing with.
"Microcode" is really just a term we use for something like "really optimized and limited instruction sets for packet forwarding". Just like an x86 or an ARM has some finite set of instructions that it can execute, so do current networking chips. The larger that instruction space is and the more combinations of those instructions you can store, the more flexible your code is. Of course, you can't make that part of the chip bigger without making something else smaller, so there's another tradeoff.
MOST current chips are really a hybrid/combination of these two extremes. You have some set of fixed logic blocks that do exactly One Set Of Things, and you have some other logic blocks that can be reconfigured to do A Few Different Things. The degree to which the programmable stuff is programmable is a major input to how many different features you can do on the chip, and at what speeds. Sometimes you can use the same hardware block to do multiple things on a packet if you're willing to sacrifice some packet rate and/or bandwidth. The constant "law of physics" is that you can always do a given function in less power/space/cost if you're willing to optimize for that specific thing -- but you're sacrificing flexibility to do it. The more flexibility ("programmability") you want to add to a chip, the more logic and memory you need to add.
From a performance standpoint, on current "fast" chips, many (but certainly not all) of the "pipelines" are designed to forward one packet per clock cycle for "normal" use cases. (Of course we sneaky vendors get to decide what is normal and what's not, but that's a separate issue...) So if I have a chip that has one pipeline and it's clocked at 1.25Ghz, that means that it can forward 1.25 billion packets per second. Note that this does NOT mean that I can forward a packet in "a one-point-two-five-billionth of a second" -- but it does mean that every clock cycle I can start on a new packet and finish another one. The length of the pipeline impacts the latency of the chip, although this part of the latency is often a rounding error compared to the number of times I have to read and write the packet into different memories as it goes through the system.
So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth. The exact details of how the pipelines are constructed and how much parallelism I built INSIDE a pipeline as opposed to replicating pipelines is sort of Gooky Implementation Details, but it's a very very important part of doing the chip level architecture as those sorts of decisions drive lots of Other Important Decisions in the silicon design...
--lj
On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:
So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth.
Thanks for the response Lawrence. The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the J2 to have something similar (as someone already mentioned, most chips I've seen are in the 1-1.5Ghz range), so in this case "only" 2 pipelines would be needed to maintain the headline 2Bpps rate of the J2, or even just 1 if they have managed to squeeze out two packets per cycle through parallelisation within the pipeline. Cheers, James.
Apologies for garbage/HTMLed email, not sure what happened (thanks Brian F for letting me know). Anyway, the podcast with Juniper (mostly around Trio/Express) has been broadcasted today and is available at https://www.youtube.com/watch?v=1he8GjDBq9g Next in the pipeline are: Cisco SiliconOne Broadcom DNX (Jericho/Qumran/Ramon) For both - the guests are main architects of the silicon Enjoy On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
Hey,
This is not an advertisement but an attempt to help folks to better understand networking HW.
Some of you might know (and love 😊) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of years.
Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic where more information and better education is very much needed (no, you won’t have to sign NDA before joining 😊), we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.
More to come, stay tuned.
Live feed: https://lnkd.in/gk2x2ezZ
Between 0x2 nerds playlist, videos will be published to: https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2yJB7
Cheers,
Jeff
From: James Bensley Sent: Wednesday, July 27, 2022 12:53 PM To: Lawrence Wobker; NANOG Subject: Re: 400G forwarding - how does it work?
On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:
So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth.
Thanks for the response Lawrence.
The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
J2 to have something similar (as someone already mentioned, most chips
I've seen are in the 1-1.5Ghz range), so in this case "only" 2
pipelines would be needed to maintain the headline 2Bpps rate of the
J2, or even just 1 if they have managed to squeeze out two packets per
cycle through parallelisation within the pipeline.
Cheers,
James.
Thank you for this. I wish there would have been a deeper dive to the lookup side. My open questions a) Trio model of packet stays in single PPE until done vs. FP model of line-of-PPE (identical cores). I don't understand the advantages of the FP model, the Trio model advantages are clear to me. Obviously the FP model has to have some advantages, what are they? b) What exactly are the gains of putting two trios on-package in Trio6, there is no local-switching between WANs of trios in-package, they are, as far as I can tell, ships in the night, packets between trios go via fabric, as they would with separate Trios. I can understand the benefit of putting trio and HBM2 on the same package, to reduce distance so wattage goes down or frequency goes up. c) What evolution they are thinking for the shallow ingress buffers for Trio6. The collateral damage potential is significant, because WAN which asks most, gets most, instead each having their fair share, thus potentially arbitrarily low rate WAN ingress might not get access to ingress buffer causing drop. Would it be practical in terms of wattage/area to add some sort of preQoS towards the shallow ingress buffer, so each WAN ingress has a fair guaranteed-rate to shallow buffers? On Fri, 5 Aug 2022 at 02:18, Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
Apologies for garbage/HTMLed email, not sure what happened (thanks Brian F for letting me know). Anyway, the podcast with Juniper (mostly around Trio/Express) has been broadcasted today and is available at https://www.youtube.com/watch?v=1he8GjDBq9g Next in the pipeline are: Cisco SiliconOne Broadcom DNX (Jericho/Qumran/Ramon) For both - the guests are main architects of the silicon
Enjoy
On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
Hey,
This is not an advertisement but an attempt to help folks to better understand networking HW.
Some of you might know (and love 😊) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of years.
Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic where more information and better education is very much needed (no, you won’t have to sign NDA before joining 😊), we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.
More to come, stay tuned.
Live feed: https://lnkd.in/gk2x2ezZ
Between 0x2 nerds playlist, videos will be published to: https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2yJB7
Cheers,
Jeff
From: James Bensley Sent: Wednesday, July 27, 2022 12:53 PM To: Lawrence Wobker; NANOG Subject: Re: 400G forwarding - how does it work?
On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:
So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth.
Thanks for the response Lawrence.
The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
J2 to have something similar (as someone already mentioned, most chips
I've seen are in the 1-1.5Ghz range), so in this case "only" 2
pipelines would be needed to maintain the headline 2Bpps rate of the
J2, or even just 1 if they have managed to squeeze out two packets per
cycle through parallelisation within the pipeline.
Cheers,
James.
-- ++ytti
Disclaimer: I work for Cisco on a bunch of silicon. I'm not intimately familiar with any of these devices, but I'm familiar with the high level tradeoffs. There are also exceptions to almost EVERYTHING I'm about to say, especially once you get into the second- and third-order implementation details. Your mileage will vary... ;-) If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, simpler programming, etc. A major downside is that to do this, all of these cores have to have access to all of the different memories used to forward said packet. Conversely, if you break up the processing into stages, you can only connect the FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap memories to the cores/blocks that are doing the encapsulation work. Those interconnects take up silicon space, which equates to higher cost and power. Packaging two cores on a single device is beneficial in that you only have one physical chip to work with instead of two. This often simplifies the board designers' job, and is often lower power than two separate chips. This starts to break down as you get to exceptionally large chips as you bump into the various physical/reticle limitations of how large a chip you can actually build. With newer packaging technology (2.5D chips, HBM and similar memories, chiplets down the road, etc) this becomes even more complicated, but the answer to "why would you put two XYZs on a package?" is that it's just cheaper and lower power from a system standpoint (and often also from a pure silicon standpoint...) Buffer designs are *really* hard in modern high speed chips, and there are always lots and lots of tradeoffs. The "ideal" answer is an extremely large block of memory that ALL of the forwarding/queueing elements have fair/equal access to... but this physically looks more or less like a full mesh between the memory/buffering subsystem and all the forwarding engines, which becomes really unwieldly (expensive!) from a design standpoint. The amount of memory you can practically put on the main NPU die is on the order of 20-200 **mega** bytes, where a single stack of HBM memory comes in at 4GB -- it's literally 100x the size. Figuring out which side of this gigantic gulf you want to live on is a super important part of the basic architecture and also drives lots of other decisions down the line... once you've decided how much buffering memory you're willing/able to put down, the next challenge is coming up with ways to provide access to that memory from all the different potential clients. It's a LOT easier to wire up/design a chip where you have four separate pipelines/cores/whatever and each one of them accesses 1/4 of the buffer memory... but that also means that any given port only has access to 1/4 of the memory for burst absorption. Lots and lots of Smart People Time has gone into different memory designs that attempt to optimize this problem, and it's a major part of the intellectual property of various chip designs. --lj -----Original Message----- From: NANOG <nanog-bounces+ljwobker=gmail.com@nanog.org> On Behalf Of Saku Ytti Sent: Friday, August 5, 2022 3:16 AM To: Jeff Tantsura <jefftant.ietf@gmail.com> Cc: NANOG <nanog@nanog.org>; Jeff Doyle <jdoyle@juniper.net> Subject: Re: 400G forwarding - how does it work? Thank you for this. I wish there would have been a deeper dive to the lookup side. My open questions a) Trio model of packet stays in single PPE until done vs. FP model of line-of-PPE (identical cores). I don't understand the advantages of the FP model, the Trio model advantages are clear to me. Obviously the FP model has to have some advantages, what are they? b) What exactly are the gains of putting two trios on-package in Trio6, there is no local-switching between WANs of trios in-package, they are, as far as I can tell, ships in the night, packets between trios go via fabric, as they would with separate Trios. I can understand the benefit of putting trio and HBM2 on the same package, to reduce distance so wattage goes down or frequency goes up. c) What evolution they are thinking for the shallow ingress buffers for Trio6. The collateral damage potential is significant, because WAN which asks most, gets most, instead each having their fair share, thus potentially arbitrarily low rate WAN ingress might not get access to ingress buffer causing drop. Would it be practical in terms of wattage/area to add some sort of preQoS towards the shallow ingress buffer, so each WAN ingress has a fair guaranteed-rate to shallow buffers? On Fri, 5 Aug 2022 at 02:18, Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
Apologies for garbage/HTMLed email, not sure what happened (thanks Brian F for letting me know). Anyway, the podcast with Juniper (mostly around Trio/Express) has been broadcasted today and is available at https://www.youtube.com/watch?v=1he8GjDBq9g Next in the pipeline are: Cisco SiliconOne Broadcom DNX (Jericho/Qumran/Ramon) For both - the guests are main architects of the silicon
Enjoy
On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
Hey,
This is not an advertisement but an attempt to help folks to better understand networking HW.
Some of you might know (and love 😊) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of years.
Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic where more information and better education is very much needed (no, you won’t have to sign NDA before joining 😊), we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.
More to come, stay tuned.
Live feed: https://lnkd.in/gk2x2ezZ
Between 0x2 nerds playlist, videos will be published to: https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2 yJB7
Cheers,
Jeff
From: James Bensley Sent: Wednesday, July 27, 2022 12:53 PM To: Lawrence Wobker; NANOG Subject: Re: 400G forwarding - how does it work?
On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:
So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth.
Thanks for the response Lawrence.
The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
J2 to have something similar (as someone already mentioned, most chips
I've seen are in the 1-1.5Ghz range), so in this case "only" 2
pipelines would be needed to maintain the headline 2Bpps rate of the
J2, or even just 1 if they have managed to squeeze out two packets per
cycle through parallelisation within the pipeline.
Cheers,
James.
-- ++ytti
On Fri, 5 Aug 2022 at 20:31, <ljwobker@gmail.com> wrote: Hey LJ,
Disclaimer: I work for Cisco on a bunch of silicon. I'm not intimately familiar with any of these devices, but I'm familiar with the high level tradeoffs. There are also exceptions to almost EVERYTHING I'm about to say, especially once you get into the second- and third-order implementation details. Your mileage will vary... ;-)
I expect it may come to this, my question may be too specific to be answered without violating some NDA.
If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, simpler programming, etc. A major downside is that to do this, all of these cores have to have access to all of the different memories used to forward said packet. Conversely, if you break up the processing into stages, you can only connect the FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap memories to the cores/blocks that are doing the encapsulation work. Those interconnects take up silicon space, which equates to higher cost and power.
While an interesting answer, that is, the statement is, cost of giving access to memory for cores versus having a more complex to program pipeline of cores is a balanced tradeoff, I don't think it applies to my specific question, while may apply to generic questions. We can roughly think of FP having a similar amount of lines as Trio has PPEs, therefore, a similar number of cores need access to memory, and possibly higher number, as more than 1 core in line will need memory access. So the question is more, why a lot of less performant cores, where performance is achieved by making pipeline, compared to fewer performant cores, where individual cores will work on packet to completion, when the former has a similar number of core lines as latter has cores.
Packaging two cores on a single device is beneficial in that you only have one physical chip to work with instead of two. This often simplifies the board designers' job, and is often lower power than two separate chips. This starts to break down as you get to exceptionally large chips as you bump into the various physical/reticle limitations of how large a chip you can actually build. With newer packaging technology (2.5D chips, HBM and similar memories, chiplets down the road, etc) this becomes even more complicated, but the answer to "why would you put two XYZs on a package?" is that it's just cheaper and lower power from a system standpoint (and often also from a pure silicon standpoint...)
Thank you for this, this does confirm that benefits aren't perhaps as revolutionary as the presentation of thread proposed, presentation divided Trio evolution to 3 phases, and this multiple trios on package was presented as one of those big evolutions, and perhaps some other division of generations could have been more communicative.
Lots and lots of Smart People Time has gone into different memory designs that attempt to optimize this problem, and it's a major part of the intellectual property of various chip designs.
I choose to read this as 'where a lot of innovation happens, a lot of mistakes happen'. Hopefully we'll figure out a good answer here soon, as the answers vendors are ending up with are becoming increasingly visible compromises in the field. I suspect a large part of this is that cloudy shops represent, if not disproportionate revenue, disproportionate focus and their networks tend to be a lot more static in config and traffic than access/SP networks. And when you have that quality, you can make increasingly broad assumptions, assumptions which don't play as well in SP networks. -- ++ytti
I don't think I can add much here to the FP and Trio specific questions, for obvious reasons... but ultimately it comes down to a set of tradeoffs where some of the big concerns are things like "how do I get the forwarding state I need back and forth to the things doing the processing work" -- that's an insane level oversimplification, as a huge amount of engineering time goes into those choices. I think the "revolutionary-ness" (to vocabulate a useful word?) of putting multiple cores or whatever onto a single package is somewhat in the eye of the beholder. The vast majority of customers would never know nor care whether a chip on the inside was implemented as two parallel "cores" or whether it was just one bigger "core" that does twice the amount of work in the same time. But to the silicon designer, and to a somewhat lesser extent the people writing the forwarding and associated chip-management code, it's definitely a big big deal. Also, having the ability to put two cores down on a given chip opens the door to eventually doing MORE than two cores, and if you really stretch your brain you get to where you might be able to put down "N" pipelines. This is the story of integration: back in the day we built systems where everything was forwarded on a single CPU. From a performance standpoint all we cared about was the clock rate and how much work was required to forward a packet. Divide the second number by the first, and you get your answer. In the late 90's we built systems (the 7500 for me) that were distributed, so now we had a bunch of CPUs on linecards running that code. Horizontal scaling -- sort of. In the early 2000's the GSR came along and now we're doing forwarding in hardware, which is an order or two faster, but a whole bunch of features are now too complex to do in hardware, so they go over the side and people have to adapt. To the best of my knowledge, TCP intercept has never come back... For a while, GSR and CRS type systems had linecards where each card had a bunch of chips that together built the forwarding pipeline. You had chips for the L1/L2 interfaces, chips for the packet lookups, chips for the QoS/queueing math, and chips for the fabric interfaces. Over time, we integrated more and more of these things together until you (more or less) had a linecard where everything was done on one or two chips, instead of a half dozen or more. Once we got here, the next step was to build linecards where you actually had multiple independent things doing forwarding -- on the ASR9k we called these "slices". This again multiplies the performance you can get, but now both the software and the operators have to deal with the complexity of having multiple things running code where you used to only have one. Now let's jump into the 2010's where the silicon integration allows you to put down multiple cores or pipelines on a single chip, each of these is now (more or less) it's own forwarding entity. So now you've got yet ANOTHER layer of abstraction. If I can attempt to draw out the tree, it looks like this now: 1) you have a chassis or a system, which has a bunch of linecards. 2) each of those linecards has a bunch of NPUs/ASICs 3) each of those NPUs has a bunch of cores/pipelines And all of this stuff has to be managed and tracked by the software. If I've got a system with 16 linecards, and each of those has 4 NPUs, and each of THOSE has 4 cores - I've got over *two hundred and fifty* separate things forwarding packets at the same time. Now a lot of the info they're using is common (the FIB is probably the same for all these entities...) but some of it is NOT. There's no value in wasting memory for the encapsulation data to host XXX if I know that none of the ports on my given NPU/core are going to talk to that host, right? So - figuring out how to manage the *state locality* becomes super important. And yes, this code breaks like all code, but no one has figured out any better way to scale up the performance. If you have a brilliant idea here that will get me the performance of 250+ things running in parallel but the simplicity of it looking and acting like a single thing to the rest of the world, please find an angel investor and we'll get phenomenally rich together. --lj -----Original Message----- From: Saku Ytti <saku@ytti.fi> Sent: Saturday, August 6, 2022 1:38 AM To: ljwobker@gmail.com Cc: Jeff Tantsura <jefftant.ietf@gmail.com>; NANOG <nanog@nanog.org>; Jeff Doyle <jdoyle@juniper.net> Subject: Re: 400G forwarding - how does it work? On Fri, 5 Aug 2022 at 20:31, <ljwobker@gmail.com> wrote: Hey LJ,
Disclaimer: I work for Cisco on a bunch of silicon. I'm not intimately familiar with any of these devices, but I'm familiar with the high level tradeoffs. There are also exceptions to almost EVERYTHING I'm about to say, especially once you get into the second- and third-order implementation details. Your mileage will vary... ;-)
I expect it may come to this, my question may be too specific to be answered without violating some NDA.
If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, simpler programming, etc. A major downside is that to do this, all of these cores have to have access to all of the different memories used to forward said packet. Conversely, if you break up the processing into stages, you can only connect the FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap memories to the cores/blocks that are doing the encapsulation work. Those interconnects take up silicon space, which equates to higher cost and power.
While an interesting answer, that is, the statement is, cost of giving access to memory for cores versus having a more complex to program pipeline of cores is a balanced tradeoff, I don't think it applies to my specific question, while may apply to generic questions. We can roughly think of FP having a similar amount of lines as Trio has PPEs, therefore, a similar number of cores need access to memory, and possibly higher number, as more than 1 core in line will need memory access. So the question is more, why a lot of less performant cores, where performance is achieved by making pipeline, compared to fewer performant cores, where individual cores will work on packet to completion, when the former has a similar number of core lines as latter has cores.
Packaging two cores on a single device is beneficial in that you only have one physical chip to work with instead of two. This often simplifies the board designers' job, and is often lower power than two separate chips. This starts to break down as you get to exceptionally large chips as you bump into the various physical/reticle limitations of how large a chip you can actually build. With newer packaging technology (2.5D chips, HBM and similar memories, chiplets down the road, etc) this becomes even more complicated, but the answer to "why would you put two XYZs on a package?" is that it's just cheaper and lower power from a system standpoint (and often also from a pure silicon standpoint...)
Thank you for this, this does confirm that benefits aren't perhaps as revolutionary as the presentation of thread proposed, presentation divided Trio evolution to 3 phases, and this multiple trios on package was presented as one of those big evolutions, and perhaps some other division of generations could have been more communicative.
Lots and lots of Smart People Time has gone into different memory designs that attempt to optimize this problem, and it's a major part of the intellectual property of various chip designs.
I choose to read this as 'where a lot of innovation happens, a lot of mistakes happen'. Hopefully we'll figure out a good answer here soon, as the answers vendors are ending up with are becoming increasingly visible compromises in the field. I suspect a large part of this is that cloudy shops represent, if not disproportionate revenue, disproportionate focus and their networks tend to be a lot more static in config and traffic than access/SP networks. And when you have that quality, you can make increasingly broad assumptions, assumptions which don't play as well in SP networks. -- ++ytti
On Sat, 6 Aug 2022 at 17:08, <ljwobker@gmail.com> wrote:
For a while, GSR and CRS type systems had linecards where each card had a bunch of chips that together built the forwarding pipeline. You had chips for the L1/L2 interfaces, chips for the packet lookups, chips for the QoS/queueing math, and chips for the fabric interfaces. Over time, we integrated more and more of these things together until you (more or less) had a linecard where everything was done on one or two chips, instead of a half dozen or more. Once we got here, the next step was to build linecards where you actually had multiple independent things doing forwarding -- on the ASR9k we called these "slices". This again multiplies the performance you can get, but now both the software and the operators have to deal with the complexity of having multiple things running code where you used to only have one. Now let's jump into the 2010's where the silicon integration allows you to put down multiple cores or pipelines on a single chip, each of these is now (more or less) it's own forwarding entity. So now you've got yet ANOTHER layer of abstraction. If I can attempt to draw out the tree, it looks like this now:
1) you have a chassis or a system, which has a bunch of linecards. 2) each of those linecards has a bunch of NPUs/ASICs 3) each of those NPUs has a bunch of cores/pipelines
Thank you for this. I think we may have some ambiguity here. I'll ignore multichassis designs, as those went out of fashion, for now. And describe only 'NPU' not express/brcm style pipeline. 1) you have a chassis with multiple linecards 2) each linecard has 1 or more forwarding packages 3) each package has 1 or more NPUs (Juniper calls these slices, unsure if EZchip vocabulary is same here) 4) each NPU has 1 or more identical cores (well, I can't really name any with 1 core, I reckon, NPU like GPU pretty inherently has many many cores, and unlike some in this thread, I don't think they ever are ARM instruction set, that makes no sense, you create instruction set targeting the application at hand which ARM instruction set is not, but maybe some day we have some forwarding-IA, allowing customers to provide ucode that runs on multiple targets, but this would reduce pace of innovation) Some of those NPU core architectures are flat, like Trio, where a single core handles the entire packet. Where other core architectures, like FP are matrices, where you have multiple lines and packet picks 1 of the lines and traverses each core in line. (FP has much more cores in line, compared to leaba/pacific stuff) -- ++ytti
You're getting to the core of the question (sorry, I could not resist...) -- and again the complexity is as much in the terminology as anything else. In EZChip, at least as we used it on the ASR9k, the chip had a bunch of processing cores, and each core performed some of the work on each packet. I honestly don't know if the cores themselves were different or if they were all the same physical design, but they were definitely attached to different memories, and they definitely ran different microcode. These cores were allocated to separate stages, and had names along the lines of {parse, search, encap, transmit} etc. I'm sure these aren't 100% correct but you get the point. Importantly, there were NOT the same number of cores for each stage, so when a packet went from stage A to stage B there was some kind of mux in between. If you knew precisely that each stage had the same number of cores, you could choose to arrange it such that the packet always followed a "straight-line" through the processing pipe, which would make some parts of the implementation cheaper/easier. You're correct that the instruction set for stuff like this is definitely not ARM (nor x86 nor anything else standard) because the problem space you're optimizing for is a lot smaller that what you'd have on a more general purpose CPU. The (enormous) challenge for running the same ucode on multiple targets is that networking has exceptionally high performance requirements -- billions of packets per second is where this whole thread started! Fortunately, we also have a much smaller problem space to solve than general purpose compute, although in a lot of places that's because we vendors have told operators "Look, if you want something that can forward a couple hundred terabits in a single system, you're going to have to constrain what features you need, because otherwise the current hardware just can't do it". To get that kind of performance without breaking the bank requires -- or at least has required up until this point in time -- some very tight integration between the hardware forwarding design and the microcode. I was at Barefoot when P4 was released, and Tofino was the closest thing I've seen to a "general purpose network ucode machine" -- and even that was still very much optimized in terms of how the hardware was designed and built, and it VERY much required the P4 programmer to have a deep understanding of what hardware resources were available. When you write a P4 program and compile it for an x86 machine, you can basically create as many tables and lookup stages as you want -- you just have to eat more CPU and memory accesses for more complex programs and they run slower. But on a chip like Tofino (or any other NPU-like target) you're going to have finite limits on how many processing stages and memory tables exist... so it's more the case that when your program gets bigger it no longer "just runs slower" but rather it "doesn't run at all". The industry would greatly benefit from some magical abstraction layer that would let people write forwarding code that's both target-independent AND high-performance, but at least so far the performance penalty for making such code target independent has been waaaaay more than the market is willing to bear. --lj -----Original Message----- From: Saku Ytti <saku@ytti.fi> Sent: Sunday, August 7, 2022 4:44 AM To: ljwobker@gmail.com Cc: Jeff Tantsura <jefftant.ietf@gmail.com>; NANOG <nanog@nanog.org>; Jeff Doyle <jdoyle@juniper.net> Subject: Re: 400G forwarding - how does it work? On Sat, 6 Aug 2022 at 17:08, <ljwobker@gmail.com> wrote:
For a while, GSR and CRS type systems had linecards where each card had a bunch of chips that together built the forwarding pipeline. You had chips for the L1/L2 interfaces, chips for the packet lookups, chips for the QoS/queueing math, and chips for the fabric interfaces. Over time, we integrated more and more of these things together until you (more or less) had a linecard where everything was done on one or two chips, instead of a half dozen or more. Once we got here, the next step was to build linecards where you actually had multiple independent things doing forwarding -- on the ASR9k we called these "slices". This again multiplies the performance you can get, but now both the software and the operators have to deal with the complexity of having multiple things running code where you used to only have one. Now let's jump into the 2010's where the silicon integration allows you to put down multiple cores or pipelines on a single chip, each of these is now (more or less) it's own forwarding entity. So now you've got yet ANOTHER layer of abstraction. If I can attempt to draw out the tree, it looks like this now:
1) you have a chassis or a system, which has a bunch of linecards. 2) each of those linecards has a bunch of NPUs/ASICs 3) each of those NPUs has a bunch of cores/pipelines
Thank you for this. I think we may have some ambiguity here. I'll ignore multichassis designs, as those went out of fashion, for now. And describe only 'NPU' not express/brcm style pipeline. 1) you have a chassis with multiple linecards 2) each linecard has 1 or more forwarding packages 3) each package has 1 or more NPUs (Juniper calls these slices, unsure if EZchip vocabulary is same here) 4) each NPU has 1 or more identical cores (well, I can't really name any with 1 core, I reckon, NPU like GPU pretty inherently has many many cores, and unlike some in this thread, I don't think they ever are ARM instruction set, that makes no sense, you create instruction set targeting the application at hand which ARM instruction set is not, but maybe some day we have some forwarding-IA, allowing customers to provide ucode that runs on multiple targets, but this would reduce pace of innovation) Some of those NPU core architectures are flat, like Trio, where a single core handles the entire packet. Where other core architectures, like FP are matrices, where you have multiple lines and packet picks 1 of the lines and traverses each core in line. (FP has much more cores in line, compared to leaba/pacific stuff) -- ++ytti
ljwobker@gmail.com wrote:
Buffer designs are *really* hard in modern high speed chips, and there are always lots and lots of tradeoffs. The "ideal" answer is an extremely large block of memory that ALL of the forwarding/queueing elements have fair/equal access to... but this physically looks more or less like a full mesh between the memory/buffering subsystem and all the forwarding engines, which becomes really unwieldly (expensive!) from a design standpoint. The amount of memory you can practically put on the main NPU die is on the order of 20-200 **mega** bytes, where a single stack of HBM memory comes in at 4GB -- it's literally 100x the size.
I'm afraid you imply too much buffer bloat only to cause unnecessary and unpleasant delay. With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 1%. With 98% load, the probability is 0.0041%. But, there are so many router engineers who think, with bloated buffer, packet drop probability can be zero, which is wrong. For example, https://www.broadcom.com/products/ethernet-connectivity/switching/stratadnx/... Jericho2 delivers a complete set of advanced features for the most demanding carrier, campus and cloud environments. The device supports low power, high bandwidth HBM packet memory offering up to 160X more traffic buffering compared with on-chip memory, enabling zero-packet-loss in heavily congested networks. Masataka Ohta
On Sun, 7 Aug 2022 at 12:16, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
I'm afraid you imply too much buffer bloat only to cause unnecessary and unpleasant delay.
With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 1%. With 98% load, the probability is 0.0041%.
I feel like I'll live to regret asking. Which congestion control algorithm are you thinking of? If we estimate BW and pace TCP window growth at estimated BW, we don't need much buffering at all. But Cubic and Reno will burst tcp window growth at sender rate, which may be much more than receiver rate, someone has to store that growth and pace it out at receiver rate, otherwise window won't grow, and receiver rate won't be achieved. So in an ideal scenario, no we don't need a lot of buffer, in practical situations today, yes we need quite a bit of buffer. Now add to this multiple logical interfaces, each having 4-8 queues, it adds up. Big buffers are bad mmmm'kay is frankly simplistic and inaccurate. Also the shallow ingress buffers discussed in the thread are not delay buffers and the problem is complex because no device is marketable that can accept wire rate of minimum packet size, so what trade-offs do we carry, when we get bad traffic at wire rate at small packet size? We can't empty the ingress buffers fast enough, do we have physical memory for each port, do we share, how do we share? -- ++ytti
Saku Ytti wrote:
I'm afraid you imply too much buffer bloat only to cause unnecessary and unpleasant delay.
With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 1%. With 98% load, the probability is 0.0041%.
I feel like I'll live to regret asking. Which congestion control algorithm are you thinking of?
I'm not assuming LAN environment, for which paced TCP may be desirable (if bandwidth requirement is tight, which is unlikely in LAN).
But Cubic and Reno will burst tcp window growth at sender rate, which may be much more than receiver rate, someone has to store that growth and pace it out at receiver rate, otherwise window won't grow, and receiver rate won't be achieved.
When many TCPs are running, burst is averaged and traffic is poisson.
So in an ideal scenario, no we don't need a lot of buffer, in practical situations today, yes we need quite a bit of buffer.
That is an old theory known to be invalid (Ethernet switches with small buffer is enough for IXes) and theoretically denied by: Sizing router buffers https://dl.acm.org/doi/10.1145/1030194.1015499 after which paced TCP was developed for unimportant exceptional cases of LAN.
Now add to this multiple logical interfaces, each having 4-8 queues, it adds up.
Having so may queues requires sorting of queues to properly prioritize them, which costs a lot of computation (and performance loss) for no benefit and is a bad idea.
Also the shallow ingress buffers discussed in the thread are not delay buffers and the problem is complex because no device is marketable that can accept wire rate of minimum packet size, so what trade-offs do we carry, when we get bad traffic at wire rate at small packet size? We can't empty the ingress buffers fast enough, do we have physical memory for each port, do we share, how do we share?
People who use irrationally small packets will suffer, which is not a problem for the rest of us. Masataka Ohta
There are MANY real world use cases which require high throughput at 64 byte packet size. Denying those use cases because they don’t fit your world view is short sighted. The word of networking is not all I-Mix.
On Aug 7, 2022, at 7:16 AM, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Saku Ytti wrote:
I'm afraid you imply too much buffer bloat only to cause unnecessary and unpleasant delay.
With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 1%. With 98% load, the probability is 0.0041%.
I feel like I'll live to regret asking. Which congestion control algorithm are you thinking of?
I'm not assuming LAN environment, for which paced TCP may be desirable (if bandwidth requirement is tight, which is unlikely in LAN).
But Cubic and Reno will burst tcp window growth at sender rate, which may be much more than receiver rate, someone has to store that growth and pace it out at receiver rate, otherwise window won't grow, and receiver rate won't be achieved.
When many TCPs are running, burst is averaged and traffic is poisson.
So in an ideal scenario, no we don't need a lot of buffer, in practical situations today, yes we need quite a bit of buffer.
That is an old theory known to be invalid (Ethernet switches with small buffer is enough for IXes) and theoretically denied by:
Sizing router buffers https://dl.acm.org/doi/10.1145/1030194.1015499
after which paced TCP was developed for unimportant exceptional cases of LAN.
Now add to this multiple logical interfaces, each having 4-8 queues, it adds up.
Having so may queues requires sorting of queues to properly prioritize them, which costs a lot of computation (and performance loss) for no benefit and is a bad idea.
Also the shallow ingress buffers discussed in the thread are not delay buffers and the problem is complex because no device is marketable that can accept wire rate of minimum packet size, so what trade-offs do we carry, when we get bad traffic at wire rate at small packet size? We can't empty the ingress buffers fast enough, do we have physical memory for each port, do we share, how do we share?
People who use irrationally small packets will suffer, which is not a problem for the rest of us.
Masataka Ohta
On Sun, 7 Aug 2022 at 17:58, <sronan@ronan-online.com> wrote:
There are MANY real world use cases which require high throughput at 64 byte packet size. Denying those use cases because they don’t fit your world view is short sighted. The word of networking is not all I-Mix.
Yes but it's not an addressable market. Such a market will just buy silly putty for 2bucks and modify the existing face-plate to do 64B. No one will ship that box for you, because the addressable market gladly will take more WAN ports as trade-off for large minimum mean packet size. -- ++ytti
You are incredibly incorrect, in fact the market for those devices is in the Billions of Dollars. But you continue to pretend that it doesn’t exist. Shane
On Aug 7, 2022, at 11:57 AM, Saku Ytti <saku@ytti.fi> wrote:
On Sun, 7 Aug 2022 at 17:58, <sronan@ronan-online.com> wrote:
There are MANY real world use cases which require high throughput at 64 byte packet size. Denying those use cases because they don’t fit your world view is short sighted. The word of networking is not all I-Mix.
Yes but it's not an addressable market. Such a market will just buy silly putty for 2bucks and modify the existing face-plate to do 64B.
No one will ship that box for you, because the addressable market gladly will take more WAN ports as trade-off for large minimum mean packet size. -- ++ytti
sronan@ronan-online.com wrote:
There are MANY real world use cases which require high throughput at 64 byte packet size.
Certainly, there were imaginary world use cases which require to guarantee so high throughput of 64kbps with 48B payload size for which 20(40)B IP header was obviously painful and 5B header was used. At that time, poor fair queuing was assumed, which requires small packet size for short delay. But as fair queuing does not scale at all, they disappeared long ago.
Denying those use cases because they don’t fit your world view is short sighted.
That could have been a valid argument 20 years ago. Masataka Ohta
On Sun, Aug 7, 2022 at 11:24 PM Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
sronan@ronan-online.com wrote:
There are MANY real world use cases which require high throughput at 64 byte packet size.
Certainly, there were imaginary world use cases which require to guarantee so high throughput of 64kbps with 48B payload size for which 20(40)B IP header was obviously painful and 5B header was used. At that time, poor fair queuing was assumed, which requires small packet size for short delay.
But as fair queuing does not scale at all, they disappeared long ago.
What do you mean by FQ, exactly? "5 tuple FQ" is scaling today on shaping middleboxes like preseem and LibreQos to over 10gbits. ISP reported results of customer calls about speed simply vanish. Admittedly the AQM is dropping or marking some .x% of packets, but tests with fq with short buffers vs aqm alone showed the former the clear winner, and fq+aqm took it in for the score. On linux fq_codel is the near universal default, also. The linux tcp stack does fq+pacing at nearly 100gbits today on "BIG" tcp. "disappeared", no. invisible, possible. transitioning from +10 gbit down to 1gbit or less, really, really, useful. IMHO, and desparately needed, in way more places. Lastly VOCs and LAG and switch fabrics essentially FQ ports. In the context of aggregating up to 400Gbit from that, you are FQing also. Now fq-ing inline against the ip headers at 400gbit appeared impossible until this convo when the depth of the pipeline and hardware hashing was discussed, but I'll settle for more rfc7567 behavior just in stepping down from that, to 100, and from 100 stepping down, adding in fq+aqm.
Denying those use cases because they don’t fit your world view is short sighted.
That could have been a valid argument 20 years ago.
Masataka Ohta
-- FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/ Dave Täht CEO, TekLibre, LLC
Dave Taht wrote:
But as fair queuing does not scale at all, they disappeared long ago.
What do you mean by FQ, exactly?
Fair queuing is "fair queuing" not some queuing idea which is, by someone, considered "fair". See, for example, https://en.wikipedia.org/wiki/Fair_queuing
"5 tuple FQ" is scaling today
Fair queuing does not scale w.r.t. the number of queues. Masataka Ohta
On Sun, 7 Aug 2022 at 14:16, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
When many TCPs are running, burst is averaged and traffic is poisson.
If you grow a window, and the sender sends the delta at 100G, and receiver is 10G, eventually you'll hit that 10G port at 100G rate. It's largely an edge problem, not a core problem.
People who use irrationally small packets will suffer, which is not a problem for the rest of us.
Quite, unfortunately, the problem I have exists in the Internet, the problem you're solving exists in Ohtanet, Ohtanet is much more civilized and allows for elegant solutions. The Internet just has a different shade of bad solution to pick from. -- ++ytti
Saku Ytti wrote:
When many TCPs are running, burst is averaged and traffic is poisson.
If you grow a window, and the sender sends the delta at 100G, and receiver is 10G, eventually you'll hit that 10G port at 100G rate.
Wrong. If it's local communicaiton where RTT is small, the window is not so large smaller than unbloated router buffer. If RTT is large, your 100G runs over several 100/400G backbone links with many other traffic, which makes the burst much slower than 10G. Masataka Ohta
On Mon, 8 Aug 2022 at 14:02, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
which is, unlike Yttinet, the reality.
Yttinet has pesky customers who care about single TCP performance over long fat links, and observe poor performance with shallow buffers at the provider end. Yttinet is cost sensitive and does not want to do work, unless sufficiently motivated by paying customers. -- ++ytti
Saku Ytti wrote:
which is, unlike Yttinet, the reality.
Yttinet has pesky customers who care about single TCP performance over long fat links, and observe poor performance with shallow buffers at the provider end.
With such an imaginary assumption, according to the end to end principle, the customers (the ends) should use paced TCP instead of paying unnecessarily bloated amount of money to intelligent intermediate entities of ISPs using expensive routers with bloated buffers.
Yttinet is cost sensitive and does not want to do work, unless sufficiently motivated by paying customers.
I understand that if customers follow the end to end principle, revenue of "intelligent" ISPs will be reduced. Masataka Ohta
On Mon, 8 Aug 2022 at 14:37, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
With such an imaginary assumption, according to the end to end principle, the customers (the ends) should use paced TCP instead
I fully agree, unfortunately I do not control the whole problem domain, and the solutions available with partial control over the domain are less than elegant. -- ++ytti
Saku Ytti wrote:
With such an imaginary assumption, according to the end to end principle, the customers (the ends) should use paced TCP instead
I fully agree, unfortunately I do not control the whole problem domain, and the solutions available with partial control over the domain are less than elegant.
OK. But, you should be aware that, with bloated buffer, all the customers sharing the buffer will suffer from delay. Masataka Ohta
You keep using the term “imaginary” when presented with evidence that does not match your view of things. There are many REAL scenarios where single flow high throughout TCP is a real requirements as well as high throughput extremely small packet size. In the case of the later, the market is extremely large, but it’s not Internet traffic. Shane
On Aug 8, 2022, at 7:34 AM, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Saku Ytti wrote:
which is, unlike Yttinet, the reality. Yttinet has pesky customers who care about single TCP performance over long fat links, and observe poor performance with shallow buffers at the provider end.
With such an imaginary assumption, according to the end to end principle, the customers (the ends) should use paced TCP instead of paying unnecessarily bloated amount of money to intelligent intermediate entities of ISPs using expensive routers with bloated buffers.
Yttinet is cost sensitive and does not want to do work, unless sufficiently motivated by paying customers.
I understand that if customers follow the end to end principle, revenue of "intelligent" ISPs will be reduced.
Masataka Ohta
Also, for data center traffic, especially real-time market data and other UDP multicast traffic, micro-bursting is one of the biggest issues especially as you scale out your backbone. We have two 100GB switches, and have to distribute the traffic over a LACL link with 4 different 100GB ports on different ASIC even though the traffic < 1% just to account for micro-bursts. -----Original Message----- From: NANOG <nanog-bounces+mhuff=ox.com@nanog.org> On Behalf Of sronan@ronan-online.com Sent: Monday, August 8, 2022 8:39 AM To: Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> Cc: nanog@nanog.org Subject: Re: 400G forwarding - how does it work? You keep using the term “imaginary” when presented with evidence that does not match your view of things. There are many REAL scenarios where single flow high throughout TCP is a real requirements as well as high throughput extremely small packet size. In the case of the later, the market is extremely large, but it’s not Internet traffic. Shane
On Aug 8, 2022, at 7:34 AM, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Saku Ytti wrote:
which is, unlike Yttinet, the reality. Yttinet has pesky customers who care about single TCP performance over long fat links, and observe poor performance with shallow buffers at the provider end.
With such an imaginary assumption, according to the end to end principle, the customers (the ends) should use paced TCP instead of paying unnecessarily bloated amount of money to intelligent intermediate entities of ISPs using expensive routers with bloated buffers.
Yttinet is cost sensitive and does not want to do work, unless sufficiently motivated by paying customers.
I understand that if customers follow the end to end principle, revenue of "intelligent" ISPs will be reduced.
Masataka Ohta
Matthew Huff wrote:
Also, for data center traffic, especially real-time market data and other UDP multicast traffic, micro-bursting is one of the biggest issues especially as you scale out your backbone.
Are you saying you rely on multicast even though loss of a packet means loss of large amount of money? Is it a reason why you use large buffer to eliminate possibilities of packet dropping caused by buffer overflow but not by other reasons? Masataka Ohta
How do you propose to fairly distribute market data feeds to the market if not multicast? Shane
On Aug 9, 2022, at 10:19 PM, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Matthew Huff wrote:
Also, for data center traffic, especially real-time market data and other UDP multicast traffic, micro-bursting is one of the biggest issues especially as you scale out your backbone.
Are you saying you rely on multicast even though loss of a packet means loss of large amount of money?
Is it a reason why you use large buffer to eliminate possibilities of packet dropping caused by buffer overflow but not by other reasons?
Masataka Ohta
On Wed, 10 Aug 2022 at 06:48, <sronan@ronan-online.com> wrote:
How do you propose to fairly distribute market data feeds to the market if not multicast?
I expected your aggressive support for small packets was for fintech. An anecdote: one of the largest exchanges in the world used MX for multicast replication, which is btree or today utree replication, that is, each NPU gets replicated packet wildy different time, therefore receivers do. Which wasn't a problem for them, because they didn't know that's how it works and suffered no negative consequence of this, which arguably should have been a show stopper if we need receivers to receive it at a remotely similar time. Also, it is not in disagreement with my statement that it is not addressable market, because this marker can use products which do not do 64B wire-rate, for two separate reason either/and a) port is no where near congested b) the market is not cost sensitive, they buy the device with many WAN ports, and don't provision it so that they can't get 64B on each actually used ports. -- ++ytti
sronan@ronan-online.com wrote:
How do you propose to fairly distribute market data feeds to the > market if not multicast?
Unicast with randomized order. To minimize latency, bloated buffer should be avoided and TCP with configured small (initial) RTT should be used. Masataka Ohta
On Mon, Aug 8, 2022 at 5:39 AM <sronan@ronan-online.com> wrote:
You keep using the term “imaginary” when presented with evidence that does not match your view of things.
There are many REAL scenarios where single flow high throughout TCP is a real requirements as well as high throughput extremely small packet size. In the case of the later, the market is extremely large, but it’s not Internet traffic.
I believe this all started with asic experts saying trade-offs need to be made to operate at crazy speeds in a single package. Ohta-san is simply saying your usecase did not make the cut, which is clear. That said, asic makers have gotten things wrong (for me), and some things they can adjust in code, others not so much. The LPM / LEM lookup table distribution is certainly one that has burned me in ipv6 and mpls label scale, but thankfully SiliconOne can make some adjustments… but watch-out if your network is anything other than /48s The only thing that will change that is $$$.
Shane
On Aug 8, 2022, at 7:34 AM, Masataka Ohta < mohta@necom830.hpcl.titech.ac.jp> wrote:
Saku Ytti wrote:
which is, unlike Yttinet, the reality. Yttinet has pesky customers who care about single TCP performance over long fat links, and observe poor performance with shallow buffers at the provider end.
With such an imaginary assumption, according to the end to end principle, the customers (the ends) should use paced TCP instead of paying unnecessarily bloated amount of money to intelligent intermediate entities of ISPs using expensive routers with bloated buffers.
Yttinet is cost sensitive and does not want to do work, unless sufficiently motivated by paying customers.
I understand that if customers follow the end to end principle, revenue of "intelligent" ISPs will be reduced.
Masataka Ohta
Disclaimer: I often use the M/M/1 queuing assumption for much of my work to keep the maths simple and believe that I am reasonably aware in which context it's a right or a wrong application :). Also, I don't intend to change the core topic of the thread, but since this has come up, I couldn't resist.
With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 1%. With 98% load, the probability is 0.0041%.
To expand the above a bit so that there is no ambiguity. The above assumes that the router behaves like an M/M/1 queue. The expected number of packets in the systems can be given by [image: image.png] where [image: image.png] is the utilization. The probability that at least B packets are in the system is given by [image: image.png] where B is the number of packets in the system. for a link utilization of .98, the packet drop probability is .98**(500) = 0.000041%. for a link utilization of 99%, .99**500 = 0.00657%.
When many TCPs are running, burst is averaged and traffic is poisson.
M/M/1 queuing assumes that traffic is Poisson, and the Poisson assumption is 1) The number of sources is infinite 2) The traffic arrival pattern is random. I think the second assumption is where I often question whether the traffic arrival pattern is truly random. I have seen cases where traffic behaves more like self-similar. Most Poisson models rely on the Central limit theorem, which loosely states that the sample distribution will approach a normal distribution as we aggregate more from various distributions. The mean will smooth towards a value. Do you have any good pointers where the research has been done that today's internet traffic can be modeled accurately by Poisson? For as many papers supporting Poisson, I have seen as many papers saying it's not Poisson. https://www.icir.org/vern/papers/poisson.TON.pdf https://www.cs.wustl.edu/~jain/cse567-06/ftp/traffic_models2/#sec1.2 On Sun, 7 Aug 2022 at 04:18, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Saku Ytti wrote:
I'm afraid you imply too much buffer bloat only to cause unnecessary and unpleasant delay.
With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 1%. With 98% load, the probability is 0.0041%.
I feel like I'll live to regret asking. Which congestion control algorithm are you thinking of?
I'm not assuming LAN environment, for which paced TCP may be desirable (if bandwidth requirement is tight, which is unlikely in LAN).
But Cubic and Reno will burst tcp window growth at sender rate, which may be much more than receiver rate, someone has to store that growth and pace it out at receiver rate, otherwise window won't grow, and receiver rate won't be achieved.
When many TCPs are running, burst is averaged and traffic is poisson.
So in an ideal scenario, no we don't need a lot of buffer, in practical situations today, yes we need quite a bit of buffer.
That is an old theory known to be invalid (Ethernet switches with small buffer is enough for IXes) and theoretically denied by:
Sizing router buffers https://dl.acm.org/doi/10.1145/1030194.1015499
after which paced TCP was developed for unimportant exceptional cases of LAN.
Now add to this multiple logical interfaces, each having 4-8 queues, it adds up.
Having so may queues requires sorting of queues to properly prioritize them, which costs a lot of computation (and performance loss) for no benefit and is a bad idea.
Also the shallow ingress buffers discussed in the thread are not delay buffers and the problem is complex because no device is marketable that can accept wire rate of minimum packet size, so what trade-offs do we carry, when we get bad traffic at wire rate at small packet size? We can't empty the ingress buffers fast enough, do we have physical memory for each port, do we share, how do we share?
People who use irrationally small packets will suffer, which is not a problem for the rest of us.
Masataka Ohta
If it's of any help... the bloat mailing list at lists.bufferbloat.net has the largest concentration of queue theorists and network operator + developers I know of. (also, bloat readers, this ongoing thread on nanog about 400Gbit is fascinating) There is 10+ years worth of debate in the archives: https://lists.bufferbloat.net/pipermail/bloat/2012-May/thread.html as one example. On Sun, Aug 7, 2022 at 10:14 AM dip <diptanshu.singh@gmail.com> wrote:
Disclaimer: I often use the M/M/1 queuing assumption for much of my work to keep the maths simple and believe that I am reasonably aware in which context it's a right or a wrong application :). Also, I don't intend to change the core topic of the thread, but since this has come up, I couldn't resist.
With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 1%. With 98% load, the probability is 0.0041%.
To expand the above a bit so that there is no ambiguity. The above assumes that the router behaves like an M/M/1 queue. The expected number of packets in the systems can be given by
[image: image.png] where [image: image.png] is the utilization. The probability that at least B packets are in the system is given by [image: image.png] where B is the number of packets in the system. for a link utilization of .98, the packet drop probability is .98**(500) = 0.000041%. for a link utilization of 99%, .99**500 = 0.00657%.
Regrettably, tcp ccs, by design do not stop growth until you get that drop, e.g. 100+% utilization.
When many TCPs are running, burst is averaged and traffic
is poisson.
M/M/1 queuing assumes that traffic is Poisson, and the Poisson assumption is 1) The number of sources is infinite 2) The traffic arrival pattern is random.
I think the second assumption is where I often question whether the traffic arrival pattern is truly random. I have seen cases where traffic behaves more like self-similar. Most Poisson models rely on the Central limit theorem, which loosely states that the sample distribution will approach a normal distribution as we aggregate more from various distributions. The mean will smooth towards a value.
Do you have any good pointers where the research has been done that today's internet traffic can be modeled accurately by Poisson? For as many papers supporting Poisson, I have seen as many papers saying it's not Poisson.
https://www.icir.org/vern/papers/poisson.TON.pdf https://www.cs.wustl.edu/~jain/cse567-06/ftp/traffic_models2/#sec1.2
I am firmly in the not-poisson camp, however, by inserting (esp) FQ and AQM techniques on the bottleneck links it is very possible to smooth traffic into this more easily analytical model - and gain enormous benefits from doing so.
On Sun, 7 Aug 2022 at 04:18, Masataka Ohta < mohta@necom830.hpcl.titech.ac.jp> wrote:
Saku Ytti wrote:
I'm afraid you imply too much buffer bloat only to cause unnecessary and unpleasant delay.
With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 1%. With 98% load, the probability is 0.0041%.
I feel like I'll live to regret asking. Which congestion control algorithm are you thinking of?
I'm not assuming LAN environment, for which paced TCP may be desirable (if bandwidth requirement is tight, which is unlikely in LAN).
But Cubic and Reno will burst tcp window growth at sender rate, which may be much more than receiver rate, someone has to store that growth and pace it out at receiver rate, otherwise window won't grow, and receiver rate won't be achieved.
When many TCPs are running, burst is averaged and traffic is poisson.
So in an ideal scenario, no we don't need a lot of buffer, in practical situations today, yes we need quite a bit of buffer.
That is an old theory known to be invalid (Ethernet switches with small buffer is enough for IXes) and theoretically denied by:
Sizing router buffers https://dl.acm.org/doi/10.1145/1030194.1015499
after which paced TCP was developed for unimportant exceptional cases of LAN.
Now add to this multiple logical interfaces, each having 4-8 queues, it adds up.
Having so may queues requires sorting of queues to properly prioritize them, which costs a lot of computation (and performance loss) for no benefit and is a bad idea.
Also the shallow ingress buffers discussed in the thread are not delay buffers and the problem is complex because no device is marketable that can accept wire rate of minimum packet size, so what trade-offs do we carry, when we get bad traffic at wire rate at small packet size? We can't empty the ingress buffers fast enough, do we have physical memory for each port, do we share, how do we share?
People who use irrationally small packets will suffer, which is not a problem for the rest of us.
Masataka Ohta
-- FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/ Dave Täht CEO, TekLibre, LLC
dip wrote:
I have seen cases where traffic behaves more like self-similar.
That could happen if there are small number of TCP streams or multiple TCPs are synchronized through interactions on bloated buffers, which is one reason why we should avoid bloated buffers.
Do you have any good pointers where the research has been done that today's internet traffic can be modeled accurately by Poisson? For as many papers supporting Poisson, I have seen as many papers saying it's not Poisson.
It is based on observations between 1989 and 1994 when Internet backbone was slow and the number of users was small, which means the number of TCP streams running in parallel is small. For example, merely 124M packets for 36 days of observation [LBL-1], is slower than 500kbps, which can be filled up by a single TCP connection even by computers at that time and is not a meaningful measurement.
https://www.cs.wustl.edu/~jain/cse567-06/ftp/traffic_models2/#sec1.2
It merely states that some use non Poisson traffic models. Masataka Ohta
Buffering is a near-religious topic across a large swath of the network industry, but here are some opinions of mine: a LOT of operators/providers need more buffering than you can realistically put directly onto the ASIC die. Fast chips without external buffers measure capacity in tens of microseconds, which is nowhere near enough for a lot of the market. We can (and do) argue about exactly where and what network roles can be met by this amount of buffering, but it's absolutely not a large enough part of the market to totally go away from "big" external buffers. Once you "jump off the cliff" of needing something more than on-chip SRAM, you're in this weird area where nothing exists in the technology space that *really* solves the problem, because you really need access rate and bandwidth more than you need capacity. HBM is currently the best (or at least the most popular) combination of capacity, power, access rate, and bandwidth... but it's still nowhere near perfect. A common HBM2 implementation gives you 8GB of buffer space and about 2Tb of raw bandwidth, and a few hundred million IOPS. (A lot of that gets gobbled up by various overheads....) These values are a function of two things: 1) memory physics - I don't know enough about how these things are Like Really Actually Built to talk about this part. 2) market forces... the market for this stuff is really GPUs, ML/AI applications, etc. The networking silicon market is a drop in the ocean compared to the rest of compute, so the specific needs of my router aren't going to ever drive enough volume to get big memory makers to do exactly what **I** want. I'm at the mercy of what they build for the gigantic players in the rest of the market. If you told me that someone had a memory technology that was something like "one-fourth the capacity of HBM, but four times the bandwidth and four times the access rate" I would do backflips and buy a lot of it, because it's a way better fit for the specific performance dimensions I need for A Really Fast Router. But nothing remotely along these lines exists... so like a lot of other people I just have to order off the menu. ;-) --lj -----Original Message----- From: NANOG <nanog-bounces+ljwobker=gmail.com@nanog.org> On Behalf Of Masataka Ohta Sent: Sunday, August 7, 2022 5:13 AM To: nanog@nanog.org Subject: Re: 400G forwarding - how does it work? ljwobker@gmail.com wrote:
Buffer designs are *really* hard in modern high speed chips, and there are always lots and lots of tradeoffs. The "ideal" answer is an extremely large block of memory that ALL of the forwarding/queueing elements have fair/equal access to... but this physically looks more or less like a full mesh between the memory/buffering subsystem and all the forwarding engines, which becomes really unwieldly (expensive!) from a design standpoint. The amount of memory you can practically put on the main NPU die is on the order of 20-200 **mega** bytes, where a single stack of HBM memory comes in at 4GB -- it's literally 100x the size.
I'm afraid you imply too much buffer bloat only to cause unnecessary and unpleasant delay. With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 1%. With 98% load, the probability is 0.0041%. But, there are so many router engineers who think, with bloated buffer, packet drop probability can be zero, which is wrong. For example, https://www.broadcom.com/products/ethernet-connectivity/switching/stratadnx/... Jericho2 delivers a complete set of advanced features for the most demanding carrier, campus and cloud environments. The device supports low power, high bandwidth HBM packet memory offering up to 160X more traffic buffering compared with on-chip memory, enabling zero-packet-loss in heavily congested networks. Masataka Ohta
Sharada’s answers: a) Yes, the run-to-completion model of Trio is superior to FP5/Nokia model when it comes to flexible processing engines. In Trio, the same engines can do either ingress or egress processing. Traditionally, there is more processing on ingress than on egress. When that happens, by design, less number of processing engines get used for egress, and more engines are available for ingress processing. Trio gives full flexibility. Unless Nokia is optimizing the engines (not all engines are identical, and some are area optimized for specific processing) to save overall area, I do not see any other advantage. b) Trio provides on-chip shallow buffering on ingress for fabric queues. We share this buffer between the slices on the same die. This gives us the flexibility to go easy on the size of SRAM we want to support for buffering. c) I didn't completely follow the question. Shallow ingress buffers are for fabric-facing queues, and we do not expect sustained fabric congestion. This, combined with the fact that we have some speed up over fabric, ensures that all WAN packets do reach the egress PFE buffer. On ingress, if packet processing is oversubscribed, we have line rate pre-classifiers do proper drops based on WAN queue priority. Cheers, Jeff
On Aug 9, 2022, at 16:34, Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
Saku,
I have forwarded your questions to Sharada.
All,
For this week – at 11:00am PST, Thursday 08/11, we will be joined by Guy Caspary (co-founder of Leaba Semiconductor (acquired by Cisco -> SiliconOne) https://m.youtube.com/watch?v=GDthnCj31_Y
For the next week, I’m planning to get one of main architects of Broadcom DNX (Jericho/Qumran/Ramon).
Cheers, Jeff
From: Saku Ytti Sent: Friday, August 5, 2022 12:15 AM To: Jeff Tantsura Cc: NANOG; Jeff Doyle Subject: Re: 400G forwarding - how does it work?
Thank you for this.
I wish there would have been a deeper dive to the lookup side. My open questions
a) Trio model of packet stays in single PPE until done vs. FP model of line-of-PPE (identical cores). I don't understand the advantages of the FP model, the Trio model advantages are clear to me. Obviously the FP model has to have some advantages, what are they?
b) What exactly are the gains of putting two trios on-package in Trio6, there is no local-switching between WANs of trios in-package, they are, as far as I can tell, ships in the night, packets between trios go via fabric, as they would with separate Trios. I can understand the benefit of putting trio and HBM2 on the same package, to reduce distance so wattage goes down or frequency goes up.
c) What evolution they are thinking for the shallow ingress buffers for Trio6. The collateral damage potential is significant, because WAN which asks most, gets most, instead each having their fair share, thus potentially arbitrarily low rate WAN ingress might not get access to ingress buffer causing drop. Would it be practical in terms of wattage/area to add some sort of preQoS towards the shallow ingress buffer, so each WAN ingress has a fair guaranteed-rate to shallow buffers?
On Fri, 5 Aug 2022 at 02:18, Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
Apologies for garbage/HTMLed email, not sure what happened (thanks Brian F for letting me know). Anyway, the podcast with Juniper (mostly around Trio/Express) has been broadcasted today and is available at https://www.youtube.com/watch?v=1he8GjDBq9g Next in the pipeline are: Cisco SiliconOne Broadcom DNX (Jericho/Qumran/Ramon) For both - the guests are main architects of the silicon
Enjoy
On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
Hey,
This is not an advertisement but an attempt to help folks to better understand networking HW.
Some of you might know (and love 😊) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of years.
Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic where more information and better education is very much needed (no, you won’t have to sign NDA before joining 😊), we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.
More to come, stay tuned.
Live feed: https://lnkd.in/gk2x2ezZ
Between 0x2 nerds playlist, videos will be published to: https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2yJB7
Cheers,
Jeff
From: James Bensley Sent: Wednesday, July 27, 2022 12:53 PM To: Lawrence Wobker; NANOG Subject: Re: 400G forwarding - how does it work?
On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker@gmail.com> wrote:
So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines.. and so on and so forth.
Thanks for the response Lawrence.
The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
J2 to have something similar (as someone already mentioned, most chips
I've seen are in the 1-1.5Ghz range), so in this case "only" 2
pipelines would be needed to maintain the headline 2Bpps rate of the
J2, or even just 1 if they have managed to squeeze out two packets per
cycle through parallelisation within the pipeline.
Cheers,
James.
-- ++ytti
On Tue, 26 Jul 2022 at 21:28, <jwbensley+nanog@gmail.com> wrote:
No you are right, FP has much much more PPEs than Trio.
Can you give any examples?
Nokia FP is like >1k, Juniper Trio is closer to 100 (earlier Trio LUs had much less). I could give exact numbers for EA and YT if needed, they are visible in the CLI and the end user can even profile them, to see what ucode function they are spending their time on. -- ++ytti
It wasn't a CPU analysis because switching ASICs != CPUs.
I am aware of the x86 architecture, but know little of network ASICs, so I was deliberately trying to not apply my x86 knowledge here, in case it sent me down the wrong path. You made references towards typical CPU features;
A CPU is 'jack of all trades, master of none'. An ASIC is 'master of one specific thing'. If a given feature or design paradigm found in a CPU fits with the use case the ASIC is being designed for, there's no reason it cannot be used. On Mon, Jul 25, 2022 at 2:52 PM James Bensley <jwbensley+nanog@gmail.com> wrote:
Thanks for the responses Chris, Saku…
On Mon, 25 Jul 2022 at 15:17, Chris Adams <cma@cmadams.net> wrote:
Once upon a time, James Bensley <jwbensley+nanog@gmail.com> said:
The obvious answer is that it's not magic and my understanding is fundamentally flawed, so please enlighten me.
So I can't answer to your specific question, but I just wanted to say that your CPU analysis is simplistic and doesn't really match how CPUs work now.
It wasn't a CPU analysis because switching ASICs != CPUs.
I am aware of the x86 architecture, but know little of network ASICs, so I was deliberately trying to not apply my x86 knowledge here, in case it sent me down the wrong path. You made references towards typical CPU features;
For example, it might take 4 times as long to process the first packet, but as long as the hardware can handle 4 packets in a queue, you'll get a packet result every cycle after that, without dropping anything. So maybe the first result takes 12 cycles, but then you can keep getting a result every 3 cycles as long as the pipeline is kept full.
Yes, in the x86/x64 CPU world keeping the instruction cache and data cache hot indeed results in optimal performance, and as you say modern CPUs use parallel pipelines amongst other techniques like branch prediction, SIMD, (N)UMA, and so on, but I would assume (because I don’t know) that not all of the x86 feature set map nicely to packet processing in ASICs (VPP uses these techniques on COTS CPUs, to emulate a fixed pipeline, rather than run to completion model).
You and Saku both suggest that heavy parallelism is the magic source;
Something can be "line rate" but not push the first packet through in the shortest time.
On Mon, 25 Jul 2022 at 15:16, Saku Ytti <saku@ytti.fi> wrote:
I.e. say JNPR Trio PPE has many threads, and only one thread is running, rest of the threads are waiting for answers from memory. That is, once we start pushing packets through the device, it takes a long ass time (like single digit microseconds) before we see any packets out. 1000x longer than your calculated single digit nanoseconds.
In principal I accept this idea. But lets try and do the maths, I'd like to properly understand;
The non-drop rate of the J2 is 2Bpps @ 284 bytes == 4.8Tbps, my example scenario was a single J2 chip in a 12x400G device. If each port is receiving 400G @ 284 bytes (164,473,684 pps), that’s one every 6.08 nanoseconds coming in. What kind of parallelism is required to stop from ingress dropping?
It takes say 5 microseconds to process and forward a packet (seems reasonable looking at some Arista data sheets which use J2 variants), which means we need to be operating on 5,000ns / 6.08ns == 822 packets per port simultaneously, so 9868 packets are being processed across all 12 ports simultaneously, to stop ingress dropping on all interfaces.
I think the latest generation Trio has 160 PPEs per PFE, but I’m not sure how many threads per PPE. Older generations had 20 threads/contexts per PPE, so if it hasn’t increased that would make for 3200 threads in total. That is a 1.6Tbps FD chip, although not apples to apples of course, Trio is run to completion too.
The Nokia FP5 has 1,200 cores (I have no idea how many threads per core) and is rated for 4.8Tbps FD. Again doing something quite different to a J2 chip, again its RTC.
J2 is a partially-fixed pipeline but slightly programmable if I have understood correctly, but definitely at the other end of the spectrum compared to RTC. So are we to surmise that a J2 chip has circa 10k parallel pipelines, in order to process 9868 packets in parallel?
I have no frame of reference here, but in comparison to Gen 6 Trio of NP5, that seems very high to me (to the point where I assume I am wrong).
Cheers, James.
I'm not sure what your specific question is. So I answer my question instead. Q: how can we do lookup fast enough to do 'big number' per second, while underlying hardware inherently takes longer A: we throw memory at the problem I.e. say JNPR Trio PPE has many threads, and only one thread is running, rest of the threads are waiting for answers from memory. That is, once we start pushing packets through the device, it takes a long ass time (like single digit microseconds) before we see any packets out. 1000x longer than your calculated single digit nanoseconds. On Mon, 25 Jul 2022 at 15:56, James Bensley <jwbensley+nanog@gmail.com> wrote:
Hi All,
I've been trying to understand how forwarding at 400G is possible, specifically in this example, in relation to the Broadcom J2 chips, but I don't the mystery is anything specific to them...
According to the Broadcom Jericho2 BCM88690 data sheet it provides 4.8Tbps of traffic processing and supports packet forwarding at 2Bpps. According to my maths that means it requires packet sizes of 300Bs to reach line rate across all ports. The data sheet says packet sizes above 284B, so I guess this is excluding some headers like the inter-frame gap and CRC (nothing after the PHY/MAC needs to know about them if the CRC is valid)? As I interpret the data sheet, J2 should supports chassis with 12x 400Gbps ports at line rate with 284B packets then.
Jericho2 can be linked to a BCM16K for expanded packet forwarding tables and lookup processing (i.e. to hold the full global routing table, in such a case, forwarding lookups are offloaded to the BCM16K). The BCM16K documentation suggests that it uses TCAM for exact matching (e.g.,for ACLs) in something called the "Database Array" (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in something called the "User Data Array" (with 16M 32b entries?).
A BCM16K supports 16 parallel searches, which means that each of the 12x 400G ports on a Jericho2 could perform an forwarding lookup at same time. This means that the BCM16K "only" needs to perform forwarding look-ups at a linear rate of 1x 400Gbps, not 4.8Tbps, and "only" for packets larger than 284 bytes, because that is the Jericho2 line-rate Pps rate. This means that each of the 16 parallel searches in the BCM16K, they need to support a rate of 164Mpps (164,473,684) to reach 400Gbps. This is much more in the realm of feasible, but still pretty extreme...
1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which is within the access time of TCAM and SRAM but this needs to include some computing time too e.g. generating a key for a lookup and passing the results along the pipeline etc. The BCM16K has a clock speed of 1Ghz (1,000,000,000, cycles per second, or cycle every 1 nano second) and supports an SRAM memory access in a single clock cycle (according to the data sheet). If one cycle is required for an SRAM lookup, the BCM16K only has 5 cycles to perform other computation tasks, and the J2 chip needs to do the various header re-writes and various counter updates etc., so how is magic this happening?!?
The obvious answer is that it's not magic and my understanding is fundamentally flawed, so please enlighten me.
Cheers, James.
-- ++ytti
So the most important bits are pipelining and parallelism. And this is substantially simplified, but hopefully it helps. Pipelining basically means that you have a whole bunch of different operations that you need to perform to forward a packet. Lots of these are lookups into things like the FIB tables, the encap tables, the MAC tables, and literally dozens of other places where you store configuration and network state. Some of these are very small simple tables (“give me the value for a packet with TOS = 0b101”) and some are very complicated, like multi-level longest-prefix trees/tries that are built from lots of custom hardware logic and memory. It varies a lot from chip to chip, but there are on the order of 50-100 different tables for the current generation of “fast” chips doing lots of 400GE interfaces. Figuring out how to distribute all this forwarding state across all the different memory banks/devices in a big, fast chip is one of the Very Hard Problems that the chip makers and system vendors have to figure out. So once you build out this pipeline, you’ve got a bunch of different steps that all happen sequentially. The “length” of the pipeline puts a floor on the latency for switching a single packet… if I have to do 25 lookups and they’re all dependent on the one before, it’s not possible for me to switch the packet in any less than 25 clocks…. BUT, if I have a bunch of hardware all running these operations at the same time, I can push the aggregate forwarding capacity way higher. This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput. Now there’s plenty of complexity in terms of HOW I do all that parallelism — figuring out whether I have to replicate entire memory structures or if I can come up with sneaky ways of doing multiple lookups more efficiently, but that’s getting into the magic secret sauce type stuff. I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism. Hopefully that helps at least some. Disclaimer: I’m a Cisco employee, these words are mine and not representative of anything awesome that I may or may not work on in my day job… —lj ________________________________ From: NANOG <nanog-bounces+ljwobker=gmail.com@nanog.org> on behalf of James Bensley <jwbensley+nanog@gmail.com> Sent: Monday, July 25, 2022 8:55 AM To: NANOG <nanog@nanog.org> Subject: 400G forwarding - how does it work? Hi All, I've been trying to understand how forwarding at 400G is possible, specifically in this example, in relation to the Broadcom J2 chips, but I don't the mystery is anything specific to them... According to the Broadcom Jericho2 BCM88690 data sheet it provides 4.8Tbps of traffic processing and supports packet forwarding at 2Bpps. According to my maths that means it requires packet sizes of 300Bs to reach line rate across all ports. The data sheet says packet sizes above 284B, so I guess this is excluding some headers like the inter-frame gap and CRC (nothing after the PHY/MAC needs to know about them if the CRC is valid)? As I interpret the data sheet, J2 should supports chassis with 12x 400Gbps ports at line rate with 284B packets then. Jericho2 can be linked to a BCM16K for expanded packet forwarding tables and lookup processing (i.e. to hold the full global routing table, in such a case, forwarding lookups are offloaded to the BCM16K). The BCM16K documentation suggests that it uses TCAM for exact matching (e.g.,for ACLs) in something called the "Database Array" (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in something called the "User Data Array" (with 16M 32b entries?). A BCM16K supports 16 parallel searches, which means that each of the 12x 400G ports on a Jericho2 could perform an forwarding lookup at same time. This means that the BCM16K "only" needs to perform forwarding look-ups at a linear rate of 1x 400Gbps, not 4.8Tbps, and "only" for packets larger than 284 bytes, because that is the Jericho2 line-rate Pps rate. This means that each of the 16 parallel searches in the BCM16K, they need to support a rate of 164Mpps (164,473,684) to reach 400Gbps. This is much more in the realm of feasible, but still pretty extreme... 1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which is within the access time of TCAM and SRAM but this needs to include some computing time too e.g. generating a key for a lookup and passing the results along the pipeline etc. The BCM16K has a clock speed of 1Ghz (1,000,000,000, cycles per second, or cycle every 1 nano second) and supports an SRAM memory access in a single clock cycle (according to the data sheet). If one cycle is required for an SRAM lookup, the BCM16K only has 5 cycles to perform other computation tasks, and the J2 chip needs to do the various header re-writes and various counter updates etc., so how is magic this happening?!? The obvious answer is that it's not magic and my understanding is fundamentally flawed, so please enlighten me. Cheers, James.
Hi Lawrence, thanks for your response. On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwobker@gmail.com> wrote:
This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput. ... I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism.
What level of parallelism is required to forward 10Bpps? Or 2Bpps like my J2 example :) Cheers, James.
On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+nanog@gmail.com> wrote:
On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwobker@gmail.com> wrote:
This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput. ... I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism.
What level of parallelism is required to forward 10Bpps? Or 2Bpps like my J2 example :)
I suspect many folks know the exact answer for J2, but it's likely under NDA to talk about said specific answer for a given thing. Without being platform or device-specific, the core clock rate of many network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with a goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that doesn't mean a latency of 1 clock ingress-to-egress but rather that every clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS packet rate is achieved by having enough pipelines in parallel to achieve that. The number here is often "1" or "0.5" so you can work the number backwards. (e.g. it emits a packet every clock, or every 2nd clock). It's possible to build an ASIC/NPU to run a faster clock rate, but gets back to what I'm hand-waving describing as "goldilocks". Look up power vs frequency and you'll see its non-linear. Just as CPUs can scale by adding more cores (vs increasing frequency), ~same holds true on network silicon, and you can go wider, multiple pipelines. But its not 10K parallel slices, there's some parallel parts, but there are multiple 'stages' on each doing different things. Using your CPU comparison, there are some analogies here that do work: - you have multiple cpu cores that can do things in parallel -- analogous to pipelines - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some DRAM or LLC) -- maybe some lookup engines, or centralized buffer/memory - most modern CPUs are out-of-order execution, where under-the-covers, a cache-miss or DRAM fetch has a disproportionate hit on performance, so its hidden away from you as much as possible by speculative execution out-of-order -- no direct analogy to this one - it's unlikely most forwarding pipelines do speculative execution like a general purpose CPU does - but they definitely do 'other work' while waiting for a lookup to happen A common-garden x86 is unlikely to achieve such a rate for a few different reasons: - packets-in or packets-out go via DRAM then you need sufficient DRAM (page opens/sec, DRAM bandwidth) to sustain at least one write and one read per packet. Look closer at DRAM and see its speed, Pay attention to page opens/sec, and what that consumes. - one 'trick' is to not DMA packets to DRAM but instead have it go into SRAM of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least potentially saves you that DRAM write+read per packet - ... but then do e.g. a LPM lookup, and best case that is back to a memory access/packet. Maybe it's in L1/L2/L3 cache, but likely at large table sizes it isn't. - ... do more things to the packet (urpf lookups, counters) and it's yet more lookups. Software can achieve high rates, but note that a typical ASIC/NPU does on the order of >100 separate lookups per packet, and 100 counter updates per packet. Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in software on generic CPUs is also a series of tradeoffs. cheers, lincoln.
As Lincoln said - all of us directly working with BCM/other silicon vendors have signed numerous NDAs. However if you ask a well crafted question - there’s always a way to talk about it ;-) In general, if we look at the whole spectrum, on one side there’re massively parallelized “many core” RTC ASICs, such as Trio, Lightspeed, and similar (as the last gasp of Redback/Ericsson venture - we have built 1400 HW threads ASIC (Spider). On another side of the spectrum - fixed pipeline ASICs, from BCM Tomahawk at its extreme (max speed/radix - min features) moving with BCM Trident, Innovium, Barefoot(quite different animal wrt programmability), etc - usually shallow on chip buffer only (100-200M). In between we have got so called programmable pipeline silicon, BCM DNX and Juniper Express are in this category, usually a combo of OCB + off chip memory (most often HBM), (2-6G), usually have line-rate/high scale security/overlay encap/decap capabilities. Usually have highly optimized RTC blocks within a pipeline (RTC within macro). The way and speed to access DBs, memories is evolving with each generation, number/speed of non networking cores(usually ARM) keeps growing - OAM, INT, local optimizations are primary users of it. Cheers, Jeff
On Jul 25, 2022, at 15:59, Lincoln Dale <ltd@interlink.com.au> wrote:
On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+nanog@gmail.com> wrote:
On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwobker@gmail.com> wrote:
This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput. ... I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism.
What level of parallelism is required to forward 10Bpps? Or 2Bpps like my J2 example :)
I suspect many folks know the exact answer for J2, but it's likely under NDA to talk about said specific answer for a given thing.
Without being platform or device-specific, the core clock rate of many network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with a goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that doesn't mean a latency of 1 clock ingress-to-egress but rather that every clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS packet rate is achieved by having enough pipelines in parallel to achieve that. The number here is often "1" or "0.5" so you can work the number backwards. (e.g. it emits a packet every clock, or every 2nd clock).
It's possible to build an ASIC/NPU to run a faster clock rate, but gets back to what I'm hand-waving describing as "goldilocks". Look up power vs frequency and you'll see its non-linear. Just as CPUs can scale by adding more cores (vs increasing frequency), ~same holds true on network silicon, and you can go wider, multiple pipelines. But its not 10K parallel slices, there's some parallel parts, but there are multiple 'stages' on each doing different things.
Using your CPU comparison, there are some analogies here that do work: - you have multiple cpu cores that can do things in parallel -- analogous to pipelines - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some DRAM or LLC) -- maybe some lookup engines, or centralized buffer/memory - most modern CPUs are out-of-order execution, where under-the-covers, a cache-miss or DRAM fetch has a disproportionate hit on performance, so its hidden away from you as much as possible by speculative execution out-of-order -- no direct analogy to this one - it's unlikely most forwarding pipelines do speculative execution like a general purpose CPU does - but they definitely do 'other work' while waiting for a lookup to happen
A common-garden x86 is unlikely to achieve such a rate for a few different reasons: - packets-in or packets-out go via DRAM then you need sufficient DRAM (page opens/sec, DRAM bandwidth) to sustain at least one write and one read per packet. Look closer at DRAM and see its speed, Pay attention to page opens/sec, and what that consumes. - one 'trick' is to not DMA packets to DRAM but instead have it go into SRAM of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least potentially saves you that DRAM write+read per packet - ... but then do e.g. a LPM lookup, and best case that is back to a memory access/packet. Maybe it's in L1/L2/L3 cache, but likely at large table sizes it isn't. - ... do more things to the packet (urpf lookups, counters) and it's yet more lookups.
Software can achieve high rates, but note that a typical ASIC/NPU does on the order of >100 separate lookups per packet, and 100 counter updates per packet. Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in software on generic CPUs is also a series of tradeoffs.
cheers,
lincoln.
On Tue, 26 Jul 2022 at 23:15, Jeff Tantsura <jefftant.ietf@gmail.com> wrote:
In general, if we look at the whole spectrum, on one side there’re massively parallelized “many core” RTC ASICs, such as Trio, Lightspeed, and similar (as the last gasp of Redback/Ericsson venture - we have built 1400 HW threads ASIC (Spider). On another side of the spectrum - fixed pipeline ASICs, from BCM Tomahawk at its extreme (max speed/radix - min features) moving with BCM Trident, Innovium, Barefoot(quite different animal wrt programmability), etc - usually shallow on chip buffer only (100-200M).
In between we have got so called programmable pipeline silicon, BCM DNX and Juniper Express are in this category, usually a combo of OCB + off chip memory (most often HBM), (2-6G), usually have line-rate/high scale security/overlay encap/decap capabilities. Usually have highly optimized RTC blocks within a pipeline (RTC within macro). The way and speed to access DBs, memories is evolving with each generation, number/speed of non networking cores(usually ARM) keeps growing - OAM, INT, local optimizations are primary users of it.
What do we call Nokia FP? Where you have a pipeline of identical cores doing different things, and the packet has to hit each core in line in order? How do we contrast this to NPU where a given packet hits exactly one core? I think ASIC, NPU, pipeline, RTC are all quite ambiguous. When we say pipeline, usually people assume a purpose build unique HW blocks packet travels through (like DNX, Express) and not fully flexible identical cores pipeline like FP. So I guess I would consider 'true pipeline', pipeline of unique HW blocks and 'true NPU' where a given packet hits exactly 1 core. And anything else as more or less hybrid. I expect once you get to the details of implementation all of these generalisations use communicative power. -- ++ytti
FYI https://community.juniper.net/blogs/nicolas-fevrier/2022/07/27/voq-and-dnx-p... Cheers, Jeff
On Jul 25, 2022, at 15:59, Lincoln Dale <ltd@interlink.com.au> wrote:
On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+nanog@gmail.com> wrote:
On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwobker@gmail.com> wrote:
This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput. ... I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism.
What level of parallelism is required to forward 10Bpps? Or 2Bpps like my J2 example :)
I suspect many folks know the exact answer for J2, but it's likely under NDA to talk about said specific answer for a given thing.
Without being platform or device-specific, the core clock rate of many network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with a goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that doesn't mean a latency of 1 clock ingress-to-egress but rather that every clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS packet rate is achieved by having enough pipelines in parallel to achieve that. The number here is often "1" or "0.5" so you can work the number backwards. (e.g. it emits a packet every clock, or every 2nd clock).
It's possible to build an ASIC/NPU to run a faster clock rate, but gets back to what I'm hand-waving describing as "goldilocks". Look up power vs frequency and you'll see its non-linear. Just as CPUs can scale by adding more cores (vs increasing frequency), ~same holds true on network silicon, and you can go wider, multiple pipelines. But its not 10K parallel slices, there's some parallel parts, but there are multiple 'stages' on each doing different things.
Using your CPU comparison, there are some analogies here that do work: - you have multiple cpu cores that can do things in parallel -- analogous to pipelines - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some DRAM or LLC) -- maybe some lookup engines, or centralized buffer/memory - most modern CPUs are out-of-order execution, where under-the-covers, a cache-miss or DRAM fetch has a disproportionate hit on performance, so its hidden away from you as much as possible by speculative execution out-of-order -- no direct analogy to this one - it's unlikely most forwarding pipelines do speculative execution like a general purpose CPU does - but they definitely do 'other work' while waiting for a lookup to happen
A common-garden x86 is unlikely to achieve such a rate for a few different reasons: - packets-in or packets-out go via DRAM then you need sufficient DRAM (page opens/sec, DRAM bandwidth) to sustain at least one write and one read per packet. Look closer at DRAM and see its speed, Pay attention to page opens/sec, and what that consumes. - one 'trick' is to not DMA packets to DRAM but instead have it go into SRAM of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least potentially saves you that DRAM write+read per packet - ... but then do e.g. a LPM lookup, and best case that is back to a memory access/packet. Maybe it's in L1/L2/L3 cache, but likely at large table sizes it isn't. - ... do more things to the packet (urpf lookups, counters) and it's yet more lookups.
Software can achieve high rates, but note that a typical ASIC/NPU does on the order of >100 separate lookups per packet, and 100 counter updates per packet. Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in software on generic CPUs is also a series of tradeoffs.
cheers,
lincoln.
James Bensley wrote:
The BCM16K documentation suggests that it uses TCAM for exact matching (e.g.,for ACLs) in something called the "Database Array" (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in something called the "User Data Array" (with 16M 32b entries?).
Which documentation? According to: https://docs.broadcom.com/docs/16000-DS1-PUB figure 1 and related explanations: Database records 40b: 2048k/1024k. Table width configurable as 80/160/320/480/640 bits. User Data Array for associated data, width configurable as 32/64/128/256 bits. means that header extracted by 88690 is analyzed by 16K finally resulting in 40b (a lot shorter than IPv6 addresses, still may be enough for IPv6 backbone to identify sites) information by "database" lookup, which is, obviously by CAM because 40b is painful for SRAM, converted to "32/64/128/256 bits data".
1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which is within the access time of TCAM and SRAM
As high speed TCAM and SRAM should be pipelined, cycle time, which matters, is shorter than access time. Finally, it should be pointed out that most, if not all, performance figures such as MIPS and Flops are merely guaranteed not to be exceeded. In this case, if so deep packet inspections by lengthy header for some complicated routing schemes or to satisfy NSA requirements are required, communication speed between 88690 and 16K will be the limitation factor for PPS resulting in a lot less than maximum possible PPS. Masataka Ohta
The Broadcom KBP -- often called an "external TCAM" is really closer to a completely separate NPU than just an external TCAM. "Back in the day" we used external TCAMs to store forwarding state (FIB tables, ACL tables, whatever) on devices that were pretty much just a bunch of TCAM memory and an interface for the "main" NPU to ask for a lookup. Today the modern KBP devices have WAY more functionality, they have lots of different databases and tables available, which can be sliced and diced into different widths and depths. They can store lots of different kinds of state, from counters to LPM prefixes and ACLs. At risk of correcting Ohta-san, note that most ACLs are implemented using TCAMs with wildcard/masking support, as opposed to an exact match lookup. Exact match lookups are generally used for things that do not require masking or wildcard bits: MAC addresses and MPLS label values are the canonical examples here. The SRAM memories used in fast networking chips are almost always built such that they provide one lookup per clock, although hardware designers often use multiple banks of these to increase the number of *effective* lookups per clock. TCAMs are also generally built such that they provide one lookup/result per clock, but again you can stack up multiple devices to increase this. Many hardware designs also allow for more flexibility in how the various memories are utilized by the software -- almost everyone is familiar with the idea of "I can have a million entries of X bits, or half a million entries of 2*X bits". If the hardware and software complexity was free, we'd design memories that could be arbitrarily chopped into exactly the sizes we need, but that complexity is Absolutely Not Free.... so we end up picking a few discrete sizes and the software/forwarding code has to figure out how to use those bits efficiently. And you can bet your life that as soon as you have a memory that can function using either 80b or 160b entries, you will immediately come across a use case that really really needs to use entries of 81b. FYI: There's nothing particularly magical about 40b memory widths. When building these chips you can (more or less) pick whatever width of SRAM you want to build, and the memory libraries that you use spit out the corresponding physical design. Ohta-san correctly mentions that a critical part of the performance analysis is how fast the different parts of the pipeline can talk to each other. Note that this concept applies whether we're talking about the connection between very small blocks within the ASIC/NPU, or the interface between the NPU and an external KBP/TCAM, or for that matter between multiple NPUs/fabric chips within a system. At some point you'll always be constrained by whatever the slowest link in the pipeline is, so balancing all that stuff out is Yet One More Thing for the system designer to deal with. --lj -----Original Message----- From: NANOG <nanog-bounces+ljwobker=gmail.com@nanog.org> On Behalf Of Masataka Ohta Sent: Wednesday, July 27, 2022 9:09 AM To: nanog@nanog.org Subject: Re: 400G forwarding - how does it work? James Bensley wrote:
The BCM16K documentation suggests that it uses TCAM for exact matching (e.g.,for ACLs) in something called the "Database Array" (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in something called the "User Data Array" (with 16M 32b entries?).
Which documentation? According to: https://docs.broadcom.com/docs/16000-DS1-PUB figure 1 and related explanations: Database records 40b: 2048k/1024k. Table width configurable as 80/160/320/480/640 bits. User Data Array for associated data, width configurable as 32/64/128/256 bits. means that header extracted by 88690 is analyzed by 16K finally resulting in 40b (a lot shorter than IPv6 addresses, still may be enough for IPv6 backbone to identify sites) information by "database" lookup, which is, obviously by CAM because 40b is painful for SRAM, converted to "32/64/128/256 bits data".
1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which is within the access time of TCAM and SRAM
As high speed TCAM and SRAM should be pipelined, cycle time, which matters, is shorter than access time. Finally, it should be pointed out that most, if not all, performance figures such as MIPS and Flops are merely guaranteed not to be exceeded. In this case, if so deep packet inspections by lengthy header for some complicated routing schemes or to satisfy NSA requirements are required, communication speed between 88690 and 16K will be the limitation factor for PPS resulting in a lot less than maximum possible PPS. Masataka Ohta
This convo is giving me some hope that the sophisticated FQ and AQM algorithms I favor can be made to run in more hardware at high rates, but most of the work I'm aware of has targetted tofino and P4. The only thing I am aware of shipping is AFD in some cisco hw. Anyone using that?
On Wed, 27 Jul 2022 at 15:11, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
James Bensley wrote:
The BCM16K documentation suggests that it uses TCAM for exact matching (e.g.,for ACLs) in something called the "Database Array" (with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in something called the "User Data Array" (with 16M 32b entries?).
Which documentation?
According to:
https://docs.broadcom.com/docs/16000-DS1-PUB
figure 1 and related explanations:
Database records 40b: 2048k/1024k. Table width configurable as 80/160/320/480/640 bits. User Data Array for associated data, width configurable as 32/64/128/256 bits.
means that header extracted by 88690 is analyzed by 16K finally resulting in 40b (a lot shorter than IPv6 addresses, still may be enough for IPv6 backbone to identify sites) information by "database" lookup, which is, obviously by CAM because 40b is painful for SRAM, converted to "32/64/128/256 bits data".
Hi Masataka, Yes I had read that data sheet. If you have 2M 40b entries in CAM, you could also have 1M 80 entries (or a mixture); the 40b CAM blocks can be chained together to store IPv4/IPv6/MPLS/whatever entries. Cheers, James.
participants (18)
-
Ca By
-
Chris Adams
-
Dave Taht
-
dip
-
Etienne-Victor Depasquale
-
James Bensley
-
Jeff Tantsura
-
jwbensley+nanog@gmail.com
-
Lawrence Wobker
-
Lincoln Dale
-
ljwobker@gmail.com
-
Masataka Ohta
-
Matthew Huff
-
Nick Hilliard
-
Saku Ytti
-
sronan@ronan-online.com
-
Tom Beecher
-
Vasilenko Eduard