Thanks for the responses Chris, Saku… On Mon, 25 Jul 2022 at 15:17, Chris Adams <cma@cmadams.net> wrote:
Once upon a time, James Bensley <jwbensley+nanog@gmail.com> said:
The obvious answer is that it's not magic and my understanding is fundamentally flawed, so please enlighten me.
So I can't answer to your specific question, but I just wanted to say that your CPU analysis is simplistic and doesn't really match how CPUs work now.
It wasn't a CPU analysis because switching ASICs != CPUs. I am aware of the x86 architecture, but know little of network ASICs, so I was deliberately trying to not apply my x86 knowledge here, in case it sent me down the wrong path. You made references towards typical CPU features;
For example, it might take 4 times as long to process the first packet, but as long as the hardware can handle 4 packets in a queue, you'll get a packet result every cycle after that, without dropping anything. So maybe the first result takes 12 cycles, but then you can keep getting a result every 3 cycles as long as the pipeline is kept full.
Yes, in the x86/x64 CPU world keeping the instruction cache and data cache hot indeed results in optimal performance, and as you say modern CPUs use parallel pipelines amongst other techniques like branch prediction, SIMD, (N)UMA, and so on, but I would assume (because I don’t know) that not all of the x86 feature set map nicely to packet processing in ASICs (VPP uses these techniques on COTS CPUs, to emulate a fixed pipeline, rather than run to completion model). You and Saku both suggest that heavy parallelism is the magic source;
Something can be "line rate" but not push the first packet through in the shortest time.
On Mon, 25 Jul 2022 at 15:16, Saku Ytti <saku@ytti.fi> wrote:
I.e. say JNPR Trio PPE has many threads, and only one thread is running, rest of the threads are waiting for answers from memory. That is, once we start pushing packets through the device, it takes a long ass time (like single digit microseconds) before we see any packets out. 1000x longer than your calculated single digit nanoseconds.
In principal I accept this idea. But lets try and do the maths, I'd like to properly understand; The non-drop rate of the J2 is 2Bpps @ 284 bytes == 4.8Tbps, my example scenario was a single J2 chip in a 12x400G device. If each port is receiving 400G @ 284 bytes (164,473,684 pps), that’s one every 6.08 nanoseconds coming in. What kind of parallelism is required to stop from ingress dropping? It takes say 5 microseconds to process and forward a packet (seems reasonable looking at some Arista data sheets which use J2 variants), which means we need to be operating on 5,000ns / 6.08ns == 822 packets per port simultaneously, so 9868 packets are being processed across all 12 ports simultaneously, to stop ingress dropping on all interfaces. I think the latest generation Trio has 160 PPEs per PFE, but I’m not sure how many threads per PPE. Older generations had 20 threads/contexts per PPE, so if it hasn’t increased that would make for 3200 threads in total. That is a 1.6Tbps FD chip, although not apples to apples of course, Trio is run to completion too. The Nokia FP5 has 1,200 cores (I have no idea how many threads per core) and is rated for 4.8Tbps FD. Again doing something quite different to a J2 chip, again its RTC. J2 is a partially-fixed pipeline but slightly programmable if I have understood correctly, but definitely at the other end of the spectrum compared to RTC. So are we to surmise that a J2 chip has circa 10k parallel pipelines, in order to process 9868 packets in parallel? I have no frame of reference here, but in comparison to Gen 6 Trio of NP5, that seems very high to me (to the point where I assume I am wrong). Cheers, James.