Once upon a time, James Bensley <jwbensley+nanog@gmail.com> said:
The obvious answer is that it's not magic and my understanding is fundamentally flawed, so please enlighten me.
So I can't answer to your specific question, but I just wanted to say that your CPU analysis is simplistic and doesn't really match how CPUs work now. Something can be "line rate" but not push the first packet through in the shortest time. CPUs break operations down into a series of very small operations and then run those operations in a pipeline, with different parts of the CPU working on the micro operations for different overall operations at the same time. The first object out of the pipeline (packet destination calculated in this case) may take more time, but then after that you keep getting a result every cycle/few cycles. For example, it might take 4 times as long to process the first packet, but as long as the hardware can handle 4 packets in a queue, you'll get a packet result every cycle after that, without dropping anything. So maybe the first result takes 12 cycles, but then you can keep getting a result every 3 cycles as long as the pipeline is kept full. This type of pipelined+superscalar processing was a big deal with Cray supercomputers, but made it down to PC-level hardware with the Pentium Pro. It has issues (see all the Spectre and Retbleed CPU flaws with branch prediction for example), but in general it allows a CPU to handle a chain of operations faster than it can handle each operation individually. -- Chris Adams <cma@cmadams.net>