My understanding of Juniper's approach to the problem is that instead
of employing TCAMs for next-hop lookup, they use general purpose CPUs
operating on a radix tree, exactly as you would for an all-software
router.

Absolutely are not doing that with "general purpose CPUs". 

The LU block on early gen Trios was a dedicated ASIC (LU by itself, then consolidated slightly) , then later gen Trio put everything on a single chip, but again dedicated ASIC. 

To
achieve an -aggregate- lookup speed comparable to a TCAM, they
implement a bunch of these lookup engines as dedicated parallel
subprocessors rather than using the router's primary compute engine.

You're correct that there is parallelism in the LU functions , but I still think you're kinda smushing a bunch of stuff that's happening in different places together. 

On Fri, Sep 29, 2023 at 4:44 PM William Herrin <bill@herrin.us> wrote:
On Thu, Sep 28, 2023 at 10:29 PM Saku Ytti <saku@ytti.fi> wrote:
> On Fri, 29 Sept 2023 at 08:24, William Herrin <bill@herrin.us> wrote:
> > Maybe. That's where my comment about CPU cache starvation comes into
> > play. I haven't delved into the Juniper line cards recently so I could
> > easily be wrong, but if the number of routes being actively used
> > pushes past the CPU data cache, the cache miss rate will go way up and
> > it'll start thrashing main memory. The net result is that the
> > achievable PPS drops by at least an order of magnitude.
>
> When you say, you've not delved into the Juniper line cards recently,
> to which specific Juniper linecard your comment applies to?

Howdy,

My understanding of Juniper's approach to the problem is that instead
of employing TCAMs for next-hop lookup, they use general purpose CPUs
operating on a radix tree, exactly as you would for an all-software
router. This makes each lookup much slower than a TCAM can achieve.
However, that doesn't matter much: the lookup delays are much shorter
than the transmission delays so it's not noticeable to the user. To
achieve an -aggregate- lookup speed comparable to a TCAM, they
implement a bunch of these lookup engines as dedicated parallel
subprocessors rather than using the router's primary compute engine.

A TCAM lookup is approximately O(1) while a radix tree lookup is
approximately O(log n). (Neither description is strictly correct but
it's close enough to understand the running time.) Log n is pretty
small so it doesn't take much parallelism for the practical run time
to catch up to the TCAM.

Feel free to correct me if I'm mistaken or fill in any important
details I've glossed over.

Regards,
Bill Herrin


--
William Herrin
bill@herrin.us
https://bill.herrin.us/