My understanding of Juniper's approach to the problem is that instead of employing TCAMs for next-hop lookup, they use general purpose CPUs operating on a radix tree, exactly as you would for an all-software router.
Absolutely are not doing that with "general purpose CPUs". The LU block on early gen Trios was a dedicated ASIC (LU by itself, then consolidated slightly) , then later gen Trio put everything on a single chip, but again dedicated ASIC. To
achieve an -aggregate- lookup speed comparable to a TCAM, they implement a bunch of these lookup engines as dedicated parallel subprocessors rather than using the router's primary compute engine.
You're correct that there is parallelism in the LU functions , but I still think you're kinda smushing a bunch of stuff that's happening in different places together. On Fri, Sep 29, 2023 at 4:44 PM William Herrin <bill@herrin.us> wrote:
On Thu, Sep 28, 2023 at 10:29 PM Saku Ytti <saku@ytti.fi> wrote:
On Fri, 29 Sept 2023 at 08:24, William Herrin <bill@herrin.us> wrote:
Maybe. That's where my comment about CPU cache starvation comes into play. I haven't delved into the Juniper line cards recently so I could easily be wrong, but if the number of routes being actively used pushes past the CPU data cache, the cache miss rate will go way up and it'll start thrashing main memory. The net result is that the achievable PPS drops by at least an order of magnitude.
When you say, you've not delved into the Juniper line cards recently, to which specific Juniper linecard your comment applies to?
Howdy,
My understanding of Juniper's approach to the problem is that instead of employing TCAMs for next-hop lookup, they use general purpose CPUs operating on a radix tree, exactly as you would for an all-software router. This makes each lookup much slower than a TCAM can achieve. However, that doesn't matter much: the lookup delays are much shorter than the transmission delays so it's not noticeable to the user. To achieve an -aggregate- lookup speed comparable to a TCAM, they implement a bunch of these lookup engines as dedicated parallel subprocessors rather than using the router's primary compute engine.
A TCAM lookup is approximately O(1) while a radix tree lookup is approximately O(log n). (Neither description is strictly correct but it's close enough to understand the running time.) Log n is pretty small so it doesn't take much parallelism for the practical run time to catch up to the TCAM.
Feel free to correct me if I'm mistaken or fill in any important details I've glossed over.
Regards, Bill Herrin
-- William Herrin bill@herrin.us https://bill.herrin.us/