On Fri, 5 Aug 2022 at 20:31, <ljwobker@gmail.com> wrote: Hey LJ,
Disclaimer: I work for Cisco on a bunch of silicon. I'm not intimately familiar with any of these devices, but I'm familiar with the high level tradeoffs. There are also exceptions to almost EVERYTHING I'm about to say, especially once you get into the second- and third-order implementation details. Your mileage will vary... ;-)
I expect it may come to this, my question may be too specific to be answered without violating some NDA.
If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, simpler programming, etc. A major downside is that to do this, all of these cores have to have access to all of the different memories used to forward said packet. Conversely, if you break up the processing into stages, you can only connect the FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap memories to the cores/blocks that are doing the encapsulation work. Those interconnects take up silicon space, which equates to higher cost and power.
While an interesting answer, that is, the statement is, cost of giving access to memory for cores versus having a more complex to program pipeline of cores is a balanced tradeoff, I don't think it applies to my specific question, while may apply to generic questions. We can roughly think of FP having a similar amount of lines as Trio has PPEs, therefore, a similar number of cores need access to memory, and possibly higher number, as more than 1 core in line will need memory access. So the question is more, why a lot of less performant cores, where performance is achieved by making pipeline, compared to fewer performant cores, where individual cores will work on packet to completion, when the former has a similar number of core lines as latter has cores.
Packaging two cores on a single device is beneficial in that you only have one physical chip to work with instead of two. This often simplifies the board designers' job, and is often lower power than two separate chips. This starts to break down as you get to exceptionally large chips as you bump into the various physical/reticle limitations of how large a chip you can actually build. With newer packaging technology (2.5D chips, HBM and similar memories, chiplets down the road, etc) this becomes even more complicated, but the answer to "why would you put two XYZs on a package?" is that it's just cheaper and lower power from a system standpoint (and often also from a pure silicon standpoint...)
Thank you for this, this does confirm that benefits aren't perhaps as revolutionary as the presentation of thread proposed, presentation divided Trio evolution to 3 phases, and this multiple trios on package was presented as one of those big evolutions, and perhaps some other division of generations could have been more communicative.
Lots and lots of Smart People Time has gone into different memory designs that attempt to optimize this problem, and it's a major part of the intellectual property of various chip designs.
I choose to read this as 'where a lot of innovation happens, a lot of mistakes happen'. Hopefully we'll figure out a good answer here soon, as the answers vendors are ending up with are becoming increasingly visible compromises in the field. I suspect a large part of this is that cloudy shops represent, if not disproportionate revenue, disproportionate focus and their networks tend to be a lot more static in config and traffic than access/SP networks. And when you have that quality, you can make increasingly broad assumptions, assumptions which don't play as well in SP networks. -- ++ytti