On Tue, Dec 24, 2024 at 8:26 AM Mike Hammett <nanog@ics-il.net> wrote:
In the articles I've read and videos I've watched, they have mentioned varying amounts of reduced power. I didn't commit them to memory because that wasn't the part I was interested in at the moment.


I'd think that, especially as data rates climb, the power consumption is going to really get important fast.
When a single device requires ~50kw to run ... I think you'll want to make sure you have space/power to deal with that :(

I'm not sure that distributed fabric plans make that problem better? (maybe it's all the same problem in the end because the fabric interconnect is
going to be distance limited/etc too)
 
Management of the things is a big thing I've been concerned about going into more modern systems. So often there's hand waiving regarding the orchestration piece of non-traditional systems. From what I've seen (and I would love to be wrong), you either build it in-house (not a small lift) or you buy something that ends up taking away all of the cost advantages that path had.


You almost certainly get into (pretty quickly) something that smells a bunch like:
  "here's my pile of ansible recipes for...."
  (choice of ansible here for example only, s/ansible/<whatever>/ of course to whatever you feel like)

That's maybe fine if that's your jam? I think it's hard to reason/plan/build without some automation plan 'now',
and it looks like a ton of folk start without that then try to retrofit once: "omg this is very large now... ugh" happens.
  (1-10 devices? sure fine do it by hand, 5-><bunches more> you really ought to have had an automation plan at ~5 ... my opinion clearly)
 
Failure domain stuff is part of what I'm trying to learn more about, which goes back to more about the fundamentals of how the fabric works.


yea... This part(reasoning about failure domains) I assume is also a tad hard.
A scenario is:
  "I built this 200tb fabric, I interconnect to the outside with ~100T max and internally with ~100T"
now that ~100T breaks and (ideally!) everything on the outside re-routes around to a different front-door... oops are you prepared for an extra ~100T arriving? 
How do you deal with parts (fabric parts) failing in part? "oops only 50T of my 100T can get through here and ... I also am still telling my external neighbors all's good"

Really that failure-domain problem is tightly linked to the 'manage a ton of things' problem too.. at least for containing damage in a quick manner.