Thousands of hosts on a gigabit LAN, maybe not
Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table. Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack. What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA R's, John
Sounds interesting. I wouldn't do more than a /23 (assuming IPv4) per subnet. Join them all together with a fast L3 switch. I'm still trying to visualize what several thousand tiny computers in a single rack might look like. Other than a cabling nightmare. 1000 RJ-45 switch ports is a good chuck of a rack itself. Chuck -----Original Message----- From: NANOG [mailto:nanog-bounces@nanog.org] On Behalf Of John Levine Sent: Friday, May 08, 2015 2:53 PM To: nanog@nanog.org Subject: Thousands of hosts on a gigabit LAN, maybe not Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table. Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack. What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA R's, John
On Fri, May 8, 2015 at 2:53 PM, John Levine <johnl@iecc.com> wrote:
Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table.
Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack.
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
consider the pain of also ipv6's link-local gamery. look at the nvo3 WG and it's predecessor (which shouldn't have really existed anyway, but whatever, and apparently my mind helped me forget about the pain involved with this wg) I think 'why one lan' ? why not just small (/26 or /24 max?) subnet sizes... or do it all in v6 on /64's with 1/rack or 1/~200 hosts.
Morrow's comment about the ARMD WG notwithstanding, there might be some useful context in https://tools.ietf.org/html/draft-karir-armd-statistics-01 Cheers, -Benson
Christopher Morrow <mailto:morrowc.lists@gmail.com> May 8, 2015 at 12:19 PM
consider the pain of also ipv6's link-local gamery. look at the nvo3 WG and it's predecessor (which shouldn't have really existed anyway, but whatever, and apparently my mind helped me forget about the pain involved with this wg)
I think 'why one lan' ? why not just small (/26 or /24 max?) subnet sizes... or do it all in v6 on /64's with 1/rack or 1/~200 hosts. John Levine <mailto:johnl@iecc.com> May 8, 2015 at 11:53 AM Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table.
Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack.
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
R's, John
On Fri, May 8, 2015 at 11:53 AM, John Levine <johnl@iecc.com> wrote:
Some people I know (yes really) are building a system that will have several thousand little computers in some racks.
Very cool-ly crazy.
Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table.
Agreed. :) You don't really want 10,000 entries in a routing FIB table either, but I was seriously encouraged by the work going on in linux 4.0 and 4.1 to improve those lookups. https://netdev01.org/docs/duyck-fib-trie.pdf I'd love to know the actual scalability of some modern routing protocols (isis, babel, ospfv3, olsrv2, rpl) with that many nodes too....
Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack.
That is an awful lot of ports to fit in a rack (48 ports, 36 2U slots in the rack (and is that too high?) = 1728 ports) A thought is you could make it meshier using multiple interfaces per tiny linux box? Put, say 3-6 interfaces and have a very few switches interconnecting given clusters (and multiple paths to each switch). That would reduce your arp table (and fib table) by a lot at the cost of adding hops...
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
max per vlan 4096. Still a lot. Another approach might be max density on a switch (48?) per cluster, routed (not switched) 10GigE to another 10GigE+ switch. I'd love to know the rule of thumbs here also, I imagine some rules must exist for those in the VM or VXLAN worlds.
R's, John
-- Dave Täht Open Networking needs **Open Source Hardware** https://plus.google.com/u/0/+EricRaymond/posts/JqxCe2pFr67
to have 10,000 entries or more in its ARP table.
Agreed. :) You don't really want 10,000 entries in a routing FIB table either, but I was seriously encouraged by the work going on in linux 4.0 and 4.1 to improve those lookups.
One obvious way to deal with that is to put some manageable number of hosts on a subnet and route traffic between the subnets. I think we can assume they'll all have 10/8 addresses, and I'm not too worried about performance to the outside world, just within the network. R's, John
- The more switches a packet has to go through, the higher the latency, so your response times may deteriorate if you cascade too many switches. Legend says up to 4 is a good number, any further you risk creating a big mess. - The more switches you add, the higher your bandwidth utilized by broadcasts in the same subnet. http://en.wikipedia.org/wiki/Broadcast_radiation - If you have only one connection between each switch, each switch is going to be limited to that rate (1gbps in this case), possibly creating a bottleneck depending on your application and how exactly it behaves. Consider aggregating uplinks. - Bundling too many Ethernet cables will cause interference (cross-talk), so keep that in mind. I'd purchase F/S/FTP cables and the like. Here I am going off on a tangent: if your friends want to build a "super computer" then there's a way to calculate the most "efficient" number of nodes given your constraints (e.g. linear optimization). This could save you time, money and headaches. An example: maximize the number of TFLOPS while minimizing number of nodes (i.e. number of switch ports). Just a quick thought. On Fri, May 8, 2015 at 1:53 PM, John Levine <johnl@iecc.com> wrote:
Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table.
Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack.
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
R's, John
On 05/08/2015 02:53 PM, John Levine wrote:
Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table.
Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack.
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
Unless you have some dire need to get these all on the same broadcast domain, those kind of numbers on a single L2 would send me running for the hills for lots of reasons, some of which you've identified. I'd find a good L3 switch and put no more ~200-500 IPs on each L2 and let the switch handle gluing it together at L3. With the proper hardware, this is a fully line-rate operation and should have no real downsides aside from splitting up the broadcast domains (if you do need multicast, make sure your gear can do it). With a divide-and-conquer approach, you shouldn't have problems fitting the L2+L3 tables into even a pretty modest L3 switch. Densest chassis switches I know of are going to be gets about 96 ports per RU (48 ports each on a half-width blade, but you need breakout panels to get standard RJ45 8P8C connectors as the blades have MRJ21s) less rack overhead for power supplies, management, etc.. That should get you ~2000 ports per rack [1]. Such switches can be quite expensive. The trend seems to be toward stacking pizza boxes these days, though. Get the number of ports you need per rack (you're presumably not putting all 10,000 nodes in a single rack) and aggregate up one or two layers. This gives you a pretty good candidate for your L2/L3 split. [1] Purely as an example, you can cram 3x Brocade MLX-16 chassis into a 42U rack (with 0RU to spare). That gives you 48 slots for line cards. Leaving at least one slot in each chassis for 10Gb or 100Gb uplinks to something else, 45x48 = 2160 1000BASE-T ports (electrically) in a 42U rack, and you'll need 45 more RU somewhere for breakout patch panels! -- Brandon Martin
* lists.nanog@monmotha.net (Brandon Martin) [Fri 08 May 2015, 21:42 CEST]:
[1] Purely as an example, you can cram 3x Brocade MLX-16 chassis into a 42U rack (with 0RU to spare). That gives you 48 slots for line cards.
You really can't. Cables need to come from the top, not from the sides, or they'll block the path of other linecards. -- Niels.
On 05/08/2015 04:17 PM, Niels Bakker wrote:
* lists.nanog@monmotha.net (Brandon Martin) [Fri 08 May 2015, 21:42 CEST]:
[1] Purely as an example, you can cram 3x Brocade MLX-16 chassis into a 42U rack (with 0RU to spare). That gives you 48 slots for line cards.
You really can't. Cables need to come from the top, not from the sides, or they'll block the path of other linecards.
Hum, good point. "Cram" may not be a strong enough term :) It'd work on the horizontal slot chassis types (4/8 slot), but not the vertical (16/32 slot). You might be able to make it fit if you didn't care about maintainability, I guess. There's some room to maneuver if you don't care about being able to get the power supplies out, too. I don't recommend this approach... Those MRJ21 cables are not easy to work with as it is. -- Brandon Martin
John Levine wrote:
Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table.
Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack.
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
It's become fairly commonplace to build supercomputers out of clusters of 100s, or 1000s of commodity PCs, see, for example: www.rocksclusters.org http://www.rocksclusters.org/presentations/tutorial/tutorial-1.pdf or http://www.dodlive.mil/files/2010/12/CondorSupercomputerbrochure_101117_kb-3... (a cluster of 1760 playstations at AFRL Rome Labs) Interestingly, all the documentation I can find is heavy on the software layers used to cluster resources - but there's little about hardware configuration other than pretty pictures of racks with lots of CPUs and lots of wires. If the people you know are trying to do something similar - it might be worth some nosing around the Rocks community, or some phone calls. I expect that interconnect architecture and latency might be a bit of an issue for this sort of application. Miles Fidelman -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra
Forgot to mention - you might also want to check out Beowulf clusters - there's an email list at http://www.beowulf.org/ - probably some useful info in the list archives, maybe a good place to post your query. Miles Miles Fidelman wrote:
John Levine wrote:
Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table.
Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack.
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
It's become fairly commonplace to build supercomputers out of clusters of 100s, or 1000s of commodity PCs, see, for example: www.rocksclusters.org http://www.rocksclusters.org/presentations/tutorial/tutorial-1.pdf or http://www.dodlive.mil/files/2010/12/CondorSupercomputerbrochure_101117_kb-3... (a cluster of 1760 playstations at AFRL Rome Labs)
Interestingly, all the documentation I can find is heavy on the software layers used to cluster resources - but there's little about hardware configuration other than pretty pictures of racks with lots of CPUs and lots of wires.
If the people you know are trying to do something similar - it might be worth some nosing around the Rocks community, or some phone calls. I expect that interconnect architecture and latency might be a bit of an issue for this sort of application.
Miles Fidelman
-- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra
Linux has a (configurable) limit on the neighbor table. I know in RHEL variants, the default has been 1024 neighbors for a while. net.ipv4.neigh.default.gc_thresh3 net.ipv4.neigh.default.gc_thresh2 net.ipv4.neigh.default.gc_thresh1 net.ipv6.neigh.default.gc_thresh3 net.ipv6.neigh.default.gc_thresh2 net.ipv6.neigh.default.gc_thresh1 These may be rough guidelines for performance or arbitrary limits someone thought would be a good idea. Either way, you'll need to increase the number if you're using IP on Linux. Although not explicitly stated, I would assume that these computers may be virtualized or inside some sort of blade chassis (which reduces the number of physical cables to a switch). Strictly speaking, I see no hardware limitation in your way, as most top of rack switches will easily do a few thousand or 10's of thousands of MAC entries and a few thousand hosts can fit inside a single IP4 or IP6 subnet. There are some pretty dense switches if you actually do need 1000 ports, but as others have stated, you'll utilize a good portion of the rack in cable and connectors. --Blake
You may want to look at CLOS / leaf/spine architecture. This design tends to be optimized for east-west traffic, scales easily as bandwidth needs grow, and keeps thing simple, l2/l3 boundry on the ToR switch, L3 ECMP from leaf to spine. Not a lot of complexity and scale fairly high on both leafs and spines. Sk. -----Original Message----- From: NANOG [mailto:nanog-bounces@nanog.org] On Behalf Of John Levine Sent: Friday, May 08, 2015 2:53 PM To: nanog@nanog.org Subject: Thousands of hosts on a gigabit LAN, maybe not Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table. Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack. What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA R's, John
Agree with many of the other comments. Smaller subnets (the /23 suggestion sounds good) with L3 between the subnets. <off topic> The first thing that came to mind was "Bitcoin farm!" then "Ask Bitmaintech" and then "I'd be more worried about the number of fans and A/C units". </off topic> Brian
Date: Fri, 8 May 2015 18:53:03 +0000 From: johnl@iecc.com To: nanog@nanog.org Subject: Thousands of hosts on a gigabit LAN, maybe not
Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table.
Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack.
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
R's, John
On 2015-05-08 13:53, John Levine wrote:
Some people I know (yes really) are building a system that will have several thousand little computers in some racks.
How many racks? How many computers per rack unit? How many computers per rack? (How are you handling power?) How big is each computer? Do you want network cabling to be contained to each rack? Or do you want to run the cable to a central networking/switching rack? Hmmmm even a 6513 fully populated with POE 48 port line cards (which could let you do power and network in the same cable (I think? Does POE work on gigabit these days)? would get you (12*48 = 576) ports. So.... 48U rack - 15U (I think the 6513 is 15U total) leaves you 33U. Can you fit 576 systems in 33U? Each of the
computers runs Linux and has a gigabit ethernet interface.
Copper? It occurs
to me that it is unlikely that I can buy an ethernet switch with thousands of ports
6515? , and even if I could, would I want a Linux system
to have 10,000 entries or more in its ARP table.
Add more ram. That's always the answer. LOL.
Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack.
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
We need more data.
The real answer to this is being able to cram them into a single chassis which can multiplex the network through a backplane. Something like the HP Moonshot ARM system or the way others like Google build high density compute with integrated Ethernet switching. Phil -----Original Message----- From: "John Levine" <johnl@iecc.com> Sent: 5/8/2015 2:59 PM To: "nanog@nanog.org" <nanog@nanog.org> Subject: Thousands of hosts on a gigabit LAN, maybe not Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface. It occurs to me that it is unlikely that I can buy an ethernet switch with thousands of ports, and even if I could, would I want a Linux system to have 10,000 entries or more in its ARP table. Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack. What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA R's, John
On 2015-05-08 18:20, Phil Bedard wrote:
The real answer to this is being able to cram them into a single chassis which can multiplex the network through a backplane. Something like the HP Moonshot ARM system or the way others like Google build high density compute with integrated Ethernet switching.
I was going to suggest moonshot myself (I walk by a number of moonshot units daily). However it seemed like the systems were already selected and then someone was like "oh yeah, better ask netops how to hook these things we bought and didn't tell anyone about to the interwebz". (I mean that's not a 100% accurate description of my $DAYJOB at all). In which case, the standard response is "well gee whizz buddy, ya should of bought moonshot jigs. But now you have to buy pallet loads of chassis switches". Hope you have some money left over in your budget.
The standard 48 port with 2 port uplink 1U switch is far from full depth. You put them in the back of the rack and have the small computers in the front. You might even turn the switches around, so the ports face inwards into the rack. The network cables would be very short and go directly from the mini computers (Raspberry Pi?) to the switch, all within the one unit shelf. Assuming a max sized rack with depth of 90 cm and the switches might be 30 cm. That leaves 60 cm to mount mini computers. That is approximately 12000 cubic cm of space per rack unit. A Raspberry PI is approximately 120 cubic cm. So you might be able to fit 48 of them in that space. It would be a very tight fit indeed but maybe not impossible. As to the original question, I would have 48 computers in a subnet. This is the correct number because you would connect each shelf switch to a top of rack switch, and spend a few extra bucks on the ToR so that it can do layer 3 routing between shelfs. Regards, Baldur
On 2015-05-09 11:57, Baldur Norddahl wrote:
The standard 48 port with 2 port uplink 1U switch is far from full depth. You put them in the back of the rack and have the small computers in the front. You might even turn the switches around, so the ports face inwards into the rack. The network cables would be very short and go directly from the mini computers (Raspberry Pi?) to the switch, all within the one unit shelf.
Yes this. I presumed ras pi, but those don't have gigabit Ethernet. Then I realized: http://www.parallella.org/ (I've got one of these sitting on my standby shelf to be racked, which is what made me think of it). To the OP please do tell us more about what you are doing, it sounds very interesting.
On Fri, May 8, 2015 at 11:53 AM, John Levine <johnl@iecc.com> wrote:
Some people I know (yes really) are building a system that will have several thousand little computers in some racks. Each of the computers runs Linux and has a gigabit ethernet interface.
Though a bit off-topic I ran in to this project at the CascadeIT conference. I'm currently in corp IT that is Notes/Windows based so I haven't had a good place to test it but the concept is very interesting. They distributed way they monitor would greatly reduce bandwidth overhead. http://assimproj.org The Assimilation Project is designed to discover and monitor infrastructure, services, and dependencies on a network of potentially unlimited size, without significant growth in centralized resources. The work of discovery and monitoring is delegated uniformly in tiny pieces to the various machines in a network-aware topology - minimizing network overhead and being naturally geographically sensitive. The two main ideas are: - distribute discovery throughout the network, doing most discovery locally - distribute the monitoring as broadly as possible in a network-aware fashion. - use autoconfiguration and zero-network-footprint discovery techniques to monitor most resources automatically. during the initial installation and during ongoing system addition and maintenance. -- Joe Hamelin, W7COM, Tulalip, WA, 360-474-7474
On 2015-05-08 12:53, John Levine wrote:
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
I won't pretend to know best practices, but my inclination would be to connect the devices to 48-port L2 ToR switches with 2-4 SFP+ uplink ports (a number of vendors have options for this), with the 10gbit ports aggregated to a 10gbit core L2/L3 switch stack (ditto). I'm not sure I'd attempt this without 10gbit to the edge switches, due to Rafael's aforementioned point of the bottleneck/loss of multiple ports for trunking. Not knowing the architectural constraints, I'd probably go with others' advice of limiting L2 zones to 200-500 hosts, which would probably amount to 4-10 edge switches per VLAN. Dang. The more I think about this project, the more expensive it sounds. Jima
On Fri, May 8, 2015 at 5:19 PM, Jima <nanog@jima.us> wrote: Dang. The more I think about this project, the more expensive it sounds. Naw, just use WiFi. ;) -- Joe Hamelin, W7COM, Tulalip, WA, 360-474-7474
On 9 May 2015, at 1:53, John Levine wrote:
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this?
Most of the major switch vendors have design guides and other examples like this available (this one is Cisco-specific): <http://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/Data_Center/VMDC/3-0-1/DG/VMDC_3-0-1_DG/VMDC301_DG3.html> Some organizations like Facebook have also taken the time to write up their approaches and make them publicly available: <https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/> ----------------------------------- Roland Dobbins <rdobbins@arbor.net>
On 05/08/2015 02:53 PM, John Levine wrote:
... Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack.
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
You know, I read this post and immediately thought 'SGI Altix'........ scalable to 512 CPU's per "system image" and 20 images per cluster (NASA's Columbia supercomputer had 10,240 CPUs in that configuration.....twelve years ago, using 1.5GHz 64-bit RISC CPUs running Linux.... my, how we've come full circle.... (today's equivalent has less power consumption, at least....)). The NUMA technology in those Altix CPU's is a de-facto 'memory-area network' and thus can have some interesting topologies. Clusters can be made using nodes with at least two NICs in them, and no switching. With four or eight ports you can do some nice mesh topologies. This wouldn't be L2 bridging, either, but a L3 mesh could be made that could be rather efficient, with no switches, as long as you have at least three ports per node, and you can do something reasonably efficient with a switch or two and some chains of nodes, with two NICs per node. L3 keeps the broadcast domain size small, and broadcast overhead becomes small. If you only have one NIC per node, well, time to get some seriously high-density switches..... but even then how many nodes are going to be per 42U rack? A top-of-rack switch may only need 192 ports, and that's only 4U, with 1U 48 port switches. 8U you can do 384 ports, and three racks will do a bit over 1,000. Octopus cables going from an RJ21 to 8P8C modular are available, so you could use high-density blades; Cisco claims you could do 576 10/100/1000 ports in a 13-slot 6500. That's half the rack space for the switching. If 10/100 is enough, you could do 12 of the WS-X6196-21AF cards (or the RJ-45 'two-ports-per-plug' WS-X6148X2-45AF) and get in theory 1,152 ports in a 6513 (one SUP; drop 96 ports from that to get a redundant SUP). Looking at another post in the thread, these moonshot rigs sound interesting.... 45 server blades in 4.3U. 4.3U?!?!? Heh, some custom rails, I guess, to get ten in 47U. They claim a quad-server blade, so 1,800 servers (with networking) in a 47U rack. Yow. Cost of several hundred thousand dollars for that setup. The effective limit on subnet size would be of course broadcast overhead; 1,000 nodes on a /22 would likely be painfully slow due to broadcast overhead alone.
On Sat, 2015-05-09 at 17:06 -0400, Lamar Owen wrote:
The effective limit on subnet size would be of course broadcast overhead; 1,000 nodes on a /22 would likely be painfully slow due to broadcast overhead alone.
Would be interesting to see how IPv6 performed, since is one of the things it was supposed to be able to deliver - massively scalable links (equivalent to an IPv4 broadcast domain) via massively reduced protocol chatter (IPv6 multicast groups vs IPv4 broadcast), plus fully automated L3 address assignment. IPv4 ARP, for example, hits every on-subnet neighbour; the IPv6 equivalent uses multicast to hit only those neighbours that happen to share the same 24 low-end L3 address bits as the desired target - a statistically much smaller subset of on-link neighbours, and in "normal" subnets typically only one host. Only chatter that really should go to all hosts does so - such as router advertisements. Regards, K. -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Karl Auer (kauer@biplane.com.au) http://www.biplane.com.au/kauer http://twitter.com/kauer389 GPG fingerprint: 3C41 82BE A9E7 99A1 B931 5AE7 7638 0147 2C3C 2AC4 Old fingerprint: EC67 61E2 C2F6 EB55 884B E129 072B 0AF0 72AA 9882
On 09/05/2015 23:33, Karl Auer wrote:
IPv4 ARP, for example, hits every on-subnet neighbour; the IPv6 equivalent uses multicast to hit only those neighbours that happen to share the same 24 low-end L3 address bits as the desired target - a statistically much smaller subset of on-link neighbours, and in "normal" subnets typically only one host. Only chatter that really should go to all hosts does so - such as router advertisements.
Except when the IPv6 solicited-node multicast groups cause $VENDOR switch meltdown: http://blog.bimajority.org/2014/09/05/the-network-nightmare-that-ate-my-week...
On 10/05/2015 00:33, Karl Auer wrote:
Would be interesting to see how IPv6 performed, since is one of the things it was supposed to be able to deliver - massively scalable links (equivalent to an IPv4 broadcast domain) via massively reduced protocol chatter (IPv6 multicast groups vs IPv4 broadcast), plus fully automated L3 address assignment.
It will perform badly because putting large numbers of hosts in a single broadcast domain is a bad idea, no matter what the protocol. If you have a very large L2 domain and if you use router advertisements to handle your default gateway announcement, you'll probably end up trashing your routers due to periodic neighbor solicitation messages. If you don't use tight timers, your failover convergence time will be trash. On the other hand, the tighter the timers, the more you'll trash your routers, particularly if there is a failover event - in other words, exactly when you don't want to stress the network. In the best case, the gateway unavailability mttr will be around 5-10 seconds and it will be non-deterministic. This means that if you want router failover which actually works, you will need to use a first-hop routing protocol like vrrp or similar. You will probably want to disable all multicast snooping on your network because of ipv6 chatter. Pushing state requirements into the L2 forwarding mechanism turns out not to be a good idea especially at scale - see the bimajority.org url that someone else posted on this thread, which is as much about poor switch implementation as it is about poor protocol design and solving problems that are a lot less relevant on today's networks. This will mean that you will also need to manually prune the scope of your dot1q network domain because otherwise the multicast chatter will be spammed network-wide across all vlans on which it's defined. RA gives the operator no way of controlling which IP address is assigned to which hosts, which means that the operator of the large l2 domain is likely to want to disable SLAAC if they plan to have any input on what IP address is assigned to what host. This may or may not be important to the operator. If it's hosts on a hot-seated corporate lan, probably it doesn't matter too much. If it's a service provider selling ipv6 services, it matters a lot. Regardless of whether this is the case, RA guard on each end-point is a necessity and if you don't have it, your control plane will be compromised. RA guard is more complicated than ARP / DHCP guard and is not well supported on a lot of hardware. Finally, if you have a connectivity problem with your large l2 domain, your problem surface area is much greater than if you segment your network into smaller chunks, which allows the scope of your outage to be a lot larger. Nick
Juniper OCX1100 have 72 ports in 1U. And you can tune Linux IPv4 neighbor: https://ams-ix.net/technical/specifications-descriptions/config-guide#11 -- Eduardo Schoedler Em sábado, 9 de maio de 2015, Lamar Owen <lowen@pari.edu> escreveu:
On 05/08/2015 02:53 PM, John Levine wrote:
... Most of the traffic will be from one node to another, with considerably less to the outside. Physical distance shouldn't be a problem since everything's in the same room, maybe the same rack.
What's the rule of thumb for number of hosts per switch, cascaded switches vs. routers, and whatever else one needs to design a dense network like this? TIA
You know, I read this post and immediately thought 'SGI Altix'........ scalable to 512 CPU's per "system image" and 20 images per cluster (NASA's Columbia supercomputer had 10,240 CPUs in that configuration.....twelve years ago, using 1.5GHz 64-bit RISC CPUs running Linux.... my, how we've come full circle.... (today's equivalent has less power consumption, at least....)). The NUMA technology in those Altix CPU's is a de-facto 'memory-area network' and thus can have some interesting topologies.
Clusters can be made using nodes with at least two NICs in them, and no switching. With four or eight ports you can do some nice mesh topologies. This wouldn't be L2 bridging, either, but a L3 mesh could be made that could be rather efficient, with no switches, as long as you have at least three ports per node, and you can do something reasonably efficient with a switch or two and some chains of nodes, with two NICs per node. L3 keeps the broadcast domain size small, and broadcast overhead becomes small.
If you only have one NIC per node, well, time to get some seriously high-density switches..... but even then how many nodes are going to be per 42U rack? A top-of-rack switch may only need 192 ports, and that's only 4U, with 1U 48 port switches. 8U you can do 384 ports, and three racks will do a bit over 1,000. Octopus cables going from an RJ21 to 8P8C modular are available, so you could use high-density blades; Cisco claims you could do 576 10/100/1000 ports in a 13-slot 6500. That's half the rack space for the switching. If 10/100 is enough, you could do 12 of the WS-X6196-21AF cards (or the RJ-45 'two-ports-per-plug' WS-X6148X2-45AF) and get in theory 1,152 ports in a 6513 (one SUP; drop 96 ports from that to get a redundant SUP).
Looking at another post in the thread, these moonshot rigs sound interesting.... 45 server blades in 4.3U. 4.3U?!?!? Heh, some custom rails, I guess, to get ten in 47U. They claim a quad-server blade, so 1,800 servers (with networking) in a 47U rack. Yow. Cost of several hundred thousand dollars for that setup.
The effective limit on subnet size would be of course broadcast overhead; 1,000 nodes on a /22 would likely be painfully slow due to broadcast overhead alone.
-- Eduardo Schoedler
You do not mention low cost before ;) Em sábado, 9 de maio de 2015, John Levine <johnl@iecc.com> escreveu:
In article < CAHf3uWyPQn1NS_umjZ-zNuk3i5uFcZBu9L39b-crovG6yUm2qA@mail.gmail.com <javascript:;>> you write:
Juniper OCX1100 have 72 ports in 1U.
Yeah, too bad it costs $32,000. Other than that it'd be perfect.
R's, John
-- Eduardo Schoedler
If you need that kind of density, I recommend a Clos fabric. Arista, Juniper, Brocade, Big Switch BCF and Cisco all have solutions that would allow you to build a high-density leaf/spine. You can build the Cisco solution with NXOS or ACI, depending which models you choose. The prices on these solutions are all somewhat in the same ballpark based on list pricing I've seen... even Cisco (the Nexus 9k is surprisingly in the same range as branded whitebox). There is also Pluribus which offers a fabric, but their niche is having server procs on board the switches and it seems like your project involves physical rather than virtual servers. Still, the Pluribus could be used without taking advantage of the on board server compute I suppose. I also recommend looking into a solution that supports VXLAN (or GENEVE, or whatever overlay works for your needs) simply because MAC is carried in Layer-3 so you won't have to deal with spanning tree or monstrous mac tables. But you don't need to do an overlay if you just segment with traditional VLANs. I'm guessing you don't need HA (A/B uplinks utilizing LACP) for these servers? Also, do you need line rate forwarding? Having 1,000 devices with 1Gb uplinks doesn't necessarily mean that full throughput is required... the clustering and the applications may be sporadic and bursty? I have seen load-testing clusters, hadoop and data warehousing pushing high volumes but the individual NICs in the clusters never actually hit capacity... If you need line-rate, then you need to do a deep dive with several of the vendors because there are significant differences in buffers on some models. And... what support do you need? Just one spare on the shelf or full vendor support on every switch? That will impact which vendor you choose. I'd like to hear more about this effort once you get it going. Which vendor you went with, how you tuned it, and why you selected who you did. Also, how it works. LFoD
Date: Sun, 10 May 2015 01:17:07 +0000 From: johnl@iecc.com To: nanog@nanog.org Subject: Re: Thousands of hosts on a gigabit LAN, maybe not
In article <CAHf3uWyPQn1NS_umjZ-zNuk3i5uFcZBu9L39b-crovG6yUm2qA@mail.gmail.com> you write:
Juniper OCX1100 have 72 ports in 1U.
Yeah, too bad it costs $32,000. Other than that it'd be perfect.
R's, John
Also, do you need line rate forwarding? Having 1,000 devices with 1Gb uplinks doesn't necessarily mean that full throughput is required... the clustering and the applications may be sporadic and bursty?
It's definitely sporadic and bursty. There's another network for high speed traffic among the nodes. The Ethernet is for stuff like program loading from NFS servers..
And... what support do you need? Just one spare on the shelf or full vendor support on every switch?
Spare on the shelf, definitely. R's, John
participants (25)
-
Baldur Norddahl
-
Benson Schliesser
-
Blake Hudson
-
Brandon Martin
-
Brian R
-
Bruce Simpson
-
c b
-
charles@thefnf.org
-
Christopher Morrow
-
Chuck Church
-
Dave Taht
-
Eduardo Schoedler
-
Jima
-
Joe Hamelin
-
John Levine
-
John R. Levine
-
Karl Auer
-
Lamar Owen
-
Miles Fidelman
-
Nick Hilliard
-
Niels Bakker
-
Phil Bedard
-
Rafael Possamai
-
Roland Dobbins
-
Sameer Khosla