FYI: An Easy way to build a server cluster without top of rack switches (MEMO)

NAOTO MATSUMOTO

12 Feb 2015 12 Feb '15

7:32 a.m.

Hi all! We wrote up TIPS memo "an easy way to build a server cluster without top of rack switches" concept. This model have a reduce switches and cables costs and high network durability by lightweight and simple configuration. if you interest in, please try to do yourself this concept ;-) An Easy way to build a server cluster without top of rack switches (MEMO) http://slidesha.re/1EduYXM Best regards, -- Naoto MATSUMOTO

Show replies by date

Dan Eckert

13 Feb 13 Feb

10:08 p.m.

New subject: An Easy way to build a server cluster without top of rack switches (MEMO)

I'm having a hard time seeing how this reduces cable costs or increases network durability. Each individual server is well connected to 3-4 other servers in the rack, but the rack still only has two uplinks. For many servers in the rack you're adding 3-4 routing hops between an end node and the rack uplink. Additionally, with only 2 external links tied to 2 specific nodes, you introduce more risks. If one of the uplink nodes fails, you've got to re-route all of the nodes that were using it as the shortest path to now exit through the other uplink node -- the worst case in the example then increases from the original 4-hops-to-exit to now 7-hops-to-exit. As far as cable costs go, you might have slightly shorter cables, but far more complex wiring pattern -- so in essence you're trading off a small amount of cable cost for a higher amount of installation and troubleshooting cost. Also, using this layout, you dramatically reduce the effective bandwidth available between devices, since per-device links now have to be used for backhaul/transport in addition to device-specific traffic. Finally, you have to manage per-server routing service configurations to make this work -- more points of failure and increased setup/troubleshooting cost. In a ToR switch scenario, you do one config on one switch, plug in the cables, and you're done -- problems happen, you go to the one switch, not chasing a needle through a haystack of interconnected servers. If your RU count is worth more than the combination of increased installation, server configuration, troubleshooting, latency, and capacity costs, then this is a good solution. Either way, it's a neat idea and a fun thought experiment to work through. Thanks! Dan -----Original Message----- From: NANOG [mailto:nanog-bounces@nanog.org] On Behalf Of NAOTO MATSUMOTO Sent: Wednesday, February 11, 2015 11:32 PM To: nanog@nanog.org Subject: FYI: An Easy way to build a server cluster without top of rack switches (MEMO) Hi all! We wrote up TIPS memo "an easy way to build a server cluster without top of rack switches" concept. This model have a reduce switches and cables costs and high network durability by lightweight and simple configuration. if you interest in, please try to do yourself this concept ;-) An Easy way to build a server cluster without top of rack switches (MEMO) http://slidesha.re/1EduYXM Best regards, -- Naoto MATSUMOTO

Ken Chase

14 Feb 14 Feb

4:09 p.m.

New subject: An Easy way to build a server cluster without top of rack switches (MEMO)

We did similar way back in the day (2001?) when GBE switches were ridiculously expensive and we wanted many nodes instead of expensive gear. The (deplorably hot!) NatSemi 83820 gbe cards were a mere $40 or something however. Uplink for loading data via NFS/control was the onboard FE (via desktop 8 port Surecoms), but 2x GBE was used for inter-node. GROMACS, a molecular modeller, only talked to adjacent nodes, so we just hooked up a linear network A-B-C-D-E-F-A in a loop. With 40 nodes though, some nodes had 3 cards in them however to effectively make two separate smaller cluster loops (A-B-C-A and D-E-F-D for eg) without having to visit the machine and move cards around. Perfectly reasonable where A talks to B and C or F only. A ridiculous concept for A talking to C however. Latency on the network was our big thing for GROMACS' speed of course, thus the GBE, so multihop would have been totally anathema. While our cluster ran about twice as slow per job (with no net gain in speed beyond 16-20 nodes due to latency catching up with us) as the way more pricey competing quote's infiniband-based solution, their total of 8 nodes were no match for us running 5 jobs in parallel on our 40 nodes for the same cost :) Considering the lab had multiple grad students in it, there was ample opportunity for running multiple jobs at the same time - while this may have thrashed the CPU cache (and increased our memory requirements slightly) in terms of pure compute efficiency, the end throughput per dollar and happiness-per-grad-student was far higher. Feel free to trawl the archives on beowulf-l ca. 2001-2 for more details of dirt cheap cluster design (or reply to me directly). Here's some pics of the cluster, but please keep in mind we were young and foolish. :) http://sizone.org/m/i/velocet/cluster_construction/133-3371_IMG.JPG.html /kc On Fri, Feb 13, 2015 at 10:08:21PM +0000, Dan Eckert said:

...

I'm having a hard time seeing how this reduces cable costs or increases network durability. Each individual server is well connected to 3-4 other servers in the rack, but the rack still only has two uplinks. For many servers in the rack you're adding 3-4 routing hops between an end node and the rack uplink.

Additionally, with only 2 external links tied to 2 specific nodes, you introduce more risks. If one of the uplink nodes fails, you've got to re-route all of the nodes that were using it as the shortest path to now exit through the other uplink node -- the worst case in the example then increases from the original 4-hops-to-exit to now 7-hops-to-exit.

As far as cable costs go, you might have slightly shorter cables, but far more complex wiring pattern -- so in essence you're trading off a small amount of cable cost for a higher amount of installation and troubleshooting cost.

Also, using this layout, you dramatically reduce the effective bandwidth available between devices, since per-device links now have to be used for backhaul/transport in addition to device-specific traffic.

Finally, you have to manage per-server routing service configurations to make this work -- more points of failure and increased setup/troubleshooting cost. In a ToR switch scenario, you do one config on one switch, plug in the cables, and you're done -- problems happen, you go to the one switch, not chasing a needle through a haystack of interconnected servers.

If your RU count is worth more than the combination of increased installation, server configuration, troubleshooting, latency, and capacity costs, then this is a good solution. Either way, it's a neat idea and a fun thought experiment to work through.

Thanks! Dan

-----Original Message----- From: NANOG [mailto:nanog-bounces@nanog.org] On Behalf Of NAOTO MATSUMOTO Sent: Wednesday, February 11, 2015 11:32 PM To: nanog@nanog.org Subject: FYI: An Easy way to build a server cluster without top of rack switches (MEMO)

Hi all!

We wrote up TIPS memo "an easy way to build a server cluster without top of rack switches" concept.

This model have a reduce switches and cables costs and high network durability by lightweight and simple configuration.

if you interest in, please try to do yourself this concept ;-)

An Easy way to build a server cluster without top of rack switches (MEMO) http://slidesha.re/1EduYXM

Best regards, -- Naoto MATSUMOTO

-- Ken Chase - ken@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto Canada Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W.

NAOTO MATSUMOTO

16 Feb 16 Feb

3:23 a.m.

New subject: An Easy way to build a server cluster without top of rack switches (MEMO)

Hi Dan and ken. I respect your great works. Certainly, our scenario was network classics and it just does not "one size fits all" network architecture. Many people tried to built centralized and decentralized networks many years ago, some guys output implementation like this. Interconnect of K computer (torus fusion) https://www.fujitsu.com/global/Images/fujitsu-hpc-roadmap-beyond-petascale-c... I agree with you point. Our approach have to do more simple way on physical and logical network engineering, and change the mindset, I think. (e.g. network cabling procedure and troubleshooting and handling) But, some guys need more cost effective server cluster environment and they does't care network latency like Low-End Web Hosting. e.g. Intel Diversity of Server workloads http://bit.ly/1BgFH65 [JPG]. Now, Many people do not use Dijkstra and automaton theory on the server side, but it is great mechanism for network durability if they controlled. The Ethernet NIC's bandwidth is increasing day by day, the cost is decreasing too. I say again, our scenario is not "one size fits all" network architecture, but we believe that something will happen for some guys works. ;-) best regards, -- Naoto MATSUMOTO On Sat, Feb 14, 2015 at 7:08 AM, Dan Eckert <daniel.eckert@microsoft.com> wrote:

...

I'm having a hard time seeing how this reduces cable costs or increases network durability. Each individual server is well connected to 3-4 other servers in the rack, but the rack still only has two uplinks. For many servers in the rack you're adding 3-4 routing hops between an end node and the rack uplink.

Additionally, with only 2 external links tied to 2 specific nodes, you introduce more risks. If one of the uplink nodes fails, you've got to re-route all of the nodes that were using it as the shortest path to now exit through the other uplink node -- the worst case in the example then increases from the original 4-hops-to-exit to now 7-hops-to-exit.

As far as cable costs go, you might have slightly shorter cables, but far more complex wiring pattern -- so in essence you're trading off a small amount of cable cost for a higher amount of installation and troubleshooting cost.

Also, using this layout, you dramatically reduce the effective bandwidth available between devices, since per-device links now have to be used for backhaul/transport in addition to device-specific traffic.

Finally, you have to manage per-server routing service configurations to make this work -- more points of failure and increased setup/troubleshooting cost. In a ToR switch scenario, you do one config on one switch, plug in the cables, and you're done -- problems happen, you go to the one switch, not chasing a needle through a haystack of interconnected servers.

If your RU count is worth more than the combination of increased installation, server configuration, troubleshooting, latency, and capacity costs, then this is a good solution. Either way, it's a neat idea and a fun thought experiment to work through.

Thanks! Dan

-----Original Message----- From: NANOG [mailto:nanog-bounces@nanog.org] On Behalf Of NAOTO MATSUMOTO Sent: Wednesday, February 11, 2015 11:32 PM To: nanog@nanog.org Subject: FYI: An Easy way to build a server cluster without top of rack switches (MEMO)

Hi all!

We wrote up TIPS memo "an easy way to build a server cluster without top of rack switches" concept.

This model have a reduce switches and cables costs and high network durability by lightweight and simple configuration.

if you interest in, please try to do yourself this concept ;-)

An Easy way to build a server cluster without top of rack switches (MEMO) http://slidesha.re/1EduYXM

Best regards, -- Naoto MATSUMOTO

NAOTO MATSUMOTO

20 Feb 20 Feb

2:50 p.m.

New subject: An Easy way to build a server cluster without top of rack switches (MEMO)

BTW: This scenario's combination has another portion for us like as below. High Availability Server Clustering without ILB(Internal Load Balancer) (MEMO) http://slidesha.re/1vld6uB -- Naoto MATSUMOTO On Mon, Feb 16, 2015 at 12:23 PM, NAOTO MATSUMOTO <naoto.mm@gmail.com> wrote:

...

Hi Dan and ken.

I respect your great works.

Certainly, our scenario was network classics and it just does not "one size fits all" network architecture. Many people tried to built centralized and decentralized networks many years ago, some guys output implementation like this.

Interconnect of K computer (torus fusion)

https://www.fujitsu.com/global/Images/fujitsu-hpc-roadmap-beyond-petascale-c...

I agree with you point. Our approach have to do more simple way on physical and logical network engineering, and change the mindset, I think. (e.g. network cabling procedure and troubleshooting and handling)

But, some guys need more cost effective server cluster environment and they does't care network latency like Low-End Web Hosting.

e.g. Intel Diversity of Server workloads http://bit.ly/1BgFH65 [JPG].

Now, Many people do not use Dijkstra and automaton theory on the server side, but it is great mechanism for network durability if they controlled.

The Ethernet NIC's bandwidth is increasing day by day, the cost is decreasing too.

I say again, our scenario is not "one size fits all" network architecture, but we believe that something will happen for some guys works. ;-)

best regards,

-- Naoto MATSUMOTO

On Sat, Feb 14, 2015 at 7:08 AM, Dan Eckert <daniel.eckert@microsoft.com> wrote:

...
I'm having a hard time seeing how this reduces cable costs or increases network durability. Each individual server is well connected to 3-4 other servers in the rack, but the rack still only has two uplinks. For many servers in the rack you're adding 3-4 routing hops between an end node and the rack uplink.

Additionally, with only 2 external links tied to 2 specific nodes, you introduce more risks. If one of the uplink nodes fails, you've got to re-route all of the nodes that were using it as the shortest path to now exit through the other uplink node -- the worst case in the example then increases from the original 4-hops-to-exit to now 7-hops-to-exit.

As far as cable costs go, you might have slightly shorter cables, but far more complex wiring pattern -- so in essence you're trading off a small amount of cable cost for a higher amount of installation and troubleshooting cost.

Also, using this layout, you dramatically reduce the effective bandwidth available between devices, since per-device links now have to be used for backhaul/transport in addition to device-specific traffic.

Finally, you have to manage per-server routing service configurations to make this work -- more points of failure and increased setup/troubleshooting cost. In a ToR switch scenario, you do one config on one switch, plug in the cables, and you're done -- problems happen, you go to the one switch, not chasing a needle through a haystack of interconnected servers.

If your RU count is worth more than the combination of increased installation, server configuration, troubleshooting, latency, and capacity costs, then this is a good solution. Either way, it's a neat idea and a fun thought experiment to work through.

Thanks! Dan

-----Original Message----- From: NANOG [mailto:nanog-bounces@nanog.org] On Behalf Of NAOTO MATSUMOTO Sent: Wednesday, February 11, 2015 11:32 PM To: nanog@nanog.org Subject: FYI: An Easy way to build a server cluster without top of rack switches (MEMO)

Hi all!

We wrote up TIPS memo "an easy way to build a server cluster without top of rack switches" concept.

This model have a reduce switches and cables costs and high network durability by lightweight and simple configuration.

if you interest in, please try to do yourself this concept ;-)

An Easy way to build a server cluster without top of rack switches (MEMO) http://slidesha.re/1EduYXM

Best regards, -- Naoto MATSUMOTO

3900

Age (days ago)

3908

Last active (days ago)

List overview

Download

4 comments

3 participants

participants (3)

Dan Eckert
Ken Chase
NAOTO MATSUMOTO