I first sent to an IX-specific mailing list, but as I have yet to see the message hit the list, I figured I would post it here as well. We've had multiple requests for 100G interfaces (instead of Nx10G) and no one seems to care about the 40G interfaces we have available. Looking at cost effective options brings us to whitebox switches. Obviously there's a wide range of hardware vendors and there are a few OSes available as well. Cumulus seems to be the market leader, while IPinFusion seems to be the most feature-rich. We're not doing any automation on the switches at this time, so it would still need decent manual configuration. It wouldn't need a Cisco-centric CLI as we're quite comfortable managing standard Linux-type config files. We're not going all-in on some overlay either given that we wouldn't be replacing our entire infrastructure, only supplementing it where we need 100G. I know that LINX has gone IPinfusion. What OS would be appropriate for our usage? I'm not finding many good comparisons of the OSes out there. I'm assuming any of them would work, but there may be gotchas that a "cheapest that meets requirements" doesn't quite unveil. Any particular hardware platforms to go towards or avoid? Broadcom Tomahawk seems to be quite popular with varying control planes. LINX went Edgecore, which was on my list given my experience with other Accton brands. Fiberstore has a switch where they actually publish the pricing vs. a bunch of mystery. Thoughts? ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com
On Aug 20, 2017, at 08:45, Mike Hammett <nanog@ics-il.net> wrote: <snip>
Any particular hardware platforms to go towards or avoid? Broadcom Tomahawk seems to be quite popular with varying control planes. LINX went Edgecore, which was on my list given my experience with other Accton brands. Fiberstore has a switch where they actually publish the pricing vs. a bunch of mystery.
Tomahawk and tomahawk 2 have precious little in the form of packet packet buffer (e.g. As little as 4 x 4MB for original tomahawk) which might be a problem in a environment where you need to rate convert 100G attached peers to a big bundle of 10s). White box Broadcom dnx / jericho is somewhat less common but does exist. That 40s are less popular I think is no surprise. They were / are largely consigned to datacenter applications.
Thoughts?
----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com
Midwest-IX http://www.midwest-ix.com
DNX/Jericho would have sufficient buffers to handle the rate conversions? ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "Joel Jaeggli" <joelja@bogus.com> To: "Mike Hammett" <nanog@ics-il.net> Cc: "NANOG list" <nanog@nanog.org> Sent: Sunday, August 20, 2017 11:07:07 AM Subject: Re: 100G - Whitebox
On Aug 20, 2017, at 08:45, Mike Hammett <nanog@ics-il.net> wrote: <snip>
Any particular hardware platforms to go towards or avoid? Broadcom Tomahawk seems to be quite popular with varying control planes. LINX went Edgecore, which was on my list given my experience with other Accton brands. Fiberstore has a switch where they actually publish the pricing vs. a bunch of mystery.
Tomahawk and tomahawk 2 have precious little in the form of packet packet buffer (e.g. As little as 4 x 4MB for original tomahawk) which might be a problem in a environment where you need to rate convert 100G attached peers to a big bundle of 10s). White box Broadcom dnx / jericho is somewhat less common but does exist. That 40s are less popular I think is no surprise. They were / are largely consigned to datacenter applications.
Thoughts?
----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com
Midwest-IX http://www.midwest-ix.com
On 08/20/17 13:30, Mike Hammett wrote:
DNX/Jericho would have sufficient buffers to handle the rate conversions?
You could try Mellanox. Some of the promotional stuff I've seen/heard indicates their focus on appropriate sized buffers & low packet loss on rate conversion.
----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com
-- Raymond Burkholder https://blog.raymond.burkholder.net/ -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Why don't we just swap out your 40g switch for a 100g switch? You've had the 40g one for a while, and we anticipate upgrades every 18-24 months. -Bill
On Aug 20, 2017, at 08:46, Mike Hammett <nanog@ics-il.net> wrote:
I first sent to an IX-specific mailing list, but as I have yet to see the message hit the list, I figured I would post it here as well.
We've had multiple requests for 100G interfaces (instead of Nx10G) and no one seems to care about the 40G interfaces we have available.
Looking at cost effective options brings us to whitebox switches. Obviously there's a wide range of hardware vendors and there are a few OSes available as well. Cumulus seems to be the market leader, while IPinFusion seems to be the most feature-rich.
We're not doing any automation on the switches at this time, so it would still need decent manual configuration. It wouldn't need a Cisco-centric CLI as we're quite comfortable managing standard Linux-type config files. We're not going all-in on some overlay either given that we wouldn't be replacing our entire infrastructure, only supplementing it where we need 100G. I know that LINX has gone IPinfusion. What OS would be appropriate for our usage? I'm not finding many good comparisons of the OSes out there. I'm assuming any of them would work, but there may be gotchas that a "cheapest that meets requirements" doesn't quite unveil.
Any particular hardware platforms to go towards or avoid? Broadcom Tomahawk seems to be quite popular with varying control planes. LINX went Edgecore, which was on my list given my experience with other Accton brands. Fiberstore has a switch where they actually publish the pricing vs. a bunch of mystery.
Thoughts?
----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com
Midwest-IX http://www.midwest-ix.com
The only viable merchant silicon chip that would be useful for a IXP is from the StrataDNX-family which house the jericho/qumran/petra/arad chips from broadcom. No packetbuffer in the exhangepoint will shred performance significantly, especially when one of your bursty 100G customers starts sending data into 1/10G customers. To the best of my knowledge the only one that offers DNX in whitebox-fashion is Agema and Edgecore. But why whitebox? Except on a very few occasions whitebox is just "i like paying hardware and software on different invoices = whitebox" the TCO is just the same but. As an exchangepoint i also see that it can hard to reap the benefits of all the hipstershit going on in these NOS-startups, you want spanning-tree, port-security, something to loadbalance over links and perhaps a overlaying-technology if the IXP becomes to big and distributed, like vxlan. This is to easy almost. Whenever i see unbuffered mix-speed IXPs i ask if i can pay 25% of the portcost since that is actually how much oumpfff i would get through the port. // hugge @ 2603
20 aug. 2017 kl. 18:40 skrev Bill Woodcock <woody@pch.net>:
Why don't we just swap out your 40g switch for a 100g switch? You've had the 40g one for a while, and we anticipate upgrades every 18-24 months.
-Bill
On Aug 20, 2017, at 08:46, Mike Hammett <nanog@ics-il.net> wrote:
I first sent to an IX-specific mailing list, but as I have yet to see the message hit the list, I figured I would post it here as well.
We've had multiple requests for 100G interfaces (instead of Nx10G) and no one seems to care about the 40G interfaces we have available.
Looking at cost effective options brings us to whitebox switches. Obviously there's a wide range of hardware vendors and there are a few OSes available as well. Cumulus seems to be the market leader, while IPinFusion seems to be the most feature-rich.
We're not doing any automation on the switches at this time, so it would still need decent manual configuration. It wouldn't need a Cisco-centric CLI as we're quite comfortable managing standard Linux-type config files. We're not going all-in on some overlay either given that we wouldn't be replacing our entire infrastructure, only supplementing it where we need 100G. I know that LINX has gone IPinfusion. What OS would be appropriate for our usage? I'm not finding many good comparisons of the OSes out there. I'm assuming any of them would work, but there may be gotchas that a "cheapest that meets requirements" doesn't quite unveil.
Any particular hardware platforms to go towards or avoid? Broadcom Tomahawk seems to be quite popular with varying control planes. LINX went Edgecore, which was on my list given my experience with other Accton brands. Fiberstore has a switch where they actually publish the pricing vs. a bunch of mystery.
Thoughts?
----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com
Midwest-IX http://www.midwest-ix.com
Fredrik Korsbäck wrote:
The only viable merchant silicon chip that would be useful for a IXP is from the StrataDNX-family which house the jericho/qumran/petra/arad chips from broadcom. No packetbuffer in the exhangepoint will shred performance significantly, especially when one of your bursty 100G customers starts sending data into 1/10G customers.
To the best of my knowledge the only one that offers DNX in whitebox-fashion is Agema and Edgecore. But why whitebox? Except on a very few occasions whitebox is just "i like paying hardware and software on different invoices = whitebox" the TCO is just the same but. As an exchangepoint i also see that it can hard to reap the benefits of all the hipstershit going on in these NOS-startups, you want spanning-tree, port-security, something to loadbalance over links and perhaps a overlaying-technology if the IXP becomes to big and distributed, like vxlan. This is to easy almost.
so yeah, hmm. spanning-tree: urgh. port-security: no thank you. Static ACLs only. core ecmp/lag load-balance: don't use vpls. Buffering is hard and gets down to having a good understanding of cost / benefit analysis of your core network and your stakeholders' requirements. The main problem category which IXPs will find it difficult to handle is the set of situations where two participant networks are exchanging individual traffic streams at a rate which is comparable to the egress port speed of the receiving network. This could be 100G-connected devices sending 50G-80G traffic streams to other 100G-connected devices, but the other main situation where this would occur would be high speed CDNs sending traffic to access networks where the individual ISP->customer links would be provisioned in roughly the same speed category as the IXP-ISP link. For example, a small provider doing high speed gpon or docsis, with a small IXP port, e.g. because it's only for a couple of small peers, or maybe it's a backup port or something. In that situation, tcp streams will flood the IXP port at rates which are comparable to the ISP-access layer. If you're not buffering in that situation, the ISP will end up in trouble because they'll drop packets like crazy and the IXP can end up in trouble because shared buffers will be exhausted and may be unavailable for other IXP participants. Mostly you can engineer around this, but it's not as simple as saying that small-buffer switches aren't suitable for an IXP. They can be, but it depends on the network engineering requirements of the ixp participants, and how the ixp is designed. No simple answers here, sorry :-( The flip side of this is that individual StrataDNX asics don't offer the raw performance grunt of the StrataXGS family, and there is a serious difference in costs and performance between a 1U box with a single tomahawk and a 2U box with a 4-way jericho config, make no mistake about it. Otherwise white boxes can reduce costs if you know what you're doing and you're spot on that TCO should be one of the main determinants if the performance characteristics are otherwise similar. TCO is a strange beast though. Everything from kit capex to FLS to depreciation term counts forwards TCO, so it's important to understand your cost base and organisational costing model thoroughly before making any decisions. Nick
On Sun, 20 Aug 2017, Nick Hilliard wrote:
Mostly you can engineer around this, but it's not as simple as saying that small-buffer switches aren't suitable for an IXP.
Could you please elaborate on this? How do you engineer around having basically no buffers at all, and especially if these very small buffers are shared between ports. -- Mikael Abrahamsson email: swmike@swm.pp.se
Mikael Abrahamsson wrote:
On Sun, 20 Aug 2017, Nick Hilliard wrote:
Mostly you can engineer around this, but it's not as simple as saying that small-buffer switches aren't suitable for an IXP.
Could you please elaborate on this?
How do you engineer around having basically no buffers at all, and especially if these very small buffers are shared between ports.
you assess and measure, then choose the appropriate set of tools to deal with your requirements and which is cost appropriate for your financials, i.e. the same as in any engineering situation. At an IXP, it comes down to the maximum size of tcp stream you expect to transport. This will vary depending on the stakeholders at the IXP, which usually depends on the size of the IXP. Larger IXPs will have a wider traffic remit and probably a much larger variance in this regard. Smaller IXPs typically transport content to access network data, which is usually well behaved traffic. Traffic drops on the core need to be kept to the minimum, particularly during normal operation. Eliminating traffic drops is unnecessary and unwanted because of how IP works, so in your core you need to aim for either link overengineering or else enough buffering to ensure that site-to-site latency does not exceed X ms and Y% packet loss. Each option has a cost implication. At the IXP participant edge, there is a different set of constraints which will depend on what's downstream of the participant, where the traffic flows are, what size they are, etc. In general, traffic loss at the IXP handoff will tend only to be a problem if there is a disparity between the bandwidth left available on the egress direction and the maximum link speed downstream of the IXP participant. For example, a content network has servers which inject content at 10G, which connects through a 100G IXP port. The egress IXP port is a mid-loaded 1G link which connects through to 10mbit WISP customers. In this case, the ixp will end up doing negligible buffering because most of the buffering load will be handled on the WISP's internal infrastructure, specifically at the core-to-10mbit handoff. The IXP port might end up dropping a packet or two during the initial tcp burst, but that is likely to be latency specific and won't particularly harm overall performance because of tcp slow start. On the other hand, if it were a mid-loaded 1G link with 500mbit access customers on the other side (e.g. docsis / gpon / ftth), then the IXP would end up being the primary buffering point between the content source and destination and this would cause problems. The remedy here is either for the ixp to move the customer to a buffered port (e.g. different switch), or for the access customer to upgrade their link. If you want to push 50G-80G streams through an IXP, I'd argue that you really shouldn't, not just because of cost but also because this is very expensive to engineer properly and you're also certainly better off with a pni. This approach works better on some networks than others. The larger the IXP, the more difficult it is to manage this, both in terms of core and edge provisioning, i.e. the greater the requirement for buffering in both situations because you have a greater variety of streaming scales per network. So although this isn't going to work as well for top-10 ixps as for mid- or smaller-scale ixps, where it works, it can provide similar quality of service at a significantly lower cost base. IOW, know your requirements and choose your tools to match. Same as with all engineering. Nick
In terms of 1G - 10G steps, it looks like UCSC has done some of that homework already. https://people.ucsc.edu/~warner/Bufs/summary "Ability to buffer 6 Mbytes is sufficient for a 10 Gb/s sender and a 1 Gb/s receiver." I'd suspect 10x would be appropriate for 100G - 10G (certainly made more accurate by testing). http://people.ucsc.edu/~warner/I2-techs.ppt Looking through their table ( https://people.ucsc.edu/~warner/buffer.html ), it looks like more switches than not in the not-100g realm have just enough buffers to handle one, possibly two mis-matches at a time. Some barely don't have enough and others are woefully inadequate. ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com ----- Original Message ----- From: "Nick Hilliard" <nick@foobar.org> To: "Mikael Abrahamsson" <swmike@swm.pp.se> Cc: "NANOG list" <nanog@nanog.org> Sent: Monday, August 21, 2017 6:10:17 AM Subject: Re: 100G - Whitebox Mikael Abrahamsson wrote:
On Sun, 20 Aug 2017, Nick Hilliard wrote:
Mostly you can engineer around this, but it's not as simple as saying that small-buffer switches aren't suitable for an IXP.
Could you please elaborate on this?
How do you engineer around having basically no buffers at all, and especially if these very small buffers are shared between ports.
you assess and measure, then choose the appropriate set of tools to deal with your requirements and which is cost appropriate for your financials, i.e. the same as in any engineering situation. At an IXP, it comes down to the maximum size of tcp stream you expect to transport. This will vary depending on the stakeholders at the IXP, which usually depends on the size of the IXP. Larger IXPs will have a wider traffic remit and probably a much larger variance in this regard. Smaller IXPs typically transport content to access network data, which is usually well behaved traffic. Traffic drops on the core need to be kept to the minimum, particularly during normal operation. Eliminating traffic drops is unnecessary and unwanted because of how IP works, so in your core you need to aim for either link overengineering or else enough buffering to ensure that site-to-site latency does not exceed X ms and Y% packet loss. Each option has a cost implication. At the IXP participant edge, there is a different set of constraints which will depend on what's downstream of the participant, where the traffic flows are, what size they are, etc. In general, traffic loss at the IXP handoff will tend only to be a problem if there is a disparity between the bandwidth left available on the egress direction and the maximum link speed downstream of the IXP participant. For example, a content network has servers which inject content at 10G, which connects through a 100G IXP port. The egress IXP port is a mid-loaded 1G link which connects through to 10mbit WISP customers. In this case, the ixp will end up doing negligible buffering because most of the buffering load will be handled on the WISP's internal infrastructure, specifically at the core-to-10mbit handoff. The IXP port might end up dropping a packet or two during the initial tcp burst, but that is likely to be latency specific and won't particularly harm overall performance because of tcp slow start. On the other hand, if it were a mid-loaded 1G link with 500mbit access customers on the other side (e.g. docsis / gpon / ftth), then the IXP would end up being the primary buffering point between the content source and destination and this would cause problems. The remedy here is either for the ixp to move the customer to a buffered port (e.g. different switch), or for the access customer to upgrade their link. If you want to push 50G-80G streams through an IXP, I'd argue that you really shouldn't, not just because of cost but also because this is very expensive to engineer properly and you're also certainly better off with a pni. This approach works better on some networks than others. The larger the IXP, the more difficult it is to manage this, both in terms of core and edge provisioning, i.e. the greater the requirement for buffering in both situations because you have a greater variety of streaming scales per network. So although this isn't going to work as well for top-10 ixps as for mid- or smaller-scale ixps, where it works, it can provide similar quality of service at a significantly lower cost base. IOW, know your requirements and choose your tools to match. Same as with all engineering. Nick
Mike, Whether it becomes a practical problem depends on the use case and by that I mean buffers can cut both ways. If buffers are too small, traffic can be dropped and even worse, other traffic could be affected depending on factors like ASIC design an HOLB. Too large, latency or order sensitive traffic can be adversely affected. We're still dealing with the same limitations of switching which were identified 30+ years ago as the technology was developed. Sure we have better chips, the options of better buffers and years ago experience to help minimize those limitations, but those still exist and likely always will with switching. Honestly at this point it comes down to understanding what the use case it and understanding the nuances that each vendor's offerings provide and determining where things line up. Then test test test. -- Stephen On 2017-12-04 2:45 PM, Mike Hammett wrote:
In terms of 1G - 10G steps, it looks like UCSC has done some of that homework already.
https://people.ucsc.edu/~warner/Bufs/summary
"Ability to buffer 6 Mbytes is sufficient for a 10 Gb/s sender and a 1 Gb/s receiver." I'd suspect 10x would be appropriate for 100G - 10G (certainly made more accurate by testing).
http://people.ucsc.edu/~warner/I2-techs.ppt
Looking through their table ( https://people.ucsc.edu/~warner/buffer.html ), it looks like more switches than not in the not-100g realm have just enough buffers to handle one, possibly two mis-matches at a time. Some barely don't have enough and others are woefully inadequate.
----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com
Midwest-IX http://www.midwest-ix.com
----- Original Message -----
From: "Nick Hilliard" <nick@foobar.org> To: "Mikael Abrahamsson" <swmike@swm.pp.se> Cc: "NANOG list" <nanog@nanog.org> Sent: Monday, August 21, 2017 6:10:17 AM Subject: Re: 100G - Whitebox
Mikael Abrahamsson wrote:
On Sun, 20 Aug 2017, Nick Hilliard wrote:
Mostly you can engineer around this, but it's not as simple as saying that small-buffer switches aren't suitable for an IXP.
Could you please elaborate on this?
How do you engineer around having basically no buffers at all, and especially if these very small buffers are shared between ports.
you assess and measure, then choose the appropriate set of tools to deal with your requirements and which is cost appropriate for your financials, i.e. the same as in any engineering situation.
At an IXP, it comes down to the maximum size of tcp stream you expect to transport. This will vary depending on the stakeholders at the IXP, which usually depends on the size of the IXP. Larger IXPs will have a wider traffic remit and probably a much larger variance in this regard. Smaller IXPs typically transport content to access network data, which is usually well behaved traffic.
Traffic drops on the core need to be kept to the minimum, particularly during normal operation. Eliminating traffic drops is unnecessary and unwanted because of how IP works, so in your core you need to aim for either link overengineering or else enough buffering to ensure that site-to-site latency does not exceed X ms and Y% packet loss. Each option has a cost implication.
At the IXP participant edge, there is a different set of constraints which will depend on what's downstream of the participant, where the traffic flows are, what size they are, etc. In general, traffic loss at the IXP handoff will tend only to be a problem if there is a disparity between the bandwidth left available on the egress direction and the maximum link speed downstream of the IXP participant.
For example, a content network has servers which inject content at 10G, which connects through a 100G IXP port. The egress IXP port is a mid-loaded 1G link which connects through to 10mbit WISP customers. In this case, the ixp will end up doing negligible buffering because most of the buffering load will be handled on the WISP's internal infrastructure, specifically at the core-to-10mbit handoff. The IXP port might end up dropping a packet or two during the initial tcp burst, but that is likely to be latency specific and won't particularly harm overall performance because of tcp slow start.
On the other hand, if it were a mid-loaded 1G link with 500mbit access customers on the other side (e.g. docsis / gpon / ftth), then the IXP would end up being the primary buffering point between the content source and destination and this would cause problems. The remedy here is either for the ixp to move the customer to a buffered port (e.g. different switch), or for the access customer to upgrade their link.
If you want to push 50G-80G streams through an IXP, I'd argue that you really shouldn't, not just because of cost but also because this is very expensive to engineer properly and you're also certainly better off with a pni.
This approach works better on some networks than others. The larger the IXP, the more difficult it is to manage this, both in terms of core and edge provisioning, i.e. the greater the requirement for buffering in both situations because you have a greater variety of streaming scales per network. So although this isn't going to work as well for top-10 ixps as for mid- or smaller-scale ixps, where it works, it can provide similar quality of service at a significantly lower cost base.
IOW, know your requirements and choose your tools to match. Same as with all engineering.
Nick
Hello Mike, Le 20/08/2017 à 17:45, Mike Hammett a écrit :
Looking at cost effective options brings us to whitebox switches. Obviously there's a wide range of hardware vendors and there are a few OSes available as well. Cumulus seems to be the market leader, while IPinFusion seems to be the most feature-rich.
You may want to take a look at this paper before choosing your NOS : https://hal-univ-tlse3.archives-ouvertes.fr/hal-01276379v1 This described an OpenFlow SDN approach to dumb-down whitebox switches as to fully avoid broadcast traffic, while performing as well as a complex VPLS fabric. This design is operating consistently on the Toulouse Internet Exchange since 2015 using Pica8's gear. So maybe a CLI and feature-full NOS will better suit your needs, but for an IXP, the programmatic approach has demonstrated to be working just fine. Best regards, -- Jérôme Nicolle +33 (0)6 19 31 27 14
participants (10)
-
Bill Woodcock
-
Coyo Stormcaller
-
Fredrik Korsbäck
-
Joel Jaeggli
-
Jérôme Nicolle
-
Mikael Abrahamsson
-
Mike Hammett
-
Nick Hilliard
-
Raymond Burkholder
-
Stephen Fulton