External BGP Controller for L3 Switch BGP routing

older
Re: [outages] ntp.org DNS lookups...

Faisal Imtiaz

14 Jan 2017 14 Jan '17

5:24 a.m.

Hello, A while back there was a discussion on how to do optimized (dynamic) BGP routing on a L3 switch which is only capable of handing a subset of BGP Routing table. Someone has pointed out that there was a project to do just that, and had posted a link to a presentation on a European operator (Ireland ? ) who had done some code to take Exabgp and create such a setup.. (I am going by memory... )... Needless to say I am trying to find that link, or name of that project. Anyone who can help in refreshing my memory with the link (my search skill are failing to find that presentation !) would be greatly appreciated. Many Thanks in Advance. Faisal Imtiaz

Show replies by date

Jeremy Austin

14 Jan 14 Jan

5:32 a.m.

Tore Anderson: https://www.redpill-linpro.com/sysadvent/2016/12/09/slimming-routing-table.h... On Fri, Jan 13, 2017 at 8:24 PM, Faisal Imtiaz <faisal@snappytelecom.net> wrote:

...

Hello,

A while back there was a discussion on how to do optimized (dynamic) BGP routing on a L3 switch which is only capable of handing a subset of BGP Routing table.

Someone has pointed out that there was a project to do just that, and had posted a link to a presentation on a European operator (Ireland ? ) who had done some code to take Exabgp and create such a setup..

(I am going by memory... )... Needless to say I am trying to find that link, or name of that project.

Anyone who can help in refreshing my memory with the link (my search skill are failing to find that presentation !) would be greatly appreciated.

Many Thanks in Advance.

Faisal Imtiaz

-- Jeremy Austin (907) 895-2311 (907) 803-5422 jhaustin@gmail.com Heritage NetWorks Whitestone Power & Communications Vertical Broadband, LLC Schedule a meeting: http://doodle.com/jermudgeon

Saku Ytti

12:37 p.m.

On 14 January 2017 at 07:32, Jeremy Austin <jhaustin@gmail.com> wrote: Hey,

...

https://www.redpill-linpro.com/sysadvent/2016/12/09/slimming-routing-table.h...

--- As described in a prevous post, we’re testing a HPE Altoline 6920 in our lab. The Altoline 6920 is, like other switches based on the Broadcom Trident II chipset, able to handle up to 720 Gbps of throughput, packing 48x10GbE + 6x40GbE ports in a compact 1RU chassis. Its price is in all likelihood a single-digit percentage of the price of a traditional Internet router with a comparable throughput rating. --- This makes it sound like small-FIB router is single-digit percentage cost of full-FIB. Purely from BOM cost, high end XEON costs more than say Jericho. If this does not match market realities, full-FIB being expensive is fabricated problem. I don't believe we're anywhere near of full-FIB being uneconomic. Of course if you have solution where small-FIB gives very minor or no compromises, it makes no sense to use full-FIB, as it has convergence time cost too. But if you have to take compromises, I'd rather try to fix the underlaying economic issue. What is driving small-FIB boxes is not that full-FIB is inherently very expensive, but rather that densest possible box is most marketable to DC people, trading large FIB and deep buffers for higher density is no-brainer in some DC applications. I suspect most ports used are in DC, not in access, so market is not focused on access requirements. For same cost, that you buy ghetto cheap TridentII box, you should be able to buy slightly less dense full-FIB and deep buffers box. In access networks this is fine, because your port utilisation rate is poor anyhow. Also having Trident in Internet facing interface may be suspect, especially if you need to go from fast interface to slow or busy interface, due to very minor packet buffers. This obviously won't be much of a problem in inside-DC traffic. This is quite hard to test in lab, you can't reasonable test it in IXIA, the burst sizes it supports are too small, in Spirent you mostly can see the problem. -- ++ytti

Tore Anderson

16 Jan 16 Jan

6:40 a.m.

Hi Saku,

...

...
https://www.redpill-linpro.com/sysadvent/2016/12/09/slimming-routing-table.h...

--- As described in a prevous post, we’re testing a HPE Altoline 6920 in our lab. The Altoline 6920 is, like other switches based on the Broadcom Trident II chipset, able to handle up to 720 Gbps of throughput, packing 48x10GbE + 6x40GbE ports in a compact 1RU chassis. Its price is in all likelihood a single-digit percentage of the price of a traditional Internet router with a comparable throughput rating. ---

This makes it sound like small-FIB router is single-digit percentage cost of full-FIB.

Do you know of any traditional «Internet scale» router that can do ~720 Gbps of throughput for less than 10x the price of a Trident II box? Or even <100kUSD? (Disregarding any volume discounts.)

...

Also having Trident in Internet facing interface may be suspect, especially if you need to go from fast interface to slow or busy interface, due to very minor packet buffers. This obviously won't be much of a problem in inside-DC traffic.

Quite the opposite, changing between different interface speeds happens very commonly inside the data centre (and most of the time it's done by shallow-buffered switches using Trident II or similar chips). One ubiquitous configuration has the servers and any external uplinks attached with 10GE to leaf switches which in turn connects to a 40GE spine layer with. In this config server<->server and server<->Internet packets will need to change speed twice: [server]-10GE-(leafX)-40GE-(spine)-40GE-(leafY)-10GE-[server/internet] I suppose you could for example use a couple of MX240s or something as a special-purpose leaf layer for external connectivity. MPC5E-40G10G-IRB or something towards the 40GE spines and any regular 10GE MPC towards the exits. That way you'd only have one shallow-buffered speed conversion remaining. But I'm very sceptical if something like this makes sense after taking the cost/benefit ratio into account. Tore

Yucong Sun

7 a.m.

In my setup, I use an BIRD instance to combine multiple internet full tables, i use some filter to generate some override route to send to my L3 switch to do routing. The L3 switch is configured with the default route to the main transit provider , if BIRD is down, the route would be unoptimized, but everything else remain operable until i fixed that BIRD instance. I've asked around about why there isn't a L3 switch capable of handling full tables, I really don't understand the difference/logic behind it. On Sun, Jan 15, 2017 at 10:43 PM Tore Anderson <tore@fud.no> wrote:

...

Hi Saku,

...
...
https://www.redpill-linpro.com/sysadvent/2016/12/09/slimming-routing-table.h...

--- As described in a prevous post, we’re testing a HPE Altoline 6920 in our lab. The Altoline 6920 is, like other switches based on the Broadcom Trident II chipset, able to handle up to 720 Gbps of throughput, packing 48x10GbE + 6x40GbE ports in a compact 1RU chassis. Its price is in all likelihood a single-digit percentage of the price of a traditional Internet router with a comparable throughput rating. ---

This makes it sound like small-FIB router is single-digit percentage cost of full-FIB.

Do you know of any traditional «Internet scale» router that can do ~720 Gbps of throughput for less than 10x the price of a Trident II box? Or even <100kUSD? (Disregarding any volume discounts.)

...
Also having Trident in Internet facing interface may be suspect, especially if you need to go from fast interface to slow or busy interface, due to very minor packet buffers. This obviously won't be much of a problem in inside-DC traffic.

Quite the opposite, changing between different interface speeds happens very commonly inside the data centre (and most of the time it's done by shallow-buffered switches using Trident II or similar chips).

One ubiquitous configuration has the servers and any external uplinks attached with 10GE to leaf switches which in turn connects to a 40GE spine layer with. In this config server<->server and server<->Internet packets will need to change speed twice:

[server]-10GE-(leafX)-40GE-(spine)-40GE-(leafY)-10GE-[server/internet]

I suppose you could for example use a couple of MX240s or something as a special-purpose leaf layer for external connectivity. MPC5E-40G10G-IRB or something towards the 40GE spines and any regular 10GE MPC towards the exits. That way you'd only have one shallow-buffered speed conversion remaining. But I'm very sceptical if something like this makes sense after taking the cost/benefit ratio into account.

Tore

Josh Reynolds

9:20 a.m.

I'm going to be keeping a close eye on this: http://blogs.cisco.com/sp/a-bigger-helping-of-internet-please On Jan 16, 2017 1:03 AM, "Yucong Sun" <sunyucong@gmail.com> wrote:

...

In my setup, I use an BIRD instance to combine multiple internet full tables, i use some filter to generate some override route to send to my L3 switch to do routing. The L3 switch is configured with the default route to the main transit provider , if BIRD is down, the route would be unoptimized, but everything else remain operable until i fixed that BIRD instance.

I've asked around about why there isn't a L3 switch capable of handling full tables, I really don't understand the difference/logic behind it.

On Sun, Jan 15, 2017 at 10:43 PM Tore Anderson <tore@fud.no> wrote:

...
Hi Saku,

...
...
https://www.redpill-linpro.com/sysadvent/2016/12/09/ slimming-routing-table.html

--- As described in a prevous post, we’re testing a HPE Altoline 6920 in our lab. The Altoline 6920 is, like other switches based on the Broadcom Trident II chipset, able to handle up to 720 Gbps of throughput, packing 48x10GbE + 6x40GbE ports in a compact 1RU chassis. Its price is in all likelihood a single-digit percentage of the price of a traditional Internet router with a comparable throughput rating. ---

This makes it sound like small-FIB router is single-digit percentage cost of full-FIB.

Do you know of any traditional «Internet scale» router that can do ~720 Gbps of throughput for less than 10x the price of a Trident II box? Or even <100kUSD? (Disregarding any volume discounts.)

...
Also having Trident in Internet facing interface may be suspect, especially if you need to go from fast interface to slow or busy interface, due to very minor packet buffers. This obviously won't be much of a problem in inside-DC traffic.

Quite the opposite, changing between different interface speeds happens very commonly inside the data centre (and most of the time it's done by shallow-buffered switches using Trident II or similar chips).

One ubiquitous configuration has the servers and any external uplinks attached with 10GE to leaf switches which in turn connects to a 40GE spine layer with. In this config server<->server and server<->Internet packets will need to change speed twice:

[server]-10GE-(leafX)-40GE-(spine)-40GE-(leafY)-10GE-[server/internet]

I suppose you could for example use a couple of MX240s or something as a special-purpose leaf layer for external connectivity. MPC5E-40G10G-IRB or something towards the 40GE spines and any regular 10GE MPC towards the exits. That way you'd only have one shallow-buffered speed conversion remaining. But I'm very sceptical if something like this makes sense after taking the cost/benefit ratio into account.

Tore

David Bass

1:41 p.m.

Arista has a version of their switches that can handle a full table. I think what the OP is asking about though is something like openflow though. Some have played around with using it to modify the switches routing table based on flows that exist. The same theory applies in regard to the presentation link provided (we don't need the full table 99%of the time, so just insert what you need). Using filters is an "old school" technique that's been around for a long time, and I don't think that's what he's asking.

...

On Jan 16, 2017, at 2:00 AM, Yucong Sun <sunyucong@gmail.com> wrote:

In my setup, I use an BIRD instance to combine multiple internet full tables, i use some filter to generate some override route to send to my L3 switch to do routing. The L3 switch is configured with the default route to the main transit provider , if BIRD is down, the route would be unoptimized, but everything else remain operable until i fixed that BIRD instance.

I've asked around about why there isn't a L3 switch capable of handling full tables, I really don't understand the difference/logic behind it.

...
On Sun, Jan 15, 2017 at 10:43 PM Tore Anderson <tore@fud.no> wrote:

Hi Saku,

...
...
https://www.redpill-linpro.com/sysadvent/2016/12/09/slimming-routing-table.h...

--- As described in a prevous post, we’re testing a HPE Altoline 6920 in our lab. The Altoline 6920 is, like other switches based on the Broadcom Trident II chipset, able to handle up to 720 Gbps of throughput, packing 48x10GbE + 6x40GbE ports in a compact 1RU chassis. Its price is in all likelihood a single-digit percentage of the price of a traditional Internet router with a comparable throughput rating. ---

This makes it sound like small-FIB router is single-digit percentage cost of full-FIB.

Do you know of any traditional «Internet scale» router that can do ~720 Gbps of throughput for less than 10x the price of a Trident II box? Or even <100kUSD? (Disregarding any volume discounts.)

...
Also having Trident in Internet facing interface may be suspect, especially if you need to go from fast interface to slow or busy interface, due to very minor packet buffers. This obviously won't be much of a problem in inside-DC traffic.

Quite the opposite, changing between different interface speeds happens very commonly inside the data centre (and most of the time it's done by shallow-buffered switches using Trident II or similar chips).

One ubiquitous configuration has the servers and any external uplinks attached with 10GE to leaf switches which in turn connects to a 40GE spine layer with. In this config server<->server and server<->Internet packets will need to change speed twice:

[server]-10GE-(leafX)-40GE-(spine)-40GE-(leafY)-10GE-[server/internet]

I suppose you could for example use a couple of MX240s or something as a special-purpose leaf layer for external connectivity. MPC5E-40G10G-IRB or something towards the 40GE spines and any regular 10GE MPC towards the exits. That way you'd only have one shallow-buffered speed conversion remaining. But I'm very sceptical if something like this makes sense after taking the cost/benefit ratio into account.

Tore

joel jaeggli

17 Jan 17 Jan

5:22 a.m.

On 1/15/17 11:00 PM, Yucong Sun wrote:

...

In my setup, I use an BIRD instance to combine multiple internet full tables, i use some filter to generate some override route to send to my L3 switch to do routing. The L3 switch is configured with the default route to the main transit provider , if BIRD is down, the route would be unoptimized, but everything else remain operable until i fixed that BIRD instance.

I've asked around about why there isn't a L3 switch capable of handling full tables, I really don't understand the difference/logic behind it.

In practice there are several merchant silicon implmentations that support the addition of external tcams. building them accordingly increases the COGS and and various performance and packaging limitions. arista 7280r and cisco ncs5500 are broadcom jericho based devices that are packaged accordingly. Ethernet merchant silicon is heavily biased towards doing most if not all the IO on the same asic, with limitations driven by gate size, die size, heat dissipation pin count an so on. There was a recent packet pushers episode with Pradeep Sindhu that touched on some of these issues: http://packetpushers.net/podcast/podcasts/show-315-future-networking-pradeep...

...

On Sun, Jan 15, 2017 at 10:43 PM Tore Anderson <tore@fud.no> wrote:

...
Hi Saku,

...
...
https://www.redpill-linpro.com/sysadvent/2016/12/09/slimming-routing-table.h...

--- As described in a prevous post, we’re testing a HPE Altoline 6920 in our lab. The Altoline 6920 is, like other switches based on the Broadcom Trident II chipset, able to handle up to 720 Gbps of throughput, packing 48x10GbE + 6x40GbE ports in a compact 1RU chassis. Its price is in all likelihood a single-digit percentage of the price of a traditional Internet router with a comparable throughput rating. ---

This makes it sound like small-FIB router is single-digit percentage cost of full-FIB.

Do you know of any traditional «Internet scale» router that can do ~720 Gbps of throughput for less than 10x the price of a Trident II box? Or even <100kUSD? (Disregarding any volume discounts.)

...
Also having Trident in Internet facing interface may be suspect, especially if you need to go from fast interface to slow or busy interface, due to very minor packet buffers. This obviously won't be much of a problem in inside-DC traffic.

Quite the opposite, changing between different interface speeds happens very commonly inside the data centre (and most of the time it's done by shallow-buffered switches using Trident II or similar chips).

One ubiquitous configuration has the servers and any external uplinks attached with 10GE to leaf switches which in turn connects to a 40GE spine layer with. In this config server<->server and server<->Internet packets will need to change speed twice:

[server]-10GE-(leafX)-40GE-(spine)-40GE-(leafY)-10GE-[server/internet]

I suppose you could for example use a couple of MX240s or something as a special-purpose leaf layer for external connectivity. MPC5E-40G10G-IRB or something towards the 40GE spines and any regular 10GE MPC towards the exits. That way you'd only have one shallow-buffered speed conversion remaining. But I'm very sceptical if something like this makes sense after taking the cost/benefit ratio into account.

Tore

Phil Bedard

4:31 p.m.

Cisco and Arista are both able to squeeze a current full Internet table into the base space on their Jericho boxes, using the right space partitioning. Cisco added this in 6.1.2 without anything in the release notes, but you’ll notice they bumped the datasheet spec on the base 5502 to 1M FIB now where it used to be 256K. It works with the standard Internet table, but may not work if you have a ton of routes with lengths that do not work well with how the memory is carved up. Of course Jericho is more expensive than Trident. Phil -----Original Message----- From: NANOG <nanog-bounces@nanog.org> on behalf of joel jaeggli <joelja@bogus.com> Date: Tuesday, January 17, 2017 at 00:22 To: Yucong Sun <sunyucong@gmail.com>, Tore Anderson <tore@fud.no>, Saku Ytti <saku@ytti.fi> Cc: nanog list <nanog@nanog.org> Subject: Re: External BGP Controller for L3 Switch BGP routing On 1/15/17 11:00 PM, Yucong Sun wrote: > In my setup, I use an BIRD instance to combine multiple internet full > tables, i use some filter to generate some override route to send to my L3 > switch to do routing. The L3 switch is configured with the default route > to the main transit provider , if BIRD is down, the route would be > unoptimized, but everything else remain operable until i fixed that BIRD > instance. > > I've asked around about why there isn't a L3 switch capable of handling > full tables, I really don't understand the difference/logic behind it. In practice there are several merchant silicon implmentations that support the addition of external tcams. building them accordingly increases the COGS and and various performance and packaging limitions. arista 7280r and cisco ncs5500 are broadcom jericho based devices that are packaged accordingly.

Saku Ytti

16 Jan 16 Jan

12:08 p.m.

On 16 January 2017 at 08:40, Tore Anderson <tore@fud.no> wrote: Hey,

...

Do you know of any traditional «Internet scale» router that can do ~720 Gbps of throughput for less than 10x the price of a Trident II box? Or even <100kUSD? (Disregarding any volume discounts.)

It's really hard to talk about pricing, as it's very dependant on many factors. But I guess pretty much all Jericho boxes would fit that bill? Arista will probably set you back anywhere in range of 15<35k, will do full table (for now) and has deep packet buffers. NCS5501 is also sub 100k, even with external TCAM. Probably single unit around 40k without external TCAM and 60k with external TCAM and you lose 8x10G and 2x100G ports. But my comment wasn't really about what is available now, it was more fundament about economics of large FIB or large buffers, they are not inherently very BOM expensive. I wonder if true whitelabel is possible, would some 'real' HW vendor, of BRCM size, release HW docs openly? Then some integrator could start selling the HW with BOM+10-20%, no support, no software at all. And community could build the actual software on it. It seems to me, what is keeping us away from near-BOM prices is software engineering, and we cannot do it as a community, as HW docs are not available.

...

Quite the opposite, changing between different interface speeds happens very commonly inside the data centre (and most of the time it's done by shallow-buffered switches using Trident II or similar chips).

Why I said it won't be a problem inside DC, is because low RTT, which means small bursts. I'm talking about backend network infra in DC, not Internet facing. Anywhere where you'll see large RTT and speed/availability step-down you'll need buffers (unless we change TCP to pace window-growth, unlike burst what it does now, AFAIK, you could already configure your Linux server to do pacing at estimate BW, but then you'd lose in congested links, as more aggressive TCP stack would beat you to oblivion).

...

I suppose you could for example use a couple of MX240s or something as a special-purpose leaf layer for external connectivity. MPC5E-40G10G-IRB or something towards the 40GE spines and any regular 10GE MPC towards the exits. That way you'd only have one shallow-buffered speed conversion remaining. But I'm very sceptical if something like this makes sense after taking the cost/benefit ratio into account.

MPC indeed is on completely another level in BOM, as it's NPU with lookup and packets in DRAM, fairly complicated and space-inefficient. But we have pipeline chips in the market with deep buffers and full DFZ. There is no real reason that the markup on them would be significant, control-plane should cost more. This is why the promise of XEON router is odd to me, as it's fundamentally very expensive chip, combined with poorly predictable performance (jitter, latency...) -- ++ytti

Tore Anderson

12:36 p.m.

* Saku Ytti <saku@ytti.fi>

...

Why I said it won't be a problem inside DC, is because low RTT, which means small bursts. I'm talking about backend network infra in DC, not Internet facing. Anywhere where you'll see large RTT and speed/availability step-down you'll need buffers (unless we change TCP to pace window-growth, unlike burst what it does now, AFAIK, you could already configure your Linux server to do pacing at estimate BW, but then you'd lose in congested links, as more aggressive TCP stack would beat you to oblivion).

But here you're talking about the RTT of each individual link, right, not the RTT of the entire path through the Internet for any given flow? Put it another way, my «Internet facing» interfaces are typically 10GEs with a few (kilo)metres of dark fibre that x-connects into my IP-transit providers' routers sitting in nearby rooms or racks (worst case somewhere else in the same metro area). Is there any reason why I should need deep buffers on those interfaces? The IP-transit providers might need the deep buffers somewhere in their networks, sure. But if so I'm thinking that's a problem I'm paying them to not have to worry about. BTW, in my experience the buffering and tail-dropping is actually a bigger problem inside the data centre because of distributed applications causing incast. So we get workarounds like DCTCP and BBR, which is apparently cheaper than using deep-buffer switches everywhere. Tore

James Jun

1:48 p.m.

On Mon, Jan 16, 2017 at 01:36:54PM +0100, Tore Anderson wrote:

...

But here you're talking about the RTT of each individual link, right, not the RTT of the entire path through the Internet for any given flow?

Put it another way, my ??Internet facing?? interfaces are typically 10GEs with a few (kilo)metres of dark fibre that x-connects into my IP-transit providers' routers sitting in nearby rooms or racks (worst case somewhere else in the same metro area). Is there any reason why I should need deep buffers on those interfaces?a

It would be RTT of the entire path through the Internet where TCP establishes sessions. Longer RTT, longer window, more burst. So, yes you'll need deeper buffers for serving IP transit, regardless of the local link (router-to-router) latency. For data centers, internal traffic within the DC is low latency end to end, so burst is relatively small. James

Saku Ytti

1:59 p.m.

On 16 January 2017 at 14:36, Tore Anderson <tore@fud.no> wrote:

...

But here you're talking about the RTT of each individual link, right, not the RTT of the entire path through the Internet for any given flow?

I'm talking about RTT of end-to-end, which will determine window-size, which will determine burst-size. Your worst burst will be half of needed window size, and you need to be able to ingest this burst at sender rate, regardless of receiver rate.

...

Put it another way, my «Internet facing» interfaces are typically 10GEs with a few (kilo)metres of dark fibre that x-connects into my IP-transit providers' routers sitting in nearby rooms or racks (worst case somewhere else in the same metro area). Is there any reason why I should need deep buffers on those interfaces?

Imagine content network having 40Gbps connection, and client having 10Gbps connection, and network between them is lossless and has RTT of 200ms. To achieve 10Gbps rate receiver needs 10Gbps*200ms = 250MB window, in worst case 125MB window could grow into 250MB window, and sender could send the 125MB at 40Gbps burst. This means the port receiver is attached to, needs to store the 125MB, as it's only serialising it at 10Gbps. If it cannot store it, window will shrink and receiver cannot get 10Gbps. This is quite pathological example, but you can try with much less pathological numbers, remembering TridentII has 12MB of buffers. -- ++ytti

Tore Anderson

2:53 p.m.

* Saku Ytti

...

On 16 January 2017 at 14:36, Tore Anderson <tore@fud.no> wrote:

...
Put it another way, my «Internet facing» interfaces are typically 10GEs with a few (kilo)metres of dark fibre that x-connects into my IP-transit providers' routers sitting in nearby rooms or racks (worst case somewhere else in the same metro area). Is there any reason why I should need deep buffers on those interfaces?

Imagine content network having 40Gbps connection, and client having 10Gbps connection, and network between them is lossless and has RTT of 200ms. To achieve 10Gbps rate receiver needs 10Gbps*200ms = 250MB window, in worst case 125MB window could grow into 250MB window, and sender could send the 125MB at 40Gbps burst. This means the port receiver is attached to, needs to store the 125MB, as it's only serialising it at 10Gbps. If it cannot store it, window will shrink and receiver cannot get 10Gbps.

This is quite pathological example, but you can try with much less pathological numbers, remembering TridentII has 12MB of buffers.

I totally get why the receiver need bigger buffers if he's going to shuffle that data out another interface with a slower speed. But when you're a data centre operator you're (usually anyway) mostly transmitting data. And you can easily ensure the interface speed facing the servers can be the same as the interface speed facing the ISP. So if you consider this typical spine/leaf data centre network topology (essentially the same one I posted earlier this morning): (Server) --10GE--> (T2 leaf X) --40GE--> (T2 spine) --40GE--> (T2 leaf Y) --10GE--> (IP-transit/"the Internet") --10GE--> (Client) If I understand you correctly you're saying this is a "suspect" topology that cannot achieve 10G transmission rate from server to client (or from client to server for that matter) because of small buffers on my "T2 leaf Y" switch (i.e., the one which has the Internet-facing interface)? If so would it solve the problem just replacing "T2 leaf Y" with, say, a Juniper MX or something else with deeper buffers? Or would it help to use (4x)10GE instead of 40GE for the links between the leaf and spine layers too, so there was no change in interface speeds along the path through the data centre towards the handoff to the IPT provider? Tore

Saku Ytti

3:07 p.m.

On 16 January 2017 at 16:53, Tore Anderson <tore@fud.no> wrote: Hey,

...

(Server) --10GE--> (T2 leaf X) --40GE--> (T2 spine) --40GE--> (T2 leaf Y) --10GE--> (IP-transit/"the Internet") --10GE--> (Client)

If I understand you correctly you're saying this is a "suspect" topology that cannot achieve 10G transmission rate from server to client (or from client to server for that matter) because of small buffers on my "T2 leaf Y" switch (i.e., the one which has the Internet-facing interface)?

This mostly isn't suspect, depending how utilised it is. If it's verbatim like above, then it's never going to be suspect, as the T2_leaf_Y is going to see large pauses between each frame coming from the 40Gbps side, so it's not going to need to store large burst of 40Gbps traffic, as no one is generating at 40Gbps, so it can cope with very small buffers. -- ++ytti

joel jaeggli

17 Jan 17 Jan

1:45 a.m.

On 1/16/17 6:53 AM, Tore Anderson wrote:

...

* Saku Ytti

...
On 16 January 2017 at 14:36, Tore Anderson <tore@fud.no> wrote:

...
Put it another way, my «Internet facing» interfaces are typically 10GEs with a few (kilo)metres of dark fibre that x-connects into my IP-transit providers' routers sitting in nearby rooms or racks (worst case somewhere else in the same metro area). Is there any reason why I should need deep buffers on those interfaces?

Imagine content network having 40Gbps connection, and client having 10Gbps connection, and network between them is lossless and has RTT of 200ms. To achieve 10Gbps rate receiver needs 10Gbps*200ms = 250MB window, in worst case 125MB window could grow into 250MB window, and sender could send the 125MB at 40Gbps burst. This means the port receiver is attached to, needs to store the 125MB, as it's only serialising it at 10Gbps. If it cannot store it, window will shrink and receiver cannot get 10Gbps.

This is quite pathological example, but you can try with much less pathological numbers, remembering TridentII has 12MB of buffers.

I totally get why the receiver need bigger buffers if he's going to shuffle that data out another interface with a slower speed.

But when you're a data centre operator you're (usually anyway) mostly transmitting data. And you can easily ensure the interface speed facing the servers can be the same as the interface speed facing the ISP.

unlikely given that the interfaces facing the server is 1/10/25/50 and the one facing the isp is n x 10 or n x 100

...

So if you consider this typical spine/leaf data centre network topology (essentially the same one I posted earlier this morning):

(Server) --10GE--> (T2 leaf X) --40GE--> (T2 spine) --40GE--> (T2 leaf Y) --10GE--> (IP-transit/"the Internet") --10GE--> (Client)

If I understand you correctly you're saying this is a "suspect" topology that cannot achieve 10G transmission rate from server to client (or from client to server for that matter) because of small buffers on my "T2 leaf Y" switch (i.e., the one which has the Internet-facing interface)?

you can externalize the cost of the buffer at the expense of latency from the t2, e.g. by enabling flow control faciing the host or other high capacity device, or engaging in packet pacing on the server if the network is fairly shallow. If the question is how can I ensure high link utilization rather than maximum throughput for this one flow, the buffer requirement may be substantially lower. e.g. if you are sizing based on buffer = (bandwidth delay * desired bandwidth) / sqrt(nr of flows) http://conferences.sigcomm.org/sigcomm/2004/papers/p277-appenzeller1.pdf rather than buffer = (bandwidth delay * bandwidth)

...

If so would it solve the problem just replacing "T2 leaf Y" with, say, a Juniper MX or something else with deeper buffers?

broadcom jericho/ptx/qfx whatever sure it's plausible to have a large buffer without using the feature rich extremely large fib asic.

...

Or would it help to use (4x)10GE instead of 40GE for the links between the leaf and spine layers too, so there was no change in interface speeds along the path through the data centre towards the handoff to the IPT provider?

it can reduce the demand on the buffer, you can however multiplex two our more flows that might otherwise run at 10Gb/s onto the same lag member.

...

Tore

Vincent Bernat

16 Jan 16 Jan

2:06 p.m.

❦ 16 janvier 2017 14:08 +0200, Saku Ytti <saku@ytti.fi> :

...

I wonder if true whitelabel is possible, would some 'real' HW vendor, of BRCM size, release HW docs openly? Then some integrator could start selling the HW with BOM+10-20%, no support, no software at all. And community could build the actual software on it. It seems to me, what is keeping us away from near-BOM prices is software engineering, and we cannot do it as a community, as HW docs are not available.

Mellanox with switches like the SN2700. I don't know how open is the hardware documentation, but they are pushing support for their ASIC directly into Linux (look at drivers/net/ethernet/mellanox/mlxsw). They are also contributing to the switchdev framework which will at some point allow transparent acceleration of the Linux box (switching, routing, tunneling, firewalling, etc.), as we already have with CumulusOS. The datasheet is quite scarce. There is a 88k L2 forwarding entries but no word for L3. Buffer sizes are not mentioned. But I suppose that someone interested would be able to get more detailed information. -- "Elves and Dragons!" I says to him. "Cabbages and potatoes are better for you and me." -- J. R. R. Tolkien

Faisal Imtiaz

20 Jan 20 Jan

1:27 a.m.

Thank you for all the on-list and off-list replies.. The project I was looking for was/is called SIR.. (SDN Internet Router) and the original presentation was done by David Barroso.. Thanks to everyone who responded ! Regards. Faisal Imtiaz Snappy Internet & Telecom 7266 SW 48 Street Miami, FL 33155 Tel: 305 663 5518 x 232 Help-desk: (305)663-5518 Option 2 or Email: Support@Snappytelecom.net ----- Original Message -----

...

From: "Tore Anderson" <tore@fud.no> To: "Saku Ytti" <saku@ytti.fi> Cc: "nanog list" <nanog@nanog.org> Sent: Monday, January 16, 2017 1:40:47 AM Subject: Re: External BGP Controller for L3 Switch BGP routing

...

Hi Saku,

...
...
https://www.redpill-linpro.com/sysadvent/2016/12/09/slimming-routing-table.h...

--- As described in a prevous post, we’re testing a HPE Altoline 6920 in our lab. The Altoline 6920 is, like other switches based on the Broadcom Trident II chipset, able to handle up to 720 Gbps of throughput, packing 48x10GbE + 6x40GbE ports in a compact 1RU chassis. Its price is in all likelihood a single-digit percentage of the price of a traditional Internet router with a comparable throughput rating. ---

This makes it sound like small-FIB router is single-digit percentage cost of full-FIB.

Do you know of any traditional «Internet scale» router that can do ~720 Gbps of throughput for less than 10x the price of a Trident II box? Or even <100kUSD? (Disregarding any volume discounts.)

...
Also having Trident in Internet facing interface may be suspect, especially if you need to go from fast interface to slow or busy interface, due to very minor packet buffers. This obviously won't be much of a problem in inside-DC traffic.

Quite the opposite, changing between different interface speeds happens very commonly inside the data centre (and most of the time it's done by shallow-buffered switches using Trident II or similar chips).

One ubiquitous configuration has the servers and any external uplinks attached with 10GE to leaf switches which in turn connects to a 40GE spine layer with. In this config server<->server and server<->Internet packets will need to change speed twice:

[server]-10GE-(leafX)-40GE-(spine)-40GE-(leafY)-10GE-[server/internet]

I suppose you could for example use a couple of MX240s or something as a special-purpose leaf layer for external connectivity. MPC5E-40G10G-IRB or something towards the 40GE spines and any regular 10GE MPC towards the exits. That way you'd only have one shallow-buffered speed conversion remaining. But I'm very sceptical if something like this makes sense after taking the cost/benefit ratio into account.

Tore

Vincent Bernat

14 Jan 14 Jan

7:04 a.m.

❦ 14 janvier 2017 05:24 GMT, Faisal Imtiaz <faisal@snappytelecom.net> :

...

A while back there was a discussion on how to do optimized (dynamic) BGP routing on a L3 switch which is only capable of handing a subset of BGP Routing table.

Someone has pointed out that there was a project to do just that, and had posted a link to a presentation on a European operator (Ireland ? ) who had done some code to take Exabgp and create such a setup..

Maybe: https://github.com/dbarrosop/sir -- The difference between the right word and the almost right word is the difference between lightning and the lightning bug. -- Mark Twain

Mike Hammett

16 Jan 16 Jan

1:48 p.m.

I thought your post was fairly self-explanatory, but people seem to be all over the place... except for what you actually asked about. ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP ----- Original Message ----- From: "Faisal Imtiaz" <faisal@snappytelecom.net> To: "nanog list" <nanog@nanog.org> Sent: Friday, January 13, 2017 11:24:36 PM Subject: External BGP Controller for L3 Switch BGP routing Hello, A while back there was a discussion on how to do optimized (dynamic) BGP routing on a L3 switch which is only capable of handing a subset of BGP Routing table. Someone has pointed out that there was a project to do just that, and had posted a link to a presentation on a European operator (Ireland ? ) who had done some code to take Exabgp and create such a setup.. (I am going by memory... )... Needless to say I am trying to find that link, or name of that project. Anyone who can help in refreshing my memory with the link (my search skill are failing to find that presentation !) would be greatly appreciated. Many Thanks in Advance. Faisal Imtiaz

3209

Age (days ago)

3215

Last active (days ago)

List overview

Download

19 comments

12 participants

participants (12)

David Bass
Faisal Imtiaz
James Jun
Jeremy Austin
joel jaeggli
Josh Reynolds
Mike Hammett
Phil Bedard
Saku Ytti
Tore Anderson
Vincent Bernat
Yucong Sun