RFC 5549 - IPv4 Routes with IPv6 next-hop - Does it really exists? - Test

RFC 5549 - IPv4 Routes with IPv6 next-hop - Does it really exists?

Douglas Fischer

29 Jul 2020 29 Jul '20

5:51 a.m.

Let's just jump all the arguing about lack of IPv4, the need of IPv6, and etc... I must confess that I don't know all the RFCs. I would like it, but I don't! And today, I reached on https://tools.ietf.org/html/rfc5549 I knew that was possible to transfer v4 routes over v6 BGP sessions, or v6 routes over v4 BGP sessions. But I got surprised when I saw this youtube vídeo of AMS-IX guys considering use a v6 only Lan, and doing v6 next-hops to v4 routes. https://www.youtube.com/watch?v=uJOtfiHDCMw Well... I guess that idea didn't go to production. But the questions are: There is any network that really implements RFC5549? Can anyone share some information about it? -- Douglas Fernando Fischer Engº de Controle e Automação

Attachments:

attachment.html (text/html — 1.8 KB)

Show replies by date

Vincent Bernat

29 Jul 29 Jul

6:59 a.m.

Hello, This is implemented in FRR and will also be available in BIRD 2.0.8. Linux accepts IPv6 next-hop for IPv4 natively since 5.3 (no tunnels). This is the solution Cumulus is advocating to its users, so I suppose they have some real users behind that. Juniper also supports RFC 5549 but, from the documentation, the forwarding part is done using lightweight tunnels. Maybe David Ahern is reading this list and could comment more. I don't use this solution myself as the vendor support is still quite limited but if I were to start a network from scratch, I would definitively go for it. -- Let the machine do the dirty work. - The Elements of Programming Style (Kernighan & Plauger) ――――――― Original Message ――――――― From: Douglas Fischer <fischerdouglas@gmail.com> Sent: 29 juillet 2020 02:51 -03 Subject: RFC 5549 - IPv4 Routes with IPv6 next-hop - Does it really exists? To: nanog@nanog.org

...

Let's just jump all the arguing about lack of IPv4, the need of IPv6, and etc...

I must confess that I don't know all the RFCs. I would like it, but I don't!

And today, I reached on https://tools.ietf.org/html/rfc5549

I knew that was possible to transfer v4 routes over v6 BGP sessions, or v6 routes over v4 BGP sessions. But I got surprised when I saw this youtube vídeo of AMS-IX guys considering use a v6 only Lan, and doing v6 next-hops to v4 routes. https://www.youtube.com/watch?v=uJOtfiHDCMw

Well... I guess that idea didn't go to production.

But the questions are: There is any network that really implements RFC5549? Can anyone share some information about it?

Saku Ytti

9:13 a.m.

On Wed, 29 Jul 2020 at 10:03, Vincent Bernat <bernat@luffy.cx> wrote:

...

This is the solution Cumulus is advocating to its users, so I suppose they have some real users behind that. Juniper also supports RFC 5549 but, from the documentation, the forwarding part is done using lightweight tunnels.

I'm not sure if you claim otherwise, but no real 'tunneling' takes place, as far as I know, it's internal implementation detail having IPV6 next-hop for IPV4. I don't think there is any additional headers or any additional lookup or cost. Cisco supports extended nexthop encoding too, so it is fairly well supported by shipping products. -- ++ytti

Vincent Bernat

9:57 a.m.

❦ 29 juillet 2020 12:13 +03, Saku Ytti:

...

...
This is the solution Cumulus is advocating to its users, so I suppose they have some real users behind that. Juniper also supports RFC 5549 but, from the documentation, the forwarding part is done using lightweight tunnels.

I'm not sure if you claim otherwise, but no real 'tunneling' takes place, as far as I know, it's internal implementation detail having IPV6 next-hop for IPV4. I don't think there is any additional headers or any additional lookup or cost.

I didn't test, but the documentation states:

...

Starting in Release 17.3R1, Junos OS devices can forward IPv4 traffic over an IPv6-only network, which generally cannot forward IPv4 traffic. As described in RFC 5549, IPv4 traffic is tunneled from CPE devices to IPv4-over-IPv6 gateways. These gateways are announced to CPE devices through anycast addresses. The gateway devices then create dynamic IPv4-over-IPv6 tunnels to remote customer premises equipment and advertise IPv4 aggregate routes to steer traffic. Route reflectors with programmable interfaces inject the tunnel information into the network. The route reflectors are connected through IBGP to gateway routers, which advertise the IPv4 addresses of host routes with IPv6 addresses as the next hop.

https://www.juniper.net/documentation/en_US/junos/topics/topic-map/multiprot... If you have a pointer around the subject on Juniper, I would be quite interested! Thanks. -- Write and test a big program in small pieces. - The Elements of Programming Style (Kernighan & Plauger)

Saku Ytti

10:19 a.m.

On Wed, 29 Jul 2020 at 12:58, Vincent Bernat <bernat@luffy.cx> wrote:

...

I didn't test, but the documentation states:

I think only disconnect here is definition of tunnel, there are no additional headers and I don't think the document implies it and the RFC it refers to does not. I've not tried it myself, but my expectation is that internally the next-hop is represented as ipv6 with ipv4 resolution copied for L2 so I anticipate the magic to be local here and when they talk about tunnel, I suspect they refer to that adjacency as tunnel. -- ++ytti

Owen DeLong

3:51 p.m.

...

On Jul 29, 2020, at 02:13 , Saku Ytti <saku@ytti.fi> wrote:

On Wed, 29 Jul 2020 at 10:03, Vincent Bernat <bernat@luffy.cx> wrote:

...
This is the solution Cumulus is advocating to its users, so I suppose they have some real users behind that. Juniper also supports RFC 5549 but, from the documentation, the forwarding part is done using lightweight tunnels.

I'm not sure if you claim otherwise, but no real 'tunneling' takes place, as far as I know, it's internal implementation detail having IPV6 next-hop for IPV4. I don't think there is any additional headers or any additional lookup or cost. Cisco supports extended nexthop encoding too, so it is fairly well supported by shipping products.

In reality, next hop isn’t really a layer 3 address. The layer 3 address is a stand-in that is resolved to a layer 2 address for forwarding. The layer 3 next-hop address never makes it into the packet. As such, the relationship between the destination address family and the next-hop address family is mostly to avoid breaking the brains of humans. Software to handle mixed-address-families in next hop vs. destination should be a relatively trivial difference from software that requires the address families to match. Owen

Saku Ytti

3:54 p.m.

On Wed, 29 Jul 2020 at 18:51, Owen DeLong <owen@delong.com> wrote:

...

In reality, next hop isn’t really a layer 3 address. The layer 3 address is a stand-in that is resolved to a layer 2 address for forwarding. The layer 3 next-hop address never makes it into the packet.

I wish you had shared in the draft process so they could have benefitted from your insight into the proper verbiage. -- ++ytti

Alejandro Acosta

11:23 a.m.

Long time ago I tried it out: https://blog.acostasite.com/2013/02/publicar-prefijos-ipv4-sobre-una-sesion.... https://blog.acostasite.com/2013/02/publicando-prefijos-ipv6-sobre-sesiones.... I did not like, difficult troubleshooting in case something goes wrong (however I can understand it's a nice feature to have and in might be useful in some scenarios). But you are right I do not know much about networks doing it, I also would like hear about it. Alejandro, On 7/29/20 1:51 AM, Douglas Fischer wrote:

...

Let's just jump all the arguing about lack of IPv4, the need of IPv6, and etc...

I must confess that I don't know all the RFCs. I would like it, but I don't!

And today, I reached on https://tools.ietf.org/html/rfc5549

I knew that was possible to transfer v4 routes over v6 BGP sessions, or v6 routes over v4 BGP sessions. But I got surprised when I saw this youtube vídeo of AMS-IX guys considering use a v6 only Lan, and doing v6 next-hops to v4 routes. https://www.youtube.com/watch?v=uJOtfiHDCMw

Well... I guess that idea didn't go to production.

But the questions are: There is any network that really implements RFC5549? Can anyone share some information about it?

-- Douglas Fernando Fischer Engº de Controle e Automação

Saku Ytti

11:40 a.m.

Hey, On Wed, 29 Jul 2020 at 14:26, Alejandro Acosta <alejandroacostaalamo@gmail.com> wrote:

...

https://blog.acostasite.com/2013/02/publicar-prefijos-ipv4-sobre-una-sesion.... https://blog.acostasite.com/2013/02/publicando-prefijos-ipv6-sobre-sesiones....

I did not like, difficult troubleshooting in case something goes wrong (however I can understand it's a nice feature to have and in might be useful in some scenarios).

Your experiment predates extended nexthop encoding, but otherwise it is indeed the very same thing. Just less operational overhead now. Of course everyone has done 6PE and 6VPE longest time, because obviously you can fit IPv4 next-hop in IPv6 coding, so nothing was needed. This extended nexthop encoding only exists to fix the problem that wire-format didn't support signalling IPv6 next-hop for IPv4 NLRI. -- ++ytti

Douglas Fischer

4:43 p.m.

Does anybody here knows what Gambiarra means? Alejandro mentioned that IPv6 NextHop on IPv4 routing breaks traceroute and difficult troubleshooting. Well... Since a while I have been thinking about a Gambiarra that I'm using on other scenarios, but I think could help to reduce de bad impacts of IPv6 NextHop on IPv4 routing. O router with several interfaces with IPv6 only and at least one public IPv4 /32 on his loopback. On the IPv4 address on each of that v6 only interfaces, use "IP address unnumbered loopback 0". This would make the ICMP responses for TTL expired be sourced with that public IPv4. Would not be as good as one public IP for each interface, but at least, on a traceroute, would be possible to Defined what ASN is responsible for that hop, and exactly in what router it occurs. Em qua, 29 de jul de 2020 08:25, Alejandro Acosta < alejandroacostaalamo@gmail.com> escreveu:

...

Long time ago I tried it out:

https://blog.acostasite.com/2013/02/publicar-prefijos-ipv4-sobre-una-sesion....

https://blog.acostasite.com/2013/02/publicando-prefijos-ipv6-sobre-sesiones....

I did not like, difficult troubleshooting in case something goes wrong (however I can understand it's a nice feature to have and in might be useful in some scenarios).

But you are right I do not know much about networks doing it, I also would like hear about it.

Alejandro,

On 7/29/20 1:51 AM, Douglas Fischer wrote:

Let's just jump all the arguing about lack of IPv4, the need of IPv6, and etc...

I must confess that I don't know all the RFCs. I would like it, but I don't!

And today, I reached on https://tools.ietf.org/html/rfc5549

I knew that was possible to transfer v4 routes over v6 BGP sessions, or v6 routes over v4 BGP sessions. But I got surprised when I saw this youtube vídeo of AMS-IX guys considering use a v6 only Lan, and doing v6 next-hops to v4 routes. https://www.youtube.com/watch?v=uJOtfiHDCMw

Well... I guess that idea didn't go to production.

But the questions are: There is any network that really implements RFC5549? Can anyone share some information about it?

-- Douglas Fernando Fischer Engº de Controle e Automação

Owen DeLong

6:10 p.m.

...

On Jul 29, 2020, at 09:43 , Douglas Fischer <fischerdouglas@gmail.com> wrote:

Does anybody here knows what Gambiarra means?

The english translation would be “Jury Rig” or “Hack”. Synonyms include “McGyverism”, “Rube Goldberg”, “Kludge”, etc. Foreign address family as next-hop is definitely in this category.

...

Alejandro mentioned that IPv6 NextHop on IPv4 routing breaks traceroute and difficult troubleshooting.

It doesn’t really break trace route, but it does complicate troubleshooting. The next hop device won’t know that the IPv4 packet arrived via IPv6 next hop. If the device has an IPv4 address, it will still report that in the trace route. Of course, that won’t match the expected next-hop from the routing table on the previous device, but it will still be reported. If it doesn’t have an IPv4 address, then one has to wonder how that’s going to work for what it will do with the packet anyway. In such a case, I would expect that it breaks more than trace route. Troubleshooting is difficult because it requires significant indirection to figure out what’s really going on and because it creates a good bit of cognitive dissonance in the human analysis part of the troubleshooting effort.

...

Well... Since a while I have been thinking about a Gambiarra that I'm using on other scenarios, but I think could help to reduce de bad impacts of IPv6 NextHop on IPv4 routing.

O router with several interfaces with IPv6 only and at least one public IPv4 /32 on his loopback. On the IPv4 address on each of that v6 only interfaces, use "IP address unnumbered loopback 0".

This would make the ICMP responses for TTL expired be sourced with that public IPv4.

Would not be as good as one public IP for each interface, but at least, on a traceroute, would be possible to Defined what ASN is responsible for that hop, and exactly in what router it occurs.

You most likely get the same result whether you add the unnumbered configuration or not on a router where the only IPv4 address is on the loopback interface. Owen

Simon Leinen

1:51 p.m.

Douglas Fischer writes:

...

And today, I reached on https://tools.ietf.org/html/rfc5549 [...] But the questions are: There is any network that really implements RFC5549?

We've been using it for more than two years in our data center networks. We use the Cumulus/FRR implementation on switches and FRR on Ubuntu on servers.

...

Can anyone share some information about it?

Sure. We found the FRR/Cumulus implementation very easy to set up. We have leaf/spine networks interconnecting hundreds of servers (IPv4+IPv6) with very minimalistic configuration. In particular, you generally don't have to configure neighbor addresses or AS numbers, because those are autodiscovered. I think we're basically following the recommendations in the "BGP in the Data Center" book including the "BGP on the Host" part (though our installation predates the book, so there might be some differences). The network has been working very reliably for us, so we never really had anything to debug. If you're coming from a world where you used separate BGP sessions to exchange IPv4 and IPv6 reachability information, then the operational commands take a little getting used to, but in the end I find it very intuitive. For example, here's one of the "show bgp ... summary" commands on a leaf switch: leinen@sw-f:mgmt-vrf:~$ net show bgp ipv6 uni sum BGP router identifier 10.1.1.46, local AS number 65111 vrf-id 0 BGP table version 96883 RIB entries 1528, using 227 KiB of memory Peers 54, using 1041 KiB of memory Peer groups 2, using 128 bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd sw-o(swp16) 4 65108 953559 938348 0 0 0 03w5d00h 688 sw-m(swp18) 4 65108 885442 938348 0 0 0 03w5d00h 688 s0001(swp1s0.3) 4 65300 748971 748977 0 0 0 03w5d00h 1 s0002(swp1s1.3) 4 65300 661787 661794 0 0 0 03w1d23h 1 s0003(swp1s2.3) 4 65300 748970 748977 0 0 0 03w5d00h 1 s0004(swp1s3.3) 4 65300 661868 661875 0 0 0 03w1d23h 1 s0005(swp2s0.3) 4 65300 748970 748976 0 0 0 03w5d00h 1 [...] Note the host names/interface names - this is how you generally refer to neighbors, rather than using literal (IPv6) addresses. Otherwise it should look very familiar if you have used vendor C's "industry-standard CLI" before. (In case you're wondering, the first two neighbors in the output are spine switches, the others are servers.) Cheers, -- Simon.

Mark Tinka

2:09 p.m.

On 29/Jul/20 15:51, Simon Leinen wrote:

...

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd sw-o(swp16) 4 65108 953559 938348 0 0 0 03w5d00h 688 sw-m(swp18) 4 65108 885442 938348 0 0 0 03w5d00h 688 s0001(swp1s0.3) 4 65300 748971 748977 0 0 0 03w5d00h 1 s0002(swp1s1.3) 4 65300 661787 661794 0 0 0 03w1d23h 1 s0003(swp1s2.3) 4 65300 748970 748977 0 0 0 03w5d00h 1 s0004(swp1s3.3) 4 65300 661868 661875 0 0 0 03w1d23h 1 s0005(swp2s0.3) 4 65300 748970 748976 0 0 0 03w5d00h 1 [...]

Note the host names/interface names - this is how you generally refer to neighbors, rather than using literal (IPv6) addresses.

Are the names based on DNS look-ups, or is there some kind of protocol association between the device underlay and its hostname, as it pertains to neighbors? Mark.

Nick Hilliard

2:30 p.m.

Mark Tinka wrote on 29/07/2020 15:09:

...

Are the names based on DNS look-ups, or is there some kind of protocol association between the device underlay and its hostname, as it pertains to neighbors?

afaik, this is an implementation of draft-walton-bgp-hostname-capability. Nick

Mark Tinka

2:51 p.m.

On 29/Jul/20 16:30, Nick Hilliard wrote:

...

afaik, this is an implementation of draft-walton-bgp-hostname-capability.

Nice. I'm curious to know if this is after-the-fact, as I can't think of a way that BGP would find hostnames to setup sessions with, outside of some kind of upper layer name resolution capability. The draft isn't clear on how this happens, if it is, indeed, before-the-fact. Mark.

Nick Hilliard

2:54 p.m.

Mark Tinka wrote on 29/07/2020 15:51:

...

I'm curious to know if this is after-the-fact, as I can't think of a way that BGP would find hostnames to setup sessions with, outside of some kind of upper layer name resolution capability.

The draft isn't clear on how this happens, if it is, indeed, before-the-fact.

it's a capability negotiation, so is handled on session setup. Nick

Mark Tinka

4:06 p.m.

On 29/Jul/20 16:54, Nick Hilliard wrote:

...

it's a capability negotiation, so is handled on session setup.

Meaning the initial setup would still require the use of literal IP addresses? Mark.

Chriztoffer Hansen

4:34 p.m.

On Wed, 29 Jul 2020 at 18:06, Mark Tinka wrote:

...

On 29/Jul/20 16:54, Nick Hilliard wrote:

...
it's a capability negotiation, so is handled on session setup.

Meaning the initial setup would still require the use of literal IP addresses?

Unless your (e.g. DC equipment) is set up for automatic bgp neighbour discovery using IPv6 ND+RA [2]. Then yes. [0]: https://github.com/FRRouting/frr/commit/04b6bdc0ee6275442464edec1d14b3f4d3ea... [1]: https://github.com/FRRouting/frr/search?p=3&q=hostname&type=Commits [2]: https://docs.cumulusnetworks.com/cumulus-linux/Layer-3/Border-Gateway-Protoc... -- Cheers, CHRIZTOFFER

Nick Hilliard

4:35 p.m.

Mark Tinka wrote on 29/07/2020 17:06:

...

Meaning the initial setup would still require the use of literal IP addresses?

You can't use hostnames, if that's what you're asking. FRR will also do unnumbered BGP with auto-config. Nick

Mark Tinka

4:39 p.m.

On 29/Jul/20 18:35, Nick Hilliard wrote:

...

You can't use hostnames, if that's what you're asking.

Yes, couldn't fathom how. So really it's convenience of troubleshooting, not convenience of setup :-). I can live with that.

...

FRR will also do unnumbered BGP with auto-config.

Interesting... Mark.

Saku Ytti

2:57 p.m.

On Wed, 29 Jul 2020 at 17:54, Mark Tinka <mark.tinka@seacom.com> wrote:

...

I'm curious to know if this is after-the-fact, as I can't think of a way that BGP would find hostnames to setup sessions with, outside of some kind of upper layer name resolution capability.

The draft isn't clear on how this happens, if it is, indeed, before-the-fact.

I'm not sure I understand what the option space is. This is like ISIS TLV137, protocol will populate some trash there and you'll politely access. It won't allow you to refer to the peer with any name prior to having the session up. Much like you won't see ISIS neighbours name when session is establishing, until it has actually loaded and processed the TLV137. -- ++ytti

Mark Tinka

4:08 p.m.

On 29/Jul/20 16:57, Saku Ytti wrote:

...

I'm not sure I understand what the option space is. This is like ISIS TLV137, protocol will populate some trash there and you'll politely access. It won't allow you to refer to the peer with any name prior to having the session up. Much like you won't see ISIS neighbours name when session is establishing, until it has actually loaded and processed the TLV137.

The IS-IS comparison came to mind as well, yes. But IS-IS is different in that LSP's are dynamically flooded upon activation on an interfaces, and those LSP's carry router information, including hostname. BGP is not dynamically setup, so in my mind, you still need literal IP addresses to set sessions up, and then this BGP hostname capability would translate those IP addresses to remote hostnames after the sessions have been established. I just wanted to clarify if this is the practical implementation and operation of the same, as the draft isn't specific on it, as I'm sure others may consider the same thought process. Mark.

Simon Leinen

30 Jul 30 Jul

10 a.m.

New subject: BGP unnumbered examples from data center network using RFC 5549 et al. [was: Re: RFC 5549 - IPv4 Routes with IPv6 next-hop - Does it really exists?]

Mark Tinka writes:

...

On 29/Jul/20 15:51, Simon Leinen wrote:

...

...
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd sw-o(swp16) 4 65108 953559 938348 0 0 0 03w5d00h 688 sw-m(swp18) 4 65108 885442 938348 0 0 0 03w5d00h 688 s0001(swp1s0.3) 4 65300 748971 748977 0 0 0 03w5d00h 1 [...]

Note the host names/interface names - this is how you generally refer to neighbors, rather than using literal (IPv6) addresses.

...

Are the names based on DNS look-ups, or is there some kind of protocol association between the device underlay and its hostname, as it pertains to neighbors?

As Nick mentions, the hostnames are from the BGP hostname extension. I should have noticed that, but we use "BGP unnumbered"[1][2], which uses RAs to discover the peer's IPv6 link-local address, and then builds an IPv6 BGP session (that uses RFC 5549 to transfer IPv4 NLRIs as well). Here are some excerpts of the configuration on such a leaf router. General BGP boilerplate: ------------------------------ router bgp 65111 bgp router-id 10.1.1.46 bgp bestpath as-path multipath-relax bgp bestpath compare-routerid ! address-family ipv4 unicast network 10.1.1.46/32 redistribute connected redistribute static exit-address-family ! address-family ipv6 unicast network 2001:db8:1234:101::46/128 redistribute connected redistribute static exit-address-family ------------------------------ Leaf switch <-> server connection: (we use a 802.1q tagged subinterface for the BGP peering and L3 server traffic; the untagged interface is used only for netbooting the servers when (re)installing the OS. Here, servers just get IPv4+IPv6 default routes, and each server will only announce a single IPv4+IPv6 (loopback) address, i.e. the leaf/server links are also "unnumbered". Very simple redundant setup without any LACP/MLAG protocols... it's all just BGP+IPv6 ND. You can basically connect any server to any switch port and things will "just work" without special inter-switch links etc.) ------------------------------ interface swp1s0 description s0001.s1.scloud.switch.ch p8p1 ! interface swp1s0.3 description s0001.s1.scloud.switch.ch p8p1 ipv6 nd ra-interval 3 no ipv6 nd suppress-ra ! [...] router bgp 65111 neighbor servers peer-group neighbor servers remote-as external neighbor servers capability extended-nexthop neighbor swp1s0.3 interface peer-group servers ! address-family ipv4 unicast neighbor servers default-originate neighbor servers soft-reconfiguration inbound neighbor servers prefix-list DEFAULTV4-PERMIT out exit-address-family ! address-family ipv6 unicast neighbor servers activate neighbor servers default-originate neighbor servers soft-reconfiguration inbound neighbor servers prefix-list DEFAULTV6-PERMIT out exit-address-family ! ip prefix-list DEFAULT-PERMIT permit 0.0.0.0/0 ! ipv6 prefix-list DEFAULTV6-PERMIT permit ::/0 ------------------------------ Leaf <-> spine: ------------------------------ interface swp16 description sw-o port 22 ipv6 nd ra-interval 3 no ipv6 nd suppress-ra ! [...] router bgp 65111 neighbor fabric peer-group neighbor fabric remote-as external neighbor fabric capability extended-nexthop neighbor swp16 interface peer-group fabric ! address-family ipv4 unicast neighbor fabric soft-reconfiguration inbound ! address-family ipv6 unicast neighbor fabric activate neighbor fabric soft-reconfiguration inbound ------------------------------ Note the "remote-as external" - this will accept any AS other than the router's own AS. AS numbering in this DC setup is a bit weird if you're used to BGP... each leaf switch has its own AS, all spine switches should have the same AS number (for reasons...), and all servers have the same AS because who cares. (We are talking about three disjoint sets of AS numbers for leaves/spines/servers though.) -- Simon. [1] https://cumulusnetworks.com/blog/bgp-unnumbered-overview/ [2] https://support.cumulusnetworks.com/hc/en-us/articles/212561648-Configuring-...

Mark Tinka

12:56 p.m.

New subject: BGP unnumbered examples from data center network using RFC 5549 et al. [was: Re: RFC 5549 - IPv4 Routes with IPv6 next-hop - Does it really exists?]

On 30/Jul/20 12:00, Simon Leinen wrote:

...

As Nick mentions, the hostnames are from the BGP hostname extension.

I should have noticed that, but we use "BGP unnumbered"[1][2], which uses RAs to discover the peer's IPv6 link-local address, and then builds an IPv6 BGP session (that uses RFC 5549 to transfer IPv4 NLRIs as well).

Here are some excerpts of the configuration on such a leaf router.

General BGP boilerplate:

------------------------------ router bgp 65111 bgp router-id 10.1.1.46 bgp bestpath as-path multipath-relax bgp bestpath compare-routerid ! address-family ipv4 unicast network 10.1.1.46/32 redistribute connected redistribute static exit-address-family ! address-family ipv6 unicast network 2001:db8:1234:101::46/128 redistribute connected redistribute static exit-address-family ------------------------------

Leaf switch <-> server connection: (we use a 802.1q tagged subinterface for the BGP peering and L3 server traffic; the untagged interface is used only for netbooting the servers when (re)installing the OS. Here, servers just get IPv4+IPv6 default routes, and each server will only announce a single IPv4+IPv6 (loopback) address, i.e. the leaf/server links are also "unnumbered". Very simple redundant setup without any LACP/MLAG protocols... it's all just BGP+IPv6 ND. You can basically connect any server to any switch port and things will "just work" without special inter-switch links etc.)

------------------------------ interface swp1s0 description s0001.s1.scloud.switch.ch p8p1 ! interface swp1s0.3 description s0001.s1.scloud.switch.ch p8p1 ipv6 nd ra-interval 3 no ipv6 nd suppress-ra ! [...] router bgp 65111 neighbor servers peer-group neighbor servers remote-as external neighbor servers capability extended-nexthop neighbor swp1s0.3 interface peer-group servers ! address-family ipv4 unicast neighbor servers default-originate neighbor servers soft-reconfiguration inbound neighbor servers prefix-list DEFAULTV4-PERMIT out exit-address-family ! address-family ipv6 unicast neighbor servers activate neighbor servers default-originate neighbor servers soft-reconfiguration inbound neighbor servers prefix-list DEFAULTV6-PERMIT out exit-address-family ! ip prefix-list DEFAULT-PERMIT permit 0.0.0.0/0 ! ipv6 prefix-list DEFAULTV6-PERMIT permit ::/0 ------------------------------

Leaf <-> spine:

------------------------------ interface swp16 description sw-o port 22 ipv6 nd ra-interval 3 no ipv6 nd suppress-ra ! [...] router bgp 65111 neighbor fabric peer-group neighbor fabric remote-as external neighbor fabric capability extended-nexthop neighbor swp16 interface peer-group fabric ! address-family ipv4 unicast neighbor fabric soft-reconfiguration inbound ! address-family ipv6 unicast neighbor fabric activate neighbor fabric soft-reconfiguration inbound ------------------------------

Note the "remote-as external" - this will accept any AS other than the router's own AS. AS numbering in this DC setup is a bit weird if you're used to BGP... each leaf switch has its own AS, all spine switches should have the same AS number (for reasons...), and all servers have the same AS because who cares. (We are talking about three disjoint sets of AS numbers for leaves/spines/servers though.)

Interesting. Data centre bits are, interesting :-). Thanks for sharing. Mark.

1908

Age (days ago)

1909

Last active (days ago)

List overview

Download

23 comments

9 participants

participants (9)

Alejandro Acosta
Chriztoffer Hansen
Douglas Fischer
Mark Tinka
Nick Hilliard
Owen DeLong
Saku Ytti
Simon Leinen
Vincent Bernat