Hello I am investigating Linux as a BNG. The BNG (Broadband Network Gateway) being the thing that acts as default gateway for our customers. The setup is one VLAN per customer. Because 4095 VLANs is not enough, we have QinQ with double VLAN tagging on the customers. The customers can use DHCP or static configuration. DHCP packets need to be option82 tagged and forwarded to a DHCP server. Every customer has one or more static IP addresses. IPv4 subnets need to be shared among multiple customers to conserve address space. We are currently using /26 IPv4 subnets with 60 customers sharing the same default gateway and netmask. In Linux terms this means 60 VLAN interfaces per bridge interface. However Linux is not quite ready for the task. The primary problem being that the system does not scale to thousands of VLAN interfaces. We do not want customers to be able to send non routed packets directly to each other (needs proxy arp). Also customers should not be able to steal another customers IP address. We want to hard code the relation between IP address and VLAN tagging. This can be implemented using ebtables, but we are unsure that it could scale to thousands of customers. I am considering writing a small program or kernel module. This would create two TAP devices (tap0 and tap1). Traffic received on tap0 with VLAN tagging, will be stripped of VLAN tagging and delivered on tap1. Traffic received on tap1 without VLAN tagging, will be tagged according to a lookup table using the destination IP address and then delivered on tap0. ARP and DHCP would need some special handling. This would be completely stateless for the IPv4 implementation. The IPv6 implementation would be harder, because Link Local addressing needs to be supported and that can not be stateless. The customer CPE will make up its own Link Local address based on its MAC address and we do not know what that is in advance. The goal is to support traffic of minimum of 10 Gbit/s per server. Ideally I would have a server with 4x 10 Gbit/s interfaces combined into two 20 Gbit/s channels using bonding (LACP). One channel each for upstream and downstream (customer facing). The upstream would be layer 3 untagged and routed traffic to our transit routers. I am looking for comments, ideas or alternatives. Right now I am considering what kind of CPU would be best for this. Unless I take steps to mitigate, the workload would probably go to one CPU core only and be limited to things like CPU cache and PCI bus bandwidth. Regards, Baldur
interspersed comments .... On 07/14/2018 06:13 AM, Baldur Norddahl wrote:
I am investigating Linux as a BNG. The BNG (Broadband Network Gateway) being the thing that acts as default gateway for our customers.
The setup is one VLAN per customer. Because 4095 VLANs is not enough, we have QinQ with double VLAN tagging on the customers. The customers can use DHCP or static configuration. DHCP packets need to be option82 tagged and forwarded to a DHCP server. Every customer has one or more static IP addresses.
Where do you have this happening? Do you have aggregation switches doing this? Are those already in place, or being planned? Because I would make a suggestion for how to do the aggregation.
IPv4 subnets need to be shared among multiple customers to conserve address space. We are currently using /26 IPv4 subnets with 60 customers sharing the same default gateway and netmask. In Linux terms this means 60 VLAN interfaces per bridge interface.
I suppose it could be made to work, but forcing a layer 3 boundary over a bunch of layer 2 boundaries, seems to be a bunch of work, but I suppose that would be the brute force and ignorance approach from the mechanisms you would be using.
However Linux is not quite ready for the task. The primary problem being that the system does not scale to thousands of VLAN interfaces.
It probably depends upon which Linux based tooling you wish to use. There are some different ways of looking at this which scale better.
We do not want customers to be able to send non routed packets directly to each other (needs proxy arp). Also customers should not be able to steal another customers IP address. We want to hard code the relation between IP address and VLAN tagging. This can be implemented using ebtables, but we are unsure that it could scale to thousands of customers.
I would consider suggesting the concepts of VxLAN (kernel plus FRR and/or openvswitch) or OpenFlow.(kernel plus openvswitch) VxLAN scales to 16 million vlan equivalents. Which is why I ask about your aggregation layers. Rather than trying to do all the addressing across all the QinQ vlans in the core boxes, the vlans/vxlans and addressing are best dealt with at the edge. Then, rather than running a bunch of vlans through your aggregation/distribution links, you can keep those resilient with a layer 3 only based strategy.
I am considering writing a small program or kernel module. This would create two TAP devices (tap0 and tap1). Traffic received on tap0 with VLAN tagging, will be stripped of VLAN tagging and delivered on tap1. Traffic received on tap1 without VLAN tagging, will be tagged according to a lookup table using the destination IP address and then delivered on tap0. ARP and DHCP would need some special handling.
I don't think this would be needed. I think all the tools are already available and are robust from daily use. Free Range Routing with EVPN/(VxLAN|MPLS) for a traditional routing mix, or use OpenFlow tooling in Open vSwitch to handle the layer 2 and layer 3 rule definitions you have in mind. Open vSwitch can be programmed via command line rules or can be hooked up to a controller of some sort. So rather than writing your own kernel program, you would write rules for a controller or script which drives the already kernel resident engines.
This would be completely stateless for the IPv4 implementation. The IPv6 implementation would be harder, because Link Local addressing needs to be supported and that can not be stateless. The customer CPE will make up its own Link Local address based on its MAC address and we do not know what that is in advance.
FRR and OVS are IPv4 and IPv6 aware. The dynamics of the CPE MAC would be handled in various ways, depending upon what tooling you decide upon.
The goal is to support traffic of minimum of 10 Gbit/s per server. Ideally I would have a server with 4x 10 Gbit/s interfaces combined into two 20 Gbit/s channels using bonding (LACP). One channel each for upstream and downstream (customer facing). The upstream would be layer 3 untagged and routed traffic to our transit routers.
As mentioned earlier, why make the core boxes do all of the work? Why not distribute the functionality out to the edge? Rather than using traditional switch gear at the edge, use smaller Linux boxes to handle all that complicated edge manipulation, and then keep your high bandwidth core boxes pushing packets only.
I am looking for comments, ideas or alternatives. Right now I am considering what kind of CPU would be best for this. Unless I take steps to mitigate, the workload would probably go to one CPU core only and be limited to things like CPU cache and PCI bus bandwidth.
There is much more to write about, but those writings would depend up on what you already have in place, what you would like to put in place, and how you wish to segment your network. Hope this helps.
Baldur
-- Raymond Burkholder ray@oneunified.net https://blog.raymond.burkholder.net
I agree with all aspects. On 07/14/2018 11:09 AM, Raymond Burkholder wrote:
As mentioned earlier, why make the core boxes do all of the work? Why not distribute the functionality out to the edge? Rather than using traditional switch gear at the edge, use smaller Linux boxes to handle all that complicated edge manipulation, and then keep your high bandwidth core boxes pushing packets only But I do ask:
Do you (the ISP) control the CPE (modem / ONT)? Could you push the VxLAN (or maybe MPLS) functionality all the way into it? This would have the added advantage of a (presumably) trusted device providing the identification back to your core equipment. Perhaps even minimal L3 routing w/ DHCP helper such that the customer saw the CPE as the default gateway. (Though this might burn a lot more IPs. This might not be an issue if you're using CGNAT.) -- Grant. . . . unix || die
Den 14/07/2018 kl. 19.09 skrev Raymond Burkholder:
Where do you have this happening? Do you have aggregation switches doing this? Are those already in place, or being planned? Because I would make a suggestion for how to do the aggregation.
The POI (Point of Interconnect) with the incumbent telco is one customer per vlan using QinQ. This telco owns all the copper and runs the VDSL2 DSLAMs. They give us a transparent ethernet tunnel to the CPE. We own the CPE. Internally the incumbent uses a MPLS network to transport the VLANs. We in turn also use MPLS with L2VPN to transport the traffic to one of two datacenters. In addition we have our own FTTH network in the ground. This is GPON on Zhone equipment. To make things easier we made the Zhone GPON OLT emulate the same one VLAN per customer setup. The incumbent have telco buildings in each city area. Typically distance between buildings is 10 km. In each such building they have a room where alternative telcos, like us, can rent rack space. The only available power is -48V DC. We currently only have Zhone MXK GPON switches and ZTE MPLS switch equipment in these facilities. Our current BNG solution is some big iron routers (ZTE M6000). This is a device that will do things like 4 million routes in hardware and move many Tb/s (not that we have traffic anything near that level). It works well enough but is not perfect. I think a discussion of the BNG limitations of ZTE M6000 would be a different thread. One of the problems with ZTE M6000 is the price and that goes double for any alternatives mentioned here (Cisco, Juniper etc). Right now I am facing the prospect of investing in more line cards for M6000. I can buy a few servers for the price of one line card and perhaps get a solution that is more "perfect". As to VXLAN the Zhone MXK can not do it and it is not an option for the POI with the incumbent. It would be an alternative to running MPLS but we are happy with the MPLS solution. I have considered OpenFlow and might do that. We have OpenFlow capable switches and I may be able to offload the work to the switch hardware. But I also consider this solution harder to get right than the idea of using Linux with tap devices. Also it appears the Openvswitch implements a different flavour of OpenFlow than the hardware switch (the hardware is limited to some fixed tables that Broadcom made up), so I might not be able to start with the software and then move on to hardware. Regards, Baldur
On 2018-07-14 22:05, Baldur Norddahl wrote:
I have considered OpenFlow and might do that. We have OpenFlow capable switches and I may be able to offload the work to the switch hardware. But I also consider this solution harder to get right than the idea of using Linux with tap devices. Also it appears the Openvswitch implements a different flavour of OpenFlow than the hardware switch (the hardware is limited to some fixed tables that Broadcom made up), so I might not be able to start with the software and then move on to hardware. AFAIK openflow is suitable for datacenters, but doesn't scale well for users termination purposes. You will run out from TCAM much sooner than you expect. Linux tap device has very high overhead, it suits no more than working as some hotspot gateway for 100s of users.
Regards,
Baldur
On 07/15/2018 09:03 AM, Denys Fedoryshchenko wrote:
On 2018-07-14 22:05, Baldur Norddahl wrote:
I have considered OpenFlow and might do that. We have OpenFlow capable switches and I may be able to offload the work to the switch hardware. But I also consider this solution harder to get right than the idea of using Linux with tap devices. Also it appears the Openvswitch implements a different flavour of OpenFlow than the hardware switch (the hardware is limited to some fixed tables that Broadcom made up), so I might not be able to start with the software and then move on to hardware.
AFAIK openflow is suitable for datacenters, but doesn't scale well for users termination purposes. You will run out from TCAM much sooner than you expect.
Denys, could you expand on this? In a linux based solution (say with OVS), TCAM is memory/software based, and in following their dev threads, they have been optimizing flow caches continuously for various types of flows: megaflow, tiny flows, flow quantity and variety, caching, ... When you mate OVS with something like a Mellanox Spectrum switch (via SwitchDev) for hardware based forwarding, I could see certain hardware limitations applying, but don't have first hand experience with that. But I suppose you will see these TCAM issues on hardware only specialized openflow switches. On edge based translations, is hardware based forwarding actually necessary, since there are so many software functions being performed anyway? But I think a clarification on Baldur's speed requirements is needed. He indicates that there are a bunch of locations: do each of the locations require 10G throughput, or was the throughput defined for all sites in aggregate? If the sites indivdiually have smaller throughput, the software based boxes might do, but if that is at each site, then software-only boxes may not handle the throughput. But then, it may be conceivable that by buying a number of servers, and load spreading across the servers will provide some resiliency and will come in at a lower cost than putting in 'big iron' anyway. Because then there are some additional benefits: you can run Network Function Virtualization at the edge and provide additional services to customers. I forgot to mention this in the earlier thread, but there are some companies out there which provide devices with many ports on them and provide compute at the same time. So software based Linux switches are possible, with out reverting to a combination of physical switch and separate compute box. In a Linux based switch, by using IRQ affiinity, traffic from ports can be balanced across CPUs. So by collapsing switch and compute, additional savings might be able to be realized. As a couple of side notes: 1) the DPDK people support a user space dataplane version of OVS/OpenFlow, and 2) an eBPF version of the OVS dataplane is being worked on. In summary, OVS supports three current dataplanes with a fourth on the way. 1) native kernel, 2) hardware offload via TC (SwitchDev), 3) DPDK, 4) eBPF.
Linux tap device has very high overhead, it suits no more than working as some hotspot gateway for 100s of users.
As does the 'veth' construct.
-- Raymond Burkholder ray@oneunified.net https://blog.raymond.burkholder.net
Den 15/07/2018 kl. 18.00 skrev Raymond Burkholder:
But I think a clarification on Baldur's speed requirements is needed. He indicates that there are a bunch of locations: do each of the locations require 10G throughput, or was the throughput defined for all sites in aggregate? If the sites indivdiually have smaller throughput, the software based boxes might do, but if that is at each site, then software-only boxes may not handle the throughput.
We have considerably more than 10G of total traffic. We are currently transporting it all to one of two locations before doing the BNG function. We then have VRRP to enable failover to the other location. Transport is by MPLS and L2VPN. I set the goal post at 10G per server. To handle more traffic we will have multiple servers. Load balance does not need to be dynamic. We would just distribute the customers so each customer is always handled by the same server. 10G per server translates to approximately 5000 customers per server (2018 - this number is expected to drop as time goes). I am wondering if we could make an open source system (does not strictly have to be Linux) that could do the BNG function at 10G per server, with a server in the price range of 1k - 2k USD. For many sizes of ISP this would be far far cheaper than any of the solutions from Cisco, Juniper et al. Even if you had to get 10 servers to handle 100G you would likely still come out ahead of the big iron solution. And for a startup (like us) it is great to be able to start out with little investment and then let the solution grow with the business. Regards, Baldur
On 2018-07-15 19:00, Raymond Burkholder wrote:
On 07/15/2018 09:03 AM, Denys Fedoryshchenko wrote:
On 2018-07-14 22:05, Baldur Norddahl wrote:
I have considered OpenFlow and might do that. We have OpenFlow capable switches and I may be able to offload the work to the switch hardware. But I also consider this solution harder to get right than the idea of using Linux with tap devices. Also it appears the Openvswitch implements a different flavour of OpenFlow than the hardware switch (the hardware is limited to some fixed tables that Broadcom made up), so I might not be able to start with the software and then move on to hardware.
AFAIK openflow is suitable for datacenters, but doesn't scale well for users termination purposes. You will run out from TCAM much sooner than you expect.
Denys, could you expand on this? In a linux based solution (say with OVS), TCAM is memory/software based, and in following their dev threads, they have been optimizing flow caches continuously for various types of flows: megaflow, tiny flows, flow quantity and variety, caching, ...
When you mate OVS with something like a Mellanox Spectrum switch (via SwitchDev) for hardware based forwarding, I could see certain hardware limitations applying, but don't have first hand experience with that.
But I suppose you will see these TCAM issues on hardware only specialized openflow switches. Yes, definitely only on hardware switches and biggest issue it is vendor+hardware dependent. This means if i find "right" switch, and make your solution depending on it, and vendor decided to issue new revision, or even new firmware, there is no guarantee "unusual" setup will keep working. That what makes many people afraid to use it.
Openflow IMO by nature is built to do complex matching, and for example for typical 12-tuple it is 750-4000 entries max in switches, but you go to l2 only matching which was possible at moment i tested, on my experience, only on PF5820 - you can do L2 entries only matching, then it can go 80k flows. But again, sticking to specific vendor is not recommended. About OVS, i didnt looked much at it, as i thought it is not suitable for BNG purposes, like for tens of thousands users termination, i thought it is more about high speed switching for tens of VM.
On edge based translations, is hardware based forwarding actually necessary, since there are so many software functions being performed anyway?
IMO at current moment 20-40G on single box is a boundary point when packet forwarding is preferable(but still not necessary) to do in hardware, as passing packets thru whole Linux stack is really not best option. But it works. I'm trying to find an alternative solution, bypassing full stack using XDP, so i can go beyond 40G.
But then, it may be conceivable that by buying a number of servers, and load spreading across the servers will provide some resiliency and will come in at a lower cost than putting in 'big iron' anyway.
Because then there are some additional benefits: you can run Network Function Virtualization at the edge and provide additional services to customers.
+1 For IPoE/PPPoE - servers scale very well, while on "hardware" eventually you will hit a limit how many line cards you can put in chassis and then you need to buy new chassis. I am not talking that some chassis have countless unobvious limitations you might hit inside chassis (in pretty old Cisco 6500/7600, which is not EOL, it is a nightmare). If ISP have big enough chassis, he need to remember, that he need second one at same place, and preferable with same amount of line cards, while with servers you are more reliable even with N+M(where M for example N/4) redundancy. Also when premium customers ask me for some unusual things, it is much easier to move them to separate nodes with extended options for termination, where i can implement their demands over custom vCPE.
søn. 15. jul. 2018 18.57 skrev Denys Fedoryshchenko <denys@visp.net.lb>:
Openflow IMO by nature is built to do complex matching, and for example for typical 12-tuple it is 750-4000 entries max in switches, but you go to l2 only matching which was possible at moment i tested, on my experience, only on PF5820 - you can do L2 entries only matching, then it can go 80k flows. But again, sticking to specific vendor is not recommended.
It would be possible to implement a general forward to controller policy and then upload matching on MAC address only as a offload strategy. You would have a different device doing the layer 3 stuff. The OpenFlow switch just adds and removes vlan tagging based on MAC matching. Regards
On 07/15/2018 10:56 AM, Denys Fedoryshchenko wrote:
On 07/15/2018 09:03 AM, Denys Fedoryshchenko wrote:
On 2018-07-14 22:05, Baldur Norddahl wrote: About OVS, i didnt looked much at it, as i thought it is not suitable for BNG purposes,
On 2018-07-15 19:00, Raymond Burkholder wrote: like for tens of thousands users termination, i thought it is more about high speed switching for tens of VM.
I would call it more of a generic all purpose tool for customized L2/L3/L4/L5 packet forwarding. It works well for datacenter as well as ISP related scenarios. Due to the wide variety of rule matching, encapsulations supported, and the ability to attach a customized controller for specialized packet handling.
On edge based translations, is hardware based forwarding actually necessary, since there are so many software functions being performed anyway? IMO at current moment 20-40G on single box is a boundary point when packet forwarding is preferable(but still not necessary) to do in hardware, as passing packets thru whole Linux stack is really not best option. But it works. I'm trying to find an alternative solution, bypassing full stack using XDP, so i can go beyond 40G.
Tied to XDP is eBPF (which is what makes tcpdump fast). Another tool is P4 which provides tools to build customized SW/HW forwarders. But I'm not sure how applicable it is to BNG. -- Raymond Burkholder ray@oneunified.net https://blog.raymond.burkholder.net
The setup is one VLAN per customer. Because 4095 VLANs is not enough, we have QinQ with double VLAN tagging on the customers. The customers can use DHCP or static configuration. DHCP packets need to be option82 tagged and forwarded to a DHCP server. Every >customer has one or more static IP addresses.
What you are describing is how the national fibre network delivers customers to the ISP's in New Zealand (with DHCP and PPP being at the ISP's choice). Generally in New Zealand we have a very active Linux community but I have to say I have not seen any of the service providers attempt to use Linux as a BNG in this way for production customers. Commonly an MPLS network is used to transport these QinQ layer2 "handovers" to centralised BNG's. These BNG's in my experience are normally Cisco (asr1k,asr9k), Juniper (MX, hardware of virtual), Nokia (7750 hardware or virtual) and a small amount of Mikrotik (tends to get swapped out with the previous vendor solutions when scale (2000+) rises). As much as I appreciate Linux, I personally still also see the value of the vendor offerings in this case (think stability and guaranteed performance). My biggest issue with the vendor offerings is that they are not making their virtual offerings (VMX, VSR) attractive enough pricing wise at the small scale, we have successful virtual Juniper and Nokia BNG's in production but pricing wise it generally ends up with the service provider thinking that hardware was probably a better choice in the long run.
On 14/Jul/18 20:29, tony@wicks.co.nz wrote:
My biggest issue with the vendor offerings is that they are not making their virtual offerings (VMX, VSR) attractive enough pricing wise at the small scale, we have successful virtual Juniper and Nokia BNG's in production but pricing wise it generally ends up with the service provider thinking that hardware was probably a better choice in the long run.
vBNG's are fraught with imperceptible license fees. But yes, if you had do deploy a BNG on an x86 box, a vBNG from a well-known vendor will likely be less problematic to setup and scale than a BNG based on Linux. Then again, I last researched this in 2016, so things could have changed a tad. Mark.
On 2018-07-14 15:13, Baldur Norddahl wrote:
Hello
I am investigating Linux as a BNG. The BNG (Broadband Network Gateway) being the thing that acts as default gateway for our customers.
The setup is one VLAN per customer. Because 4095 VLANs is not enough, we have QinQ with double VLAN tagging on the customers. The customers can use DHCP or static configuration. DHCP packets need to be option82 tagged and forwarded to a DHCP server. Every customer has one or more static IP addresses.
IPv4 subnets need to be shared among multiple customers to conserve address space. We are currently using /26 IPv4 subnets with 60 customers sharing the same default gateway and netmask. In Linux terms this means 60 VLAN interfaces per bridge interface.
However Linux is not quite ready for the task. The primary problem being that the system does not scale to thousands of VLAN interfaces.
We do not want customers to be able to send non routed packets directly to each other (needs proxy arp). Also customers should not be able to steal another customers IP address. We want to hard code the relation between IP address and VLAN tagging. This can be implemented using ebtables, but we are unsure that it could scale to thousands of customers.
I am considering writing a small program or kernel module. This would create two TAP devices (tap0 and tap1). Traffic received on tap0 with VLAN tagging, will be stripped of VLAN tagging and delivered on tap1. Traffic received on tap1 without VLAN tagging, will be tagged according to a lookup table using the destination IP address and then delivered on tap0. ARP and DHCP would need some special handling.
This would be completely stateless for the IPv4 implementation. The IPv6 implementation would be harder, because Link Local addressing needs to be supported and that can not be stateless. The customer CPE will make up its own Link Local address based on its MAC address and we do not know what that is in advance.
The goal is to support traffic of minimum of 10 Gbit/s per server. Ideally I would have a server with 4x 10 Gbit/s interfaces combined into two 20 Gbit/s channels using bonding (LACP). One channel each for upstream and downstream (customer facing). The upstream would be layer 3 untagged and routed traffic to our transit routers.
I am looking for comments, ideas or alternatives. Right now I am considering what kind of CPU would be best for this. Unless I take steps to mitigate, the workload would probably go to one CPU core only and be limited to things like CPU cache and PCI bus bandwidth.
accel-ppp supports IPoE termination for both IPv4 and IPv6, with radius and everything. It is also done such way, that it will utilize multicore server efficiently (might need some tuning, depends on hardware). It should handle 2x10G easily on decent server, about 4x10 it depends on your hardware and how well tuning are done.
Hi Baldur, Le 14/07/2018 à 14:13, Baldur Norddahl a écrit :
I am investigating Linux as a BNG
As we say in France, it's like your trying to buttfuck flies (a local saying standing for "reinventing the wheel for no practical reason"). Linux' kernel networking stack is not made for this kind of job. 6WIND or fd.io may be right on the spot, but it's still a lot of dark magic for something that has been done over and over for the past 20 years by most vendors. And it just works. DHCP (implying straight L2 from the CPE to the BNG) may be an option bust most codebases are still young. PPP, on the other hand, is field-tested for extremely large scale deployments with most vendors. If I were in you shooes, and I don't say I'd want to (my BNGs are scaled to less than a few thousand of subscribers, with 1-4 concurrent session each), I'd stick to plain old bitstream (PPP) model, with a decent subscriber framework on my BNGs (I mostly use Juniper MXs, but I also like Nokia's and Cisco's for some features). But let's say we would want to go forward and ditch legacy / proprietary code to surf on the NFV bullshit-wave. What would you actually need ? Linux does soft-recirculation at every encapsulation level by memory copy. You can't scale anything with that. You need to streamline decapsulation with 6wind's turborouter or fd.io frameworks. It'll cost you a few thousand of man-hours to implement your first prototype. Let's say you got a woking framework to treat subsequent headers on the fly (because decapsulation is not really needed, what you want is just to forward the payload, right ?)… Well, you'd need to address provisionning protocols on the same layers. Who would want to rebase a DHCP server with alien packet forms incoming ? I gess no one. Well, I could dissert on the topic for hours, because I've already spent months to address such design issues in scalable ISP networks, and the conclusion is : - PPPoE is simple and proven. Its rigid structure alleviates most of the dual-stack issues. It is well supported and largelly deployed. - DHCP requires hacks (in the form of undocummented options from several vendors) to seemingly work on IPv4, but the multicast boundaries for NDP are a PITA to handle, so no one implemented that properly yet. So it is to avoid for now. - Subscriber frameworks, be it uniper's, Cisco's or Nokia's, are at the core of the largest residentioal ISPs out there. It Just Works. Trust them. That being said, I love the idea of NFV-ing all the things, let it be BNGs first because those bricks in the wall are the most fragile we have to maintain. But I cleraly won't stand for an alternative to traditionnal offerings just yet : it's too critical, and it's a PITA to build from scratch and scale. Best regards, -- Jérôme Nicolle +33 6 19 31 27 14
On 2018-07-15 06:09, Jérôme Nicolle wrote:
Hi Baldur,
Le 14/07/2018 à 14:13, Baldur Norddahl a écrit :
I am investigating Linux as a BNG
As we say in France, it's like your trying to buttfuck flies (a local saying standing for "reinventing the wheel for no practical reason"). You can say that about whole opensource ecosystem, why to bother, if *proprietary solution name* exists. It is endless flamewar topic.
Linux' kernel networking stack is not made for this kind of job. 6WIND or fd.io may be right on the spot, but it's still a lot of dark magic for something that has been done over and over for the past 20 years by most vendors.
And it just works.
Linux developers are working continuously to improve this, for example latest feature, XDP, able to process several Mpps on <$1000 server. Ask yourself, why Cloudflare "buttfuck flies" and doesn't buy some proprietary vendor who 20 years does filtering in hardware? https://blog.cloudflare.com/how-to-drop-10-million-packets/ I am doing experiments with XDP as well, to terminate PPPoE, and it is doing that quite well over XDP.
DHCP (implying straight L2 from the CPE to the BNG) may be an option bust most codebases are still young. PPP, on the other hand, is field-tested for extremely large scale deployments with most vendors.
DHCP here, at least from RFC 2131 existence in March 1997. Quite old, isn't it? When you stick to PPPoE, you tie yourself with necessary layers of encapsulation/decapsulation, and this is seriously degrading performance at _user_ level at least. With some development experience of firmware for routers, i can tell that hardware offloading of ipv4 routing (DHCP) obviousl is much easier and cheaper, than offloading PPPoE encap/decap+ipv4 routing. Also Vendors keep screwing up their routers with PPP, and for example one of them failed processing properly PADO in newest firmware revision. Another problem, with PPPoE you subscribe to headache called reduced mtu, that also will give a lot of unpleasant hours for ISP support.
If I were in you shooes, and I don't say I'd want to (my BNGs are scaled to less than a few thousand of subscribers, with 1-4 concurrent session each), I'd stick to plain old bitstream (PPP) model, with a decent subscriber framework on my BNGs (I mostly use Juniper MXs, but I also like Nokia's and Cisco's for some features).
I am consulting operators from few hundreds to hundreds of thousands. It is very rare, when Linux bng doesn't suit them.
But let's say we would want to go forward and ditch legacy / proprietary code to surf on the NFV bullshit-wave. What would you actually need ?
Linux does soft-recirculation at every encapsulation level by memory copy. You can't scale anything with that. You need to streamline decapsulation with 6wind's turborouter or fd.io frameworks. It'll cost you a few thousand of man-hours to implement your first prototype.
6wind/fd.io is great solutions, but not suitable for mentioned task. They are mostly created for very tailor made tasks or even as core of some vendor solution. Implementing your BNG based on such frameworks, or DPDK, is really reinventing the wheel, unless you will sell it or can save by that millions of US$.
Let's say you got a woking framework to treat subsequent headers on the fly (because decapsulation is not really needed, what you want is just to forward the payload, right ?)… Well, you'd need to address provisionning protocols on the same layers. Who would want to rebase a DHCP server with alien packet forms incoming ? I gess no one.
accel-ppp does all that and exactly for IPoE termination, and no black magic there.
Well, I could dissert on the topic for hours, because I've already spent months to address such design issues in scalable ISP networks, and the conclusion is :
- PPPoE is simple and proven. Its rigid structure alleviates most of the dual-stack issues. It is well supported and largelly deployed.
PPPoE has VERY serious flaws. 1)Security of PPPoE sucks big time. Anybody who run rogue PPPoE server in your network will create significant headache for you, while with DHCP you have at least "DHCP snooping". DHCP snooping supported in very many vendors switches, while for PPPoE most of them have nothing, except... you stick each user to his own vlan. Why to pppox them then? 2)DHCP can send some circuit information in Option 82, this is very useful for billing and very cost efficient on last stage of access switches. 3)Modern FTTX(GPON) solutions are built with QinQ in mind, so IPoE fits there flawlessly.
- DHCP requires hacks (in the form of undocummented options from several vendors) to seemingly work on IPv4, but the multicast boundaries for NDP are a PITA to handle, so no one implemented that properly yet. So it is to avoid for now. While you can do multicast(mostly for IPTV, yes it is not easy, and need some vendor magic on "native" layer (DHCP), with PPP you can forget about multicast entirely.
- Subscriber frameworks, be it Juniper's, Cisco's or Nokia's, are at the core of the largest residentioal ISPs out there. It Just Works. Trust them. Sticking to "It Just Works" means "zero innovation" as well. For example, while everybody said so, Mikrotik guys appeared, and very possible in total numbers their solutions serving now more users than Cisco or Nokia, for much lower cost. But sure they have their own market niche, Mikrotik doesn't fit well for large deployments.
That being said, I love the idea of NFV-ing all the things, let it be BNGs first because those bricks in the wall are the most fragile we have to maintain.
But I cleraly won't stand for an alternative to traditionnal offerings just yet : it's too critical, and it's a PITA to build from scratch and scale.
Best regards,
A lot of people who just sit on their warm, monthly salary chair and care about their personal stability only(but not employer profit) - will ask employer to pay for expensive *vendor name* solution, as it's safest bet for them. So, if person who implement solution is just "corporate screw" - he will say *vendor*. Nicolas Taleb "Skin in the game" books perfectly explains, why they will do worst choice. If person is entrepreneur - he will start feasibility study.
Hi Baldur, These guys made a PPPoE client for VPP - you could probably extend that into a PPP server: https://lists.fd.io/g/vpp-dev/message/9181 https://github.com/raydonetworks/vpp-pppoeclient Although, I would agree that deploying PPP now is a bit of a step backwards and IPoE is the way to be doing this in 2018. If you want subscribers with a S-TAG/C-TAG landing in unique virtual interfaces with shared gateway etc, IPv4 + IPv6 (DHCP/v6) and were deploying this on "real service provider networking kit" [1] then the way to do this is with pseudowire headend termination. (PWHE/PWHT). However, you're going to struggle to implement something like PWHT on the native Linux networking stack. Many of the features you want exist in Linux like DHCP/v6, IPv4/6, MPLS, LDP, pseudowires etc, but not all together as a combined service offering. My two pence would be to buy kit from some like Cisco or Juniper as I don't think the open source world is quite there yet. Alternatively if it *must* be Linux look at adding the code to https://wiki.fd.io/view/VPP/Features as it has all constituent parts (DHCP, IP, MPLS, bridges etc.) but not glued together. VPP is an order of magnitude faster than the native Kernel networking stack. I'd be shocked if you did all that you wanted to do at 10Gbps line rate with one CPU core. Cheers, James. [1] Which means the expensive stuff big name vendors like Cisco and Juniper sell
Am 14.07.2018 um 14:13 schrieb Baldur Norddahl <baldur.norddahl@gmail.com>:
I am considering writing a small program or kernel module. This would create two TAP devices (tap0 and tap1). Traffic received on tap0 with VLAN tagging, will be stripped of VLAN tagging and delivered on tap1. Traffic received on tap1 without VLAN tagging, will be tagged according to a lookup table using the destination IP address and then delivered on tap0. ARP and DHCP would need some special handling.
As a proof of concept, a userland implementation using tap is likely the easiest to implement. But it won’t give you the throughput you’re looking for. I’d look at https://www.dpdk.org if you want to stay in userland. If FreeBSD ist an option, netgraph(4) is designed to allow packet filtering, manipulation and distribution in a set of small processing modules. In either case, Ethernet frames would be processed outside the regular network stack, but could be handed over to the kernel for further processing, i.e. DHCP or SLAAC. Stefan -- Stefan Bethke <stb@lassitu.de> Fon +49 151 14070811
Hi Baldur, Based on the information you provided, CPE connects to the POI via different service provider (access network provider / middle man) before it reaches your network/POP. With this construct, you are typically responsible for IP allocation and session authentication via DHCP (option 82) with AAA or via Radius for PPPoE. You may also have to deal with the S-TAG and C-TAG at BNG level. Here are some options to consider; *Option 1.* Use Radius for session authentication and IP/DNS allocation to CPE. You can configure BBA-GROUP on the BNG to overcome the 409x vlan limitation as well as the S-TAG and C-TAG. BBA-GROUP can handle multiple session. BBA-GROUP is also a well-supported feature. Here is an example of the config for your BNG (Cisco router) ; =============================================== *bba-group pppoe NAME -1* virtual-template 1 sessions per-mac limit 2 ! *bba-group pppoe NAME -2* virtual-template 2 sessions per-mac limit 2 ! interface GigabitEthernet1/3.100 * encapsulation dot1Q 100 second-dot1q 500-4094* no ip redirects no ip unreachables no ip proxy-arp ip flow ingress ip flow egress ip multicast boundary 30 *pppoe enable group NAME -1* no cdp enable ! interface GigabitEthernet1/3.200 encapsulation dot1Q 200 second-dot1q 200-300 no ip redirects no ip unreachables no ip proxy-arp ip flow ingress ip flow egress ip multicast boundary 30 *pppoe enable group NAME -2* no cdp enable Configure Virtual templates too. =============================================== *Option 2.* You can deploy a DHCP server using DHCP option 82 to handle all IP or IPoE sessions. DHCP option 82 provides you with additional flexibility that can scale as your customer base grows. You can perform authentication using a combination of CircuitID, RemoteID or CPE MAC-ADD etc. I hope this information helps. Cheers, Ahad On Sat, Jul 14, 2018 at 10:13 PM, Baldur Norddahl <baldur.norddahl@gmail.com
wrote:
Hello
I am investigating Linux as a BNG. The BNG (Broadband Network Gateway) being the thing that acts as default gateway for our customers.
The setup is one VLAN per customer. Because 4095 VLANs is not enough, we have QinQ with double VLAN tagging on the customers. The customers can use DHCP or static configuration. DHCP packets need to be option82 tagged and forwarded to a DHCP server. Every customer has one or more static IP addresses.
IPv4 subnets need to be shared among multiple customers to conserve address space. We are currently using /26 IPv4 subnets with 60 customers sharing the same default gateway and netmask. In Linux terms this means 60 VLAN interfaces per bridge interface.
However Linux is not quite ready for the task. The primary problem being that the system does not scale to thousands of VLAN interfaces.
We do not want customers to be able to send non routed packets directly to each other (needs proxy arp). Also customers should not be able to steal another customers IP address. We want to hard code the relation between IP address and VLAN tagging. This can be implemented using ebtables, but we are unsure that it could scale to thousands of customers.
I am considering writing a small program or kernel module. This would create two TAP devices (tap0 and tap1). Traffic received on tap0 with VLAN tagging, will be stripped of VLAN tagging and delivered on tap1. Traffic received on tap1 without VLAN tagging, will be tagged according to a lookup table using the destination IP address and then delivered on tap0. ARP and DHCP would need some special handling.
This would be completely stateless for the IPv4 implementation. The IPv6 implementation would be harder, because Link Local addressing needs to be supported and that can not be stateless. The customer CPE will make up its own Link Local address based on its MAC address and we do not know what that is in advance.
The goal is to support traffic of minimum of 10 Gbit/s per server. Ideally I would have a server with 4x 10 Gbit/s interfaces combined into two 20 Gbit/s channels using bonding (LACP). One channel each for upstream and downstream (customer facing). The upstream would be layer 3 untagged and routed traffic to our transit routers.
I am looking for comments, ideas or alternatives. Right now I am considering what kind of CPU would be best for this. Unless I take steps to mitigate, the workload would probably go to one CPU core only and be limited to things like CPU cache and PCI bus bandwidth.
Regards,
Baldur
-- Regards, Ahad Swiftel Networks "*Where the best is good enough*"
participants (10)
-
Ahad Aboss
-
Baldur Norddahl
-
Denys Fedoryshchenko
-
Grant Taylor
-
James Bensley
-
Jérôme Nicolle
-
Mark Tinka
-
Raymond Burkholder
-
Stefan Bethke
-
tony@wicks.co.nz