link monitoring and BFD in SDN networks
Hi, Routers connected back to back often rely on BFD for link failures. Its certainly possible that there is a switch between two routers and hence a link down event on one side is not visible to the other side. So, you run some sort of an OAM protocol on the two routers so that they can detect link flaps/failures. How will this happen in SDN networks where there is no control plane on the routers. Will the routers be sending a state of all their links to a central controller who will then detect that a link has gone down. This just doesnt sound good. I am presuming that some sort of "control plane" will always be required. Any pointers here? Is there any other reason other than link events for which we would need a control plane on the routers in SDN? Thanks, Glen
BFD etc aim to prove there is end-to-end connectivity between two points, not just that all links are up along the path. All ports could be up, but end-to-end connectivity broken, for example a misconfigured VLAN across a L2 network. Sending some kind of packet across the network is pretty much the only way to guarantee reachability. The OpenFlow protocol in particular has a way to instruct a switch to send a frame out of an interface. By default, the OpenFlow switches will forward all frames it has received and doesn't know what to do with back to the controller. This means someone could write an OAM protocol that will work via OpenFlow. A quick google for 'OpenFlow OAM' brought me this link which has someone who has done just that: "http://www.rvdp.org/presentations/SC11-SRS-8021ag.pdf" Of course if you want fast failover, you need to send packets very rapidly. Every 250ms is not unreasonable. This is going to cause the control plane to get very chatty. Typically on high end routers, processes such as BFD are actually ran on line cards as opposed to on the routing engine. When a failure is detected this reports up into the control plane to trigger a reconvergence event. I see no reason why this couldn't occur using SDN. Regards, Dave On 19 January 2015 at 22:01, Glen Kent <glen.kent@gmail.com> wrote:
Hi,
Routers connected back to back often rely on BFD for link failures. Its certainly possible that there is a switch between two routers and hence a link down event on one side is not visible to the other side. So, you run some sort of an OAM protocol on the two routers so that they can detect link flaps/failures.
How will this happen in SDN networks where there is no control plane on the routers. Will the routers be sending a state of all their links to a central controller who will then detect that a link has gone down. This just doesnt sound good. I am presuming that some sort of "control plane" will always be required.
Any pointers here?
Is there any other reason other than link events for which we would need a control plane on the routers in SDN?
Thanks, Glen
On Mon, Jan 19, 2015 at 22:55:04 +0000, Dave Bell wrote:
The 802.1ag code used is open source and available on: https://svn.surfnet.nl/trac/dot1ag-utils/
Of course if you want fast failover, you need to send packets very rapidly. Every 250ms is not unreasonable. This is going to cause the control plane to get very chatty. Typically on high end routers, processes such as BFD are actually ran on line cards as opposed to on the routing engine. When a failure is detected this reports up into the control plane to trigger a reconvergence event. I see no reason why this couldn't occur using SDN.
Exactly. This is something you want to do in hardware, especially if you want to do fast reroute with the OpenFlow group table. Problem is that many 1U OpenFlow switches do not support 802.1ag. We made the propotype mentioned above to show and investigate the benefits of OAM. The closed "open" networking foundation is supposed to be working on this, but I don't know the status because their mailing lists are closed. In SDN/OpenFlow I think a couple of things are needed: - configure 802.1ag on the interfaces (via ofconfig?) - configure OpenFlow paths (e.g. primary and backup) and also create forwarding entries for 802.1ag datagrams along those paths - configure fast reroute with the group table (ofconfig?) By doing this detection and failover are handled in hardware. rvdp
On Wed, Jan 21, 2015 at 12:22 PM, Ronald van der Pol < Ronald.vanderPol@rvdp.org> wrote:
On Mon, Jan 19, 2015 at 22:55:04 +0000, Dave Bell wrote:
The 802.1ag code used is open source and available on: https://svn.surfnet.nl/trac/dot1ag-utils/
Of course if you want fast failover, you need to send packets very rapidly. Every 250ms is not unreasonable. This is going to cause the control plane to get very chatty. Typically on high end routers, processes such as BFD are actually ran on line cards as opposed to on the routing engine. When a failure is detected this reports up into the control plane to trigger a reconvergence event. I see no reason why this couldn't occur using SDN.
Exactly. This is something you want to do in hardware, especially if you want to do fast reroute with the OpenFlow group table. Problem is that many 1U OpenFlow switches do not support 802.1ag. We made the propotype mentioned above to show and investigate the benefits of OAM. The closed "open" networking foundation is supposed to be working on this, but I don't know the status because their mailing lists are closed.
In SDN/OpenFlow I think a couple of things are needed: - configure 802.1ag on the interfaces (via ofconfig?) - configure OpenFlow paths (e.g. primary and backup) and also create forwarding entries for 802.1ag datagrams along those paths - configure fast reroute with the group table (ofconfig?)
Fast reroute (in the form of fast failover) is supported in the OF spec (1.3+), using Group Tables.
By doing this detection and failover are handled in hardware.
rvdp
Data plane reachability could be performed in SDN/OpenFlow networks using BFD/ Ethernet CFM (802.1ag), Y.1731, preferably on silicon if there is support (which i believe every silicon vendor should work on). It would not be ideal if these OAM frames are forwarded to a central controller. Today - I think it is done on some form of software layer (ovs, sdks) that reside on these OF switches.
Gents, We need to separate the context of fast reroute via control plane topology map vs local link protection with OAM at mac/phy sub-layer and time frames at which they are relevant. There are efforts going on at the media level but then there are current solutions that are media and encapsulation independent which need to be juxtaposed to the SDN paradigm. Going back to the original question that Glen posed, it is more a question on implementation complexity. The more state machines that are pushed down to the Nodes in SDN network away from the control plane, the more cost and barriers to entry for OEM products, inter-op issues etc. Now looking squarely at BFD, the popular application is bootstrapping BFD link state to routing topology and peer pathway which may traverse multiple nodes/switches/media and encapsulations. BFD is a next hop communication failure detection mechanism which may itself rely (bootstrap) on routing topology to find alternate paths and is therefore a larger time frame event than a phy/mac sub layer protection, and is media/encapsulation independent. And the fact that such a state change will have a high probability to trigger a topology/network wide event (if not less need to run BFD) makes it a controller centric state which it needs to bootstrap its routing services on. Link layer OAM on the other hand may be a mechanism that protects the BFD event from triggering. Further, BFD will enable faster end to end connectivity communication/reachability detection than hold down timers allow on hardware that do not support OAM features. Finally the scale at which BFD is used is far less than the number of links. I.e if you have a 10K port network, you are likely using BFD on a few tens maybe (for Datacenters) and the timescale is typically in the 100s ms which any control plane software module can handle at large scale and should be run just like any hello protocols for routing services. Link layer state machines on the nodes on the other hand operate in the sub 1ms timeframe. It is an overhead, but an insignificantly small tax. Cheers, Sudeep Khuraijam On 1/21/15, 3:14 PM, "Nitin Sharma" <nitinics@gmail.com> wrote:
On Wed, Jan 21, 2015 at 12:22 PM, Ronald van der Pol < Ronald.vanderPol@rvdp.org> wrote:
On Mon, Jan 19, 2015 at 22:55:04 +0000, Dave Bell wrote:
The 802.1ag code used is open source and available on: https://svn.surfnet.nl/trac/dot1ag-utils/
Of course if you want fast failover, you need to send packets very rapidly. Every 250ms is not unreasonable. This is going to cause the control plane to get very chatty. Typically on high end routers, processes such as BFD are actually ran on line cards as opposed to on the routing engine. When a failure is detected this reports up into the control plane to trigger a reconvergence event. I see no reason why this couldn't occur using SDN.
Exactly. This is something you want to do in hardware, especially if you want to do fast reroute with the OpenFlow group table. Problem is that many 1U OpenFlow switches do not support 802.1ag. We made the propotype mentioned above to show and investigate the benefits of OAM. The closed "open" networking foundation is supposed to be working on this, but I don't know the status because their mailing lists are closed.
In SDN/OpenFlow I think a couple of things are needed: - configure 802.1ag on the interfaces (via ofconfig?) - configure OpenFlow paths (e.g. primary and backup) and also create forwarding entries for 802.1ag datagrams along those paths - configure fast reroute with the group table (ofconfig?)
Fast reroute (in the form of fast failover) is supported in the OF spec (1.3+), using Group Tables.
By doing this detection and failover are handled in hardware.
rvdp
Data plane reachability could be performed in SDN/OpenFlow networks using BFD/ Ethernet CFM (802.1ag), Y.1731, preferably on silicon if there is support (which i believe every silicon vendor should work on). It would not be ideal if these OAM frames are forwarded to a central controller. Today - I think it is done on some form of software layer (ovs, sdks) that reside on these OF switches.
participants (5)
-
Dave Bell
-
Glen Kent
-
Nitin Sharma
-
Ronald van der Pol
-
Sudeep Khuraijam