Forwarding issues related to MACs starting with a 4 or a 6 (Was: [c-nsp] Wierd MPLS/VPLS issue)
Hi all, Ever since the IEEE started allocating OUIs (MAC address ranges) in a randomly distributed fashion rather then sequentially, the operator community has suffered enormously. Time after time issues pop up related to MAC addresses that start with a 4 or a 6. I believe IEEE changed their strategy to attempt to purposefully higher the chance of collisions with MAC squatters, to encourage people to register and pay the fee. The forwarded email at the bottom is yet another example of a widely deployed, but fundamentally broken ASIC. The switch can't forward VPLS frames which contain a payload where the inner packet is destined to a MAC starting with a 4 or a 6. This is with the switch operating in pure layer-2 mode, it doesn't know what MPLS or VPLS even are. The switch is dropping packets on the floor, based on their _payload_. Try selling such circuits to customers "discounted layer-2 service, some flows might not be forwarded". Had IEEE continued the sequential OUI allocations, it probably would've taken many years before we ever reached MACs starting with a 4 or a 6, but instead, in 2012 the first linecards started rolling out of factories with MACs burned in which start with a 4 or a 6, and this took some vendors by surpise. There have been quite some issues, both in hardware and software: Brocade produced a 24x10GE linecard to the market in 2013/2014, with limited FIB scale, meant for a BGP-free MPLS core, but the card can't keep flows together on LACP bundles if the inner packets in a pseudowire were destined for a 4 or 6 MAC. The result: out of order delivery, hurting performance. Cisco ASR 9k's had a bug where if a payload started with a 6, it assumed it would be an IPv6 packet, compare the calculated packet-length with the packet-length in the packet and obviously fail because an ethernet packet is not an IPv6 packet. The result: packets dropped on the floor. (Fixed in 4.3(0.32)I) The Nexus 9000 issue described at the top of this mail. Brocade IronWare had an issue related to packet reordering for flows inside pseudowires, fixed in 2013/2014. There are probably many more examples out there in the wild, slowly driving operators insane. At this moment, some issues related to MACs starting with a 4 or a 6 can be mitigated if you enable Pseudowire Control-Word (RFC 4385) _AND_ Flow-Aware Transport (RFC 6391). You need both to mitigate certain issues in multi-vendor networks (for instance if you have Cisco edge + Juniper core). But what to do when the ASIC won't forward the payload? As ISP you often don't control the payload. Unfortunatly, I don't think we've seen the end of this. The linecards bought in 2012 will trickle down to the grey/second-hand market about now, often without accompanying support contracts. In a world with increased complexity in our interconnectedness, and lack of visibility into the underlaying infrastructure (think remote peering, cloud connectivity, resellers reselling layer-2) it will hurt when some flows inexplicably fail to arrive. Dear IEEE, please pause assigning MAC addresses that start with a 4 or a 6 for the next 6 years. Or at least, next time you change the policy, consult the operational community. This 4/6 MAC issue was well documented in BCP128 back in 2007. The control-word drafts mentioned that there would be dragons related to 4 and 6 back in 2004. Dear Vendors, take this issue more serious. Realise that for operators these issues are _extremely_ hard to debug, this is an expensive time sink. Some of these issues are only visible under very specific, rare circumstances, much like chasing phantoms. So take every vague report of "mysterious" packetloss, or packet reordering at face value and immediately dispatch smart people to delve into whether your software or hardware makes wrong assumptions based on encountering a 4 or a 6 somewhere in the frame. And you, my fellow operators, please continue to publicly document these issues and possible workarounds. Kind regards, Job resources: c-nsp thread "Wierd MPLS/VPLS issue": https://puck.nether.net/pipermail/cisco-nsp/2016-December/thread.html https://www.nanog.org/meetings/nanog57/presentations/Tuesday/tues.general.Sn... BCP128: https://tools.ietf.org/html/bcp128 ----- Forwarded message from Simon Lockhart <simon@slimey.org> ----- Date: Fri, 2 Dec 2016 11:44:21 +0000 From: Simon Lockhart <simon@slimey.org> To: cisco-nsp@puck.nether.net Subject: Re: [c-nsp] Wierd MPLS/VPLS issue On Wed Nov 23, 2016 at 12:01:20PM +0000, Simon Lockhart wrote:
On Fri Nov 04, 2016 at 03:40:05PM +0000, Simon Lockhart wrote:
To me, everything *looks* right, it's just that some VPLS traffic traversing the new link gets lost.
For those who are interested...
Well, I finally got to the bottom of this, and have pushed it to Cisco TAC for a fix...
Cisco TAC finally accepted the issue. Bug CSCvc33783 has been logged. Nexus BU has investigated. Response is... "[...] unfortunately this is an ASIC limitation on the Nexus 9000 switches and is therefore not fixable." If you want a Layer 2 switch that will forward all valid Ethernet frames, I'd suggest avoiding the Nexus 9000 range... Simon _______________________________________________ cisco-nsp mailing list cisco-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/cisco-nsp archive at http://puck.nether.net/pipermail/cisco-nsp/ ----- End forwarded message -----
On Fri, Dec 2, 2016 at 9:32 AM, Job Snijders <job@instituut.net> wrote:
Dear Vendors, take this issue more serious. Realise that for operators these issues are _extremely_ hard to debug, this is an expensive time sink. Some of these issues are only visible under very specific, rare circumstances, much like chasing phantoms. So take every vague report of "mysterious" packetloss, or packet reordering at face value and immediately dispatch smart people to delve into whether your software or hardware makes wrong assumptions based on encountering a 4 or a 6 somewhere in the frame.
you'd think standard testing of traffic through the asic path somewhere between 'let's design an asic!' and 'here's your board ms customer!' would have found this sort of thing, no? or does testing only use 1 mac address ever?
On Fri Dec 02, 2016 at 10:29:56AM -0500, Christopher Morrow wrote:
you'd think standard testing of traffic through the asic path somewhere between 'let's design an asic!' and 'here's your board ms customer!' would have found this sort of thing, no? or does testing only use 1 mac address ever?
Well, it's actually payload, rather than src/dst MAC used for forwarding, so there's quite a few more combinations to look for... 2^(8*9216) is quite a lot of different packets to test through the forwarding path... But, wait, that assumes every bit combination for 9216 byte packets, but the packet might be shorter than that... So multiply that by (9216-64). Anyone want to work out how many years that'd take to test, even at 100G? Simon
On Fri, Dec 2, 2016 at 11:02 AM, Simon Lockhart <simon@slimey.org> wrote:
On Fri Dec 02, 2016 at 10:29:56AM -0500, Christopher Morrow wrote:
you'd think standard testing of traffic through the asic path somewhere between 'let's design an asic!' and 'here's your board ms customer!' would have found this sort of thing, no? or does testing only use 1 mac address ever?
Well, it's actually payload, rather than src/dst MAC used for forwarding, so there's quite a few more combinations to look for...
2^(8*9216) is quite a lot of different packets to test through the forwarding path... But, wait, that assumes every bit combination for 9216 byte packets, but the packet might be shorter than that... So multiply that by (9216-64).
but most/all forwarding asics (aside from perhaps extreme's?) only deal with the first N bits in the header (128 or so..) so... not quite as many right?
Anyone want to work out how many years that'd take to test, even at 100G?
Simon
On Fri, Dec 2, 2016 at 11:07 AM, Christopher Morrow <morrowc.lists@gmail.com
wrote:
On Fri, Dec 2, 2016 at 11:02 AM, Simon Lockhart <simon@slimey.org> wrote:
On Fri Dec 02, 2016 at 10:29:56AM -0500, Christopher Morrow wrote:
2^(8*9216) is quite a lot of different packets to test through the forwarding path... But, wait, that assumes every bit combination for 9216 byte packets, but the packet might be shorter than that... So multiply that by (9216-64).
but most/all forwarding asics (aside from perhaps extreme's?) only deal with the first N bits in the header (128 or so..) so... not quite as many right?
and REALLY they could have just started ~9 yrs ago: "Hey, maybe this 4/6 thing is really a problem? how about we add 2 other things to our testing framework?" instead of: "High Five! First to market!"
On Fri, Dec 2, 2016 at 11:07 AM, Christopher Morrow <morrowc.lists@gmail.com
wrote:
On Fri, Dec 2, 2016 at 11:02 AM, Simon Lockhart <simon@slimey.org> wrote:
On Fri Dec 02, 2016 at 10:29:56AM -0500, Christopher Morrow wrote:
you'd think standard testing of traffic through the asic path somewhere between 'let's design an asic!' and 'here's your board ms customer!' would have found this sort of thing, no? or does testing only use 1 mac address ever?
Well, it's actually payload, rather than src/dst MAC used for forwarding, so there's quite a few more combinations to look for...
2^(8*9216) is quite a lot of different packets to test through the forwarding path... But, wait, that assumes every bit combination for 9216 byte packets, but the packet might be shorter than that... So multiply that by (9216-64).
but most/all forwarding asics (aside from perhaps extreme's?) only deal with the first N bits in the header (128 or so..) so... not quite as many right?
This sounds related to the well-known (at least 10+ years) issues around guessing the type of IP packet by looking at the first nibble of the encapsulated packet. Take a quick look at RFC 7325, section 2.4.5.1 bullet 6. This is what using the pseudo-wire code-word is meant to protect against. I don't know if that's an option for networks using this. Regards, Alia
Anyone want to work out how many years that'd take to test, even at 100G?
Simon
On 2 December 2016 at 18:16, Alia Atlas <akatlas@gmail.com> wrote:
This sounds related to the well-known (at least 10+ years) issues around guessing the type of IP packet by looking at the first nibble of the encapsulated packet. Take a quick look at RFC 7325, section 2.4.5.1 bullet 6. This is what using the pseudo-wire code-word is meant to protect against.
I don't know if that's an option for networks using this.
Some devices by default look inside pseudowires to find IP inside them, in this case even control-word won't help, you'll need to also disable looking inside pseudowire. -- ++ytti
On Fri, Dec 02, 2016 at 04:02:43PM +0000, Simon Lockhart wrote:
On Fri Dec 02, 2016 at 10:29:56AM -0500, Christopher Morrow wrote:
you'd think standard testing of traffic through the asic path somewhere between 'let's design an asic!' and 'here's your board ms customer!' would have found this sort of thing, no? or does testing only use 1 mac address ever?
Well, it's actually payload, rather than src/dst MAC used for forwarding, so there's quite a few more combinations to look for...
2^(8*9216) is quite a lot of different packets to test through the forwarding path... But, wait, that assumes every bit combination for 9216 byte packets, but the packet might be shorter than that... So multiply that by (9216-64).
Anyone want to work out how many years that'd take to test, even at 100G?
Folks on NLNOG found another gem: http://mailman.nlnog.net/pipermail/nlnog/2016-December/002637.html Liberal translation below. The big take-away for operators is that service providers need to make it part of the MPLS Psuedo-wire troubleshooting procedure to ask the customer which MACs are involved and raise the red flag when a 4 or 6 is involved. ----------- Hi, We've observed a similar problem on the Arista 7150S-52 with EOS 4.12.3.1: issues with passing transient MPLS traffic through a layer-2 domain where the payload contains certain MACs. Arista EOS 4.16.9M apparently contains a fix to address this problem. MACs that didnt make it through the switch when running 4.12.3.1: 4*:**:**:**:**:** 6*:**:**:**:**:** *4:**:**:**:**:** *6:**:**:**:**:** **:**:*B:**:6*:** **:**:*F:**:4*:** Maybe there are more combinations, but we didn't iterate through all possibilities. Big thank you to Richard van Looijen (Flowmailer) for finding this issue, Edwin Kalle (2hip) for pointing us at this thread, and Job Snijders, his email which prompted us to investigate the intermediate switches. Kind regards, Robert
MACs that didnt make it through the switch when running 4.12.3.1:
4*:**:**:**:**:** 6*:**:**:**:**:** *4:**:**:**:**:** *6:**:**:**:**:** **:**:*B:**:6*:** **:**:*F:**:4*:**
Can anyone explain the last 2 for me? I was under the impression that this bug was mainly caused by some optimistic attempt to detect raw IPv4 or IPv6 payloads by checking for a version at the start of the frame. This does not explain why it would be looking at the 5th octet. I also would assume that there must be something else to the last 2 examples beyond just the B or F and 4 or 6 because otherwise it would match way too many addresses to have not been noticed before. Perhaps the full MAC address looks like some other protocol with a 4 byte header? Thanks, Mike
The root cause for that issue is most likely due to the following bug: BUG65077 : On the DCS-7150 series, the MPLS label of a frame may be incorrectly overwritten by a DSCP field update in the ASIC. Fixed in 4.11.7 , 4.12.6 , 4.13.0 . It was not related on the MAC values but rather the incorrect parsing of the MPLS header. On Tue, Dec 6, 2016 at 12:50 PM, Mike Jones <mike@mikejones.in> wrote:
MACs that didnt make it through the switch when running 4.12.3.1:
4*:**:**:**:**:** 6*:**:**:**:**:** *4:**:**:**:**:** *6:**:**:**:**:** **:**:*B:**:6*:** **:**:*F:**:4*:**
Can anyone explain the last 2 for me?
I was under the impression that this bug was mainly caused by some optimistic attempt to detect raw IPv4 or IPv6 payloads by checking for a version at the start of the frame. This does not explain why it would be looking at the 5th octet.
I also would assume that there must be something else to the last 2 examples beyond just the B or F and 4 or 6 because otherwise it would match way too many addresses to have not been noticed before. Perhaps the full MAC address looks like some other protocol with a 4 byte header?
Thanks, Mike
-- Regards, Alexandru Suciu
Dear Alexandru,
MACs that didnt make it through the switch when running 4.12.3.1:
4*:**:**:**:**:** 6*:**:**:**:**:** *4:**:**:**:**:** *6:**:**:**:**:** **:**:*B:**:6*:** **:**:*F:**:4*:**
Can anyone explain the last 2 for me?
On Wed, Dec 07, 2016 at 12:19:02PM +0000, Alexandru Suciu via NANOG wrote:
The root cause for that issue is most likely due to the following bug:
BUG65077 : On the DCS-7150 series, the MPLS label of a frame may be incorrectly overwritten by a DSCP field update in the ASIC. Fixed in 4.11.7 , 4.12.6 , 4.13.0 .
It was not related on the MAC values but rather the incorrect parsing of the MPLS header.
That seems phrased somewhat strange. To me as end user it really does seem related to the MAC values, the NLNOG folk tested: packets destined to MAC address 4*:**:**:**:**:** do not arrive, on the other hand, same packet destined to 3*:**:**:**:**:** does arrive. Keep in mind that in this scenario the 7150 is a layer-2 switch between two MPLS PE routers. The 7150 is used as a layer-2 bridge, NOT as MPLS Label Switching Router. Packet layout probably was something like this: outer Ethernet header Dest MAC A Source MAC B Type: 0x8847 MPLS Header 1 or 2 labels inner Ethernet header Dest MAC C Source MAC D Type: 0x0800 IP src X dst Y ICMP type: request The bug in 4.12.3.1 was triggered if "MAC C" started with a 4 or a 6, but I'd expect that anything below the "outer Ethernet header" (including the MPLS header) is considered just the payload by the switch. Kind regards, Job
On 6 December 2016 at 14:28, Job Snijders <job@instituut.net> wrote:
Liberal translation below. The big take-away for operators is that service providers need to make it part of the MPLS Psuedo-wire troubleshooting procedure to ask the customer which MACs are involved and raise the red flag when a 4 or 6 is involved.
Expect also per-vendor behaviour on ethertype values, result from one vendor: http://ytti.fi/ether_type.png Granted these are not technically ethertypes at all, but 802.3 frame length, still some other vendors don't care and pass each of these transparently. Here we can observe blackholing and policing depending on 802.3 frame length value. The same vendor here experiences packet loss on pseudowires if ethertype tells it's ipv4, ipv6, mpls, vlan and packet /does not/ contain said payload. Potentially because NPU time-cost increases too much. Vendor never really explained either behaviour. Other behavioural differences is that some vendors don't accept bad source addresses, like MCAST source address, some other vendors do. Pseudowires behaviour is highly dependent on hardware and software release in corner cases. It's easy to debate that bad MACs should be dropped, but it's also easy to argue that perhaps you're testing things, and you expect to get transparent pipe and you to test if your SUT accepts bad MACs or not. -- ++ytti
Job Snijders wrote:
Dear IEEE, please pause assigning MAC addresses that start with a 4 or a 6 for the next 6 years.
Disagree that this is an IEEE problem. This is problem that vendors need to work around. There is limited MAC space, and deprecating 1/8 of it due to the inability of vendors to cope properly with it seems like a really bad long term idea. It seems that the problem that cropped up on cisco-nsp is that a layer 2 switch, the Nexus 92160 (and possibly everything else which uses the same forwarding ASIC), cannot forward vpls frames with a 4 or 6 buried at a specific location inside the contents of the frame. This is an extraordinary bug which renders the hardware useless in specific circumstances. What makes it worse is that this is a well known corner case which should have been shaken out during design, if not found during QA. Nick
On Fri, Dec 02, 2016 at 09:32:37AM -0800, Leo Bicknell wrote:
I also do not think this is an IEEE/MAC assignement problem. This is a vendor's box can't forward a particular payload problem.
On Fri, Dec 02, 2016 at 04:59:37PM +0000, Nick Hilliard wrote:
Job Snijders wrote:
Dear IEEE, please pause assigning MAC addresses that start with a 4 or a 6 for the next 6 years.
Disagree that this is an IEEE problem. This is problem that vendors need to work around. There is limited MAC space, and deprecating 1/8 of it due to the inability of vendors to cope properly with it seems like a really bad long term idea.
Yes the vendors are doing a poor job. I also appreciate the argument that IEEE just manages that number space and we should consider these 'just numbers' and the vendors need to make do. On the other hand if IEEE had just stuck to the original allocation plan, you and I wouldn't be dealing with this garbage situation. IEEE told one of my friends: "We changed our allocation methods to prevent vendors using unregistered mac addresses." Does the cost of some squatters on poorly usable MAC space outweight the cost of the community spending countless hours tracking down where those dropped packets went? IEEE could've shown more restrain by (temporary, until IPv4 is dead?) avoiding 4 and 6 and still accomplished some of their goal (if this dubious strategy even is effective). I consider this a cascading failure. Clearly IEEE's change had a ripple effect, and suprised a number of implementers, and ended up hurting us. Kind regards, Job
Job Snijders wrote:
I consider this a cascading failure. Clearly IEEE's change had a ripple effect, and suprised a number of implementers, and ended up hurting us.
this would be credible if this were a previously unknown problem, but it isn't. It's been known for years that you need to be careful when handling mpls encapsulated packets which encapsulate L2 frames and where the source mac address starts with 4 or 6. This is not a new problem and because it's not new, there is no good reason for vendors to make the same mistakes again and again. TBH, it beggars belief that new L2 hardware is being thrown out the door which is unable to forward frames of this form due to hardware limitations, and that it's apparently unfixable. Nick
All, I just want to come back on behalf of Cisco on this. We just investigated this issue and the issue is not an ASIC bug, but a flag set wrong by SW. We will reach out to the original customer through TAC who posted this in NSP to resolve this issue. sukumar On 12/2/16, 11:50 AM, "NANOG on behalf of Job Snijders" <nanog-bounces@nanog.org on behalf of job@instituut.net> wrote: On Fri, Dec 02, 2016 at 09:32:37AM -0800, Leo Bicknell wrote: > I also do not think this is an IEEE/MAC assignement problem. This is a > vendor's box can't forward a particular payload problem. On Fri, Dec 02, 2016 at 04:59:37PM +0000, Nick Hilliard wrote: > Job Snijders wrote: > > Dear IEEE, please pause assigning MAC addresses that start with a 4 > > or a 6 for the next 6 years. > > Disagree that this is an IEEE problem. This is problem that vendors > need to work around. There is limited MAC space, and deprecating 1/8 of > it due to the inability of vendors to cope properly with it seems like a > really bad long term idea. Yes the vendors are doing a poor job. I also appreciate the argument that IEEE just manages that number space and we should consider these 'just numbers' and the vendors need to make do. On the other hand if IEEE had just stuck to the original allocation plan, you and I wouldn't be dealing with this garbage situation. IEEE told one of my friends: "We changed our allocation methods to prevent vendors using unregistered mac addresses." Does the cost of some squatters on poorly usable MAC space outweight the cost of the community spending countless hours tracking down where those dropped packets went? IEEE could've shown more restrain by (temporary, until IPv4 is dead?) avoiding 4 and 6 and still accomplished some of their goal (if this dubious strategy even is effective). I consider this a cascading failure. Clearly IEEE's change had a ripple effect, and suprised a number of implementers, and ended up hurting us. Kind regards, Job
Sukumar Subburayan (sukumars) wrote:
I just want to come back on behalf of Cisco on this. We just investigated this issue and the issue is not an ASIC bug, but a flag set wrong by SW. We will reach out to the original customer through TAC who posted this in NSP to resolve this issue.
oh cool - this is great. Thanks for following up and clarifying. Nick
On Fri Dec 02, 2016 at 09:23:24PM +0000, Sukumar Subburayan (sukumars) wrote:
I just want to come back on behalf of Cisco on this. We just investigated this issue and the issue is not an ASIC bug, but a flag set wrong by SW. We will reach out to the original customer through TAC who posted this in NSP to resolve this issue.
Sukumar, Can I just publicly say thank you for taking the time to investigate my issue, and identify that a fix is possible. Looking forward to having a working Nexus... Simon
I just want to come back on behalf of Cisco on this. We just investigated this issue and the issue is not an ASIC bug, but a flag set wrong by SW.
damn! you just took all the fun out of lynching ieee. sheesh! </sarcasm> randy
In a message written on Fri, Dec 02, 2016 at 08:50:40PM +0100, Job Snijders wrote:
IEEE told one of my friends: "We changed our allocation methods to prevent vendors using unregistered mac addresses."
Does the cost of some squatters on poorly usable MAC space outweight the cost of the community spending countless hours tracking down where those dropped packets went?
That's the wrong question to ask. The right question is, what could have been done to prevent this entire situation? This problem has occured in all sorts of number spaces before. There have been squatters in almost every number space, boxes "optimized" based on the pattern of allocation, code bugs that went unnoticed due to part of the number space not being used. It's happened to MAC's, IP's, ports, even protocol numbers. One of the answers is to better allocate numbers. Starting at the bottom and working up is almost never the optimal solution. Various sparce allocation strategies exist which insure a wider range of addresses are used early, there is a greater chance of wacking a squatter early, and that the number space ends up more efficiently used in many cases. Had the IETF allocated a MAC starting with 0 then 2, then 4 then 6 then 8 then 10 then 12 then 14 this problem would have likely been identified early on in vendor labs when testing the pseudowire code and would have prevented the "hack" of looking deeper in the packet and guessing because too many 4 and 6 MACs were already deployed. -- Leo Bicknell - bicknell@ufp.org PGP keys at http://www.ufp.org/~bicknell/
In a message written on Fri, Dec 02, 2016 at 03:32:13PM +0100, Job Snijders wrote:
Dear Vendors, take this issue more serious. Realise that for operators these issues are _extremely_ hard to debug, this is an expensive time sink. Some of these issues are only visible under very specific, rare circumstances, much like chasing phantoms. So take every vague report of "mysterious" packetloss, or packet reordering at face value and immediately dispatch smart people to delve into whether your software or hardware makes wrong assumptions based on encountering a 4 or a 6 somewhere in the frame.
I also do not think this is an IEEE/MAC assignement problem. This is a vendor's box can't forward a particular payload problem. If I had boxes with this issue, I would be talking to my vendor about how: a) They were going to replace every single one of them with something that does not have the bug. b) What discount I would get on mainteance/support for having to swap all of the devices. Then I would follow it up with the other vendors I'm talking to about all of my future purchases if they are unable to produce boxes that work. And if the vendor who supplied these did not fix it, I would give them no more business. -- Leo Bicknell - bicknell@ufp.org PGP keys at http://www.ufp.org/~bicknell/
participants (11)
-
Alexandru Suciu
-
Alia Atlas
-
Christopher Morrow
-
Job Snijders
-
Leo Bicknell
-
Mike Jones
-
Nick Hilliard
-
Randy Bush
-
Saku Ytti
-
Simon Lockhart
-
Sukumar Subburayan (sukumars)