Recently while running a packet capture I came across some unicast flooding that was happening on my network. One of our core switches didn't have the mac-address for a server, and was flooding all packets destined to that server. It wasn't learning the mac-address because the server was responding to packets out on a different network card on a different switch. The flooding I was seeing wasn't enough to cause any network issues, it was only a few megs, but it was something that I wanted to fix. I've ran into this issue before, and solved it by statically entering the mac-address into the cam tables. I want to avoid this problem in the future, and I'm looking at two different things. The first is preventing it in the first place. Along those lines, I've seen some recommendations on-line about changing the arp and cam timeouts to be the same. However, there seems to be a disagreement on which is better, making the arp timeouts match the cam table timeouts, or vice versa. Also, when talking about this, everyone seems to be only considering routers, but what about the timers on a firewall? I'm worried that I might cause other issues by changing these timers. The second thing I'm considering is monitoring. I'd like to setup something to monitor for any excessive unicast flooding in the future. I understand that a little unicast flooding is normal, as the switch has to do a little bit of flooding to find out where people are. While looking for a way to monitor this, I came across the 'mac-address-table unicast-flood' command on Cisco switches. This looked perfect for what I needed, but apparently it is currently not an option on 6500 switches with Sup720s. Since there doesn't appear to be an option on Cisco that monitors specificaly for unicast floods, I thought that maybe I could setup a server with a network card in promiscuous mode and then keep stats of all packets received that aren't destined for the server and that also aren't legitimate broadcasts or multicasts. The only problem with that is that I don't want to have to completely custom build my own solution. I was hoping that someone may have already created something like this, or that maybe there is a good reporting tool for wireshark or something that could generate the report that I want. Anyone have any suggestions on either prevention/monitoring? Thanks!! -Brian
Unicast flooding is a common occurrence in large datacenters especially with asymmetrical paths caused by different first hop routers (via HSRP, VRRP, etc). We ran into this some time ago. Most arp sensitive systems such as clusters, HSRP, content switches etc are smart enough to send out gratuitous arps which eliminates the worries of increasing the timeouts. We haven't had any issues since we made the changes. After debugging the problem we added "mac-address-table aging-time 14400" to our data center switches. That syncs the mac aging time to the same timeout value as the ARP timeout ---- Matthew Huff | One Manhattanville Rd OTA Management LLC | Purchase, NY 10577 http://www.ox.com | Phone: 914-460-4039 aim: matthewbhuff | Fax: 914-460-4139
-----Original Message----- From: Brian Shope [mailto:blackwolf99999@gmail.com] Sent: Wednesday, June 17, 2009 5:33 PM To: nanog@nanog.org Subject: Unicast Flooding
Recently while running a packet capture I came across some unicast flooding that was happening on my network. One of our core switches didn't have the mac-address for a server, and was flooding all packets destined to that server. It wasn't learning the mac-address because the server was responding to packets out on a different network card on a different switch. The flooding I was seeing wasn't enough to cause any network issues, it was only a few megs, but it was something that I wanted to fix.
I've ran into this issue before, and solved it by statically entering the mac-address into the cam tables.
I want to avoid this problem in the future, and I'm looking at two different things.
The first is preventing it in the first place. Along those lines, I've seen some recommendations on-line about changing the arp and cam timeouts to be the same. However, there seems to be a disagreement on which is better, making the arp timeouts match the cam table timeouts, or vice versa. Also, when talking about this, everyone seems to be only considering routers, but what about the timers on a firewall? I'm worried that I might cause other issues by changing these timers.
The second thing I'm considering is monitoring. I'd like to setup something to monitor for any excessive unicast flooding in the future. I understand that a little unicast flooding is normal, as the switch has to do a little bit of flooding to find out where people are. While looking for a way to monitor this, I came across the 'mac-address-table unicast-flood' command on Cisco switches. This looked perfect for what I needed, but apparently it is currently not an option on 6500 switches with Sup720s. Since there doesn't appear to be an option on Cisco that monitors specificaly for unicast floods, I thought that maybe I could setup a server with a network card in promiscuous mode and then keep stats of all packets received that aren't destined for the server and that also aren't legitimate broadcasts or multicasts. The only problem with that is that I don't want to have to completely custom build my own solution. I was hoping that someone may have already created something like this, or that maybe there is a good reporting tool for wireshark or something that could generate the report that I want.
Anyone have any suggestions on either prevention/monitoring?
Thanks!!
-Brian
I have had the same issue in the past. The best fix for this has been to set the Layer2/3 aging timers to be the same. Matthew Huff wrote:
Unicast flooding is a common occurrence in large datacenters especially with asymmetrical paths caused by different first hop routers (via HSRP, VRRP, etc). We ran into this some time ago. Most arp sensitive systems such as clusters, HSRP, content switches etc are smart enough to send out gratuitous arps which eliminates the worries of increasing the timeouts. We haven't had any issues since we made the changes.
After debugging the problem we added "mac-address-table aging-time 14400" to our data center switches. That syncs the mac aging time to the same timeout value as the ARP timeout
---- Matthew Huff | One Manhattanville Rd OTA Management LLC | Purchase, NY 10577 http://www.ox.com | Phone: 914-460-4039 aim: matthewbhuff | Fax: 914-460-4139
-----Original Message----- From: Brian Shope [mailto:blackwolf99999@gmail.com] Sent: Wednesday, June 17, 2009 5:33 PM To: nanog@nanog.org Subject: Unicast Flooding
Recently while running a packet capture I came across some unicast flooding that was happening on my network. One of our core switches didn't have the mac-address for a server, and was flooding all packets destined to that server. It wasn't learning the mac-address because the server was responding to packets out on a different network card on a different switch. The flooding I was seeing wasn't enough to cause any network issues, it was only a few megs, but it was something that I wanted to fix.
I've ran into this issue before, and solved it by statically entering the mac-address into the cam tables.
I want to avoid this problem in the future, and I'm looking at two different things.
The first is preventing it in the first place. Along those lines, I've seen some recommendations on-line about changing the arp and cam timeouts to be the same. However, there seems to be a disagreement on which is better, making the arp timeouts match the cam table timeouts, or vice versa. Also, when talking about this, everyone seems to be only considering routers, but what about the timers on a firewall? I'm worried that I might cause other issues by changing these timers.
The second thing I'm considering is monitoring. I'd like to setup something to monitor for any excessive unicast flooding in the future. I understand that a little unicast flooding is normal, as the switch has to do a little bit of flooding to find out where people are. While looking for a way to monitor this, I came across the 'mac-address-table unicast-flood' command on Cisco switches. This looked perfect for what I needed, but apparently it is currently not an option on 6500 switches with Sup720s. Since there doesn't appear to be an option on Cisco that monitors specificaly for unicast floods, I thought that maybe I could setup a server with a network card in promiscuous mode and then keep stats of all packets received that aren't destined for the server and that also aren't legitimate broadcasts or multicasts. The only problem with that is that I don't want to have to completely custom build my own solution. I was hoping that someone may have already created something like this, or that maybe there is a good reporting tool for wireshark or something that could generate the report that I want.
Anyone have any suggestions on either prevention/monitoring?
Thanks!!
-Brian
-- Steve King Network Engineer - Liquid Web, Inc. Cisco Certified Network Associate CompTIA Linux+ Certified Professional CompTIA A+ Certified Professional
In a layer 3 switch I consider unicast flooding due to an L2 cam table timeout a design defect. To test vendors' L3 switches for this defect we have used a traffic generator to send 50-100 Mbps of pings to a device that does not reply to the pings, where the L3 switch was routing from one vlan to another to forward the pings. In defective devices the L2 cam table entry expires, causing the 50-100 Mbps unicast stream to be flooded out all ports in the destination vlan. In my view the L3 and L2 forwarding state machines must be synchronized such that the L3 forwarding continues as long as there are packets entering the L3 switch on one vlan, and exiting the switch on another vlan via routing. It seems that gratuitous arps are a workaround which serves to reset the cam entry timeout interval, but not an elegant solution. -----Original Message----- From: Matthew Huff [mailto:mhuff@ox.com] Sent: Wednesday, June 17, 2009 2:58 PM To: 'Brian Shope'; 'nanog@nanog.org' Subject: RE: Unicast Flooding Unicast flooding is a common occurrence in large datacenters especially with asymmetrical paths caused by different first hop routers (via HSRP, VRRP, etc). We ran into this some time ago. Most arp sensitive systems such as clusters, HSRP, content switches etc are smart enough to send out gratuitous arps which eliminates the worries of increasing the timeouts. We haven't had any issues since we made the changes. After debugging the problem we added "mac-address-table aging-time 14400" to our data center switches. That syncs the mac aging time to the same timeout value as the ARP timeout ---- Matthew Huff | One Manhattanville Rd OTA Management LLC | Purchase, NY 10577 http://www.ox.com | Phone: 914-460-4039 aim: matthewbhuff | Fax: 914-460-4139
-----Original Message----- From: Brian Shope [mailto:blackwolf99999@gmail.com] Sent: Wednesday, June 17, 2009 5:33 PM To: nanog@nanog.org Subject: Unicast Flooding
Recently while running a packet capture I came across some unicast flooding that was happening on my network. One of our core switches didn't have the mac-address for a server, and was flooding all packets destined to that server. It wasn't learning the mac-address because the server was responding to packets out on a different network card on a different switch. The flooding I was seeing wasn't enough to cause any network issues, it was only a few megs, but it was something that I wanted to fix.
I've ran into this issue before, and solved it by statically entering the mac-address into the cam tables.
I want to avoid this problem in the future, and I'm looking at two different things.
The first is preventing it in the first place. Along those lines, I've seen some recommendations on-line about changing the arp and cam timeouts to be the same. However, there seems to be a disagreement on which is better, making the arp timeouts match the cam table timeouts, or vice versa. Also, when talking about this, everyone seems to be only considering routers, but what about the timers on a firewall? I'm worried that I might cause other issues by changing these timers.
The second thing I'm considering is monitoring. I'd like to setup something to monitor for any excessive unicast flooding in the future. I understand that a little unicast flooding is normal, as the switch has to do a little bit of flooding to find out where people are. While looking for a way to monitor this, I came across the 'mac-address-table unicast-flood' command on Cisco switches. This looked perfect for what I needed, but apparently it is currently not an option on 6500 switches with Sup720s. Since there doesn't appear to be an option on Cisco that monitors specificaly for unicast floods, I thought that maybe I could setup a server with a network card in promiscuous mode and then keep stats of all packets received that aren't destined for the server and that also aren't legitimate broadcasts or multicasts. The only problem with that is that I don't want to have to completely custom build my own solution. I was hoping that someone may have already created something like this, or that maybe there is a good reporting tool for wireshark or something that could generate the report that I want.
Anyone have any suggestions on either prevention/monitoring?
Thanks!!
-Brian
I wouldn't consider this a defect. Historically L2 and L3 devices have always been separate. When you get L3 switch those functions are just combined into one device. In Cisco devices that support CEF, the CEF table is used to make all forwarding decisions. But the CEF table is dependent the ARP and Routing tables on the L3 side. When it comes to forwarding the frame of the proper interface the CAM table comes into play. If that table is timing out quicker than the L3 tables, there will be times the CAM table is incomplete. This is mostly present in redundant gateway setups. In bound traffic is usually load balanced between the two redundant devices. The gateways learn about the servers/workstations by traffic leaving the VLAN, not coming into the VLAN. In the case of HSRP/VRRP the servers/workstations are only using one of the two redundant devices to send traffic out of the VLAN. In this case, one device will end up with incomplete information every 5 minutes (default MAC aging timer). This will cause traffic coming in to the VLAN (usually load balanced with EIGRP or OSPF) to be a unknown unicast flood out all ports on the standby device. Making the L2/3 timers the same corrects this. The reason this corrects this because, for CEF to make a forwarding decision, it must have the layer 3 engine make an ARP request if the ARP entry is not present. This causes an ARP broadcast. With the ARP reply being returned the active and standby device can both keep their CAM/ARP/CEF tables up to date. As I do not consider this a defect that these are not synchronized by default, I do agree it would be very beneficial and prevent a lot of confusion and hours of troubleshooting when unsuspecting engineers are trying to figure out why they have a ton of unknown unicast packets. Just my additional 0.02 Holmes,David A wrote:
In a layer 3 switch I consider unicast flooding due to an L2 cam table timeout a design defect. To test vendors' L3 switches for this defect we have used a traffic generator to send 50-100 Mbps of pings to a device that does not reply to the pings, where the L3 switch was routing from one vlan to another to forward the pings. In defective devices the L2 cam table entry expires, causing the 50-100 Mbps unicast stream to be flooded out all ports in the destination vlan. In my view the L3 and L2 forwarding state machines must be synchronized such that the L3 forwarding continues as long as there are packets entering the L3 switch on one vlan, and exiting the switch on another vlan via routing. It seems that gratuitous arps are a workaround which serves to reset the cam entry timeout interval, but not an elegant solution.
-----Original Message----- From: Matthew Huff [mailto:mhuff@ox.com] Sent: Wednesday, June 17, 2009 2:58 PM To: 'Brian Shope'; 'nanog@nanog.org' Subject: RE: Unicast Flooding
Unicast flooding is a common occurrence in large datacenters especially with asymmetrical paths caused by different first hop routers (via HSRP, VRRP, etc). We ran into this some time ago. Most arp sensitive systems such as clusters, HSRP, content switches etc are smart enough to send out gratuitous arps which eliminates the worries of increasing the timeouts. We haven't had any issues since we made the changes.
After debugging the problem we added "mac-address-table aging-time 14400" to our data center switches. That syncs the mac aging time to the same timeout value as the ARP timeout
---- Matthew Huff | One Manhattanville Rd OTA Management LLC | Purchase, NY 10577 http://www.ox.com | Phone: 914-460-4039 aim: matthewbhuff | Fax: 914-460-4139
-----Original Message----- From: Brian Shope [mailto:blackwolf99999@gmail.com] Sent: Wednesday, June 17, 2009 5:33 PM To: nanog@nanog.org Subject: Unicast Flooding
Recently while running a packet capture I came across some unicast flooding that was happening on my network. One of our core switches didn't have the mac-address for a server, and was flooding all packets destined to that server. It wasn't learning the mac-address because the server was responding to packets out on a different network card on a different switch. The flooding I was seeing wasn't enough to cause any network issues, it was only a few megs, but it was something that I wanted to fix.
I've ran into this issue before, and solved it by statically entering the mac-address into the cam tables.
I want to avoid this problem in the future, and I'm looking at two different things.
The first is preventing it in the first place. Along those lines, I've seen some recommendations on-line about changing the arp and cam timeouts to be the same. However, there seems to be a disagreement on which is better, making the arp timeouts match the cam table timeouts, or vice versa. Also, when talking about this, everyone seems to be only considering routers, but what about the timers on a firewall? I'm worried that I might cause other issues by changing these timers.
The second thing I'm considering is monitoring. I'd like to setup something to monitor for any excessive unicast flooding in the future. I understand that a little unicast flooding is normal, as the switch has to do a little bit of flooding to find out where people are. While looking for a way to monitor this, I came across the 'mac-address-table unicast-flood' command on Cisco switches. This looked perfect for what I needed, but apparently it is currently not an option on 6500 switches with Sup720s. Since there doesn't appear to be an option on Cisco that monitors specificaly for unicast floods, I thought that maybe I could setup a server with a network card in promiscuous mode and then keep stats of all packets received that aren't destined for the server and that also aren't legitimate broadcasts or multicasts. The only problem with that is that I don't want to have to completely custom build my own solution. I was hoping that someone may have already created something like this, or that maybe there is a good reporting tool for wireshark or something that could generate the report that I want.
Anyone have any suggestions on either prevention/monitoring?
Thanks!!
-Brian
-- Steve King Network Engineer - Liquid Web, Inc. Cisco Certified Network Associate CompTIA Linux+ Certified Professional CompTIA A+ Certified Professional
Thanks for all the good info.. So it sounds like changing my CAM timeout to 4 hours is the best suggestion. Anyone have any problems when implementing this?
On 6/18/09, Brian Shope <blackwolf99999@gmail.com> wrote:
Thanks for all the good info..
So it sounds like changing my CAM timeout to 4 hours is the best suggestion. Anyone have any problems when implementing this?
Not as long as all the user ports have portfast enabled. Without portfast, when a port goes up or down it causes a topology change notification which sets the fast aging timer and the cam table entries age out in something like 15 seconds. Regards, Lee
Relying on a TCN would yield very inconsistent results. Lee wrote:
On 6/18/09, Brian Shope <blackwolf99999@gmail.com> wrote:
Thanks for all the good info..
So it sounds like changing my CAM timeout to 4 hours is the best suggestion. Anyone have any problems when implementing this?
Not as long as all the user ports have portfast enabled. Without portfast, when a port goes up or down it causes a topology change notification which sets the fast aging timer and the cam table entries age out in something like 15 seconds.
Regards, Lee
-- Steve King Network Engineer - Liquid Web, Inc. Cisco Certified Network Associate CompTIA Linux+ Certified Professional CompTIA A+ Certified Professional
Holmes,David A wrote:
In a layer 3 switch I consider unicast flooding due to an L2 cam table timeout a design defect. To test vendors' L3 switches for this defect we have used a traffic generator to send 50-100 Mbps of pings to a device that does not reply to the pings, where the L3 switch was routing from one vlan to another to forward the pings.
You don't need an elaborate scenario to create the unicast flooding. Syslog servers can cause this quite frequently, if all they do is sink syslog UDP traffic and never (or rarely) generate any packets themselves. You can push up L2 / CAM / mac-address-table timeouts, but you may have some unexpected results if you have a volatile / mobile network where end devices are not static. I still don't have a "really comfortable" recommendation on settings, but agree in general that the ARP timeout should be somewhat less than the L2 timeout, and yes, the ARP response will refresh the L2 entry. It gets even more complicated if you are using a NAC / monitoring function that triggers on mac-address-table tracking / changes / traps, as the shorter the L2 timeout, the more frequent your mac-address-table changes are generated. You can complicate this even further with "smart" monitors that are trying to keep a mapping of IP-to-MAC-to-switchport -- you may have L2 entries without ARPs, ARPs without L2 entries, etc. Jeff
In a message written on Wed, Jun 17, 2009 at 02:32:44PM -0700, Brian Shope wrote:
Anyone have any suggestions on either prevention/monitoring?
If you control the servers, writing a small program to emit a packet every 300 seconds or so out every interface should be nearly trivial, and will insure the switch knows where all the mac addresses are dynamically. I think I would find that vastly preferable to static configuration. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
Brian,
The first is preventing it in the first place.
As annoying as this might sound, this is one of the standard operating modes for load balancing within a Microsoft server cluster (see NLB). We've tried to avoid it, but it seems to come up around once a year from someone on our campus... Eric :)
Very true Eric. Microsoft even acknowledges the issue, and still has not fixed it. I have had a few customers use NLB and have this issue. Eric Gauthier wrote:
Brian,
The first is preventing it in the first place.
As annoying as this might sound, this is one of the standard operating modes for load balancing within a Microsoft server cluster (see NLB). We've tried to avoid it, but it seems to come up around once a year from someone on our campus...
Eric :)
-- Steve King Network Engineer - Liquid Web, Inc. Cisco Certified Network Associate CompTIA Linux+ Certified Professional CompTIA A+ Certified Professional
Steven King wrote:
Very true Eric. Microsoft even acknowledges the issue, and still has not fixed it. I have had a few customers use NLB and have this issue.
Eric Gauthier wrote:
Brian,
The first is preventing it in the first place.
As annoying as this might sound, this is one of the standard operating modes for load balancing within a Microsoft server cluster (see NLB). We've tried to avoid it, but it seems to come up around once a year from someone on our campus...
Eric :)
I understand is 'working as designed' ? Much like the Stonegate (?) Firewall redundancy trick ? It was a little worse when doing the multicast-l2 to a unicast-l3 address trick.. By the way, if you think this is funny in a campus ethernet backbone.. Try it in an old ATM/LANE environment..I had customer that had the chance to try it, and wanted a root cause analysis. The BUS switch, was NOT happy in forwarding all the traffic going to the firewall cluster :-)...
participants (10)
-
Brian Shope
-
Deepak Jain
-
Eric Gauthier
-
Holmes,David A
-
Jeff Kell
-
Julio Arruda
-
Lee
-
Leo Bicknell
-
Matthew Huff
-
Steven King