Strange connectivity issue Frontier EVPL
We have a strange issue that defies logic. We have a NNI at our POP with Frontier serving as an aggregation circuit with different customers on different VLANs. It's working well to several customers. Bringing up a new customer shows roughly half of the IP addresses unreachable across the link, as if there's some kind of load-balancing or hashing function that's mis-directing half of the traffic. It's consistent, if an address is reachable it's always reachable. If it's not reachable, it's never reachable. Everything ARPs fine. The Frontier circuit is layer 2 so shouldn't care about IP addresses. Frontier tech shows no trouble. They changed the RAD device on-premise. We've triple-checked configurations, torn down and rebuilt subinterface, etc. with no joy. Any suggestions? -- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
Could you be running up against a MAC table limit on the circuit? On 11/6/20 11:59 AM, Jay Hennigan wrote:
We have a strange issue that defies logic. We have a NNI at our POP with Frontier serving as an aggregation circuit with different customers on different VLANs. It's working well to several customers.
Bringing up a new customer shows roughly half of the IP addresses unreachable across the link, as if there's some kind of load-balancing or hashing function that's mis-directing half of the traffic. It's consistent, if an address is reachable it's always reachable. If it's not reachable, it's never reachable. Everything ARPs fine.
The Frontier circuit is layer 2 so shouldn't care about IP addresses. Frontier tech shows no trouble. They changed the RAD device on-premise. We've triple-checked configurations, torn down and rebuilt subinterface, etc. with no joy.
Any suggestions?
EVPL (eline) should not be learning macs. So mac table size should be a non-issue. Unless someone somewhere has constructed a 2-part bridge domain (mef-speak, etree or elan of sorts) which would have mac learning, then Matt's question comes into play. -Aaron -----Original Message----- From: NANOG <nanog-bounces+aaron1=gvtc.com@nanog.org> On Behalf Of Matt Hoppes Sent: Friday, November 6, 2020 11:09 AM To: Jay Hennigan <jay@west.net>; NANOG list <nanog@nanog.org> Subject: Re: Strange connectivity issue Frontier EVPL Could you be running up against a MAC table limit on the circuit? On 11/6/20 11:59 AM, Jay Hennigan wrote:
We have a strange issue that defies logic. We have a NNI at our POP with Frontier serving as an aggregation circuit with different customers on different VLANs. It's working well to several customers.
Bringing up a new customer shows roughly half of the IP addresses unreachable across the link, as if there's some kind of load-balancing or hashing function that's mis-directing half of the traffic. It's consistent, if an address is reachable it's always reachable. If it's not reachable, it's never reachable. Everything ARPs fine.
The Frontier circuit is layer 2 so shouldn't care about IP addresses. Frontier tech shows no trouble. They changed the RAD device on-premise. We've triple-checked configurations, torn down and rebuilt subinterface, etc. with no joy.
Any suggestions?
On 11/6/20 09:08, Matt Hoppes wrote:
Could you be running up against a MAC table limit on the circuit?
Unlikely. The only MACs that should be in play are our gateway on our PE router and the customer's router and those are both in the address table and ARP. At layer 3, customer can consistently reach about 50% of the IP addresses attempted. -- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
Jay, I previously ran the engineering org over there, so sent this to my old team to look at, including the best engineer I know in regard to the RADs. Will pass along anything they come back with. Thanks, -Jeff
On Nov 6, 2020, at 8:59 AM, Jay Hennigan <jay@west.net> wrote:
We have a strange issue that defies logic. We have a NNI at our POP with Frontier serving as an aggregation circuit with different customers on different VLANs. It's working well to several customers.
Bringing up a new customer shows roughly half of the IP addresses unreachable across the link, as if there's some kind of load-balancing or hashing function that's mis-directing half of the traffic. It's consistent, if an address is reachable it's always reachable. If it's not reachable, it's never reachable. Everything ARPs fine.
The Frontier circuit is layer 2 so shouldn't care about IP addresses. Frontier tech shows no trouble. They changed the RAD device on-premise. We've triple-checked configurations, torn down and rebuilt subinterface, etc. with no joy.
Any suggestions?
-- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
I have similar Frontier NNI's out of One Wilshire, some 1gig some 10. While I haven't seen the half-IP-reachable issue you describe I have spent days and days chasing performance issues on them. I finally got gig line-rate capable iperf3 boxes at both ends and see distinct differences in single-TCP stream performance vs running 3-4 streams, and the difference disappears like clockwork at "unbusy hours" (1am-7am) every day. After running hundreds of tests and adjusting my buffering and RED on both ends of these circuits I just have come to the conclusion that they have some LAGs somewhere "in the middle" that get busy during the day, and they don't care if I have to run 4 TCP streams to max a 1gig circuit. It makes browser-based speedtests look really bad but otherwise the circuits are usable. We're trying to replace the worst ones with wavelength services. -Will Orton On Fri, Nov 06, 2020 at 08:59:28AM -0800, Jay Hennigan wrote:
We have a strange issue that defies logic. We have a NNI at our POP with Frontier serving as an aggregation circuit with different customers on different VLANs. It's working well to several customers.
Bringing up a new customer shows roughly half of the IP addresses unreachable across the link, as if there's some kind of load-balancing or hashing function that's mis-directing half of the traffic. It's consistent, if an address is reachable it's always reachable. If it's not reachable, it's never reachable. Everything ARPs fine.
The Frontier circuit is layer 2 so shouldn't care about IP addresses. Frontier tech shows no trouble. They changed the RAD device on-premise. We've triple-checked configurations, torn down and rebuilt subinterface, etc. with no joy.
Any suggestions?
-- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
What hardware is on each side?
On Nov 6, 2020, at 10:08, will@loopfree.net wrote:
I have similar Frontier NNI's out of One Wilshire, some 1gig some 10.
While I haven't seen the half-IP-reachable issue you describe I have spent days and days chasing performance issues on them. I finally got gig line-rate capable iperf3 boxes at both ends and see distinct differences in single-TCP stream performance vs running 3-4 streams, and the difference disappears like clockwork at "unbusy hours" (1am-7am) every day.
After running hundreds of tests and adjusting my buffering and RED on both ends of these circuits I just have come to the conclusion that they have some LAGs somewhere "in the middle" that get busy during the day, and they don't care if I have to run 4 TCP streams to max a 1gig circuit.
It makes browser-based speedtests look really bad but otherwise the circuits are usable. We're trying to replace the worst ones with wavelength services.
-Will Orton
On Fri, Nov 06, 2020 at 08:59:28AM -0800, Jay Hennigan wrote: We have a strange issue that defies logic. We have a NNI at our POP with Frontier serving as an aggregation circuit with different customers on different VLANs. It's working well to several customers.
Bringing up a new customer shows roughly half of the IP addresses unreachable across the link, as if there's some kind of load-balancing or hashing function that's mis-directing half of the traffic. It's consistent, if an address is reachable it's always reachable. If it's not reachable, it's never reachable. Everything ARPs fine.
The Frontier circuit is layer 2 so shouldn't care about IP addresses. Frontier tech shows no trouble. They changed the RAD device on-premise. We've triple-checked configurations, torn down and rebuilt subinterface, etc. with no joy.
Any suggestions?
-- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
On 11/6/20 10:14, Mike Lyon wrote:
What hardware is on each side?
On our aggregate side an ASR920. Customer has a RAD device as the Frontier handoff. We've seen the same issue with multiple devices at the customer side including a laptop direct to the RAD. -- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
Recently saw a relatively same problem when Wave migrated us off of their antiquated 6500 to a brand new ASR920. EVPL had been working flawlessly for years on the 6500, but then stopped working when migrated to the ASR. Tried multiple ports on the ASR and then even another brand new ASR, same problem. Moved the circuit over to another (different) antiquated 6500 and all was good. On my side, i was using a Mikrotik, i had the port in a bridge group and was seeing all the MAC addresses across the link but for some reason, they weren’t showing up in the ARP table of the Mikrotik. Tried a couple other Mikrotik devices, same thing. Installed a dumb gigabit switch in the middle, same thing. However, when my laptop was plugged in, that worked. So yes, seen the same weird behavior. As to how to fix it, no idea :) -Mike
On Nov 6, 2020, at 10:32, Jay Hennigan <jay@west.net> wrote:
On 11/6/20 10:14, Mike Lyon wrote:
What hardware is on each side?
On our aggregate side an ASR920. Customer has a RAD device as the Frontier handoff. We've seen the same issue with multiple devices at the customer side including a laptop direct to the RAD.
-- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
Am Freitag, 6. November 2020, 10:31:25 schrieb Jay Hennigan:
On 11/6/20 10:14, Mike Lyon wrote:
What hardware is on each side?
On our aggregate side an ASR920. Customer has a RAD device as the Frontier handoff. We've seen the same issue with multiple devices at the customer side including a laptop direct to the RAD.
It sounds a bit like loadbalancing with one broken link... Have you verified, for example with acl counters at both sides of the link, in which direction the packets are dropped? As the customer has changed the devices, does the ASR uses a MAC starting with 4 or 6? My only idea at the moment is to generate load on the link with a udp traffic generator which does not work end to end and let them check where the traffic dies within their network. Karsten
This is my biggest complaint about non-wavelength transport. The provider is overselling a port somewhere in the circuit, unless it's a wave. ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP ----- Original Message ----- From: will@loopfree.net To: nanog@nanog.org Sent: Friday, November 6, 2020 11:54:53 AM Subject: Re: Strange connectivity issue Frontier EVPL I have similar Frontier NNI's out of One Wilshire, some 1gig some 10. While I haven't seen the half-IP-reachable issue you describe I have spent days and days chasing performance issues on them. I finally got gig line-rate capable iperf3 boxes at both ends and see distinct differences in single-TCP stream performance vs running 3-4 streams, and the difference disappears like clockwork at "unbusy hours" (1am-7am) every day. After running hundreds of tests and adjusting my buffering and RED on both ends of these circuits I just have come to the conclusion that they have some LAGs somewhere "in the middle" that get busy during the day, and they don't care if I have to run 4 TCP streams to max a 1gig circuit. It makes browser-based speedtests look really bad but otherwise the circuits are usable. We're trying to replace the worst ones with wavelength services. -Will Orton On Fri, Nov 06, 2020 at 08:59:28AM -0800, Jay Hennigan wrote:
We have a strange issue that defies logic. We have a NNI at our POP with Frontier serving as an aggregation circuit with different customers on different VLANs. It's working well to several customers.
Bringing up a new customer shows roughly half of the IP addresses unreachable across the link, as if there's some kind of load-balancing or hashing function that's mis-directing half of the traffic. It's consistent, if an address is reachable it's always reachable. If it's not reachable, it's never reachable. Everything ARPs fine.
The Frontier circuit is layer 2 so shouldn't care about IP addresses. Frontier tech shows no trouble. They changed the RAD device on-premise. We've triple-checked configurations, torn down and rebuilt subinterface, etc. with no joy.
Any suggestions?
-- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
My coworker is having similar issues with PS Lightwave and Alpheus/Logix from San Antonio to Houston whereas some things work and somethings don't -Aaron
I'm amazed you can get *anything* to work with Logix involved. Haven't heard of many issues with PSLightwave in Houston, however... they seem to be one of the only halfway decent options here. On 11/6/20 2:57 PM, aaron1@gvtc.com wrote:
My coworker is having similar issues with PS Lightwave and Alpheus/Logix from San Antonio to Houston whereas some things work and somethings don't
-Aaron
As it happens, I've just recently turned up a peering circuit with PSL in Houston, and their senior engineer is clue++ Naturally, he's on vacation this week, but [Aaron] ping me unicast if I might be able to assist/lend eyeballs/make an introduction of you guys next week. --Adam On 11/9/20, 8:05 AM, "NANOG on behalf of Tim Burke" <nanog-bounces+ak=mid.net@nanog.org on behalf of tim@mid.net> wrote: I'm amazed you can get *anything* to work with Logix involved. Haven't heard of many issues with PSLightwave in Houston, however... they seem to be one of the only halfway decent options here. On 11/6/20 2:57 PM, aaron1@gvtc.com wrote: > My coworker is having similar issues with PS Lightwave and Alpheus/Logix > from San Antonio to Houston whereas some things work and somethings don't > > -Aaron > >
participants (10)
-
aaron1@gvtc.com
-
Adam Korab
-
Jay Hennigan
-
Jeff Richmond
-
Karsten Thomann
-
Matt Hoppes
-
Mike Hammett
-
Mike Lyon
-
Tim Burke
-
will@loopfree.net