Hi are there any AWS engineers out there? We are seeing routing problems between NTT and AWS in Ashburn, Va and would like to find out which side is having the problem. Thanks, Curt
I was just about to email the group for a related issue. We are also seeing some funky routing/peering within the AWS network. We primarily communicate with Verizon Media/Oath - AS10310. Verizon Media has a presence in Singapore, and its peered locally with AWS AS38895 - we normally see 8ms latency. Verizon Media also peers with AWS AS16509 in Japan, but for Singapore traffic, Verizon Media sends a lower MED so AWS Singapore should prefer that route/peer, but its not working properly on the AWS side, all of our traffic is going to Japan, this started early AM today. I had Verizon Media investigate, and we gave them our AWS Singapore IP addresses, they confirmed that they are not receiving those prefixes/announcements from AWS Singapore (AS38895). So something is broke…. hopefully if someone from AWS is reading they can escalate. In my case, the AWS Singapore IP ranges in question are : 46.51.216.0/21 and 52.74.0.0/16 -John
On May 8, 2019, at 10:55 AM, Curt Rice <crice@broadaspect.com> wrote:
Hi are there any AWS engineers out there? We are seeing routing problems between NTT and AWS in Ashburn, Va and would like to find out which side is having the problem.
Thanks, Curt
Are you sure the problem isn’t NTT? My buddy’s WISP peers with Spirit and had a boatload of problems with random packet loss affecting initially just SIP and RTP (both UDP). Spirit was blaming NTT. Problems went away when Spirit stopped peering with NTT yesterday. Path is through Telia now to their main SIP trunk provider. chuck From: NANOG <nanog-bounces@nanog.org> On Behalf Of Curt Rice Sent: Wednesday, May 08, 2019 10:56 AM To: nanog@nanog.org Subject: Routing issues to AWS environment. Hi are there any AWS engineers out there? We are seeing routing problems between NTT and AWS in Ashburn, Va and would like to find out which side is having the problem. Thanks, Curt
Interesting at my 9-5 we use NTT exclusively for SIP traffic and it has been flawless. If there are any tests that you want me to run over NTT via their pop at 111 8th let me know. On Thu, May 9, 2019 at 6:35 AM Chuck Church <chuckchurch@gmail.com> wrote:
Are you sure the problem isn’t NTT? My buddy’s WISP peers with Spirit and had a boatload of problems with random packet loss affecting initially just SIP and RTP (both UDP). Spirit was blaming NTT. Problems went away when Spirit stopped peering with NTT yesterday. Path is through Telia now to their main SIP trunk provider.
chuck
*From:* NANOG <nanog-bounces@nanog.org> *On Behalf Of *Curt Rice *Sent:* Wednesday, May 08, 2019 10:56 AM *To:* nanog@nanog.org *Subject:* Routing issues to AWS environment.
Hi are there any AWS engineers out there? We are seeing routing problems between NTT and AWS in Ashburn, Va and would like to find out which side is having the problem.
Thanks,
Curt
Hi Chuck, On Thu, May 09, 2019 at 06:34:21AM -0400, Chuck Church wrote:
Are you sure the problem isn’t NTT? My buddy’s WISP peers with Spirit and had a boatload of problems with random packet loss affecting initially just SIP and RTP (both UDP). Spirit was blaming NTT. Problems went away when Spirit stopped peering with NTT yesterday. Path is through Telia now to their main SIP trunk provider.
I don't know the specifics of what you reference, but in a large geographically dispersed network like NTT's backbone, I can assure you there will always be something down somewhere. Issues can take on many forms: sometimes it is a customer specific issue related to a single interface, sometimes something larger is going on. It is quite rare that the whole network is on fire, so in the general case is good to investigate and consider each and every report about potential issues separately. The excellent people at the NTT NOC are always available at noc@ntt.net or the phone numbers listed in PeeringDB. Kind regards, Job
Job, We have had a lot of dialog with the excellent people at NTT NOC this week, easily over a couple of hours in total. We were told to talk to AWS directly and have our customers talk to AWS. Basically, "it's not us" response. So we reached out to our buddies in NANOG. We have no way to get AWS to communicate to us, we don't directly peer with them like many other cloud providers out of the Equinix IX. We have a work around in the fact that we broke up some of our Ashburn /21 advertisements into /23 and /24 advertisements of the ones that included our customer IP assignments. The result was pushing a more specific route out our Ashburn peers versus our out of the area peers such as in Chicago is helping. That has helped resolve our direct customer issues, but leads us to believe where we have BGP peering in other regions outside of Ashburn, VA AWS isn't honoring our AS prepending. The original issue is that our local customers in the DC region get routed from our AS over NTT into AWS in Ashburn for AWS-East region environments, but AWS is sending the return traffic over to Chicago to one of our other upstream peers. For a few select customers this is breaking their applications completely with not being able to connect or severely disrupting performance and bringing the applications to a crawl. Yet, we can push iperf traffic in our own AWS instances with zero packet loss or perceivable issue other than the asymmetrical routing that is adding around 30ms to the return latency versus the typical 2ms to 3ms latency. We do have Layer2 between our POPs. Is ignoring AS prepending common? Given my example issue, what direction would you normally take? Sincerely, Nick Ellermann -----Original Message----- From: NANOG <nanog-bounces+nellermann=broadaspect.com@nanog.org> On Behalf Of Job Snijders Sent: Thursday, May 9, 2019 10:24 To: Chuck Church <chuckchurch@gmail.com> Cc: nanog@nanog.org Subject: Re: Routing issues to AWS environment. Hi Chuck, On Thu, May 09, 2019 at 06:34:21AM -0400, Chuck Church wrote:
Are you sure the problem isn’t NTT? My buddy’s WISP peers with Spirit and had a boatload of problems with random packet loss affecting initially just SIP and RTP (both UDP). Spirit was blaming NTT. Problems went away when Spirit stopped peering with NTT yesterday. Path is through Telia now to their main SIP trunk provider.
I don't know the specifics of what you reference, but in a large geographically dispersed network like NTT's backbone, I can assure you there will always be something down somewhere. Issues can take on many forms: sometimes it is a customer specific issue related to a single interface, sometimes something larger is going on. It is quite rare that the whole network is on fire, so in the general case is good to investigate and consider each and every report about potential issues separately. The excellent people at the NTT NOC are always available at noc@ntt.net or the phone numbers listed in PeeringDB. Kind regards, Job
Dear Nick, I sympathize with you plight, network debugging can be quite a test of character at times. I am snipping some text as I can't comment on on specific details in this case, but you do raise two excellent questions which I can maybe help with. On Thu, May 09, 2019 at 03:05:43PM +0000, Nick Ellermann wrote:
Is ignoring AS prepending common?
It is not common, but yes it does happen. Some cloudproviders and CDNs have broken away from the traditional BGP best path selection and use SDN controllers to steer traffic. I don't know if in play here or not.
Given my example issue, what direction would you normally take?
Your issue reminds me of an issue I encountered some years ago. A member of the Dutch community reported that seemingly random pairs of IP addresses could not reach each other across an Internet Exchange fabric. It drove this person crazy because none of the involved parties could find anything wrong within their domain. The debugging process was hard because the person had to ask for pingsweeps, traceroutes, would get information back without timestamps, didn't have the ability to alter source and destination ports on packets sent for debugging. It turned out to be a faulty linecard, that under specific circumstances would hash traffic into a blackhole. It took WEEKS to find this. So, I identified a need for a more advanced debugging platform - one that wouldn't require human-to-human interaction to help operators debug things, in other words it seemed to make sense to stand up linux shell servers in lots of networks and share access with each other. This project is the NLNOG RING and I'd recommend you to participate. An introduction can be found here https://www.youtube.com/watch?v=TlElSBBVFLw and a nice use case video is available here https://www.youtube.com/watch?v=mDIq8xc2QcQ NTT, Amazon, and many others are part of it, and I assume that you have SSH access to the problematic destination so I hope you can use tcpdump there to verify if you can or can't receive packets coming from NLNOG RING nodes. You mentioned that altering your announcements (deaggregating, prepending) resolves the issue, this strongly suggests that something somewhere is broken and it is a matter of triangulating until you've find the shortest path that exhibits the problem. Perhaps you can find something like "Between these two nodes, when I use source port X, protocol Y, destport Z, traffic doesn't arrive". Website: https://ring.nlnog.net/ There also is an IRC channel where people perhaps can help you make the best use of this tool. Kind regards, Job
participants (6)
-
Chuck Church
-
Curt Rice
-
Dovid Bender
-
Job Snijders
-
John Von Essen
-
Nick Ellermann