Dear Nick, I sympathize with you plight, network debugging can be quite a test of character at times. I am snipping some text as I can't comment on on specific details in this case, but you do raise two excellent questions which I can maybe help with. On Thu, May 09, 2019 at 03:05:43PM +0000, Nick Ellermann wrote:
Is ignoring AS prepending common?
It is not common, but yes it does happen. Some cloudproviders and CDNs have broken away from the traditional BGP best path selection and use SDN controllers to steer traffic. I don't know if in play here or not.
Given my example issue, what direction would you normally take?
Your issue reminds me of an issue I encountered some years ago. A member of the Dutch community reported that seemingly random pairs of IP addresses could not reach each other across an Internet Exchange fabric. It drove this person crazy because none of the involved parties could find anything wrong within their domain. The debugging process was hard because the person had to ask for pingsweeps, traceroutes, would get information back without timestamps, didn't have the ability to alter source and destination ports on packets sent for debugging. It turned out to be a faulty linecard, that under specific circumstances would hash traffic into a blackhole. It took WEEKS to find this. So, I identified a need for a more advanced debugging platform - one that wouldn't require human-to-human interaction to help operators debug things, in other words it seemed to make sense to stand up linux shell servers in lots of networks and share access with each other. This project is the NLNOG RING and I'd recommend you to participate. An introduction can be found here https://www.youtube.com/watch?v=TlElSBBVFLw and a nice use case video is available here https://www.youtube.com/watch?v=mDIq8xc2QcQ NTT, Amazon, and many others are part of it, and I assume that you have SSH access to the problematic destination so I hope you can use tcpdump there to verify if you can or can't receive packets coming from NLNOG RING nodes. You mentioned that altering your announcements (deaggregating, prepending) resolves the issue, this strongly suggests that something somewhere is broken and it is a matter of triangulating until you've find the shortest path that exhibits the problem. Perhaps you can find something like "Between these two nodes, when I use source port X, protocol Y, destport Z, traffic doesn't arrive". Website: https://ring.nlnog.net/ There also is an IRC channel where people perhaps can help you make the best use of this tool. Kind regards, Job