On Jan 15, 2014, at 10:26 AM, William Herrin <bill@herrin.us> wrote:
Of course working, monitorable and testable are three different things. If my NMS can't reach the IXP's addresses, my view of the IXP is impaired. And "the Internet is broken" is not a trouble report that leads to a successful outcome with customer support... it helps to be able to pin things down with some specificity.
This approach concerns me for a number of reasons. First, having your NMS ping your upstream’s IXP peers probably doesn’t scale. If I’m a peer of a reasonably large provider, I’m pretty sure I don’t want all their customers hammering my management plane. Even if you’re the only one doing it, you also don’t know if I’m rate-limiting pings for that or any other reason. Second, what information do you get that you didn’t already have? If you saw the IP in a traceroute then you know it exists, is alive, is in the path, and a rough estimation of the latency. Pinging it may even give you negative information. Platforms vary and all, but in my experience pinging a router, especially a potentially busy one peering at an IXP, shows notably worse performance than “real” traffic experiences (admittedly somewhat true of TTL Expired responses, but less so in my experience). Now you’re potentially seeing high latency and packet loss which in reality might not even be there at all. Third, you don’t know that your ping to the peering IP is even taking the same path as the packets addressed to the real destination. MTR for example looks nice, but it would probably be more accurate if it simply ran the traceroute over and over instead of pinging each hop directly. You would also detect path changes for the real destination that pinging intermediate hops wouldn’t show you. While I appreciate the desire to be able to do as much of your own detective work as possible, I can also see where you’re now shifting workload onto someone else’s support organization when they’re not necessarily the problem either (“Hey, my NMS says your peering router is causing latency and packet loss, fix it!”). I’m also not saying there isn’t a troubleshooting gap caused by this. I’m just not sure being able to ping the IXP hop solves that problem either. Semi-related tangent: Working in an IXP setting I have seen weird corner cases cause issues in conjunction with the IXP subnet existing in BGP. Say someone’s got proxy ARP enabled on their router (sadly, more common than it should be, and not just from noobs at startups). Now say your IXP is growing and you expand the subnet. No matter how much you harp on the customers to make the change, they don’t all do it at once. Someone announces the new, larger subnet in BGP. Now when anyone ARPs for IPs in the new part of the range, proxy ARP guy (still on the smaller subnet) says “hey I have a route for that, send it here”. That was fun to troubleshoot. :) -c