Re: Refusing Pings on Core Routers??? A new trend?

20 Oct 2006

      On Thu, Oct 19, 2006 at 07:18:01PM -0400, Deepak Jain wrote:
...
1 NOC (that will remain nameless even though they should really be 
shamed) said the following in response to the question -- when we were 
trying to diagnose +50ms jumps in their latency within a single POP.
Q: "As part of this, can you tell me why your router is prohibiting packets
being sent to our interface?"
A:"	The reason you cannot hit your interface is it is blocked for
security reasons."
I've heard this response before, albeit not from the company you're
referring to.  The most common response -- which is at this point a
template response -- I hear is "Well, you can't rely on traceroute
because of ICMP prioritisation".  When you start to explain how
traceroute actually works (both ICMP-based and UDP-based (which
still relies on ICMP responses, of course!)), and that ICMP prio
should only affect the IP of which the router listens on (and not
hops beyond or at the dest), most NOCs fire back with another
template of their choice ("We're not aware of any issues", "No that's
incorrect", "I'll check with engineers", or the ever-so-amusing
"traceroute and ping aren't reliable, you need to use a different
method of testing") -- but the most common is: "can you send us
traceroutes of what you're seeing?"

"But! You just said..... argh!!"

I happen to work in a NOC, and I have never -- nor will I ever --
spout off that template response.  When a client or customer calls
about something, I give them the benefit of the doubt.  If it
turns out they're wrong later, they at least (hopefully) learned
something.  I just happen to believe in getting things done,
rather than arguing against doing investigative work.  When an
issue occurs, look at it quickly, not 24-48 hours later.

I am absolutely fine with ICMP being prioritised last, but those
scenarios induce more questions; "so ICMP is prio'd last, which
would mean the router is busy processing other packets, which could
mean your router is over-utilised either CPU-wise or iface-wise
since we're seeing 250ms at your hop and beyond".  48 hours later,
a network technician looks at the router and either finds absolutely
nothing ("It must've gone away on it's own") or finds something
conclusive (but only when the issue re-occurs, is still occurring,
or if they keep historic data).
...
Did I miss the conspiracy?? I know my membership dues are all paid up.
If this has been going on a while, I apologize I guess I've just noticed 
the trend in our shift reports.
Yes, this has been going on for awhile.  Well, not ICMP_UNREACH_NET
(from your example) but general ICMP prioritisation or the explicit
dropping of either ICMP_TIMXCEED (traceroute) or ICMP_ECHOREPLY (ping).

A real-life example, from my own (residential) ISP.  Try to imagine
reporting an issue at hop 6 to a technician (who will always insist
the problem is somewhere prior).  Here's an example of a working
network (no sarcasm; I'm serious!):

1. 192.168.1.1                   0.0%    30    30   0.5   0.5   0.5   0.6
2. ???                          100.0    30     0   0.0   0.0   0.0   0.0
3. 68.87.198.129                 0.0%    30    30   8.5   9.9   7.5  21.1
4. 68.87.192.34                 30.0%    30    21   9.2  11.9   9.2  20.5
5. 68.87.226.134                66.7%    30    10  10.9  12.9  10.2  25.9
6. 12.116.188.13                 0.0%    30    30  10.6  12.6  10.1  25.0
7. 12.123.12.126                 0.0%    30    30  12.9  12.3  10.2  15.3

And an example of when things are broken:

1. 192.168.1.1                   0.0%    30    30   0.5   0.5   0.4   0.6
2. ???                          100.0    30     0   0.0   0.0   0.0   0.0
3. 68.87.198.129                 0.0%    30    30  12.7  12.7   8.1  28.4
4. 68.87.192.34                 20.0%    30    24  13.4  11.5   9.5  14.5
5. 68.87.226.134                96.7%    30     1  12.6  12.6  12.6  12.6
6. 12.116.188.13                50.0%    30    15  15.1  11.8  10.3  15.1
7. 12.123.12.122                50.0%    30    15  11.6  17.5  11.1  60.5

Since I'm not a network administrator, I'll ask point blank: why
exactly do your netadmins filter and rate ICMP like this, and what
are you gaining from it?  Most kiddies stick with pure TCP or UDP
these days -- the goal is to saturate the pipe, not cause a literal
service DoS (e.g. crashing Apache, etc.)

Additionally, I'll ask another question: exactly what tool are
NOCs (or even network administrators) supposed to use to diagnose
network path problems via layer 3 and 4?

-- 
| Jeremy Chadwick                                 jdc at parodius.com |
| Parodius Networking                        http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP: 4BD6C0CB |