Re: Cisco ASR9902 SNMP polling ... is interesting

3 Aug 2025

      ...
When the SNMP process received the poll request, it in turn fires off
requests internally to other processes to get the stats being asked for.
This is/was (I'm out of touch now) a maximum amount of time SNMP would wait
for the other processes to respond. If they didn't respond in time the SNMP
response was sent without those details, or the query which was pending an
answer was just dropped and no response sent. So problem number one was
those other processes taking too long to respond.
This is generally true with multiple vendors. Main SNMP process is
responsible for receiving/replying to requests, and separate processes for
actual collection of the data from elements.  If those collector processes
wedge or bog down ( on their own, or due to the element polled being bogged
down, etc) that timeout bubbles up and you get nothing.

Pretty standard design to segment the things this way.

On Sun, Aug 3, 2025 at 3:12 AM James Bensley via NANOG <
nanog@lists.nanog.org> wrote:
...
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
On Friday, 1 August 2025 at 15:10, Drew Weaver via NANOG <
nanog@lists.nanog.org> wrote:
...
Hello,
Hi Drew.
I haven't worked with IOS-XR for a few years but I have had problems with
SNMP in the past.
A few years ago I was deploying 9904 chassis with a modest amount of
services on them (not thousands of services per chassis, but hundreds, so
they weren't idle, but certainly not under any mentionable load
control-plane wise).
We noticed that SNMP polling was returning nothing for some of the
services and it ended up being a couple of problems compounding. At that
time we had virtually every 9xxx and 99xx chassis in the network. This
problem only exists with these boxes, but they were also the only routers
in the network with this exact combination of services on them. So nothing
chassis specific I believe, this was on IOS-XR 6.something for reference.
When the SNMP process received the poll request, it in turn fires off
requests internally to other processes to get the stats being asked for.
This is/was (I'm out of touch now) a maximum amount of time SNMP would wait
for the other processes to respond. If they didn't respond in time the SNMP
response was sent without those details, or the query which was pending an
answer was just dropped and no response sent. So problem number one was
those other processes taking too long to respond.
Problem number two was those other processes had a bug; after provisioning
services those processes hadn't pick up on the changes. The request came
from the SNMP process to the other processes for stats relating to X, the
other processes had no knowledge of X.
TAC provided us with a short term work around, which was to restart some
processes after provisioning new services, to ensure the processes were
aware of the new services and would respond to the SNMP process with the
requested stats. Long term they created a DDTS and SMU to fix the
inter-process timeout issues and missing stats issues.
I don't know exactly what you're polling, and like I said, I'm a bit out
of touch here, but I can say that it took quite a lot of digging and
working with TAC to bottom out the problem. We could replicate the issue in
the lab which always helps. So if you can replicate the issue in the lab,
and turn all debugging settings up to 11, you might be able to find
something like we did (TAC sent some debug commands and we could trace the
issue in the lab, IPC debgging is hard on these boxes!). Even if TAC are
trying to fob you off by saying "oh yeah this is dropped by LTSP as
expected", get them to prove it to you; replicate the issue in the lab and
gather the debug info which shows how/where the request is being dropped,
if they can't find the drop in LTPS, then LTPS isn't the problem and you
need to look else were like IPC/EOBC.
Cheers,
James.
-----BEGIN PGP SIGNATURE-----
Version: ProtonMail
wsG5BAEBCgBtBYJojwuDCZCoEx+igX+A+0UUAAAAAAAcACBzYWx0QG5vdGF0
aW9ucy5vcGVucGdwanMub3Jne6/4gXRiD1B/oyx0cm03xe+bPfK4lh4ErWip
GQvWH9oWIQQ+k2NZBObfK8Tl7sKoEx+igX+A+wAAlZAP/3DFVyR1e2DiJ7bv
4udRjmX0xLtEpkZM7UJGwhihiIiqW/JV+TyqEq75Ko4Hu9xOiOURkz+VkBx6
XfgbrFuXxPT/i4NhcMZ8qygSBwoAQK4Z6CIeXf9msWnly259hA5F88SB/oCc
LKOjcH6hNHVI2+5jSIMJFqNVkD/3b2eSIF3ZHbdWsZ+uq6QRMMvM7gOHuJAm
0mCiOBTUbN4oIziQdN0u3tbWVgIWulC2TyM8wy2FGyN+r5ks/jqmZQhlTASo
u+9kPtBZ4SQc0p9GwvYZN4XHXQtcftx7xrPymmXhwU+3UaE70YoSZuJVULE+
eGipYUDUiQ9OA9pj39BWZe6fpRLqgoeEl6GDiavHYLcfw3CVkMwThPUGDRFX
RDNxKpebdPEZHzsJyvqORgM+/RHYIAgqOOQIQdiZGbaiIxa8ooT06WJRkNWO
iKL2jOkXndbbxWenyw4RNZwVX50H1Y79eqUxhU24yiA0Wfs6qVCRZWP3M//g
a+BJwOBqb8gFmuJErvezWUPUNIt94UhEv8aFpVtPZ7R4IIpPzFBFlLUV4HEK
F5IU9JgqvyBagubAPeIOoUk0+DboE4gGBPTz9RGWSfdxM+D5pX/HWBh8qIwB
prO6hDk3PkkGAk4/fhd5jNmGk0hE0yKyTubE711vIJ9vXD1dJbqKgoOjSA18
t315dumB
=LkYJ
-----END PGP SIGNATURE-----
_______________________________________________
NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/LFEK3ERO...

Re: Cisco ASR9902 SNMP polling ... is interesting

Tom Beecher