
By design, the LPTS default values are set to be on the "slow but safe" side. As I've already mentioned, picking default values is incredibly hard for stuff like this because you've got a dramatic range of system sizes, shapes, use cases, blah blah. The general consensus is we'd rather force people to open up policers explicitly than have them be too open by default. Feel free to dismiss me as a crybaby apologist, but that's how we got here. Said another way: feel free to say that we made terrible choices and you hate our defaults. But you can't justly accuse of us of not thinking about it and/or just making shit up. Another challenge here (yeah, I know... there goes LJ apologizing again...) is that a "500 pps" policer is not actually 500 packets per second. It's a token bucket meter where the actual parameters are the token fill rate and a burst size. Choosing THESE values is yet another messy problem... if we assumed that the bucket gets filled up once per second, then what you'd end up with is a meter that would allow 500 packets through as fast as they could be dequeued, but then let nothing else through for the rest of that second time window. Then we add 500 "tokens" at T = 1 sec, and lather rinse repeat. In the real world every hardware based meter is slightly different as far as what burst sizes are available, how fast the token interval fills, and more stuff. But if we circle back to this particular case, we might well not have a 500 PPS policer, but rather we might have a policer that is "50 packets every tenth of a second" or "5 packets every hundredth of a second". This is where you have to know something about other side (here - the SNMP client)... does it send a burst of packets all at once? How large is that burst? Does that burst overrun the policer on the router? If it times out and re-tries, does that make it worse or better? Your complaint about the thing silently accepting a value that can't be supported in the hardware is 100% valid. We should not let you say "police this to a rate of 4 billion" and reply with "OK no problem" when in reality we're not doing that. Please ask your TAC engineer to file a bug for this ... we might or might not ever get around to fixing it, but it at least needs to be documented somewhere. (I would do it myself but they deemed me too dangerous to allow continued access to the DDTS database many years ago...) As to why it takes so much longer to do the same thing on a non-management interface, I'm truly curious as to this one. 5 seconds is a bonkers amount of time on a system like this... my best guesses at this point are things like: - because the rate limiters are different for mgmt vs non, somehow we're getting a partial completion each "cycle" and we've got tons of retries in there. Drew, did you ever get the output of something like "debug snmp packet" or whatever it was the TAC guys asked for? I'd be specifically interested in comparing those traces for the two {mgmt, not} cases... once the SNMP process generates it's replies the data plane on the way OUT is pretty much non blocking, so I'd want to see if somehow we're pacing the arrival of the requests into the SNMP process, and/or if it thinks it's generating the responses in the same amount of time.... --lj -----Original Message----- From: Nick Hilliard via NANOG <nanog@lists.nanog.org> Sent: Friday, August 8, 2025 11:39 AM To: North American Network Operators Group <nanog@lists.nanog.org> Cc: Nick Hilliard <nick@foobar.org> Subject: Re: Cisco ASR9902 SNMP polling ... is interesting Drew Weaver via NANOG wrote on 08/08/2025 14:31:
It couldn't be bothered to simply set it to 50000 if you set it to the configured maximum of 4294967295 It couldn't be bothered to simply say: "Hey we know the max for this platform is50000 so we set it to 50000 but you probably shouldn't be using 50000 for this value anyway" It could be bothered to do absolutely nothing and silently reject the command which made me laugh for about 5 minutes this morning.
Some years ago I was fighting with a low level pps rate limiter for a telemetry service on a long obsolete platform. The default limit caused packets to be dropped, and we finally settled on an updated figure based on the usual compromise of performance vs consequence. But: if we increased the limiter above what we had measured to be reasonable, this fairly quickly caused a performance cliff which affected other services, e.g. snmp / lacp timeouts, etc, so production impact. Although this was in the days of in-house NOS schedulers, I'd be fairly cautious in this area - particular on RTOS platforms like XR. If Cisco have implemented a pps limiter of 50k/s, that's a lot of snmp pps. Is this a realistic amount of requests to be properly serviced per second? SNMP packet encapsulation / general handling is one thing, but stats collection / intermediation can be more heavyweight. Bear in mind that the failure modes in this sort of situation are often non-linear. For sure it's a bit annoying that they don't warn that this is the maximum (possibly a platform / LC limit? i.e. possible that this is not a generic limit across all SPs on all types of unit), but at least the box won't fall over in production just because someone tweaked a parameter beyond what the hardware was likely capable of handling. Nick _______________________________________________ NANOG mailing list https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/WUYR7KRW QCA5EA2IF6RVNE4BKUUD5TZL/