On Fri, 29 Mar 2024 at 20:10, Steven Bakker <steven.bakker@ams-ix.net> wrote:
To top it off, both the sFlow and IPFIX specs are sufficiently vague about the meaning of the "frame size", so vendors can implement whatever they want (include/exclude padding, include/exclude FCS). This implies that you shouldn't trust these fields.
I share this concern, but in my experience the market simply does not care at all what the data means. People happily graph L3 rate from Junos, and L2 rate from other boxes, using them interchangeably as well as using them to determine if or not there is congestion. While in reality, what you really want is L1 speed, so you can actually see if the interface is full or not. Luckily we are starting to see more and more devices also support peak-buiffer-util in previous N seconds, which is far more useful for congestion monitoring, unfortunately it is not IF-MIB so most will never ever collect it. Note, it is possible to get most Juniper gear to report L2 rate like IF-MIB specifies, but it's a non-standard configuration option, therefore very rarely used. I also wholeheartedly agree on inline templates being near peak insanity. Huge complexity for upside that is completely beyond my understanding. If I decide to collect a new metric, then punching in the metric number+name somewhere is the least of my worries. Idea that the costs are lowered by having machines dynamically determine what is being collected and monitored is just bizarre. Most of the cost of starting to collect a new metric is figuring out how it is actionable, what needs to happen to the metric to trigger a given action, and how exactly we are extracting value from this action. Definitely Netflow v9/v10 should have done out-of-band templates, and left it to operator concern to communicate to the collector what it is seeing. Even exceedingly trivial things in v9/v10 entities can be broken for years and years before anyone notices, like for example the original sampling entities are deprecated, they are replaced with new entities, which communicate 'every N packets, sample C packets', this is very very good, because it allows you to do stateless sampling, while still filling out export packet with MTU or larger size to keep export PPS rate same before/after axing cache. However, by the time I was looking into this, only pmacct correctly understood how to use these entities, nfcapd and arbor either didn't understand them, or understood them incorrectly (both were fixed in a timely manner by responsible maintainers, thank you). -- ++ytti