On Wed, Aug 15, 2018 at 08:49:12AM -0500, Colton Conor wrote:
We are looking for a new network monitoring system. Since there are so many operators on this list, I would like to know which NMS do you use and why? Is there one that you really like, and others that you hate?
For free options (opensouce), LibreNMS and NetXMS come highly recommended by many wireless ISPs on low budgets. However, I am not sure the commercial options available nor their price points.
For monitoring network device/interface data plane reachability with ping, we are still using an ancient piece of open source software called Autostatus. I find it invaluable for notifying us about reachability issues with it's simple to understand parent/child relationships and graph-based fping methodology. It isn't perfect--it doesn't scale very well, it doesn't have HA/clustering, it has no fancy dependencies (just basic parent-child) and no event correlation, no contact scheduling, no API, etc. but it is very easy to understand why you are getting an alert or not and boiling that down to a single point of failure and as such it provides reliable, trustable information about data plane reachability from one vantage point on the network. For monitoring server & network service availability, device/environmental health, etc. we are currently using Nagios. My problems with it are that it has complex rules for how/when to perform a specific health check and send or suppress a notification (and perhaps bugs in our old version that never ever seems to send any Host notifications except when it does) and the whole idea of "suppress the Host check unless all Service checks for all services on the host are down" doesn't really fit well with the idea of monitoring device/interface reachability on routers & switches that make up a complex graph of dependencies. Trying to shoehorn Nagios into alerting on just the one IP address/device/interface that is causing all the others behind it to be unreachable doesn't work very well. You can't use Host Depenencies because Host checks are suppressed by default, and Host Dependencies don't affect Service Checks/notifications. Forcing Host checks to always run causes performance problems. Creating a "Ping" service for every host requires creating manual Service Dependencies between all the "Ping" services on every Host. Then you end up with a complex configuration that is very hard to understand. But for things like telling you when a power supply or fan has died, or if the web service crashed, it works well. We did a survey of a bunch of open source tools to replace Nagios and have settled on Icinga for it's APIs, dynamic rules with pattern matching and boolean logic, and compatibility with Nagios plugins. But it still doesn't change the basic architectural choices of the Nagios core engine and hence isn't a good fit for network device/interface reachability monitoring IMO.