On Monday, June 6, 2016, Manuel MarĂn <mmg@transtelco.net> wrote:
Dear Nanog community
We are currently planning to upgrade our monitoring system (Opsview) due to scalability issues and I was wondering what do you recommend for monitoring 5000 hosts and 35000 services. We would like to use a monitoring system that is compatible with the nagios plugin format, however we are not sure if systems like Icinga/Shinken/Op5 are the way to go.
Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts? Would you recommend commercial systems like Sevone, Zabbix, etc instead of open source ones?
We (op5) have customers running > 50,000 hosts and > 300,000 services. So 5,000 hosts is generally not a problem. As mentioned by Jeff, the forking model *can* become a problem. Small binaries that don't load a lot of libraries fork pretty fast. A test we made some time ago showed a 15 minute load peak at 3.89 (on 24 cores/hyperthreads) when checking 100,000 services every 5 minutes. Check latencies were 0.8 seconds max and 0.002 seconds avg. Average cpu load was 15%. Specs for the machine used: Dell PowerEdge R620 2x Intel Xeon E5-2620 24 GB ram Dell PERC H710 hardware RAID card RAID10 on 4x300GB 15kRPM SAS drives So a single (now almost vintage) server can handle 300 plugin executions per second without breaking a sweat. Scaling up is definitely a possibility, but scaling out (using mod gearman, mk or merlin, all open source) is available as well. Complex plugins, for example check_vmware_api which loads the large VMware perl SDK can get you in trouble though. I suggest you run a test with the plugin mix you are planning to use. If scaling out is not an option, and you want to stay in the nagios/naemon world, a custom worker can be developed to get rid of the loading overhead. Documentation is available at http://www.naemon.org/documentation/developer/workers.html Full disclosure: I work as development team lead at op5 best regards Mikael Falkvidd