Monitoring system recommendation
Dear Nanog community We are currently planning to upgrade our monitoring system (Opsview) due to scalability issues and I was wondering what do you recommend for monitoring 5000 hosts and 35000 services. We would like to use a monitoring system that is compatible with the nagios plugin format, however we are not sure if systems like Icinga/Shinken/Op5 are the way to go. Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts? Would you recommend commercial systems like Sevone, Zabbix, etc instead of open source ones? Your input is really appreciated it Thank you and have a great day Regards
On Mon, Jun 6, 2016, at 09:18, Manuel Marín wrote:
Dear Nanog community
We are currently planning to upgrade our monitoring system (Opsview) due to scalability issues and I was wondering what do you recommend for monitoring 5000 hosts and 35000 services. We would like to use a monitoring system that is compatible with the nagios plugin format, however we are not sure if systems like Icinga/Shinken/Op5 are the way to go.
Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts? Would you recommend commercial systems like Sevone, Zabbix, etc instead of open source ones?
While not being completely drop-in compatible with Nagios plugins, Xymon (Big Brother clone) is up to the task of monitoring this many hosts/services. Here's a page with a list of businesses who are publicly reporting their use of Xymon and the number of hosts/services they're monitoring. ServiceNow is the biggest I've seen with 569,869 hosts and 740,185 status messages (different service checks being reported back in). It's really hard to find tools that can scale that large, but with the load distributed to a few Xymon Proxys which are reporting to your centralized instance it will scale as large as you want. https://en.wikibooks.org/wiki/System_Monitoring_with_Xymon/User_Guide/The_Xy.... I've used it for years and greatly prefer it to everything else due to its simplicity and config format. I find nagios's config format extremely tedious. As for Nagios plugins: Nagios derives the results of plugins from the status as exit codes: 0 = green, 1 = yellow, 2 = red if I recall correctly. If you just modify the plugin to execute a Xymon command as the last step and report the color instead of the exit code it should work fine. There was a tool called "xynagios" that automatically made nagios plugins work without modification but I haven't tried to use it and don't know if it's still out there. There are two things you might want to be aware of with Xymon: the monitoring data is not encrypted not the wire; it's up to you to handle that at the moment if you feel it is necessary. It also does not support IPv6. There was a huge rewrite in progress for years to handle both of these but it stalled out. Recently it has picked up a lot of development steam and they're scrapping the major rewrite and back porting the important things. I believe Xymon 4.4 will at least have the encrypted transport. -- Mark Felder feld@feld.me
On Mon, Jun 6, 2016, at 09:18, Manuel Marín wrote:
5000 hosts and 35000 services. We would like to use a monitoring system that is compatible with the nagios plugin format, however we are not sure if systems like Icinga/Shinken/Op5 are the way to go.
At that kind of scale, you need to take a serious look at moving away from the Nagios plugin model. Any model based primarily on forking external processes is going to hold you back.
Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts?
This kind of scale is easily achieved with OpenNMS with appropriate hardware and planning, largely because it does everything in-process. We do offer limited support for NRPE as a transitional mechanism. Disclosure: I get paid to work with OpenNMS in a consulting capacity.
Would you recommend commercial systems like Sevone, Zabbix, etc instead of open source ones?
Zabbix is open source. I know some of their team and would recommend putting them on your list. I also know a number of brilliant people who work for SevOne, but I don't know much about their product. -jeff
We are currently planning to upgrade our monitoring system (Opsview) due to scalability issues and I was wondering what do you recommend for monitoring 5000 hosts and 35000 services. We would like to use a monitoring system that
Another consideration is check_mk. We use it in our shop. The check_mk people wrapped a bunch of python around the Nagios notification engine. No longer do you need to worry about the tedium of nagios config files, those are all built automatically from commands from a gui or from a single configuration file. Check_mk has a benchmarking page which scales to more hosts than you specified: https://mathias-kettner.de/checkmk_checkmk_benchmarks.html For an architecture diagram of how they use nagios for alerting, and python for scanning: http://mathias-kettner.com/check_mk.html If an included agent isn't available, new ones can be written. We are quite happy with the solution. We've replaced cricket, cacti, nagios, observium, and a little bit of smokeping with this almost all in one tool. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
On 6 June 2016 at 07:18, Manuel Marín <mmg@transtelco.net> wrote:
Dear Nanog community
We are currently planning to upgrade our monitoring system (Opsview) due to scalability issues and I was wondering what do you recommend for monitoring 5000 hosts and 35000 services. We would like to use a monitoring system that is compatible with the nagios plugin format, however we are not sure if systems like Icinga/Shinken/Op5 are the way to go.
Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts? Would you recommend commercial systems like Sevone, Zabbix, etc instead of open source ones?
Although I haven't ever scaled it that high, I've had a lot of luck using Gearman (mod_gearman) to make Nagios horizontally scalable. It allows you to use Nagios itself only as a scheduler and reporting UI, and offload all of the actual probing to other servers. There'll be a theoretical limit to the amount of scale you get get out of that due to relying on a single Nagios instance to schedule checks and receive reports of success, but I imagine it's much higher than your current requirements.
I once worked for Zenoss and still suggest them. Zenoss supports NAGIOS plugins, and my $DAYJOB is at a Zenoss Partner who can help you achieve your goals. If you need some help with Zenoss feel free to contact me off list. Andrew On Monday, June 6, 2016, Manuel Marín <mmg@transtelco.net> wrote:
Dear Nanog community
We are currently planning to upgrade our monitoring system (Opsview) due to scalability issues and I was wondering what do you recommend for monitoring 5000 hosts and 35000 services. We would like to use a monitoring system that is compatible with the nagios plugin format, however we are not sure if systems like Icinga/Shinken/Op5 are the way to go.
Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts? Would you recommend commercial systems like Sevone, Zabbix, etc instead of open source ones?
Your input is really appreciated it
Thank you and have a great day
Regards
Things to notice, as I prefer Zabbix over nagios (real database related, more functionalities) : - Zabbix actually is open source. You can buy support from them or from partners if you want - Zabbix can be distributed through central/proxies architecture to scale - nagios plugins can be adapted for Zabbix, as the later only needs numerical value (no status or text)
Le 7 juin 2016 à 07:11, Andrew Kirch <trelane@trelane.net> a écrit :
I once worked for Zenoss and still suggest them. Zenoss supports NAGIOS plugins, and my $DAYJOB is at a Zenoss Partner who can help you achieve your goals. If you need some help with Zenoss feel free to contact me off list.
Andrew
On Monday, June 6, 2016, Manuel Marín <mmg@transtelco.net> wrote:
Dear Nanog community
We are currently planning to upgrade our monitoring system (Opsview) due to scalability issues and I was wondering what do you recommend for monitoring 5000 hosts and 35000 services. We would like to use a monitoring system that is compatible with the nagios plugin format, however we are not sure if systems like Icinga/Shinken/Op5 are the way to go.
Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts? Would you recommend commercial systems like Sevone, Zabbix, etc instead of open source ones?
Your input is really appreciated it
Thank you and have a great day
Regards
On Monday, June 6, 2016, Manuel Marín <mmg@transtelco.net> wrote:
Dear Nanog community
We are currently planning to upgrade our monitoring system (Opsview) due to scalability issues and I was wondering what do you recommend for monitoring 5000 hosts and 35000 services. We would like to use a monitoring system that is compatible with the nagios plugin format, however we are not sure if systems like Icinga/Shinken/Op5 are the way to go.
Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts? Would you recommend commercial systems like Sevone, Zabbix, etc instead of open source ones?
We (op5) have customers running > 50,000 hosts and > 300,000 services. So 5,000 hosts is generally not a problem. As mentioned by Jeff, the forking model *can* become a problem. Small binaries that don't load a lot of libraries fork pretty fast. A test we made some time ago showed a 15 minute load peak at 3.89 (on 24 cores/hyperthreads) when checking 100,000 services every 5 minutes. Check latencies were 0.8 seconds max and 0.002 seconds avg. Average cpu load was 15%. Specs for the machine used: Dell PowerEdge R620 2x Intel Xeon E5-2620 24 GB ram Dell PERC H710 hardware RAID card RAID10 on 4x300GB 15kRPM SAS drives So a single (now almost vintage) server can handle 300 plugin executions per second without breaking a sweat. Scaling up is definitely a possibility, but scaling out (using mod gearman, mk or merlin, all open source) is available as well. Complex plugins, for example check_vmware_api which loads the large VMware perl SDK can get you in trouble though. I suggest you run a test with the plugin mix you are planning to use. If scaling out is not an option, and you want to stay in the nagios/naemon world, a custom worker can be developed to get rid of the loading overhead. Documentation is available at http://www.naemon.org/documentation/developer/workers.html Full disclosure: I work as development team lead at op5 best regards Mikael Falkvidd
We use Zabbix here pretty heavily. Monitoring roughly 10,000 hosts 13,000 interfaces and a mirage of services. -Brent
On Jun 7, 2016, at 2:42 AM, Mikael Falkvidd <mikael.falkvidd@op5.com> wrote:
On Monday, June 6, 2016, Manuel Marín <mmg@transtelco.net> wrote:
Dear Nanog community
We are currently planning to upgrade our monitoring system (Opsview) due to scalability issues and I was wondering what do you recommend for monitoring 5000 hosts and 35000 services. We would like to use a monitoring system that is compatible with the nagios plugin format, however we are not sure if systems like Icinga/Shinken/Op5 are the way to go.
Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts? Would you recommend commercial systems like Sevone, Zabbix, etc instead of open source ones?
We (op5) have customers running > 50,000 hosts and > 300,000 services. So 5,000 hosts is generally not a problem.
As mentioned by Jeff, the forking model *can* become a problem. Small binaries that don't load a lot of libraries fork pretty fast. A test we made some time ago showed a 15 minute load peak at 3.89 (on 24 cores/hyperthreads) when checking 100,000 services every 5 minutes. Check latencies were 0.8 seconds max and 0.002 seconds avg. Average cpu load was 15%.
Specs for the machine used: Dell PowerEdge R620 2x Intel Xeon E5-2620 24 GB ram Dell PERC H710 hardware RAID card RAID10 on 4x300GB 15kRPM SAS drives
So a single (now almost vintage) server can handle 300 plugin executions per second without breaking a sweat. Scaling up is definitely a possibility, but scaling out (using mod gearman, mk or merlin, all open source) is available as well.
Complex plugins, for example check_vmware_api which loads the large VMware perl SDK can get you in trouble though. I suggest you run a test with the plugin mix you are planning to use.
If scaling out is not an option, and you want to stay in the nagios/naemon world, a custom worker can be developed to get rid of the loading overhead. Documentation is available at http://www.naemon.org/documentation/developer/workers.html
Full disclosure: I work as development team lead at op5
best regards Mikael Falkvidd
I'm not at that scale, but I've seen some fairly impressive performance searching through a friend's NetXMS system with a couple years of verbose syslog and monitoring to go through. ----- Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest Internet Exchange http://www.midwest-ix.com ----- Original Message ----- From: "Manuel Marín" <mmg@transtelco.net> To: "NANOG" <nanog@nanog.org> Sent: Monday, June 6, 2016 9:18:07 AM Subject: Monitoring system recommendation Dear Nanog community We are currently planning to upgrade our monitoring system (Opsview) due to scalability issues and I was wondering what do you recommend for monitoring 5000 hosts and 35000 services. We would like to use a monitoring system that is compatible with the nagios plugin format, however we are not sure if systems like Icinga/Shinken/Op5 are the way to go. Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts? Would you recommend commercial systems like Sevone, Zabbix, etc instead of open source ones? Your input is really appreciated it Thank you and have a great day Regards
Yes, but depends on HW. They support some pretty huge environments. You have to have "enough" IOPs to keep up with the polling, DB and RRD data. Then there will never be a "heavy" load... I would contact them and based on your needs ask them what HW you will need for your implementation. You can get real world info from the mailing list: https://sourceforge.net/p/opennms/mailman/ I would suggest the opennms-discuss list. On 6/8/16 4:39 PM, Jeff wrote:
On 06/06/2016 10:18 AM, Manuel Marín wrote:
Dear Nanog community [...snipped...] Your input is really appreciated it
Thank you and have a great day
Regards
I have not used openNMS in production.. does it work well under heavy load?
regards, J
participants (12)
-
Andrew Kirch
-
Crier, Brent
-
Dan Lacey
-
Guillaume Tournat
-
Jeff
-
Jeff Gehlbach
-
Manuel Marín
-
Mark Felder
-
Matthew Pounsett
-
Mikael Falkvidd
-
Mike Hammett
-
Raymond Burkholder