OS, Hardware, Network - Logging, Monitoring, and Alerting
Hi. I've a (theoretically) simple problem and I'm wondering how others solve it. I've recently deployed ~40 Linux instances on ~20 different Dell blades and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and assorted switchable PDUs and whatnot. We need to monitor standard things like cpu, memory, disk usage on all OSes. This is straightforward with net-snmp. It would also be cool if I could monitor more esoteric things, like ntp synchronization status, i/o statistics, etc. Other stuff we really need to keep an eye on is hardware - redundant PSU status in our 7204s and Dells, temperatures and voltages (one of our colos in New York peaked at over 40C a few weeks ago, for instance), and disk array status (I'd like to know of a failed disk in a hardware RAID5 before I get calls about performance issues). Our blade chassis have DRACs in them and I think they export this data via SNMP (I'm trying to avoid the use of SNMP traps), but not all of our other PowerEdges have the DRACs in them so some of this information may need to be pulled via IPMI from within the host OS. Presumably the Cisco gear makes the temperature available via SNMP. Finally, service checks - standard stuff (dns, http, https, ssh, smtp). Now, to the questions. 1) Is SNMP the best way to do this? Obviously some of the data (service checks) will need to be collected other ways. 2) Is there any good solution that does both logging/trending of this data and also notification/monitoring/alerting? I've used both Nagios and Cacti in the past, and, due to the number of individual things being monitored (3-5 items per OS instance, 5-10 items per physical server, 10-50 things per network device), setting them both up independently seems like a huge pain. Also, I've never really liked Nagios that much. I recently entertained the idea of writing a CGI that output all of this information in a standard format (csv?), distributing and installing it, then collecting it periodically at a central location and doing all the rrd/notification myself, but then realized that this problem must've been solved a million times already. There's got to be a better way. What do you guys use? (I'm not opposed to non-free solutions, provided they work better.) Cheers, -jp -- -------------------------------------------------------- Rev. Jeffrey Paul -datavibe- sneak@datavibe.net aim:x736e65616b pgp:0xD9B3C17D phone:1-800-403-1126 9440 0C7F C598 01CA 2F17 D098 0A3A 4B8F D9B3 C17D "Virtue is its own punishment." --------------------------------------------------------
On Jun 26, 2008, at 5:22 AM, Rev. Jeffrey Paul wrote:
Hi. I've a (theoretically) simple problem and I'm wondering how others solve it.
I've recently deployed ~40 Linux instances on ~20 different Dell blades and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and assorted switchable PDUs and whatnot.
We need to monitor standard things like cpu, memory, disk usage on all OSes. This is straightforward with net-snmp. It would also be cool if I could monitor more esoteric things, like ntp synchronization status, i/o statistics, etc.
Other stuff we really need to keep an eye on is hardware - redundant PSU status in our 7204s and Dells, temperatures and voltages (one of our colos in New York peaked at over 40C a few weeks ago, for instance), and disk array status (I'd like to know of a failed disk in a hardware RAID5 before I get calls about performance issues). Our blade chassis have DRACs in them and I think they export this data via SNMP (I'm trying to avoid the use of SNMP traps), but not all of our other PowerEdges have the DRACs in them so some of this information may need to be pulled via IPMI from within the host OS. Presumably the Cisco gear makes the temperature available via SNMP.
Finally, service checks - standard stuff (dns, http, https, ssh, smtp).
Now, to the questions.
1) Is SNMP the best way to do this? Obviously some of the data (service checks) will need to be collected other ways.
2) Is there any good solution that does both logging/trending of this data and also notification/monitoring/alerting? I've used both Nagios and Cacti in the past, and, due to the number of individual things being monitored (3-5 items per OS instance, 5-10 items per physical server, 10-50 things per network device), setting them both up independently seems like a huge pain. Also, I've never really liked Nagios that much.
I recently entertained the idea of writing a CGI that output all of this information in a standard format (csv?), distributing and installing it, then collecting it periodically at a central location and doing all the rrd/notification myself, but then realized that this problem must've been solved a million times already.
There's got to be a better way. What do you guys use?
(I'm not opposed to non-free solutions, provided they work better.)
You may want to have a look at Zenoss, http://www.zenoss.com/ Cheers, Andrew
Andrew Girling wrote:
On Jun 26, 2008, at 5:22 AM, Rev. Jeffrey Paul wrote:
Hi. I've a (theoretically) simple problem and I'm wondering how others solve it.
I've recently deployed ~40 Linux instances on ~20 different Dell blades and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and assorted switchable PDUs and whatnot.
We need to monitor standard things like cpu, memory, disk usage on all OSes. This is straightforward with net-snmp. It would also be cool if I could monitor more esoteric things, like ntp synchronization status, i/o statistics, etc.
Other stuff we really need to keep an eye on is hardware - redundant PSU status in our 7204s and Dells, temperatures and voltages (one of our colos in New York peaked at over 40C a few weeks ago, for instance), and disk array status (I'd like to know of a failed disk in a hardware RAID5 before I get calls about performance issues). Our blade chassis have DRACs in them and I think they export this data via SNMP (I'm trying to avoid the use of SNMP traps), but not all of our other PowerEdges have the DRACs in them so some of this information may need to be pulled via IPMI from within the host OS. Presumably the Cisco gear makes the temperature available via SNMP.
Finally, service checks - standard stuff (dns, http, https, ssh, smtp).
Now, to the questions.
1) Is SNMP the best way to do this? Obviously some of the data (service checks) will need to be collected other ways.
2) Is there any good solution that does both logging/trending of this data and also notification/monitoring/alerting? I've used both Nagios and Cacti in the past, and, due to the number of individual things being monitored (3-5 items per OS instance, 5-10 items per physical server, 10-50 things per network device), setting them both up independently seems like a huge pain. Also, I've never really liked Nagios that much.
I recently entertained the idea of writing a CGI that output all of this information in a standard format (csv?), distributing and installing it, then collecting it periodically at a central location and doing all the rrd/notification myself, but then realized that this problem must've been solved a million times already.
There's got to be a better way. What do you guys use?
(I'm not opposed to non-free solutions, provided they work better.)
You may want to have a look at Zenoss, http://www.zenoss.com/
Cheers, Andrew
I have to second the Zenoss recommendation. Fairly automatic setup for most things, great categorization and it will incorporate nagios plugins or any script that outputs in that format. It's free, but you can also buy support or install service from them. -- Alex Thurlow Technical Director Blastro Networks
Rev. Jeffrey Paul (sneak) writes:
1) Is SNMP the best way to do this? Obviously some of the data (service checks) will need to be collected other ways.
SNMP, the vendor MIBs + SNMP extensions for monitoring hardware specifics (PSU, etc...), and something like Nagios to do the TCP/network checks.
2) Is there any good solution that does both logging/trending of this data and also notification/monitoring/alerting? I've used both Nagios and Cacti in the past, and, due to the number of individual things being monitored (3-5 items per OS instance, 5-10 items per physical server, 10-50 things per network device), setting them both up independently seems like a huge pain. Also, I've never really liked Nagios that much.
Well, you could look at Zabbix, Hyperic, ZenOSS, OpenNMS and see if they cut it better for you, but the trick with Nagios is to use a DB and generate the include files automatically, then have some other more user friendly tools to populate the DB. Or use templates extensively. Then make sure your plugins output performance data for perf.data monitoring, and use something like NagiosGraph http://nagiosgraph.wiki.sourceforge.net/ or PNP4Nagios: http://www.pnp4nagios.org/pnp/about#system_requirements http://nagiosplug.sourceforge.net/developer-guidelines.html#AEN203 http://www.pnp4nagios.org/pnp/screenshots
I recently entertained the idea of writing a CGI that output all of this information in a standard format (csv?), distributing and installing it, then collecting it periodically at a central location and doing all the rrd/notification myself, but then realized that this problem must've been solved a million times already.
Yes :) But check out the above links, and with a bit of planning and a small amount of coding/adapting existing components, it will work out.
There's got to be a better way. What do you guys use?
We rewrote our own NMS from scratch :)
(I'm not opposed to non-free solutions, provided they work better.)
We sell our solution, so I'm biased, but do check out the Nagios route, it works well enough for small to medium, and larger installations with careful planning (problem with Nagios is how to make it perform with thousands of hosts). Hth, Phil
hi jeffrey I personally prefer hobbit over cacti and nagios http://sourceforge.net/projects/hobbitmon/ http://hobbitmon.sourceforge.net/ Thomas Quilling NCIR GmbH Network, Consulting & Internet Services Munich / Germany tier1@ncinet.de -----Ursprungliche Nachricht----- Von: Rev. Jeffrey Paul [mailto:sneak@datavibe.net] Gesendet: Donnerstag, 26. Juni 2008 11:22 An: nanog@nanog.org Betreff: OS, Hardware, Network - Logging, Monitoring, and Alerting Hi. I've a (theoretically) simple problem and I'm wondering how others solve it. I've recently deployed ~40 Linux instances on ~20 different Dell blades and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and assorted switchable PDUs and whatnot. We need to monitor standard things like cpu, memory, disk usage on all OSes. This is straightforward with net-snmp. It would also be cool if I could monitor more esoteric things, like ntp synchronization status, i/o statistics, etc. Other stuff we really need to keep an eye on is hardware - redundant PSU status in our 7204s and Dells, temperatures and voltages (one of our colos in New York peaked at over 40C a few weeks ago, for instance), and disk array status (I'd like to know of a failed disk in a hardware RAID5 before I get calls about performance issues). Our blade chassis have DRACs in them and I think they export this data via SNMP (I'm trying to avoid the use of SNMP traps), but not all of our other PowerEdges have the DRACs in them so some of this information may need to be pulled via IPMI from within the host OS. Presumably the Cisco gear makes the temperature available via SNMP. Finally, service checks - standard stuff (dns, http, https, ssh, smtp). Now, to the questions. 1) Is SNMP the best way to do this? Obviously some of the data (service checks) will need to be collected other ways. 2) Is there any good solution that does both logging/trending of this data and also notification/monitoring/alerting? I've used both Nagios and Cacti in the past, and, due to the number of individual things being monitored (3-5 items per OS instance, 5-10 items per physical server, 10-50 things per network device), setting them both up independently seems like a huge pain. Also, I've never really liked Nagios that much. I recently entertained the idea of writing a CGI that output all of this information in a standard format (csv?), distributing and installing it, then collecting it periodically at a central location and doing all the rrd/notification myself, but then realized that this problem must've been solved a million times already. There's got to be a better way. What do you guys use? (I'm not opposed to non-free solutions, provided they work better.) Cheers, -jp -- -------------------------------------------------------- Rev. Jeffrey Paul -datavibe- sneak@datavibe.net aim:x736e65616b pgp:0xD9B3C17D phone:1-800-403-1126 9440 0C7F C598 01CA 2F17 D098 0A3A 4B8F D9B3 C17D "Virtue is its own punishment." --------------------------------------------------------
Rev. Jeffrey Paul wrote:
Hi. I've a (theoretically) simple problem and I'm wondering how others solve it.
Taken one at a time, mos of them are simple. Most of life is like that.
1) Is SNMP the best way to do this? Obviously some of the data (service checks) will need to be collected other ways.
I've actually been out of the admin biz for some time but back in the day I was very found of SNMP tools for all sorts of reporting. For output I liked MRTG for most things, WhatsUpGold had some nice features if you would rather pay money. For alarms, I used some unix hack or another (home-made). I also used home-made hacks to gather data about things that did not have a suitable SNMP interface.
2) Is there any good solution that does both logging/trending of this data and also notification/monitoring/alerting? I've used both Nagios and Cacti in the past, and, due to the number of individual things being monitored (3-5 items per OS instance, 5-10 items per physical server, 10-50 things per network device), setting them both up independently seems like a huge pain. Also, I've never really liked Nagios that much.
See MRTG, RRD, et al.
I recently entertained the idea of writing a CGI that output all of this information in a standard format (csv?), distributing and installing it, then collecting it periodically at a central location and doing all the rrd/notification myself, but then realized that this problem must've been solved a million times already.
There's got to be a better way. What do you guys use?
I had the luxury of management that thought managing was a good idea, so I had a machine pretty much dedicated to systems management and all the machines (including routers, bridges, hubs, and such) reported to it. We had a web interface to the MRTG and MRTG-like presentations.
(I'm not opposed to non-free solutions, provided they work better.)
Just before the fired me for being too old, they bought all the HP and cisco stuff in the world. I do not recommend any of it. -- Requiescas in pace o email Two identifying characteristics of System Administrators: Ex turpi causa non oritur actio Infallibility, and the ability to learn from their mistakes. Eppure si rinfresca ICBM Targeting Information: http://tinyurl.com/4sqczs
At 2008-06-26T02:22-0700, Rev. Jeffrey Paul wrote:
Other stuff we really need to keep an eye on is hardware - redundant PSU status in our 7204s and Dells, temperatures and voltages
Do yourself a favor, monitor temp in C. Most stuff only does C, people burn routers if there's a mix of C and F (I set the alarm to 90, why didn't it shut down? Well, you should have set it to 30, the router only understands C).
1) Is SNMP the best way to do this? Obviously some of the data (service checks) will need to be collected other ways.
Pretty much. Particularly with NetSNMP, you can hook in external commands etc. Check out http://www.net-snmp.org/docs/man/snmpd.conf.html Arbitrary Extension Commands If you don't use SNMP for everything, you're going to be stuck with hooking SNMP into whatever you do use so that all your networking kit and environmental monitors can be monitored.
2) Is there any good solution that does both logging/trending of this data and also notification/monitoring/alerting? I've used both Nagios and Cacti in the past, and, due to the number of individual things being monitored (3-5 items per OS instance, 5-10 items per physical server, 10-50 things per network device), setting them both up independently seems like a huge pain. Also, I've never really liked Nagios that much.
Take a look at OpenNMS....
There's got to be a better way. What do you guys use?
We wrote our own, but that's a company culture thing. Paul -- End dual-measurement, let's finish going metric! http://gometric.us/ http://www.metric.org/
Rev. Jeffrey Paul wrote:
Hi. I've a (theoretically) simple problem and I'm wondering how others solve it.
I've recently deployed ~40 Linux instances on ~20 different Dell blades and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and assorted switchable PDUs and whatnot.
We need to monitor standard things like cpu, memory, disk usage on all OSes. This is straightforward with net-snmp. It would also be cool if I could monitor more esoteric things, like ntp synchronization status, i/o statistics, etc.
Other stuff we really need to keep an eye on is hardware - redundant PSU status in our 7204s and Dells, temperatures and voltages (one of our colos in New York peaked at over 40C a few weeks ago, for instance), and disk array status (I'd like to know of a failed disk in a hardware RAID5 before I get calls about performance issues). Our blade chassis have DRACs in them and I think they export this data via SNMP (I'm trying to avoid the use of SNMP traps), but not all of our other PowerEdges have the DRACs in them so some of this information may need to be pulled via IPMI from within the host OS. Presumably the Cisco gear makes the temperature available via SNMP.
Finally, service checks - standard stuff (dns, http, https, ssh, smtp).
Now, to the questions.
1) Is SNMP the best way to do this? Obviously some of the data (service checks) will need to be collected other ways.
2) Is there any good solution that does both logging/trending of this data and also notification/monitoring/alerting? I've used both Nagios and Cacti in the past, and, due to the number of individual things being monitored (3-5 items per OS instance, 5-10 items per physical server, 10-50 things per network device), setting them both up independently seems like a huge pain. Also, I've never really liked Nagios that much.
I recently entertained the idea of writing a CGI that output all of this information in a standard format (csv?), distributing and installing it, then collecting it periodically at a central location and doing all the rrd/notification myself, but then realized that this problem must've been solved a million times already.
There's got to be a better way. What do you guys use?
I wrote an NMS to do something along these lines. It's focussed more towards graphing than alerting. It knows where to find Dell/Cisco temperature monitors via SNMP and will keep track of hardware and OS types/versions. It's probably still not really ready for general consumption, but if you think it would be useful to you, give me a shout and I'll see if I can help you make it work properly for you. http://www.project-observer.org I wrote it mostly due to my own absolute hatred of Nagios and disappointment at the other NMSes around (where are the asthetics?)! :) Thanks, adam.
you can do most of this with Cacti out of the box. you can also add the thold and monitoring plugins to get the additional things you need. Cacti mainly uses SNMP but you can also use external scripts to gather information. It does have future trending capabilities (that i am aware of) but can evaluate against baseline thresholds using the thold plugin. The Cacti community has created templates and add-ons for the most common network vendors and system types. On Fri, Jun 27, 2008 at 11:42 AM, Adam Armstrong <lists@memetic.org> wrote:
Rev. Jeffrey Paul wrote:
Hi. I've a (theoretically) simple problem and I'm wondering how others solve it.
I've recently deployed ~40 Linux instances on ~20 different Dell blades and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and assorted switchable PDUs and whatnot. We need to monitor standard things like cpu, memory, disk usage on all OSes. This is straightforward with net-snmp. It would also be cool if I could monitor more esoteric things, like ntp synchronization status, i/o statistics, etc.
Other stuff we really need to keep an eye on is hardware - redundant PSU status in our 7204s and Dells, temperatures and voltages (one of our colos in New York peaked at over 40C a few weeks ago, for instance), and disk array status (I'd like to know of a failed disk in a hardware RAID5 before I get calls about performance issues). Our blade chassis have DRACs in them and I think they export this data via SNMP (I'm trying to avoid the use of SNMP traps), but not all of our other PowerEdges have the DRACs in them so some of this information may need to be pulled via IPMI from within the host OS. Presumably the Cisco gear makes the temperature available via SNMP.
Finally, service checks - standard stuff (dns, http, https, ssh, smtp).
Now, to the questions.
1) Is SNMP the best way to do this? Obviously some of the data (service checks) will need to be collected other ways.
2) Is there any good solution that does both logging/trending of this data and also notification/monitoring/alerting? I've used both Nagios and Cacti in the past, and, due to the number of individual things being monitored (3-5 items per OS instance, 5-10 items per physical server, 10-50 things per network device), setting them both up independently seems like a huge pain. Also, I've never really liked Nagios that much.
I recently entertained the idea of writing a CGI that output all of this information in a standard format (csv?), distributing and installing it, then collecting it periodically at a central location and doing all the rrd/notification myself, but then realized that this problem must've been solved a million times already.
There's got to be a better way. What do you guys use?
I wrote an NMS to do something along these lines. It's focussed more towards graphing than alerting. It knows where to find Dell/Cisco temperature monitors via SNMP and will keep track of hardware and OS types/versions. It's probably still not really ready for general consumption, but if you think it would be useful to you, give me a shout and I'll see if I can help you make it work properly for you.
http://www.project-observer.org
I wrote it mostly due to my own absolute hatred of Nagios and disappointment at the other NMSes around (where are the asthetics?)! :)
Thanks, adam.
Mike wrote:
you can do most of this with Cacti out of the box. you can also add the thold and monitoring plugins to get the additional things you need. Cacti mainly uses SNMP but you can also use external scripts to gather information. It does have future trending capabilities (that i am aware of) but can evaluate against baseline thresholds using the thold plugin.
The Cacti community has created templates and add-ons for the most common network vendors and system types.
Cacti does graphs, but it's really just not useful enough to me. Neither was Nagios (on top of being a nightmare to configure). I found similar issues with other similarish solutions such as OpenNMS and JFFNMS. I generally used Cricket with the config-generation tool for graphing devices and ports, Cacti was prettier, but IMO slightly more complex than necessary. Observer is intended to be autodiscovering, with as little manually configured as possible. This has made a few things quite hard to do properly, like alerting. It was written firstly to discover the network, secondly to graph and log it, and thirdly to alert you when it breaks. Unfortunately it turns out that i can't get my head around the alerting bit, so it remains a little unfinished! My personal opinion is that all of the FOSS NMS solutions are sorely disappointing, Observer included. It seems to be something that no one has quite gotten right yet! Adam.
On Fri, Jun 27, 2008 at 11:42 AM, Adam Armstrong <lists@memetic.org> wrote:
Rev. Jeffrey Paul wrote:
Hi. I've a (theoretically) simple problem and I'm wondering how others solve it.
I've recently deployed ~40 Linux instances on ~20 different Dell blades and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and assorted switchable PDUs and whatnot. We need to monitor standard things like cpu, memory, disk usage on all OSes. This is straightforward with net-snmp. It would also be cool if I could monitor more esoteric things, like ntp synchronization status, i/o statistics, etc.
Other stuff we really need to keep an eye on is hardware - redundant PSU status in our 7204s and Dells, temperatures and voltages (one of our colos in New York peaked at over 40C a few weeks ago, for instance), and disk array status (I'd like to know of a failed disk in a hardware RAID5 before I get calls about performance issues). Our blade chassis have DRACs in them and I think they export this data via SNMP (I'm trying to avoid the use of SNMP traps), but not all of our other PowerEdges have the DRACs in them so some of this information may need to be pulled via IPMI from within the host OS. Presumably the Cisco gear makes the temperature available via SNMP.
Finally, service checks - standard stuff (dns, http, https, ssh, smtp).
Now, to the questions.
1) Is SNMP the best way to do this? Obviously some of the data (service checks) will need to be collected other ways.
2) Is there any good solution that does both logging/trending of this data and also notification/monitoring/alerting? I've used both Nagios and Cacti in the past, and, due to the number of individual things being monitored (3-5 items per OS instance, 5-10 items per physical server, 10-50 things per network device), setting them both up independently seems like a huge pain. Also, I've never really liked Nagios that much.
I recently entertained the idea of writing a CGI that output all of this information in a standard format (csv?), distributing and installing it, then collecting it periodically at a central location and doing all the rrd/notification myself, but then realized that this problem must've been solved a million times already.
There's got to be a better way. What do you guys use?
I wrote an NMS to do something along these lines. It's focussed more towards graphing than alerting. It knows where to find Dell/Cisco temperature monitors via SNMP and will keep track of hardware and OS types/versions. It's probably still not really ready for general consumption, but if you think it would be useful to you, give me a shout and I'll see if I can help you make it work properly for you.
http://www.project-observer.org
I wrote it mostly due to my own absolute hatred of Nagios and disappointment at the other NMSes around (where are the asthetics?)! :)
Thanks, adam.
On 6/27/08, Adam Armstrong <lists@memetic.org> wrote:
My personal opinion is that all of the FOSS NMS solutions are sorely disappointing, Observer included. It seems to be something that no one has quite gotten right yet!
Adam.
Very true. One product (not OSS, somewhat pricey) we've had great luck with is SolarWinds Netmonitor. I can install it and point it at all of our equipment in under a couple of hours. When you want to monitor a server, you just need an SNMP service running on it, point Netmonitor at the IP of the box, and it'll ask you what you'd like to monitor (disk, CPU, memory, etc). Works great with our Cisco and HP networking gear as well. -brandon
participants (10)
-
Adam Armstrong
-
Alex Thurlow
-
Andrew Girling
-
Brandon Galbraith
-
Laurence F. Sheldon, Jr.
-
Mike
-
Paul Armstrong
-
Phil Regnauld
-
Rev. Jeffrey Paul
-
Tom Quilling