At 2008-06-26T02:22-0700, Rev. Jeffrey Paul wrote:
Other stuff we really need to keep an eye on is hardware - redundant PSU status in our 7204s and Dells, temperatures and voltages
Do yourself a favor, monitor temp in C. Most stuff only does C, people burn routers if there's a mix of C and F (I set the alarm to 90, why didn't it shut down? Well, you should have set it to 30, the router only understands C).
1) Is SNMP the best way to do this? Obviously some of the data (service checks) will need to be collected other ways.
Pretty much. Particularly with NetSNMP, you can hook in external commands etc. Check out http://www.net-snmp.org/docs/man/snmpd.conf.html Arbitrary Extension Commands If you don't use SNMP for everything, you're going to be stuck with hooking SNMP into whatever you do use so that all your networking kit and environmental monitors can be monitored.
2) Is there any good solution that does both logging/trending of this data and also notification/monitoring/alerting? I've used both Nagios and Cacti in the past, and, due to the number of individual things being monitored (3-5 items per OS instance, 5-10 items per physical server, 10-50 things per network device), setting them both up independently seems like a huge pain. Also, I've never really liked Nagios that much.
Take a look at OpenNMS....
There's got to be a better way. What do you guys use?
We wrote our own, but that's a company culture thing. Paul -- End dual-measurement, let's finish going metric! http://gometric.us/ http://www.metric.org/