Hi, We have the Zabbix IT Services (running on Zabbix 3.2) configured for some test groups. It usually returns good data but occasionally it seems that one service group or trigger will get stuck in an alerting state and provide an incorrect SLA. This can occur if the trigger has changed to a problem state and then back to OK but the IT services doesn't reflect that change. It will occur where the top level group will show as having 100% problem time and the sub groups and items either have no problem time or such a small amount that it wouldn't indicate 100% problem time. We have it built with some groups under root, some sub groups and items and the items will have a trigger associated with those items. We followed this article to the best of our knowledge: https://www.zabbix.com/documentation/3.2/manual/it_services For Example: |Data Center |-Core1 |--Core1 - ICMP - Trigger |-Core2 |--Core2 - ICMP - Trigger Each subitem is a child of the item above it. We haven't configured any dependencies to any other groups or items. My question is, has anyone gotten the Zabbix IT Services to work correctly? Is there a trick to getting it to work, some configuration we are doing incorrectly? Thanks, Graham
On Tue, 18 Jul 2017 14:33:19 -0000, Graham Johnston said:
My question is, has anyone gotten the Zabbix IT Services to work correctly? Is there a trick to getting it to work, some configuration we are doing incorrectly?
We're a Zabbix shop, with a large number of boxes being monitored. This may or may not be your problem, but it bit me big time when were were first getting it up and running. There's a "gotcha" with triggers, in that they may have *TWO* values to provide hysteresis. So if you have a trigger set to go off at 25 wombats/second, and your system hits 32 wps, the trigger will flag a problem. It will *continue* doing so *not* until it drops below 25 wps, but until it drops down to the "clear" value (for example 10 wombats/sec). SO you can be sitting at 11 or 13 or 12 for a long time, but it won't go to OK until till it's below 10 when Zabbix checks. (A side effect is if it manages to have a very short dip to 9.8 wps, and back up to 13, you'll be scratching your head wondering how it went to OK. :) (And then of course there's the "somebody had a wild hair" cases where the trigger into trouble state is one hand-coded expression checking one thing, and the "OK" trigger checks something entirely different. :) Hope that helps.
participants (2)
-
Graham Johnston
-
valdis.kletnieks@vt.edu