On Wed, Jun 5, 2019 at 5:44 AM William Waites <
ww@styx.org> wrote:
> It's not enough to have monitoring and a ticket system. You need to pay
> attention to them, care for them and feed them. I can't count the number
> of ticket systems full of ancient and irrelevant things or monitoring
> systems that people have forgotten about or don't know how to add new
> stuff to. Even the cycle of,
Some points to consider when monitoring your network:
1. Beware early automation. If you write a generator to go and monitor all your stuff without addressing how operators will change things one-off (which is hard to design well) the other operators will find the monitoring system unusable. Which means they won't update it when stuff is added and changed. Making it quickly useless.
2. Careful aggregating alarms. That big green or red light is useless. The operator has to be able to start with the alarm and immediately trace back to exactly what tests and results bubbled up in to the aggregate and from there to the malfunctioning component. If you lose this information during the aggregation process, you're just producing noise.
2. Every alarm must be actionable. When the light goes red, what -exactly- do you want the operator to do as a result? Don't create an alarm until you can offer a detailed and specific answer, and link that answer to the alarm so the operator doesn't have to hunt for it.
Regards,
Bill Herrin
--