Hi David - Just a bit of insight from my own experience:

Common issues when monitoring (and the associated escalation processes) don't work and similar issues are seen as you described:
- Inconsistent HTTP response codes across services and service layers (nginx vs the backend tomcat), means you can't use them properly.
- Monitoring on arbitrary metrics (90% of something) as opposed to metrics linked to an actual outcome (response times for example).
- No runbook in place (engineer to change some setting to switch on/off maintenance mode).
- No central view of what engineer is doing what to which systems.

Some fairly simple example of when I've seen things work pretty well:
Organisation uses HTTP code monitoring, alerting on 5xx but not 503.
Services configured (and tested!) to return other, specific 5xx errors, but keep 503 as a 'known and expected maintenance' mode.
Runbook in place to let other engineers know what's happening (slack message for example) and then maintenance page on the reverse proxy.
Monitor and report on the common 90% metrics (disk space, memory) but no alerts.
Don't fill up the disk with logs, only to delete them and let it fill up again.. :)
Remove all non-actionable alerts.

Of course a good solution could be to implement a rolling-upgrade / ha maintenance strategy, but in reality (depending on how ancient the app is) this can be quite hard.

ps. This is a really good read: https://landing.google.com/sre/sre-book/toc/index.html


Cheers
Heath




On Thu, Dec 6, 2018 at 9:03 AM David H <ispcolohost@gmail.com> wrote:

Hey all, was curious if anyone knows of a website monitoring service that has the option to incorporate a human component into the decision and escalation tree?  I’m trying to help a customer find a way around false positives bogging down their NOC staff, by having a human determine the difference between a real error, desired (but different) content, or something in between like “Hey it’s 3am and we’ve taken our website offline for maintenance, we’ll be back up by 6am.”  Automated systems tend to only know if test A, or steps A through C, are failing, then this is ‘down’ and do my preconfigured thing, but that ends up needlessly taking NOC time if the customer themselves is performing work on their own site, or just changed it and whatever content was being watched, is now gone.  So, the goal would be to have the end user be the first point of contact if it looks like more of a customer-side issue.  If they can’t be reached to confirm, THEN contact NOC, and unlike email alerts, keep contacting until a human acknowledges receipt of the alert.

 

Thanks