SLA monitoring and reporting to customers
What open-source or low-budget tools are operators using for SLA monitoring when the reports (current state and historical) should be available to customers ? Looking at NANOG archives, NAGIOS is the most prevalent tool, but its authorization mechanisms are somewhat below I would like so customers could not change anything both in configuration and in SLA software state. I'm looking for something more like Cacti, where customers can be contained to only see some of the generated graphs. Thanks for any input, Rubens
On Sun, 18 Mar 2007, Rubens Kuhl Jr. wrote:
What open-source or low-budget tools are operators using for SLA monitoring when the reports (current state and historical) should be available to customers ?
Please define SLA in terms of monitoring.
Looking at NANOG archives, NAGIOS is the most prevalent tool, but its authorization mechanisms are somewhat below I would like so customers could not change anything both in configuration and in SLA software state
You can setup so that customer only sees the data on status of the services he or she has access to by adding customer into as a contact for host or services. Do you think that your customers should or should not have such access to your central nagios system?
I'm looking for something more like Cacti, where customers can be contained to only see some of the generated graphs.
Would you be satisfied with graphing extension to nagios that is tied replicates nagios security mechanism where customer can see graphs for the service he/she is listed as contact for? -- William Leibzon Elan Networks william@elan.net
What open-source or low-budget tools are operators using for SLA monitoring when the reports (current state and historical) should be available to customers ?
Please define SLA in terms of monitoring.
- 99.x% availability (defined by packet loss and response time) monthly - A certain number of hours from service interruption to service recovery
Looking at NANOG archives, NAGIOS is the most prevalent tool, but its authorization mechanisms are somewhat below I would like so customers could not change anything both in configuration and in SLA software state
You can setup so that customer only sees the data on status of the services he or she has access to by adding customer into as a contact for host or services.
There are 2 main issues on my reading of http://nagios.sourceforge.net/docs/2_0/cgiauth.html - Users can issue commands for hosts/services they are contact for. They could acknowledge an outage even when we should know about it. - Some devices of interest to a customer are not specific to a customer: a switch, a router. If they are considered contact for such devices, they can issue commands for it.
Do you think that your customers should or should not have such access to your central nagios system?
That's something I woud like to hear opinions on, but even with NAGIOS such an issue could be solved by having one NOC-only NAGIOS and one customers-only NAGIOS. Using NagiosQL would be probably make replication easier.
I'm looking for something more like Cacti, where customers can be contained to only see some of the generated graphs.
Would you be satisfied with graphing extension to nagios that is tied replicates nagios security mechanism where customer can see graphs for the service he/she is listed as contact for?
Is it http://nagiosgraph.sourceforge.net/ ? Can a user be a nagiosgraph contact without being a NAGIOS contact ? Thanks, Rubens
On Sun, 18 Mar 2007, Rubens Kuhl Jr. wrote:
What open-source or low-budget tools are operators using for SLA monitoring when the reports (current state and historical) should be available to customers ?
Please define SLA in terms of monitoring.
- 99.x% availability (defined by packet loss and response time) monthly - A certain number of hours from service interruption to service recovery
So what you're looking for is a number for a monthly report to be calculated based on known downtime as measured by monitoring software.
Looking at NANOG archives, NAGIOS is the most prevalent tool, but its authorization mechanisms are somewhat below I would like so customers could not change anything both in configuration and in SLA software state
You can setup so that customer only sees the data on status of the services he or she has access to by adding customer into as a contact for host or services.
There are 2 main issues on my reading of http://nagios.sourceforge.net/docs/2_0/cgiauth.html - Users can issue commands for hosts/services they are contact for. They could acknowledge an outage even when we should know about it.
If they acknowledge an outage you'll know about it (acknowledgement notification). I also don't necessarily see it as bad that user for some service to acknowledge that certain service (say HTTP) that you monitore is down and tells that they purposely took apache down. But I guess what you're asking for is additional permission list for nagios users for view-only access...
- Some devices of interest to a customer are not specific to a customer: a switch, a router. If they are considered contact for such devices, they can issue commands for it.
Depends on how you set it up. The setup that I use is that each router & switch port is separate service and can have separate list of associated users and they will see no other data about the switch or issue commands for anything other then that switch.
Do you think that your customers should or should not have such access to your central nagios system?
That's something I woud like to hear opinions on, but even with NAGIOS such an issue could be solved by having one NOC-only NAGIOS and one customers-only NAGIOS. Using NagiosQL would be probably make replication easier.
Yes that can be done. But maintaining separate parallel systems is actually a pain. I also would like to hear options on if more complex user permission systems is good to have for nagios web interface and if so what those permissions should be.
I'm looking for something more like Cacti, where customers can be contained to only see some of the generated graphs.
Would you be satisfied with graphing extension to nagios that is tied replicates nagios security mechanism where customer can see graphs for the service he/she is listed as contact for?
Is it http://nagiosgraph.sourceforge.net/ ? Can a user be a nagiosgraph contact without being a NAGIOS contact ?
I'm actually asking because I wrote my own web interface (see ngraph.cgi at http://www.elan.net/~william/nagios/) originally for nagiosgrapher but it is now being decoupled from particular graphing package and I plan to have it support multiple nagios data collection & backend systems. The next step on TODO list is user access & authenication which is supposed to replicate how nagios itself does it by allowing only authenticated users who are contacts for the service to see the graphs, BUT you do have opportunity here to tell what else such interface should support as far as user access rights control. (BTW, the current cgi does support specifying users who would have access to graphs but not nagios itself - however user would have access to see all graphs then...) -- William Leibzon Elan Networks william@elan.net
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 william(at)elan.net wrote:
On Sun, 18 Mar 2007, Rubens Kuhl Jr. wrote:
What open-source or low-budget tools are operators using for SLA monitoring when the reports (current state and historical) should be available to customers ?
Please define SLA in terms of monitoring.
- ----------------------- I would say, - - availability - - response time / latency - - utilization - - accuracy and errors - - five nines, six nines , take your pick and define your own holy grail.
Looking at NANOG archives, NAGIOS is the most prevalent tool, but its authorization mechanisms are somewhat below I would like so customers could not change anything both in configuration and in SLA software state
You can setup so that customer only sees the data on status of the services he or she has access to by adding customer into as a contact for host or services. Do you think that your customers should or should not have such access to your central nagios system?
- ----------------------- correct, one can define user privilege mode as to what can be drilled into regards, /virendra
I'm looking for something more like Cacti, where customers can be contained to only see some of the generated graphs.
Would you be satisfied with graphing extension to nagios that is tied replicates nagios security mechanism where customer can see graphs for the service he/she is listed as contact for?
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF/hQDpbZvCIJx1bcRAmp4AKCKzbeGbI5de5jAmdKtRFvgxTNQFACcDbjt O/+7R16CnaezvKeVpTzy9jY= =cL7B -----END PGP SIGNATURE-----
On Sun, 18 Mar 2007, virendra rode // wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
william(at)elan.net wrote:
On Sun, 18 Mar 2007, Rubens Kuhl Jr. wrote:
What open-source or low-budget tools are operators using for SLA monitoring when the reports (current state and historical) should be available to customers ?
Please define SLA in terms of monitoring.
- ----------------------- I would say,
- - availability
OK - network connection up or UP/DOWN with list of when its down and for how long and SLA based on amount of time its been down or more commonly time_up/time_down*100
- - response time / latency
ok ping latency graph for user view with SLA based on maximum average latency over given time period
- - utilization
How is that part of SLA? Or do you mean you gurantee that your own upstream network connection would not be overutilized?
- - accuracy and errors
accuracy of what? what type of errors, packet drops?
- - five nines, six nines , take your pick and define your own holy grail.
$ echo "60*24*365*(1-0.99999)" | bc -l 5.25600 You wish to tell me you guarantee network connection to customer to be down for no more then 5 minutes during the year? Yeh, right :) (but don't let me discourage any of you in trying to achieve it!) -- William Leibzon Elan Networks william@elan.net
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 william(at)elan.net wrote:
On Sun, 18 Mar 2007, virendra rode // wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
william(at)elan.net wrote:
On Sun, 18 Mar 2007, Rubens Kuhl Jr. wrote:
What open-source or low-budget tools are operators using for SLA monitoring when the reports (current state and historical) should be available to customers ?
Please define SLA in terms of monitoring.
- ----------------------- I would say,
- - availability
OK - network connection up or UP/DOWN with list of when its down and for how long and SLA based on amount of time its been down or more commonly time_up/time_down*100
- - response time / latency
ok ping latency graph for user view with SLA based on maximum average latency over given time period
- - utilization
How is that part of SLA? Or do you mean you gurantee that your own upstream network connection would not be overutilized?
- ----------------- When an object exceeds a specified threshold (e.g. cpu, interface, temperature, routing table, etc) which could cause it to be unavailable triggering an event.
- - accuracy and errors
accuracy of what? what type of errors, packet drops?
- -------------- availability and reachability because we care about of uptime, correct?
- - five nines, six nines , take your pick and define your own holy grail.
$ echo "60*24*365*(1-0.99999)" | bc -l 5.25600
You wish to tell me you guarantee network connection to customer to be down for no more then 5 minutes during the year? Yeh, right :) (but don't let me discourage any of you in trying to achieve it!)
regards, /virendra -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF/rTYpbZvCIJx1bcRAh5vAJ91QWFjQ19jPrB/uzd+eZ8GSztvQACfV4vq LOT5Mf8E/1jG729NrgY8QKw= =zIg8 -----END PGP SIGNATURE-----
-----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of william(at)elan.net Sent: Monday, March 19, 2007 3:20 AM To: virendra rode // Cc: Rubens Kuhl Jr.; NANOG list Subject: Re: SLA monitoring and reporting to customers
How is that part of SLA? Or do you mean you gurantee that your own upstream network connection would not be overutilized? ... accuracy of what? what type of errors, packet drops?
SLA's are simply contracts that two parties negotiate...any number of metrics can be decided upon as the 'agreed level of service'. Different kinds of providers obviously have different metrics that are important - while the availability of services is pretty ubiquitous, accuracy and utilization might not make sense for an ISP SLA...then again they're often deal-breakers for an ASP SLA.
You wish to tell me you guarantee network connection to customer to be down for no more then 5 minutes during the year? Yeh, right :) (but don't let me discourage any of you in trying to achieve it!)
Maintenance Windows / Planned Downtime are nearly always present and defined in an SLA, and should be excluded from the calculation of x number of 9's. Furthermore, all SLAs I've come across also include 'Emergency Windows' which can happen anytime given a pre-determined amount of forewarning. Limits of duration and frequency of these windows should obviously be agreed upon in any good SLA. Bottom line: it's good practice for an SLA to define exactly what metrics are being used, who is measuring them (read: third party, i.e. Keynote), how they are measuring them (software/tools), what constitute a violation and what the recompense should be. - Gregori
What open-source or low-budget tools are operators using for SLA monitoring when the reports (current state and historical) should be available to customers ?
Here is one way to do it on the cheap. I have worked with Cricket and genDevConfig extensively. genDevConfig will scan a router and automatically create the cricket SNMP commands to pull the IP SLA statistics out, or what ever other statistics in which you are interested. This scanning parameters are stored in a cricket config file and the data in rrd files. A custom Perl script, with or without some Mason templating could be used along with a connection to a backend Postgresql database for user authentication. It should be relatively easy to create two tables: a) userid, username or email, password b) userid, router, interface/id for sla Then that data can be used in a Perl script to generate a page of customer specific graphs in a user-authenticated web site. Ray. -- Scanned for viruses and dangerous content at http://www.oneunified.net and is believed to be clean.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ray Burkholder wrote:
What open-source or low-budget tools are operators using for SLA monitoring when the reports (current state and historical) should be available to customers ?
Here is one way to do it on the cheap.
I have worked with Cricket and genDevConfig extensively. genDevConfig will scan a router and automatically create the cricket SNMP commands to pull the IP SLA statistics out, or what ever other statistics in which you are interested. This scanning parameters are stored in a cricket config file and the data in rrd files.
A custom Perl script, with or without some Mason templating could be used along with a connection to a backend Postgresql database for user authentication.
It should be relatively easy to create two tables: a) userid, username or email, password b) userid, router, interface/id for sla
Then that data can be used in a Perl script to generate a page of customer specific graphs in a user-authenticated web site.
Ray.
Generally if you are responsible for meeting a SLA, one has to take outages into account. regards, /virendra -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF/hHlpbZvCIJx1bcRAgYxAKC14AbDi47oVrMkE73XgUpY+PTBPgCfQiNZ OW5X3VjTPh71qtcq38ou8cM= =imML -----END PGP SIGNATURE-----
participants (5)
-
Gregori Parker
-
Ray Burkholder
-
Rubens Kuhl Jr.
-
virendra rode //
-
william(at)elan.net