HP® ServiceGuard™ Cluster Monitoring with Nagios®

Author: Trond Hasle Amundsen
Date: 2010-06-30
Latest version: 1.2.2 Released Thu Jul 23 2009


1   About

check_serviceguard is a plugin for Nagios that checks various aspects of a HP ServiceGuard cluster. The plugin checks cluster status, node and package status etc. The plugin tries to be smart and minimizes the output, by letting only the first node (in the node list) that are up and running report any errors. The other nodes will report that everything is ok.

2   Usage

check_serviceguard is designed to be used with NRPE, i.e. run locally. Example:

# check_serviceguard
OK - Cluster 'pgprod' is up, 3 nodes, 12 packages

If something is wrong, the plugin will report it:

# check_serviceguard
[imap-cluster] Package 'lister-prod' is down (halted)

2.1   Prefix alerts with service state

The option --state can be used to prefix all alerts with its corresponding service state as reported by the plugin:

# check_serviceguard --state
CRITICAL: [odont-cluster] Package 'odxray' is down (halted)
WARNING: [odont-cluster] Service 'ODPROD_mon' for package 'odprod' status is unknown

Alternatively, you can use the option --short-state to get an abbreviated, one-letter service state:

# check_serviceguard --short-state
C: [odont-cluster] Package 'odxray' is down (halted)
W: [odont-cluster] Service 'ODPROD_mon' for package 'odprod' status is unknown

The Nagios plugin development guideline suggests that this is good practice. I'm not a fan of this, but I've included these options for those who disagree.

2.2   Multiple line output, turn off escaping HTML tags

The output from check_serviceguard contains multiple lines separated by HTML linebreaks (<br/>) if run as a command within Nagios, via NRPE etc. If run from a console which has a TTY, i.e. if you log in via SSH or similar and run check_serviceguard manually, the linebreaks will be regular linebreaks.

Nagios 3.x allows the following option in cgi.cfg:

# This option determines whether HTML tags in host and service
# status output is escaped in the web interface.  If enabled,
# your plugin output will not be able to contain clickable links.


The default, as seen above in the sample cgi.cfg from the distribution, is that HTML tags are escaped. My advice is to turn this off. If not, you will see output like this in your Nagios console:

[odont-cluster] Package 'odxray' is down (halted)<br/>[odont-cluster] Service 'ODPROD_mon' for package 'odprod' status is unknown

instead of this:

[odont-cluster] Package 'odxray' is down (halted)
[odont-cluster] Service 'ODPROD_mon' for package 'odprod' status is unknown

With Nagios 3.x, plugins are allowed to output multiple lines with regular linebreaks, but only the first line is shown in the web interface (status.cgi). I have not succeeded in my attempts to show all lines in the alarm console. As such, this multiline feature of Nagios 3.x is pretty useless in my opinion.

If you have tips on how to achieve this then please tell me what I'm doing wrong.

2.3   Sudo permissions

check_serviceguard uses the ServiceGuard command cmviewcl for all its work, and needs permission to run this command. The best way to accomplish this is to use sudo. Edit the file /etc/sudoers (e.g. by running visudo as root) and add the following line:

nagios ALL=NOPASSWD:/usr/local/cmcluster/bin/cmviewcl

If you run NRPE as another user than nagios, replace "nagios" with the appropriate user name.

check_serviceguard will automatically try to use sudo unless it is run as root.

2.4   Primary and alternate nodes

A package can run on its primary node, or one of its alternate nodes. The plugin can give a warning about packages that aren't running on their primary nodes. This is turned off by default, but can be activated with the --primary switch:

# check_serviceguard --primary
[imap-cluster] Package 'lister-prod' is down (halted)
[imap-cluster] Package 'mail-mgmt' is running on alternate node mail-imap6 (primary=mail-imap4)
[imap-cluster] Package 'imap-sg17' is running on alternate node mail-imap6 (primary=mail-imap5)
[imap-cluster] Package 'imap-sg14' is running on alternate node mail-imap5 (primary=mail-imap4)
[imap-cluster] Package 'imap-sg19' is running on alternate node mail-imap6 (primary=mail-imap5)
[imap-cluster] Package 'imap-sg15' is running on alternate node mail-imap5 (primary=mail-imap4)
[imap-cluster] Package 'imap-sg18' is running on alternate node mail-imap6 (primary=mail-imap5)
[imap-cluster] Package 'imap-sg09' is running on alternate node mail-imap4 (primary=mail-imap2)
[imap-cluster] Package 'imap-sg13' is running on alternate node mail-imap5 (primary=mail-imap4)
[imap-cluster] Package 'imap-sg05' is running on alternate node mail-imap2 (primary=mail-imap1)
[imap-cluster] Package 'lister-test' is running on alternate node mail-imap4 (primary=mail-imap5)
[imap-cluster] Package 'jabber' is running on alternate node mail-imap2 (primary=mail-imap1)
[imap-cluster] Package 'imap-sg20' is running on alternate node mail-imap6 (primary=mail-imap5)

We have found that the important thing is that the package is up and running, not that it's running on its primary node. Reporting this by default therefore seemed like overkill.

2.5   Blacklisting

If the cluster contains unimportant packages of which you're not interested in the status (e.g. test packages), they can be blacklisted with the -b|--blacklist option.

# check_serviceguard
[imap-cluster] Package 'lister-prod' is down (halted)

# check_serviceguard -b lister-prod
OK - Cluster 'imap-cluster' is up, 6 nodes, 30 packages

Blacklisted packages are skipped and are not checked at all. They will not turn up in the verbose output. The argument to the -b|--blacklist option is a string with comma-separated package names, or a filename that contains the string.

2.6   Full usage information

Usage output gathered with check_serviceguard -h:

    check_serviceguard [OPTION...]

    --primary, --no-primary
        Check that packages are running on their primary nodes. This is
        turned off by default.

    --autorun, --no-autorun
        Enable or disable the check of auto_run for the packages. This is
        turned on by default.

    -b, --blacklist STRING or FILE
        Blacklist one or more packages, e.g. test packages. Blacklisted
        packages are completely ignored. The parameter is either the
        blacklist string, or a file (that may or may not exist) containing
        the string. The blacklist string contains package names separated by
        comma (,). This option can be specified multiple times.

        Prefix each alert with its corresponding Nagios state. This is
        useful in case of several alerts from the same cluster.

        Same as the --state option above, except that the state is
        abbreviated to a single letter (W=warning, C=critical etc.).

    -t, --timeout SECONDS
        The number of seconds after which the plugin will abort. Default
        timeout is 30 seconds if the option is not present.

    --linebreak STRING
        check_serviceguard will sometimes report more than one line, e.g. if
        there are several alerts. If the script has a TTY, it will use
        regular linebreaks. If not (which is the case with NRPE) it will use
        HTML linebreaks. Sometimes it can be useful to control what the
        plugin uses as a line separator, and this option provides that

        The argument is the exact string to be used as the line separator.
        There are two exceptions, i.e. two keywords that translates to the

        REG Regular linebreaks, i.e. "\n".

            HTML linebreaks, i.e. "<br/>".

        This is a rather special option that is normally not needed. The
        default behaviour should be sufficient for most users.

    -v, --verbose
        Verbose output. Will report status on everything, even if status is
        ok. Blacklisted packages are ignored (i.e. no output).

    -h, --help
        Display help text.

    -m, --man
        Display man page.

    -V, --version
        Display version info.

See also the man page.

3   Download

3.1   Latest version

You can also download the plugin and the manpage separately. You don't really need the manpage, as you can display the manpage with 'check_serviceguard -m'.

3.2   Changelog / Old versions

Version Date Changes
1.2.2 2009-07-23
  • Minor feature enhancements
  • Added LANG=C to commands
  • Don't use -S option with sudo, not supported on HP-UX
  • More error checking wrt. sudo
  • A couple of cosmetic output changes
  • License is now GPLv3
1.2.1 2009-04-07
  • Minor bugfixes
  • Only report on disabled auto_run if the package is up
1.2.0 2009-04-07
  • Major feature enhancements
  • Also check that auto_run for a package is enabled. This check is ON by default, but can be disabled with the --no-autorun switch. Thanks to Martin Christov for reporting
  • New option --timeout to specify plugin timeout (default is 30 seconds)
  • New option --linebreak to specify the type of linebreaks between alerts
  • Alerts are now sorted by severity level (criticals first)
  • Improvements in the verbose output
  • RPMs are now architecture dependent (different libdir)
  • Added install script etc. to the tarball and zip archive
  • Lots of other fixes and improvements
1.1.0 2009-01-26
  • Major feature enhancements, Minor bugfixes
  • Bugfix for HP-UX 11 + ServiceGuard 11.17 (thanks to Florian Wolf for reporting)
  • New option --state
  • New option --short-state
  • Various small fixes and improvements
1.0.0 2009-01-16
  • Initial release

4   Known bugs & limitations

I have only a limited ability to test this plugin locally. Consequently, it is not regularly tested on many different OS/SG implementations.

5   Reporting bugs, proposing new features etc.

Please send me a note if you are experiencing bugs, have feature requests, or suggestions on how to improve check_serviceguard. We use this plugin in production at the University of Oslo, on many ServiceGuard clusters, but only with RHEL. While the plugin is bug-free for us, it might not be for you, so let me know if you have problems.

6   Disclaimer

This is free software. Use at your own risk.