check_dell_bladechassis

Dell™ Blade Enclosure Monitoring with Nagios®

Author: Trond Hasle Amundsen
Contact: t.h.amundsen@usit.uio.no
Date: 2010-06-30
Latest version: 1.0.0 Released Tue Aug 4 2010

Contents:

Important

When using this plugin to monitor 1855/1955 enclosures, it is important to use SNMP versjon 1. To accomplish this, use the -P option, like this:

check_dell_bladechassis -H myhostname -P 1

The management module on the 1855/1955 chassis dies or otherwise becomes unavailable after a while if it is probed with SNMP versjon 2c. This is merely an annoyance, as it is easy to remove and insert the management controller to rectify the issue. But to avoid this altogether, use SNMP version 1.

1   Basic Overview

check_dell_bladechassis is a plugin for the Nagios monitoring software which checks the hardware health of Dell blade enclosures via SNMP. The plugin supports both the new M1000e enclosure and the old 1855/1955 enclosures.

PowerEdge M1000e

This plugin is designed to be a companion plugin to check_openmanage in terms of supported options and functionality. The information that can be gathered via SNMP from these enclosures is limited, so the plugin can't be as detailed as check_openmanage can for Dell servers. In particular, this applies to the old 1855/1955 chassis.

2   Prerequisites

check_dell_bladechassis is written in Perl, and needs a perl interpreter. Nagios' embedded perl interpreter (ePN) can be used, but be aware that the plugin is not well tested against ePN. The plugin assumes that perl is available as /usr/bin/perl, but you can easily change this as you wish by editing the first line in the script.

Since this plugin uses SNMP, you'll also need the perl module Net::SNMP on the Nagios server (or the server running the queries). This module is not part of perl itself, but is available in all modern Linux distributions. Installing Net::SNMP is quite easy:

If this does not apply to your server, consult your OS repository to find Net::SNMP. If all else fails, try installing from CPAN.

3   Getting started

Attention!

This is a short HOWTO that describes how to get started with using check_dell_bladechassis. This HOWTO assumes that the prerequisites are met, and that you have a Nagios server up and running. Nagios version 3.x is assumed.

The examples below are simple examples with very basic usage of check_dell_bladechassis. There are many more or less advanced options that you might consider useful. Se the usage section for info.

3.1   Creating a hostgroup

The first thing you want to do is create a hostgroup that contains your blade enclosures. If you have very few enclosures you can skip this step and use hosts in the service definition instead, but I think hostgroups are always better:

# hostgroup for Dell blade enclosures
define hostgroup {
    hostgroup_name  dell-bladecenters
    alias           Dell bladecenters
}

3.2   Defining the hosts

You'll need a host definition for each of the enclosures. If you are an experienced Nagios admin you already know this, of course:

define host {
    host_name       my-bladecenter1.foo.org
    alias           my-bladecenter1
    address         192.168.10.12
    use             generic-host
    hostgroups      dell-bladecenters
    contact_groups  example@foo.org
}

3.3   Creating a servicegroup

Next you want to create a servicegroup for this service. This is not required, but it makes things easier when you want to inspect your Dell servers via Nagios' web interface. Creating a servicegroup is simple:

# Servicegroup for Dell blade enclosures
define servicegroup {
    servicegroup_name         dell-bladechassis
    alias                     Dell server health status
}

The servicegroup is used later in the service definition.

3.4   Defining a command

The next step is to define a command for check_dell_bladechassis:

# Dell blade enclosure check
define command {
    command_name    check_dell_bladechassis
    command_line    /path/to/check_dell_bladechassis -H $HOSTADDRESS$
}

Note that is is a very basic example of check_dell_bladechassis usage. Refer to the usage section for info about the different options that alters the behaviour of check_dell_bladechassis.

3.5   Defining the service

Finally, you define the service:

define service {
    use                       generic-service
    hostgroup_name            dell-bladecenters
    service_description       Dell blade enclosure health
    servicegroups             dell-bladechassis
    check_command             check_dell_bladechassis
    action_url                https://$HOSTNAME$/
    notes_url                 http://folk.uio.no/trondham/software/check_dell_bladechassis.html
}

The action_url and notes_url is optional.

4   Usage

The plugin queries the monitored host remotely via SNMP. Prerequisites for this are that the monitored host is running SNMP, and that the Nagios server is allowed to communicate with the enclosure over SNMP. The -H|--hostname option is needed for the hostname/IP you want to check.

$ check_dell_bladechassis -H my-bladecenter1
OK - System: 'PowerEdge M1000e', SN: 'XXXXXXX', Firmware: '2.00', hardware working fine

You can specify the SNMP community string (for SNMP version 1 and 2c) with the -C|--community option. Default community is set to "public" if the option is not present:

$ check_dell_bladechassis -H my-bladecenter2 -C mycommunity
OK - System: 'DRAC/MC', SN: 'XXXXXXX', Firmware: '1.5.0 (Build 10.01)', hardware working fine

For other SNMP options, refer to the manual page.

4.1   Output control

The default behaviour of the plugin is to print all alerts on separate lines with no extra fuzz:

$ check_dell_bladechassis -H my-bladecenter1
Blade subsystem health status is Critical
Global system health status is Critical

There are several options that allows you to alter this, as listed below.

4.1.1   Prefix alerts with the service state

The -s|--state option will prefix each alert with the full service state:

$ check_dell_bladechassis -H my-bladecenter1 -s
CRITICAL: Blade subsystem health status is Critical
CRITICAL: Global system health status is Critical

4.1.2   Prefix alerts with the service state (abbreviated)

Example output with the --short-state option, which does the same, except that the service state is abbreviated to only one letter, i.e. C for CRITICAL, W for WARNING etc.:

$ check_dell_bladechassis -H my-bladecenter1 --short-state
C: Blade subsystem health status is Critical
C: Global system health status is Critical

4.1.3   Prefix alerts with the service tag

The option -i|--info will prefix all alerts with the service tag:

$ check_dell_bladechassis -H my-bladecenter1 -i
[XXXXXXX] Blade subsystem health status is Critical
[XXXXXXX] Global system health status is Critical

4.1.4   System info after the alert(s)

The option -e|--extinfo will print the server model and service tag on a separate line at the end of the alert:

$ check_dell_bladechassis -H my-bladecenter1 -e
Blade subsystem health status is Critical
Global system health status is Critical
------ SYSTEM: PowerEdge M1000e, SN: XXXXXXX, FW: 2.00

4.1.5   Combination of output options

You can combine any of these options. Example:

$ check_dell_bladechassis -H my-bladecenter1 -s -e
CRITICAL: Blade subsystem health status is Critical
CRITICAL: Global system health status is Critical
------ SYSTEM: PowerEdge M1000e, SN: XXXXXXX, FW: 2.00

Which (combination) of these options you choose to use, if any, depends on how you use Nagios and your personal preference.

4.2   Debug output

If supplied the option -d or --debug, check_dell_bladechassis will output messages about all the checked components, along with their respectible alert states. If supported by the enclosure (i.e. M1000e) the plugin will also output power supply data and total power usage. An example debug output from a M1000e is given below.

$ check_dell_bladechassis -H my-bladecenter1 -d
   System:      PowerEdge M1000e
   ServiceTag:  XXXXXXX
   Firmware:    2.00
-----------------------------------------------------------------------------
   System Component Status
=============================================================================
  STATE  |  MESSAGE TEXT
---------+-------------------------------------------------------------------
      OK | IO Module (IOM) subsytem health status is Ok
      OK | KVM subsystem health status is Ok
      OK | Redundancy status is Ok
      OK | Power subsystem health status is Ok
      OK | Fan subsystem health status is Ok
CRITICAL | Blade subsystem health status is Critical
      OK | Temperature sensor subsystem health status is Ok
      OK | Chassis Management Controller (CMC) health status is Ok
CRITICAL | Global system health status is Critical
-----------------------------------------------------------------------------
   System Power Readings
=============================================================================
   Power Supply 1 (PS-1) voltage reading: 231.5 V
   Power Supply 2 (PS-2) voltage reading: 233.8 V
   Power Supply 3 (PS-3) voltage reading: 229.2 V
   Power Supply 4 (PS-4) voltage reading: 240.5 V
   Power Supply 5 (PS-5) voltage reading: 240.5 V
   Power Supply 6 (PS-6) voltage reading: 241.8 V
------------------------------------------------------------
   Power Supply 1 (PS-1) amperage reading: 4.66 A
   Power Supply 2 (PS-2) amperage reading: 0.25 A
   Power Supply 3 (PS-3) amperage reading: 0.25 A
   Power Supply 4 (PS-4) amperage reading: 4.61 A
   Power Supply 5 (PS-5) amperage reading: 0.31 A
   Power Supply 6 (PS-6) amperage reading: 0.27 A
------------------------------------------------------------
   Total chassis power usage: 2300 W
   Total chassis current usage: 10.406 A

Debug output from a 1855/1955 chassis is depressing in comparison:

$ check_dell_bladechassis -H my-bladecenter2 -d
   System:      DRAC/MC
   ServiceTag:  XXXXXXX
   Firmware:    1.5.0 (Build 10.01)
-----------------------------------------------------------------------------
   System Component Status
=============================================================================
  STATE  |  MESSAGE TEXT
---------+-------------------------------------------------------------------
      OK | Global system health status is Ok

The limited output from the 1855/1955 is due to limitations in available information via SNMP.

Warning

The option -d|--debug is intended for diagnostics and debugging purposes only. Do not use this option from within Nagios, i.e. in your Nagios config.

4.3   Multiple line output, turn off escaping HTML tags

The output from check_openmanage contains multiple lines separated by HTML linebreaks (<br/>) if run as a command within Nagios, via NRPE etc. If run from a console which has a TTY, i.e. if you log in via SSH or similar and run check_openmanage manually, the linebreaks will be regular linebreaks.

Nagios 3.x allows the following option in cgi.cfg:

# ESCAPE HTML TAGS
# This option determines whether HTML tags in host and service
# status output is escaped in the web interface.  If enabled,
# your plugin output will not be able to contain clickable links.

escape_html_tags=1

The default, as seen above in the sample cgi.cfg from the distribution, is that HTML tags are escaped. My advice is to turn this off. If not, you will see output like this in your Nagios console:

Blade subsystem health status is Critical<br/>Global system health status is Critical

instead of this:

Blade subsystem health status is Critical
Global system health status is Critical

With Nagios 3.x, plugins are allowed to output multiple lines with regular linebreaks, but only the first line is shown in the web interface (status.cgi).

4.4   Full usage information

Usage information gathered with check_dell_bladechassis -h:

Usage: check_dell_bladechassis -H <HOSTNAME> [OPTION]...

OPTIONS:
   -H, --hostname      Hostname or IP of the enclosure
   -C, --community     SNMP community string
   -P, --protocol      SNMP protocol version
   --port              SNMP port number
   -p, --perfdata      Ouput performance data
   -t, --timeout       Plugin timeout in seconds
   -i, --info          Prefix any alerts with the service tag
   -e, --extinfo       Append system info to alerts
   -s, --state         Prefix alerts with alert state
   --short-state       Prefix alerts with alert state (abbreviated)
   -d, --debug         Debug output, reports everything
   -h, --help          Display this help text
   -V, --version       Display version info

For more information and advanced options, see the manual page.

5   Performance data

check_dell_bladechassis will output performance data if the --perfdata or -p option is used. Performance data is only available on the M1000e enclosure. An example graph using PNP4Nagios is given below.

pnp4nagios

The template used to generate these graphs are available here: check_dell_bladechassis.template. Right-click on the link and choose "Save As". Rename to check_dell_bladechassis.php.

Note

The PNP4Nagios template is included in the tarball and zip archive.

6   Download

6.2   Changelog / Old versions

Version Date Changes
1.0.0 2009-08-04
  • Initial release

7   Reporting bugs, proposing new features etc.

Please let me know if you are experiencing bugs, have feature requests, or suggestions on how to improve check_dell_bladechassis. We use this plugin in production at the University of Oslo, but we don't use all the different features of the plugin. While the plugin is bug-free for us, it might not be for you, so let me know if you have problems.

Please send bug reports or feature requests to the Nagios users mailing list. I read postings to this list frequently:

nagios-users@lists.sourceforge.net

You can also email me directly, but then other users won't benefit from the discussion. Unless you have security issues or other concerns are preventing you from using the mailing list, it is better to discuss problems in a public forum.

Depending on the time of day etc., you can also reach me on IRC, as trondham on the #nagios channel on Freenode.

8   Disclaimer

This is free software. Use at your own risk.