Nagios

Nagios: Service Check Timed Out

Since I got the pleasure of watching some Windows boxen with Nagios, I took the Windows Update plugin from Michal Jankowski and implemented it. It took me some time, to initially set up the nsclient++ correctly so it just works, but up till now the check plugin sometimes reported the usual “Service Check Timed Out”. Usually I ended up increasing the cscript timeout, or the nsclient++ socket timeout, but it still kept showing up. Since I rely heavily on my surveillance tools, I have the demand, that as few as possible false positives show up. So I ended up chasing down this error today, and after that I have to say it was quite simple. ...

Nagios: Integrating Cisco switches

Well, as I wrote recently, we received a new BladeCenter a few weeks back. Now, as we slowly take it into service I was interested in watching the utilization of the back planes as well as the CPU utilization of the Cisco Catalyst 3012 network switches. The first mistake I made, was to trust Cisco with their guide about how to get the utilization from the device using SNMP. They stated some OID’s, which I tried with snmpwalk and got a result from. 1 2 snmpwalk -v1 -c public -O n 10.0.0.35 .1.3.6.1.4.1.9.5.1.1.8 .1.3.6.1.4.1.9.5.1.1.8.0 = INTEGER: 0 Now, as I tried retrieving the SNMP data by means of the check_snmp plugin, I got some flaky results: 1 2 3 4 /usr/lib/nagios/plugins/check_snmp -H 10.0.0.35 -C public .1.3.6.1.4.1.9.5.1.1.8 SNMP problem - No data received from host CMD: /usr/bin/snmpget -t 1 -r 5 -m '' -v 1 [authpriv] 10.0.0.35:161 Those of you, who read the excerpts carefully will notice the difference between snmpwalk and the OID I passed on to check_snmp. The point being, the OID’s Cisco gave in their Design tech notes are either old, or just not accurate at all. After passing on the .0 to each value given by Cisco, the check_snmp is all honky dory and integrated into Nagios. As usual, the Nagios definitions are further down, for those interested.

Nagios virtualization

As virtualization seems to be a trendy thing to do, I went ahead and virtualized our nagios (while reinstalling the whole thing …). Now as I went into work today and started my email client, I received 4 nagios warnings about a LOAD service reaching critical state. Looked at the nagios box itself, opened up the VM console, looked into the syslog. Nothing. Yet over 3/4 of the services were flapping, some ping checks were critical (for whatever reason). So I opened the nagios webinterface again, and noticed it dropping the connection over and over again (had to reauthentificate me again and again). ...

Nagios Hostgroup Inheritance

As I wrote earlier, I recently virtualized our nagios. Along with that came a complete " redesign" of how checks are applied. Up till now, I defined checks for each and every single server, thus ending up with ~25 files, each holding roughly 6 checks which are in the same file just sorted by hostname. As you can imagine, it gets quite confusing with that amount of checks (~150). So the last two days I spent on reorganizing (with Visio), on which object/hostgroup placing a check would make sense. Now, this is my first result of two days planning, reorganizing, reordering and moving hosts into different hostgroups. ...

MessPC Ethernetbox 2 and Nagios

As I talked to Tobi yesterday, we came to talk about our Ethernet Box thermometer. It’s a neat device, which works pretty much out of the box. Integrating it with Nagios is a bit of a bummer. That’s what the ~300 EUR box looks like. It’s basically a small black box with a RJ45 jack, and four RJ11 jacks for attached external devices. The box itself only functions as a " management station" and doesn’t come with a sensor. Normally, you can attach up till four RJ11 sensors to it. But, MessPC also has RJ11 port splitters, which enables you to attach up to eight RJ11 sensors to the MessPC. As you can see, the box has a RJ45 jack on the other side, which you basically hook up to your network and then configure an IP address (or if you fancy DHCP for those things, it’s possible too). On the opposite site, are the RJ11 jacks for the sensors. As you can see, we currently do have 4 splitters attachted to the box, enabling up till 8 sensors to be measured. Once you have it up and running, you can look at the web interface and you’ll be able to see the state of the sensors right on the first page.

Monitoring Brocade FC switches with SNMP/Nagios

I looked into the mess a bit more, and as it turns out, the weird crap I was talking about only happens if you have a port with LossofSynchronization, LossofSignal or LinkFailures value with the base of ten (i.e. 10, 101 or 10.000). Additionally, the OID’s for those three failure elements seem to be dependent on the firmware version, as with 6.3.x they appear as different OIDs. So I may need to introduce another command-line switch, which selects the firmware version and depending on that, the OID. ...

Monitoring Brocade FC switches with Nagios

The last four days I spent looking for ways on monitoring a Brocade Fibrechannel switch (in my case IBM 2145 B32/F40). The first thing I came up with, is using SNMP. As it was already configured for the previous monitoring with Munin, getting information should be quite easy. After looking through Google for a bit, there is already one script that worked for me. Only trouble I had with that script, is that it crams every single port into one result. As I wanted something, that a) could watch a single port and b) return performance data, I went ahead an used the script to do a basic rewrite. But after a short while, I grew antsy and started writing a script from scratch, using the OIDs I got from that script and a Cacti template. ...

NetApp: Monitoring of SnapVault/SnapMirror/LUN/Snapshot information with Nagios

As I wrote before, we have a bunch of filers (and a ton of volumes w/ luns on them), that I need to monitor. At first, I tried the existing NetApp Nagios-Plugin(s), but they all use SNMP and with that I can either watch all volumes or none. And that didn’t satisfy me. Don’t get me wrong, the existing plugins are okay and I still use them for stuff (like GLOBALSTATUS or FAN/CPU/POWER) which isn’t present in the API or real hard to get at, however I wanted more. So I ended up looking at the NetApp API, and ended up writing a “short” plugin for Nagios using Perl. Maybe if I’m ever bored, I’ll rewrite it using C, but for now the Perl plugin has to suffice. So far the plugin supports the following things: Monitoring FlexVolumes (simply watching the free space) Monitoring LUN space (the allocated space inside a FlexVolume for iSCSI/FC LUNs) Monitoring Snapshot space (the allocated space inside a FlexVolume for Snapshots) Monitoring SnapVault relations (and their age) Monitoring SnapMirror relations (and their age) The plugin will return performance data for most (if not all) of those classes. It needs a user on the filer you wish to monitor - which sadly needs to have the admin role.

Generate Nagios config for check_netapp-api-pl

As so often, I wanted a script, that’ll crawl my filers and regenerate the configuration if there are any new volumes/snapvaults/snapmirrors or if one of them has been removed. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 #!/bin/bash FAS_HOSTS="$( ls /etc/nagios/objects/hosts/san/fas*{a,b}.cfg | cut -d/ -f7 | cut -d. -f1 )" for host in $FAS_HOSTS; do OUTPUT_FILE=/etc/nagios/objects/hosts/san/$host-vol.cfg # Clear the output file echo "" > $OUTPUT_FILE # Get the volume list for volume in `ssh $host vol status | awk '{ print $1 }' | grep ^vol | sort -u | grep -v vol0$`; do user="$( grep "USER=" /etc/netapp-sdk/$host | cut -d= -f2 )" pass="$( grep "PASS=" /etc/netapp-sdk/$host | cut -d= -f2 )" # echo "define service {" # echo " use generic-service" # echo "" # echo " check_command check_netapp-volfree!$user!$pass!${volume}!92!98" # echo " check_interval 5" # echo " host_name ${host}" # echo " notifications_enabled 0" # echo " notification_interval 720" # echo " service_description VOLSPACE ${volume}" # echo "}" echo echo "define service {" echo " use generic-service-san-perfdata" echo "" echo " check_command check_netapp-lunspace!$user!$pass!${volume}" echo " check_interval 5" echo " host_name ${host}" echo " notifications_enabled 0" echo " notification_interval 720" echo " service_description LUNSPACE ${volume}" echo "}" echo SR="$( ssh $host snap reserve $volume | cut -d -f7 )" if [ "$SR" != "0%" ] ; then echo "define service {" echo " use generic-service-san-perfdata" echo "" echo " check_command check_netapp-snapreserve!$user!$pass!${volume}" echo " check_interval 10" echo " host_name ${host}" echo " notifications_enabled 0" echo " notification_interval 720" echo " # SR: $SR" echo " service_description SNAPRESERVE ${volume}" echo "}" echo fi done | tee -a $OUTPUT_FILE # Check snapvault foo for sv in `ssh $host snapvault status -l 2>/dev/null | awk '{ print $2 }' | grep vol`; do # only do the checks on sv_secondary if [ "$( echo $sv | grep $host | cut -d: -f1 )" == "${host}" ]; then vol="$( echo $sv | cut -d/ -f3 )" user="$( grep "USER=" /etc/netapp-sdk/$host | cut -d= -f2 )" pass="$( grep "PASS=" /etc/netapp-sdk/$host | cut -d= -f2 )" echo "define service {" echo " use generic-service-san-perfdata" echo "" echo " check_command check_netapp-snapvault!$user!$pass!$vol!38!42!" echo " check_interval 60" echo " host_name ${host}" echo " notifications_enabled 0" echo " notification_interval 720" echo " service_description SNAPVAULT ${vol}" echo "}" echo fi done | tee -a $OUTPUT_FILE # Check snapmirror foo for sm in `ssh $host snapmirror status 2>/dev/null | awk '{ print $2 }' | grep vol | grep $host`; do # only do the checks on sm_secondary if [ "$( echo $sm | grep $host | cut -d: -f1 )" == "${host}" ]; then vol="$( echo $sm | cut -d/ -f3 | cut -d: -f2 )" user="$( grep "USER=" /etc/netapp-sdk/$host | cut -d= -f2 )" pass="$( grep "PASS=" /etc/netapp-sdk/$host | cut -d= -f2 )" echo "define service {" echo " use generic-service-san-perfdata" echo "" echo " check_command check_netapp-snapmirror!$user!$pass!$vol!38!42!" echo " check_interval 60" echo " host_name ${host}" echo " notifications_enabled 0" echo " notification_interval 720" echo " service_description SNAPMIRROR ${vol}" echo "}" echo fi done | tee -a $OUTPUT_FILE done

Generate Nagios config for NetApp filers

At some point in the last few weeks, I repeatedly had to recreate my Nagios config for currently six filers. After doing that a few times, I ended up (like sooo often) writing a short Bash script, that’ll do this for me - without any fuss. The only thing the script needs, is that the filers and the filers are registered in DNS … Here’s an example: 1 2 3 4 fas3240a IN A 172.31.76.150 fas3240a-sp IN A 172.31.74.150 fas3240b IN A 172.31.76.151 fas3240b-sp IN A 172.31.74.151 With that done, the script will create the necessary Nagios config for those filers.