Nagios

From Wikitech
Revision as of 12:58, 23 May 2010 by Midom (Talk | contribs)

Jump to: navigation, search

Nagios ( http://www.nagios.org/ ) is a host and service monitoring software using a binary daemon, some cgi scripts for the web interface and binaries plugins to check various things.

It can be set to monitor services such as ssh, squid status, mysql socket as well as # of user logged in, load, disk usage. There is two levels of alarms (warning, critical) and notification system is fully customizable (groups of users, notify by email / irc / pager, stop notifying after x alarm ..).

Our installation can be found at http://nagios.wikimedia.org/ which is currently an alias to bart.

Contents

Installation

For FC3 on i386, install every RPM in /home/wikipedia/rpms/nagios . They come from http://dag.wiers.com/ . For other architectures, binaries or source RPMs are available, you'll have to download them. The packages needed are:

  • nagios
  • nagios-plugins
  • fping
  • perl-Net-SNMP
  • perl-Crypt-DES
  • perl-Socket6

After installing, do this:

cp /home/wikipedia/conf/nagios/* /etc/nagios/
service start nagios

and you're away.

Configuration

There's a configurator script for adding hosts, host groups, services and service groups at /home/wikipedia/conf/nagios/conf.php . Run it somewhere with PHP CLI installed, i.e. not bart. The configurator writes to a file called hosts.cfg in the current directory.

cd /home/wikipedia/conf/nagios
./sync

Most host groups (the ones in $hostGroups) are based on dsh node group files. This is preferred for maintainability reasons, if such a node group exists, otherwise you can list miscellaneous hosts inline using $listedHosts. Some service groups (e.g. Apache and Squid) are just replicas of the host groups, others (such as Lucene and Memcached) are taken from the MediaWiki configuration. Services may also be listed inline using $listedServices, but again, this is not preferred.

Other configuration should be done by editing the *.cfg files on NFS and then copying to bart. Keeping two up-to-date copies like this protects us against failure of the monitoring host or NFS.

If nagios refuses to restart due to a configuration error, you can get more information by running this on the monitoring host:

nagios -v /etc/nagios/nagios.cfg

The error messages can be cryptic at times.

Custom Checks

Custom checks can be found on the private svn repository under ops/nagios-checks

Examples include:

  • check_stomp.pl
  • check_all_memcached.php

Authentication

To add a user or update a password:

  1. Log in to bart
  2. Run htpasswd /home/wikipedia/conf/nagios/htpasswd.users <user>
  3. Log in to zwinger
  4. Run /home/wikipedia/conf/nagios/sync

You need to use the htpasswd on bart, but the passwd file needs to be synchronised between NFS and the local files on bart.

IRC notification

There's a contact called "irc" in a contact group called "admins" which currently does the IRC notification. Messages are appended to /var/log/nagios/irc.log and picked up by an IRC client. Our IRC client (ircecho) can be started with:

tail -n0 -f /var/log/nagios/irc.log | /home/wikipedia/bin/ircecho \#wikimedia-tech nagios-wm irc.freenode.net

or words to that effect.

NRPE

To install NRPE on an ubuntu server:

apt-get update
apt-get -y install nagios-nrpe-server nagios-plugins
cp /home/wikipedia/conf/nagios/nrpe-debian.cfg /etc/nagios/nrpe.cfg
invoke-rc.d nagios-nrpe-server restart
Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox