Nagios
Nagios ( http://www.nagios.org/ ) is a host and service monitoring software using a binary daemon, some cgi scripts for the web interface and binaries plugins to check various things.
It can be set to monitor services such as ssh, squid status, mysql socket as well as # of user logged in, load, disk usage. There is two levels of alarms (warning, critical) and notification system is fully customizable (groups of users, notify by email / irc / pager, stop notifying after x alarm ..).
Our installation can be found at http://nagios.wikimedia.org/ which is currently an alias to bart.
Contents |
Installation
For FC3 on i386, install every RPM in /home/wikipedia/rpms/nagios . They come from http://dag.wiers.com/ . For other architectures, binaries or source RPMs are available, you'll have to download them. The packages needed are:
- nagios
- nagios-plugins
- fping
- perl-Net-SNMP
- perl-Crypt-DES
- perl-Socket6
After installing, do this:
cp /home/wikipedia/conf/nagios/* /etc/nagios/ service start nagios
and you're away.
Configuration
There's a configurator script for adding hosts, host groups, services and service groups at /home/wikipedia/conf/nagios/conf.php . Run it somewhere with PHP CLI installed, i.e. not bart. The configurator writes to a file called hosts.cfg in the current directory.
cd /home/wikipedia/conf/nagios ./sync
Most host groups (the ones in $hostGroups) are based on dsh node group files. This is preferred for maintainability reasons, if such a node group exists, otherwise you can list miscellaneous hosts inline using $listedHosts. Some service groups (e.g. Apache and Squid) are just replicas of the host groups, others (such as Lucene and Memcached) are taken from the MediaWiki configuration. Services may also be listed inline using $listedServices, but again, this is not preferred.
Other configuration should be done by editing the *.cfg files on NFS and then copying to bart. Keeping two up-to-date copies like this protects us against failure of the monitoring host or NFS.
If nagios refuses to restart due to a configuration error, you can get more information by running this on the monitoring host:
nagios -v /etc/nagios/nagios.cfg
The error messages can be cryptic at times.
Custom Checks
Custom checks can be found on the private svn repository under ops/nagios-checks
Examples include:
- check_stomp.pl
- check_all_memcached.php
Authentication
To add a user or update a password:
- Log in to bart
- Run htpasswd /home/wikipedia/conf/nagios/htpasswd.users <user>
- Log in to zwinger
- Run /home/wikipedia/conf/nagios/sync
You need to use the htpasswd on bart, but the passwd file needs to be synchronised between NFS and the local files on bart.
IRC notification
There's a contact called "irc" in a contact group called "admins" which currently does the IRC notification. Messages are appended to /var/log/nagios/irc.log and picked up by an IRC client. Our IRC client (ircecho) can be started with:
tail -n0 -f /var/log/nagios/irc.log | /home/wikipedia/bin/ircecho \#wikimedia-tech nagios-wm irc.freenode.net
or words to that effect.
NRPE
To install NRPE on an ubuntu server:
apt-get update apt-get -y install nagios-nrpe-server nagios-plugins cp /home/wikipedia/conf/nagios/nrpe-debian.cfg /etc/nagios/nrpe.cfg invoke-rc.d nagios-nrpe-server restart