Nagios
(updated bot info) |
|||
| Line 3: | Line 3: | ||
It can be set to monitor services such as ssh, squid status, mysql socket as well as # of user logged in, load, disk usage. There is two levels of alarms (warning, critical) and notification system is fully customizable (groups of users, notify by email / irc / pager, stop notifying after x alarm ..). | It can be set to monitor services such as ssh, squid status, mysql socket as well as # of user logged in, load, disk usage. There is two levels of alarms (warning, critical) and notification system is fully customizable (groups of users, notify by email / irc / pager, stop notifying after x alarm ..). | ||
| − | Our installation can be found at http://nagios.wikimedia.org/ which is currently an alias to [[ | + | Our installation can be found at http://nagios.wikimedia.org/ which is currently an alias to [[spence]]. |
| − | == | + | == Quick Summary == |
| − | + | * Nagios is installed on Spence.wikimedia.org (208.80.152.161) | |
| + | * Nagios can be reached at http://nagios.wikimedia.org | ||
| + | * In order to set downtime / ack alerts you need to login which is done over https | ||
| + | * Nagios configuration files are automatically generated by /home/w/conf/nagios/conf.php (on any host with NFS home mounted) and synched over to Spence. | ||
| + | * MRTG was setup and will display Nagios usage data (useful to see if Nagios is actually doing what it is supposed to be doing) | ||
| + | * Ganglia and Wikitech are loosely integrated with most hosts (G and W icons next to the host name respectively) and will display the Ganglia data of that host or its associated wikitech page if SPOF. | ||
| + | * There is a bot in #wikimedia-tech that will echo whatever nagios alerts on (see below) | ||
| + | |||
| + | == Installation == | ||
| − | + | === On the server === | |
| − | + | Install package from source found at http://www.nagios.org/download/core/thanks/ (both core and plugins packages are needed) | |
| − | + | ||
| − | + | ||
| − | + | ||
| − | + | ||
After installing, do this: | After installing, do this: | ||
| Line 22: | Line 26: | ||
and you're away. | and you're away. | ||
| + | |||
| + | === On each client === | ||
| + | |||
| + | Install NRPE from packages (apt-get): | ||
| + | |||
| + | apt-get update | ||
| + | apt-get -y install nagios-nrpe-server nagios-plugins | ||
| + | cp /home/wikipedia/conf/nagios/nrpe-debian.cfg /etc/nagios/nrpe.cfg | ||
| + | invoke-rc.d nagios-nrpe-server restart | ||
| + | |||
| + | |||
== Configuration == | == Configuration == | ||
| − | There's a configurator script for adding hosts, host groups, services and service groups at /home/wikipedia/conf/nagios/conf.php . Run it somewhere with PHP CLI installed, i.e. ''' | + | There's a configurator script for adding hosts, host groups, services and service groups at /home/wikipedia/conf/nagios/conf.php . Run it somewhere with PHP CLI installed, i.e. '''[[fenari]]'''. The configurator writes to a file called hosts.cfg in the current directory. |
cd /home/wikipedia/conf/nagios | cd /home/wikipedia/conf/nagios | ||
| Line 32: | Line 47: | ||
Most host groups (the ones in $hostGroups) are based on dsh node group files. This is preferred for maintainability reasons, if such a node group exists, otherwise you can list miscellaneous hosts inline using $listedHosts. Some service groups (e.g. Apache and Squid) are just replicas of the host groups, others (such as Lucene and Memcached) are taken from the MediaWiki configuration. Services may also be listed inline using $listedServices, but again, this is not preferred. | Most host groups (the ones in $hostGroups) are based on dsh node group files. This is preferred for maintainability reasons, if such a node group exists, otherwise you can list miscellaneous hosts inline using $listedHosts. Some service groups (e.g. Apache and Squid) are just replicas of the host groups, others (such as Lucene and Memcached) are taken from the MediaWiki configuration. Services may also be listed inline using $listedServices, but again, this is not preferred. | ||
| − | Other configuration should be done by editing the *.cfg files on NFS and then copying to | + | Other configuration should be done by editing the *.cfg files on NFS and then copying to [[spence]]. Keeping two up-to-date copies like this protects us against failure of the monitoring host or NFS. (Note: the Sync command actually replicates every .cfg to Spence) |
| − | If nagios refuses to restart due to a configuration error, you can get more information by running this on the monitoring host: | + | If nagios refuses to restart due to a configuration error, you can get more information by running this on the monitoring host (Spence): |
nagios -v /etc/nagios/nagios.cfg | nagios -v /etc/nagios/nagios.cfg | ||
| Line 53: | Line 68: | ||
To add a user or update a password: | To add a user or update a password: | ||
| − | # Log in to | + | # Log in to Spence |
| − | # Run <tt>htpasswd / | + | # Run <tt>htpasswd /usr/local/nagios/etc/htpasswd.users ''<user>''</tt> |
| − | + | ||
| − | + | ||
| − | |||
== IRC notification == | == IRC notification == | ||
| Line 70: | Line 82: | ||
su -s/bin/sh -c'tail -n0 -f /var/log/nagios/irc.log | /home/wikipedia/bin/ircecho \#wikimedia-tech nagios-wm irc.freenode.net &' nobody > /dev/null 2>&1 | su -s/bin/sh -c'tail -n0 -f /var/log/nagios/irc.log | /home/wikipedia/bin/ircecho \#wikimedia-tech nagios-wm irc.freenode.net &' nobody > /dev/null 2>&1 | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
[[Category:Bot and monitoring]] | [[Category:Bot and monitoring]] | ||
Revision as of 18:47, 9 July 2010
Nagios ( http://www.nagios.org/ ) is a host and service monitoring software using a binary daemon, some cgi scripts for the web interface and binaries plugins to check various things.
It can be set to monitor services such as ssh, squid status, mysql socket as well as # of user logged in, load, disk usage. There is two levels of alarms (warning, critical) and notification system is fully customizable (groups of users, notify by email / irc / pager, stop notifying after x alarm ..).
Our installation can be found at http://nagios.wikimedia.org/ which is currently an alias to spence.
Contents |
Quick Summary
* Nagios is installed on Spence.wikimedia.org (208.80.152.161) * Nagios can be reached at http://nagios.wikimedia.org * In order to set downtime / ack alerts you need to login which is done over https * Nagios configuration files are automatically generated by /home/w/conf/nagios/conf.php (on any host with NFS home mounted) and synched over to Spence. * MRTG was setup and will display Nagios usage data (useful to see if Nagios is actually doing what it is supposed to be doing) * Ganglia and Wikitech are loosely integrated with most hosts (G and W icons next to the host name respectively) and will display the Ganglia data of that host or its associated wikitech page if SPOF. * There is a bot in #wikimedia-tech that will echo whatever nagios alerts on (see below)
Installation
On the server
Install package from source found at http://www.nagios.org/download/core/thanks/ (both core and plugins packages are needed)
After installing, do this:
cp /home/wikipedia/conf/nagios/* /etc/nagios/ service start nagios
and you're away.
On each client
Install NRPE from packages (apt-get):
apt-get update apt-get -y install nagios-nrpe-server nagios-plugins cp /home/wikipedia/conf/nagios/nrpe-debian.cfg /etc/nagios/nrpe.cfg invoke-rc.d nagios-nrpe-server restart
Configuration
There's a configurator script for adding hosts, host groups, services and service groups at /home/wikipedia/conf/nagios/conf.php . Run it somewhere with PHP CLI installed, i.e. fenari. The configurator writes to a file called hosts.cfg in the current directory.
cd /home/wikipedia/conf/nagios ./sync
Most host groups (the ones in $hostGroups) are based on dsh node group files. This is preferred for maintainability reasons, if such a node group exists, otherwise you can list miscellaneous hosts inline using $listedHosts. Some service groups (e.g. Apache and Squid) are just replicas of the host groups, others (such as Lucene and Memcached) are taken from the MediaWiki configuration. Services may also be listed inline using $listedServices, but again, this is not preferred.
Other configuration should be done by editing the *.cfg files on NFS and then copying to spence. Keeping two up-to-date copies like this protects us against failure of the monitoring host or NFS. (Note: the Sync command actually replicates every .cfg to Spence)
If nagios refuses to restart due to a configuration error, you can get more information by running this on the monitoring host (Spence):
nagios -v /etc/nagios/nagios.cfg
The error messages can be cryptic at times.
Custom Checks
Custom checks can be found on the private svn repository under ops/nagios-checks
Examples include:
- check_stomp.pl
- check_all_memcached.php
Authentication
To add a user or update a password:
- Log in to Spence
- Run htpasswd /usr/local/nagios/etc/htpasswd.users <user>
IRC notification
There's a contact called "irc" in a contact group called "admins" which currently does the IRC notification. Messages are appended to /var/log/nagios/irc.log and picked up by an IRC client. Our IRC client (ircecho) can be started with:
/usr/local/bin/start-nagios-bot
Which is just the shell one-liner:
su -s/bin/sh -c'tail -n0 -f /var/log/nagios/irc.log | /home/wikipedia/bin/ircecho \#wikimedia-tech nagios-wm irc.freenode.net &' nobody > /dev/null 2>&1