Nagios
m (Reverted edit of 70.92.116.166, changed back to last version by 194.78.221.136) |
(Add note that wikipulse has guest$guest posted for Nagios) |
||
| Line 103: | Line 103: | ||
*** I believe we should file a bug that Nagios has no convenient way to do this. --[[User:Baylink|Baylink]] 22:49, 12 Jan 2005 (UTC) | *** I believe we should file a bug that Nagios has no convenient way to do this. --[[User:Baylink|Baylink]] 22:49, 12 Jan 2005 (UTC) | ||
*** And don't forget: initscripts. :-) --[[User:Baylink|Baylink]] 04:27, 13 Jan 2005 (UTC) | *** And don't forget: initscripts. :-) --[[User:Baylink|Baylink]] 04:27, 13 Jan 2005 (UTC) | ||
| + | |||
| + | It may or may not be of interest to the admin crew that Wikipulse has posted a link to Nagios, ''which includes the guest login and password''. Wikipulse is less public than OpenFacts, and certainly less public than Collected Status, but neither of those posted the username and password, either. I'm not sure this is really problematic; I just thought I'd point it out in case someone else did think so. Or wanted to know where those other newbies on IRC were getting all the things they were blathering about. :-)<br>--[[User:24.129.168.240|24.129.168.240]] 17:31, 27 Jun 2005 (UTC) | ||
== External links == | == External links == | ||
Revision as of 17:31, 27 June 2005
Nagios ( http://www.nagios.org/ ) is a host and service monitoring software using a binarie daemon, some cgi scripts for the web interface and binaries plugins to check various things.
It can be set to monitor services such as ssh, squid status, mysql socket as well as # of user logged in, load, disk usage. There is two levels of alarms (warning, critical) and notification system is fully customizable (groups of users, notify by email / irc / pager, stop notifying after x alarm ..).
A test installation is at: http://noc.wikimedia.org/nagios/
Wikimedia system administrator need to grant themselves access (see /h/w/doc/nagios-password for instructions).
Contents |
Installing
We are using larousse as a monitoring device. This mean all probes are sent by this server. The nagios sources are in /home/hashar/ , build using nagios-2.0b1.
If you want to update, get a tarball (for example Nagios (1.2) stable), put in /home/hashar/ and tar xfvz.
Make sure user nagios and group nagios exists.
./configure --prefix=/usr/local/nagios \ '--with-htmlurl=/nagios' \ '--with-cgiurl=/nagios/cgi-bin' \ --with-nagios-user=nagios \ --with-nagios-grp=nagios \ make all make install
Some default configuration files can be installed using 'make install-config' but Ashar already have some configuration files for WikiMedia server farm (more on that later).
We want to run the daemon through init.d, nagios comes with a script for that, so we want to install/update it:
make install-init
Have a look at the apache configuration and allow cgiexec for the --with-cgiurl path. Nagios does not have a user management system to allow access to the website, so you have to configure a web authentication (Ashar used htpassword). Some .htaccess are in /usr/local/nagios/share/ and /usr/local/nagios/libexec/ .
Nagios provide some plugins to check things. You have to download them (Ashar used version 1.4 beta1). Once compiled, you should update them in /home/wikipedia/nagios/libexec/ this way every servers will have the plugins.
Remote monitoring
Now to monitor services on remote hosts we are using nrpe (Nagios Remote Plugin Executor). It is a standalone daemon listening on port 5666 and waiting for commands from the monitoring host (larousse).
The sources are in /home/hashar/nrpe-2.0 and provide the daemon (nrpe) as well as a plugin to be used by nagios (check_nrpe). To install it:
cd /home/hashar/nrpe-2.0/ ./configure && make .
The binary is /home/hashar/nrpe-2.0/src/nrpe , plugin is /home/hashar/nrpe-2.0/src/check_nrpe
Update the daemon for all hosts:
cp /home/hashar/nrpe-2.0/src/nrpe /home/wikipedia/nagios/bin/
Now from larousse update the check_nrpe plugin:
cp /home/hashar/nrpe-2.0/src/check_nrpe /usr/local/nagios/libexec/
Don't forget to check the global configuration: /home/wikipedia/nagios/etc/nrpe.cfg .
Ashar launched the daemon on all apaches using:
dsh -f -N apaches \ -e '/home/wikipedia/nagios/bin/nrpe -c /home/wikipedia/nagios/etc/nrpe.cfg -d'
And blam, all apaches listening for larousse checks :o)
Launching
Nagios
As we installed the init.d script, log in on larousse and launch:
/etc/init.d/nagios start
It *should* work :o)
Remote daemon
On remote servers you just need to be sure the nrpe daemon is running, ashar (or someone else) still have to write the nrpe init.d script. lmk;l
irc bots
See script /home/hashar/bin/wikinagios.sh , have a look at it before launching it though :o) We probably want to run it under user nagios:nagios .
WikiServices outputs the services alarms while WikiHosts outputs host one. This way, if needed, we can have the services spam in one channel and hosts notifications in #mediawiki .
Configuration
There is two configuration system ! The default provided with nagios tarball doesn't work (old system). The configurations files are all in larousse:/usr/local/nagios/etc/ the only files that matter are the .cfg ones.
Building and maintening config files is a boring job. We should probably write scripts that do the task for us using "dsh groups" and "/etc/hosts" as source.
- cgi.cfg
- configuration for the cgi web interface.
- checkcommands.cfg
- define checking commands later user by other configuration files.
- contactgroups.cfg
- this is to regroup people by groups, usefull if we want only some person to receive mysql checks for example.
- contacts.cfg
- everyone should be configured there !
- dependencies.cfg
- sometime when a host or service is done (ex: ping), it is useless to check other hosts / services. For example if the nrpe daemon is not responding on a server we can skip all remote check as they will not work.
- hostgroups.cfg
- regroup server by function (for example)
- hosts.cfg
- define the hosts and their service
- htpasswd.users
- users and password to access the nagios web interface.
- misccommands.cfg
- stores commands that are not core to nagios monitoring
- nagios.cfg
- the main nagios configuration file. That's where you can define user rights (among other things).
- resource.cfg
- define some macro to be used in other .cfg files.
- services.cfg
- define our services and which groups of servers should be monitored for that service.
- timeperiods.cfg
- have a look at it. It s default.
Operational Use
It will be necessary to assemble a certain amount of protocol about how to best utilize Nagios in the WP grid. Q&A on that should go here.
- Is there a preferred method for making temporarily decommissioned machines not affect the outage stats? Does 'downtime' do that? Trying on the french machines... --Baylink 19:04, 11 Jan 2005 (UTC)
It may or may not be of interest to the admin crew that Wikipulse has posted a link to Nagios, which includes the guest login and password. Wikipulse is less public than OpenFacts, and certainly less public than Collected Status, but neither of those posted the username and password, either. I'm not sure this is really problematic; I just thought I'd point it out in case someone else did think so. Or wanted to know where those other newbies on IRC were getting all the things they were blathering about. :-)
--24.129.168.240 17:31, 27 Jun 2005 (UTC)
External links
- http://www.nagios.org/
- Ashar Voultoiz presenting nagios in wikitech
- http://noc.wikimedia.org/nagios/ (private access, need root. See above)