Ganglia

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(updated)
Line 1: Line 1:
{{document me}}
+
==gmond==
{{fixme|Which machines are aggregators?}}
+
{{fixme|How to add a group correctly?}}
+
{{fixme|Why doesn't search group show up?}}
+
  
==unicast vs multicast==
+
Configuration of gmond is done via the master file <tt>/home/wikipedia/conf/gmond/gmond.conf.master</tt>, and the configuration generator script <tt>/home/wikipedia/conf/gmond/conf.php</tt>. The whole <tt>/home/wikipedia/conf/gmond</tt> directory is copied to every server running gmond, as <tt>/etc/gmond</tt>. The <tt>./sync</tt> script is a shortcut to the relevant rsync command. Each server then has a symlink from <tt>/etc/gmond.conf</tt> to the relevant specialised configuration script in <tt>/etc/gmond/*.conf</tt>.
*As of version 3.0.0, released 2005-02-07, status messages can be sent over unicast.
+
  
==components==
+
Each cluster has its own multicast channel. Channels are allocated automatically by <tt>conf.php</tt>. The first cluster in the $clusters array is given the IP address 239.192.0.1, the second is given 239.192.0.2, and so on up. 239.192.0.0/24 should be considered reserved for this purpose, as documented at [[IP addresses]].
; gmetad
+
: This daemon collect data from the other hosts and write them to RRD databases.
+
: It is installed on [[zwinger]] using <tt>ganglia-monitor-core-gmetad-2.5.6-1.i386.rpm</tt>
+
  
; gmond
+
The <tt>*_aggr.conf</tt> configuration files are "aggregator" configuration files. These configure ganglia in non-deaf mode, allowing it to listen to the multicast channel, aggregate the state of the cluster in memory and respond to XML requests from gmetad. The remaining configuration files use deaf mode, which saves a small amount of CPU time and memory for those servers.  
: It is a daemon installed on each machine that locally collect usefull informations. It is queried from time to time by the gmetad hosts.
+
: gmond is nstalled on each machine using <tt>ganglia-monitor-core-gmond-2.5.6-1.i386.rpm</tt>
+
: <s><tt>/etc/gmond.conf</tt> on each machine '''MUST''' be a symlink to <tt>/home/wikipedia/gmond.conf</tt>.  ''If the default gmond.conf is used, ganglia stats for the entire cluster will not be recorded in the right place and will be effectively lost.'' (But see ''merging RRDs'' below.)</s>
+
  
==version mismatches==
+
<tt>/home/wikipedia/src/packages/install-ganglia</tt> should be considered the canonical installation script.
*things go wrong when gmond and gmetad have different versions. or is it when peer gmonds have different versions? something like that anyway
+
  
==zwinger==
+
==gmetad==
RRDs (ganglia statistics, in this case) are in /home/wikipedia/rrds
+
there is some old data in the default location, /var/lib/ganglia/rrds
+
  
==cluster-wide ganglia restart==
+
There are two instances of gmetad running on zwinger: <tt>gmetad_aggr</tt> and <tt>gmetad_pmtpa</tt>. The instances run via symlinks at <tt>/usr/sbin/gmetad_aggr</tt> and <tt>/usr/sbin/gmetad_pmtpa</tt>, so the specialised name will appear in ps. There are two init.d services which start these instances. They do not use PID files, they just signal the processes by name.
If something is amiss with the state of ganglia, and reconfiguring and restarting gmetad isn't enough, do this on zwinger as root:
+
<pre>
+
#!/bin/bash
+
/etc/init.d/gmetad stop
+
dsh -f -a /etc/init.d/gmond stop
+
sleep 5
+
dsh -f -a /etc/init.d/gmond start
+
/etc/init.d/gmetad start
+
</pre>
+
  
or you can just run <tt>/home/wikipedia/bin/ganglia-restart-all</tt>.
+
<tt>gmetad_aggr</tt> is responsible for aggregating data from the pmtpa, knams and (eventually) yaseo "grids". Its resource requirements are small.  
  
==merging RRDs==
+
<tt>gmetad_pmtpa</tt> aggregates data from all of the servers in the pmtpa grid. Resource requirements are much larger. Its RRD files are stored on a tmpfs mounted at /mnt/ganglia_tmp -- the disk write rate was too high for it to be stored directly to disk. Once per hour, the in-memory RRDs are synced to disk, via <tt>/usr/local/bin/save-gmetad-rrds</tt>. On startup, <tt>/usr/local/bin/restore-gmetad-rrds</tt> is executed, which loads the RRDs from disk into the tmpfs.  
It is possible, in principle, to merge RRDs using a perl script which can be found on the net somewhere. Looks like a royal pain though.
+
  
The script location is available through [http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/rrdworld/index.en.htmlRRDWorld on RRDTool official website]. You might want to look at [http://www.pintori.it:8080/rrd-merger/README the README file].
+
The gmetad configuration files are in <tt>/etc/gmetad_aggr.conf</tt> and <tt>/etc/gmetad_pmtpa.conf</tt>, with a copy in <tt>/home/wikipedia/conf/gmond</tt>.  
  
{{PD}}
+
==Web frontend==
  
[[Category:Bot and monitoring]]
+
We run a customised copy of the web frontend with a document root at <tt>/home/wikipedia/htdocs/ganglia</tt>. There is a symlink from <tt>pmtpa</tt> to <tt>.</tt>, <tt>conf.php</tt> detects the request URI and reads from either <tt>gmetad_aggr</tt> or <tt>gmetad_pmtpa</tt> appropriately.
 +
 
 +
==gmetricd==
 +
 
 +
A python daemon called gmetricd collects the <tt>diskio_*</tt> metrics. It should be running on every server that runs gmond. It should be possible to extend it with other metrics if desired. The code is in SVN in the <tt>ganglia_metrics</tt> directory, and there are some RPMs in <tt>/home/wikipedia/rpms/ganglia/ganglia_metrics/</tt>.

Revision as of 17:06, 30 October 2006

Contents

gmond

Configuration of gmond is done via the master file /home/wikipedia/conf/gmond/gmond.conf.master, and the configuration generator script /home/wikipedia/conf/gmond/conf.php. The whole /home/wikipedia/conf/gmond directory is copied to every server running gmond, as /etc/gmond. The ./sync script is a shortcut to the relevant rsync command. Each server then has a symlink from /etc/gmond.conf to the relevant specialised configuration script in /etc/gmond/*.conf.

Each cluster has its own multicast channel. Channels are allocated automatically by conf.php. The first cluster in the $clusters array is given the IP address 239.192.0.1, the second is given 239.192.0.2, and so on up. 239.192.0.0/24 should be considered reserved for this purpose, as documented at IP addresses.

The *_aggr.conf configuration files are "aggregator" configuration files. These configure ganglia in non-deaf mode, allowing it to listen to the multicast channel, aggregate the state of the cluster in memory and respond to XML requests from gmetad. The remaining configuration files use deaf mode, which saves a small amount of CPU time and memory for those servers.

/home/wikipedia/src/packages/install-ganglia should be considered the canonical installation script.

gmetad

There are two instances of gmetad running on zwinger: gmetad_aggr and gmetad_pmtpa. The instances run via symlinks at /usr/sbin/gmetad_aggr and /usr/sbin/gmetad_pmtpa, so the specialised name will appear in ps. There are two init.d services which start these instances. They do not use PID files, they just signal the processes by name.

gmetad_aggr is responsible for aggregating data from the pmtpa, knams and (eventually) yaseo "grids". Its resource requirements are small.

gmetad_pmtpa aggregates data from all of the servers in the pmtpa grid. Resource requirements are much larger. Its RRD files are stored on a tmpfs mounted at /mnt/ganglia_tmp -- the disk write rate was too high for it to be stored directly to disk. Once per hour, the in-memory RRDs are synced to disk, via /usr/local/bin/save-gmetad-rrds. On startup, /usr/local/bin/restore-gmetad-rrds is executed, which loads the RRDs from disk into the tmpfs.

The gmetad configuration files are in /etc/gmetad_aggr.conf and /etc/gmetad_pmtpa.conf, with a copy in /home/wikipedia/conf/gmond.

Web frontend

We run a customised copy of the web frontend with a document root at /home/wikipedia/htdocs/ganglia. There is a symlink from pmtpa to ., conf.php detects the request URI and reads from either gmetad_aggr or gmetad_pmtpa appropriately.

gmetricd

A python daemon called gmetricd collects the diskio_* metrics. It should be running on every server that runs gmond. It should be possible to extend it with other metrics if desired. The code is in SVN in the ganglia_metrics directory, and there are some RPMs in /home/wikipedia/rpms/ganglia/ganglia_metrics/.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox