NTP
Introduction
At pmtpa, csw1-pmpta broadcasts NTP time of day signals, all servers in pmtpa with the exception of albert and srv5 should be configured as broadcast clients to receive these signals.
For a while, csw1-pmtpa was getting its time from the stratum 1 servers ntp-s1.cise.ufl.edu and clock2.redhat.com. (A stratum 1 server is a server which gets its time from an accurate source outside of the NTP hierarchy, such as GPS.) Apparently, some time in November 2005, this stopped working. The switch declared both of these sources "insane", and thus had nothing to synchronise from. According to the NTP protocol, an unsynchronised clock cannot act as a time server, so the effect of this was for the entire cluster to similarly fall out of sync, relying only on their local clocks. Larousse and zwinger seemed to be suffering a similar problem as the switch, they weren't able to understand NTP responses from external servers when I queried them with ntpdate or ntpq, despite the fact that the same commands worked on servers in other clusters.
Albert, however, worked just fine, so I set it up as the local stratum 2 server. csw1-pmtpa happily accepted this as a time source, and thus became a stratum 3 server. Once it came into sync with albert, the rest of the cluster started slewing towards the broadcast source.
The configuration problem with csw1 is now fixed, but albert and srv5 remain as backup stratum 2 servers, in case it happens again.
Installation
To install a client, something along the lines of
yum install ntpd cp -r /home/config/others/etc/ntp* /etc/ chkconfig ntpd on service ntpd start
should work. The important thing is that /etc/ntp.conf has broadcastclient enabled.
Testing
Testing is very important, as the November 2005 event demonstrated. It's easy to test whether ntpd is working.
/usr/sbin/ntpq -p
Here is the output from a happy server:
remote refid st t when poll reach delay offset jitter ============================================================================== *vl-2-0.csw1-pmt albert.pmtpa.wm 3 u 43 64 177 0.486 1.155 1.135
Note the asterisk in the first column, that tells you it's happy. It's synchronised to csw1, which is on stratum (st) 3, and the refid gives the stratum 2 server. The other important columns are:
- when: this tells you how long ago it received a response from the server, in this case 43 seconds
- offset: this tells you how far off the clock is, in milliseconds.
Here is the output from a server which is on its way to synchronisation:
remote refid st t when poll reach delay offset jitter ============================================================================== vl-2-0.csw1-pmt albert.pmtpa.wm 3 u 81 1024 7 0.573 -203.77 1.635
There's no asterisk, which means it hasn't synchronised yet. The offset is substantial, so it will take a while to get into sync. The fact that the remote, refid, st and when columns are reasonable tells you that it is actually working. Hopefully we check back later, offset should be smaller.
Here is the output from a completely broken server:
remote refid st t when poll reach delay offset jitter ============================================================================== vl-2-0.csw1-pmt .B▒▒. 16 u 16 64 0 0.000 0.000 4000.00
It seems to know what server it's meant to be reading from, but the other columns are just silly. There's no such thing as stratum 16, and I'm quite sure the network delay is meant to be more than zero. If you see something like this, you need to fix it.
ntpq can be run remotely. The output of ntpq -c peers csw1-pmtpa currently shows:
remote refid st t when poll reach delay offset jitter ============================================================================== 207.142.131.255 0.0.0.0 16 u - 64 0 0.000 0.000 16000.0 10.0.255.255 0.0.0.0 16 u - 64 0 0.000 0.000 16000.0 clock2.redhat.c .CDMA. 1 - 18d 64 0 73.200 1.322 16000.0 ntp-s1.cise.ufl 85.83.78.79 16 - 18d 1024 0 18.140 0.673 16000.0 raptor.tera-byt 0.0.0.0 16 - - 1024 0 0.000 0.000 16000.0 *albert.pmtpa.wm ntp-s1.cise.ufl 2 u 45 64 377 0.790 -3.597 0.400
Three broken external servers, two broadcast domains and albert, which is a working stratum 2 server. Finally, albert gives:
remote refid st t when poll reach delay offset jitter ============================================================================== *ntp-s1.cise.ufl .USNO. 1 u 86 128 377 22.090 12.095 5.867 +ip-207-145-113- .GPS. 1 u 89 128 377 76.639 20.227 5.468 +solarnet.ru hora.cs.tu-berl 2 u 92 128 377 180.877 11.987 3.481 -blah.jabber.dk ntp2.sth.netnod 2 u 97 128 377 135.076 22.336 5.524 LOCAL(0) LOCAL(0) 10 l 12 64 377 0.000 0.000 0.001
A nearby stratum 1 server at the University of Florida is selected as the reference, but I've configured three other servers from pool.ntp.org in case that one goes down. Two of them are contributing to the averaging process, the third is ignored because its clock doesn't agree with the others. If all 4 are unreachable, the local clock will be used. It's currently undesirable because it's been declared stratum 10.