LVS
Contents |
HOWTO
Pool or depool hosts
For pmtpa: edit the files in /etc/pybal/ directly. PyBal will reread the file within a minute.
For esams: edit the files in /home/w/conf/pybal/esams/ and wait a minute - PyBal will fetch the file over HTTP.
LVS installation
esams has a newer setup that uses Puppet and automatic BGP failover. Puppet arranges the service IP configuration, and installation of packages. To configure the service IPs that an LVS balancer should serve (both primary and backup!), set the $lvs_realserver_ips variable:
node /amslvs[1-4]\.esams\.wikimedia\.org/ {
$cluster = "misc_esams"
$lvs_balancer_ips = [ "91.198.174.2", "91.198.174.232", "91.198.174.233", "91.198.174.234" ]
include base,
ganglia,
lvs::balancer
In this setup, all 4 hosts amslvs1-amslvs4 are configured to serve all three service IPs, although in practice every service IP is only ever serviced by one out of two hosts due to the router configuration.
Puppet uses the (now misleadingly named) wikimedia-lvs-realserver package to bind these IPs to the loopback (!) interface. This is to make sure that a server answers on these IPs, but does not announce them via ARP - we'll use BGP for that.
PyBal configuration
Puppet currently does not configure PyBal.
Edit the configuration in /etc/pybal/pybal.conf. This a .INI-format file, with one section for each LVS service, and a [global] section with global settings. The default configuration file should provide hints.
Put a local copy of the conf files in /home/wikipedia/conf/pybal/cluster/ under the hostname; see the README.txt in that directory.
There is more information on Pybal if you need it.
BGP failover and load sharing
Previously, the LVS balancer that had a certain service IP bound to its eth0 interface was active for that IP. To do failovers, the IP had to be moved manually.
In the new setup, multiple servers announce the service IP(s) via BGP to the router(s), which then pick which server(s) to use based on BGP and OSPF routing policy.
Example BGP configuration for Foundry:
neighbor 10.0.0.210 remote-as 64601 neighbor 10.0.0.210 description "PyBal on lvs3.wikimedia.org" neighbor 10.0.0.210 timers keep-alive 10 hold-time 30 neighbor 10.0.0.210 update-source loopback 1 neighbor 10.0.0.210 prefix-list LVS in neighbor 10.0.0.210 prefix-list none out
Prefix-list LVS:
ip prefix-list LVS: 2 entries
seq 5 permit 208.80.152.0/22 ge 32
seq 10 permit 10.0.0.0/8 ge 32
Example BGP configuration for Juniper (csw2-esams):
root@csw2-esams> show configuration protocols bgp
group PyBal {
type external;
multihop {
ttl 1;
}
local-address 91.198.174.244;
hold-time 30;
import LVS_import;
family inet {
unicast {
prefix-limit {
maximum 10;
teardown;
}
}
}
export NONE;
peer-as 64600;
neighbor 91.198.174.111;
neighbor 91.198.174.112;
}
group iBGP {
type internal;
peer-as 43821;
neighbor 91.198.174.247 {
import LVS_exchange;
export LVS_exchange;
}
}
root@csw2-esams> show configuration policy-options
prefix-list LVS {
91.198.174.0/25;
91.198.174.232/30;
}
policy-statement LVS_exchange {
term 1 {
from {
prefix-list-filter LVS longer;
}
then accept;
}
from protocol bgp;
}
policy-statement LVS_import {
term 1 {
from {
protocol bgp;
prefix-list-filter LVS longer;
}
then {
metric add 10;
accept;
}
}
}
SSH checking
As the Apache cluster is often suffering from broken disks which break SSH but keep Apache up, I have implemented a RunCommand monitor in PyBal which can periodically run an arbitrary command, and check the server's health by the return code. If the command does not return within a certain timeout, the server is marked down as well.
The RunCommand configuration is in /etc/pybal/pybal.conf:
runcommand.command = /bin/sh runcommand.arguments = [ '/etc/pybal/runcommand/check-apache', server.host ] runcommand.interval = 60 runcommand.timeout = 10
- runcommand.command
- The path to the command which is being run. Since we are using a shell script and PyBal does not invoke a shell by itself, we have to do that explicitly.
- runcommand.arguments
- A (Python) list of command arguments. This list can refer to the monitor's server object, as shown here.
- runcommand.interval
- How often to run the check (seconds).
- runcommand.timeout
- The command timeout; after this amount of seconds the entire process group of the command will be KILLed, and the server is marked down.
Currently we're using the following RunCommand script, in /etc/pybal/runcommand/check-apache:
#!/bin/sh set -e HOST=$1 SSH_USER=pybal-check SSH_OPTIONS="-o PasswordAuthentication=no -o StrictHostKeyChecking=no -o ConnectTimeout=8" # Open an SSH connection to the real-server. The command is overridden by the authorized_keys file. ssh -i /root/.ssh/pybal-check $SSH_OPTIONS $SSH_USER@$HOST true exit 0
The limited ssh accounts on the application servers are managed by the wikimedia-task-appserver package.
Old
To install an LVS load balancer, on a base Ubuntu install, do:
- apt-get install pybal (ignore the warning about the kernel not supporting IPVS)
- Set up configuration in /etc/pybal/
- Restart PyBal and check whether it is working correctly (tail /var/log/pybal.log)
- Bind the LVS ip(s) to the external interface (usually eth0); for persistence after booting add the following line to the loopback interface block in /etc/network/interfaces:
up ip addr add ip/32 dev $IFACE
Removing real-servers
Real-servers can be removed from the pool temporarily by simply shutting down apache. Because lvsmon runs in a single thread, checking apaches in turn, it's probably better to remove permanently dead apaches from the apache nodelist.
If a misbehaving realserver is in LVS and for some reason PyBal is not removing it, you can remove it by running a command of the following form:
ipvsadm -d -t <VIP>:<PORT> -r <REALSERVER>
e.g.
ipvsadm -d -t 66.230.200.228:80 -r sq1.pmtpa.wmnet
Diagnosing problems
Run ipvsadm -l on the director. Healthy output looks like this:
IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP upload.pmtpa.wikimedia.org:h wlc -> sq10.pmtpa.wmnet:http Route 10 5202 5295 -> sq1.pmtpa.wmnet:http Route 10 8183 12213 -> sq4.pmtpa.wmnet:http Route 10 7824 13360 -> sq5.pmtpa.wmnet:http Route 10 7843 12936 -> sq6.pmtpa.wmnet:http Route 10 7930 12769 -> sq8.pmtpa.wmnet:http Route 10 7955 11010 -> sq2.pmtpa.wmnet:http Route 10 7987 13190 -> sq7.pmtpa.wmnet:http Route 10 8003 7953
All the servers are getting a decent amount of traffic, there's just normal variation.
If a realserver is refusing connections or doesn't have the VIP configured, it will look like this:
IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP upload.pmtpa.wikimedia.org:h wlc -> sq10.pmtpa.wmnet:http Route 10 2 151577 -> sq1.pmtpa.wmnet:http Route 10 2497 1014 -> sq4.pmtpa.wmnet:http Route 10 2459 1047 -> sq5.pmtpa.wmnet:http Route 10 2389 1048 -> sq6.pmtpa.wmnet:http Route 10 2429 1123 -> sq8.pmtpa.wmnet:http Route 10 2416 1024 -> sq2.pmtpa.wmnet:http Route 10 2389 970 -> sq7.pmtpa.wmnet:http Route 10 2457 1008
Active connections for the problem server are depressed, inactive connections normal or above normal. This problem must be fixed immediately, because in wlc mode, LVS load balances based on the ActiveConn column, meaning that servers that are down get all the traffic.
LVS director list
| Cluster | Director | VIP |
|---|---|---|
| pmtpa apaches | lvs3 | 10.2.1.1 |
| search backend 1 | lvs3 | 10.2.1.11 |
| search backend 2 | lvs3 | 10.2.1.12 |
| search backend 3 | lvs3 | 10.2.1.13 |
| rendering | lvs3 | 10.2.1.21 |
| pmtpa text | lvs4 | 208.80.152.2 |
| m | lvs4 | 208.80.152.5 |
| pmtpa upload | lvs2 | 208.80.152.3 |
| esams text | amslvs1 / amslvs3 | 91.198.174.232 |
| esams upload | amslvs2 / amslvs4 | 91.198.174.234 |
| esams bits | amslvs1 / amslvs3 | 91.198.174.233 |
A good way to generate this list is:
dsh -N ALL -f -e 'ipvsadm -l '
and look for the hosts that give you a pile of output. Because most hosts have config files for both text and upload squids, they will pretend to serve for both. You can check what they are really doing by looking at the output.
Example: output like
fuchsia: IP Virtual Server version 1.2.1 (size=1048576) fuchsia: Prot LocalAddress:Port Scheduler Flags fuchsia: -> RemoteAddress:Port Forward Weight ActiveConn InActConn fuchsia: TCP rr.esams.wikimedia.org:www wlc fuchsia: -> knsq6.esams.wikimedia.org:ww Route 10 26707 31425 fuchsia: -> knsq5.esams.wikimedia.org:ww Route 10 26708 31426 fuchsia: -> knsq24.esams.wikimedia.org:w Route 10 26741 31116 ... (more lines with lots of ActiveConn) fuchsia: TCP upload.esams.wikimedia.org:w wlc fuchsia: -> knsq17.esams.wikimedia.org:w Route 10 0 5 fuchsia: -> knsq13.esams.wikimedia.org:w Route 10 0 5 fuchsia: -> knsq19.esams.wikimedia.org:w Route 10 0 5 ... (more lines with 0 ActiveConn)
means that the host is doing lvs for rr.esams.wikimedia.org but not for upload.esams.wikimedia.org.