LVS

From Wikitech
Revision as of 15:43, 16 July 2010 by Mark (Talk | contribs)

Jump to: navigation, search

Wikimedia uses LVS for balancing traffic over multiple servers.

Contents

Overview

Esams LVS.png

We use LVS-DR, or Direct Routing. This means that only forward (incoming) traffic is balanced by the load balancer, and return traffic does not even go through the load balancer. Essentially, the LVS balancer receives traffic for a given service IP and port, selects one out of multiple "real servers", and then forwards the packet to that real server with only a modified destination MAC address. The destination servers also listen to and accept traffic for the service IP, but don't advertise it over ARP. Return traffic is simply sent directly to the gateway/router.

The LVS balancer and the real servers need to be in the same subnet for this to work.

The real servers are monitored by a Python program called Pybal. It does certain kinds of health checks to determine which servers can be used, and pools and depools them accordingly. You can follow what Pybal is doing in log file /var/log/pybal.log.

PyBal also has an integrated BGP module that Mark has written (Twisted BGP, available in the MediaWiki SVN repository). This is used as a failover/high availability protocol between the LVS balancers (PyBal) and the routers. PyBal announces the LVS service IPs to the router(s) to indicate that it is alive and can serve traffic. This also removes the need to manually configure the service IPs on the active balancers. Esams already has this BGP setup; pmtpa still uses the old manual setup, but will follow soon.

HOWTO

Pool or depool hosts

For pmtpa: edit the files in /etc/pybal/ directly. PyBal will reread the file within a minute.

For esams: edit the files in /home/w/conf/pybal/esams/ and wait a minute - PyBal will fetch the file over HTTP.

If you set a host do disabled, PyBal will continue to monitor it but just not pool it:

{ 'host': 'knsq1.esams.wikimedia.org', 'weight': 10, 'enabled': False }

If you comment the line, PyBal will forget about it completely.

In emergency cases, you can do this manually using ipvsadm, if PyBal for some reason is not working for example.

ipvsadm -d -t VIP:PORT -r REALSERVER

Such as:

ipvsadm -d -t 91.198.174.232:80 -r knsq1.esams.wikimedia.org

Note that PyBal won't know about this, so make sure you bring the situation back in sync.

See which LVS balancer is active for a given service

Authoritatively, by asking the directly attached routers. You can request the route for a given service IP. E.g. on Foundry:

csw1-esams#show ip route 91.198.174.234
Type Codes - B:BGP D:Connected I:ISIS S:Static R:RIP O:OSPF; Cost - Dist/Metric
Uptime - Days:Hours:Minutes:Seconds 
        Destination        Gateway         Port        Cost     Type Uptime
1       91.198.174.234/32  91.198.174.110  ve 1        20/1     B    10:14:28:44

So 91.198.174.110 (amslvs2] is active for Upload LVS service IP 91.198.174.234.

On Juniper:

csw2-esams> show route 91.198.174.232 

inet.0: 38 destinations, 41 routes (38 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

91.198.174.232/32  *[BGP/170] 19:38:18, localpref 100, from 91.198.174.247
                      AS path: 64600 I
                    > to 91.198.174.109 via vlan.100
                    [BGP/170] 1w3d 14:24:52, MED 10, localpref 100
                      AS path: 64600 I
                    > to 91.198.174.111 via vlan.100

So 91.198.174.109 (*) is active for Text LVS service IP 91.198.174.232.

LVS installation

esams has a newer setup that uses Puppet and automatic BGP failover. Puppet arranges the service IP configuration, and installation of packages. To configure the service IPs that an LVS balancer should serve (both primary and backup!), set the $lvs_realserver_ips variable:

node /amslvs[1-4]\.esams\.wikimedia\.org/ {
        $cluster = "misc_esams"

        $lvs_balancer_ips = [ "91.198.174.2", "91.198.174.232", "91.198.174.233", "91.198.174.234" ]

        include base,
                ganglia,
                lvs::balancer
}

In this setup, all 4 hosts amslvs1-amslvs4 are configured to serve all three service IPs, although in practice every service IP is only ever serviced by one out of two hosts due to the router configuration.

Puppet uses the (now misleadingly named) wikimedia-lvs-realserver package to bind these IPs to the loopback (!) interface. This is to make sure that a server answers on these IPs, but does not announce them via ARP - we'll use BGP for that.

PyBal configuration

Puppet currently does not configure PyBal.

Edit the configuration in /etc/pybal/pybal.conf. This a .INI-format file, with one section for each LVS service, and a [global] section with global settings. The default configuration file should provide hints.

Put a local copy of the conf files in /home/wikipedia/conf/pybal/cluster/ under the hostname; see the README.txt in that directory.

There is more information on Pybal if you need it.

BGP failover and load sharing

Previously, the LVS balancer that had a certain service IP bound to its eth0 interface was active for that IP. To do failovers, the IP had to be moved manually.

In the new setup, multiple servers announce the service IP(s) via BGP to the router(s), which then pick which server(s) to use based on BGP routing policy.

PyBal BGP configuration

In the global section, the following BGP related settings exist:

bgp = yes

Enables bgp globally, but can be overridden per service.

bgp-local-asn =  64600

The ASN to use while communicating to the routers. All prefixes will get this ASN as AS path.

bgp-peer-address = 91.198.174.247

The IP of the router this PyBal instance speaks BGP to.

#bgp-as-path = 64600 64601

An optional modified AS path. Can be used e.g. to make the AS path longer and thus less attractive (on a backup balancer).

Example BGP configuration for Foundry

router bgp
 neighbor 91.198.174.109 remote-as 64600
 neighbor 91.198.174.109 description "PyBal on amslvs1"
 neighbor 91.198.174.109 timers  keep-alive 10  hold-time 30
 neighbor 91.198.174.109 update-source loopback 1
 neighbor 91.198.174.110 remote-as 64600
 neighbor 91.198.174.110 description "PyBal on amslvs2"
 neighbor 91.198.174.110 timers  keep-alive 10  hold-time 30
 neighbor 91.198.174.110 update-source loopback 1

 neighbor 91.198.174.244 description "iBGP to csw2-esams"
 neighbor 91.198.174.244 timers  keep-alive 10  hold-time 30
 neighbor 91.198.174.244 update-source loopback 1


 neighbor 91.198.174.109 prefix-list LVS in
 neighbor 91.198.174.109 prefix-list none out
 neighbor 91.198.174.110 maximum-prefix 10 teardown
 neighbor 91.198.174.110 prefix-list LVS in                       
 neighbor 91.198.174.110 prefix-list none out

 neighbor 91.198.174.244 maximum-prefix 10 teardown
 neighbor 91.198.174.244 prefix-list LVS in
 neighbor 91.198.174.244 prefix-list LVS out
 neighbor 91.198.174.244 unsuppress-map LVS-IBGP-EXCHANGE
!

ip prefix-list  LVS seq 5 permit 91.198.174.0/25 ge 32 
ip prefix-list  LVS seq 10 permit 91.198.174.232/30 ge 32 


route-map  ospf_bgp_export permit  10 
 match ip address prefix-list LVS
 match as-path  ^64600$
route-map  ospf_bgp_export deny  100 

route-map  LVS-IBGP-EXCHANGE permit  10 
 match ip address prefix-list LVS
route-map  LVS-IBGP-EXCHANGE deny  100

Example BGP configuration for Juniper (csw2-esams)

root@csw2-esams> show configuration protocols bgp    
group PyBal {
    type external;
    multihop {
        ttl 1;
    }
    local-address 91.198.174.244;
    hold-time 30;
    import LVS_import;
    family inet {
        unicast {
            prefix-limit {
                maximum 10;
                teardown;
            }
        }
    }
    export NONE;
    peer-as 64600;
    neighbor 91.198.174.111;
    neighbor 91.198.174.112;
}
group iBGP {
    type internal;
    peer-as 43821;
    neighbor 91.198.174.247 {
        import LVS_exchange;
        export LVS_exchange;
    }
}

root@csw2-esams> show configuration policy-options 
prefix-list LVS {
    91.198.174.0/25;
    91.198.174.232/30;
}
policy-statement LVS_exchange {
    term 1 {
        from {
            prefix-list-filter LVS longer;
        }
        then accept;
    }
    from protocol bgp;
}
policy-statement LVS_import {
    term 1 {
        from {
            protocol bgp;
            prefix-list-filter LVS longer;
        }
        then {
            metric add 10;
            accept;
        }
    }
}

The LVS_import policy adds metric 10 to the "routes" (service IPs) received from the secondary (backup) LVS balancers. This means that the router will regard them as less preferred.

At esams, Foundry router csw1-esams and JUNOS router csw2-esams exchange the service IPs over iBGP.

SSH checking

As the Apache cluster is often suffering from broken disks which break SSH but keep Apache up, I have implemented a RunCommand monitor in PyBal which can periodically run an arbitrary command, and check the server's health by the return code. If the command does not return within a certain timeout, the server is marked down as well.

The RunCommand configuration is in /etc/pybal/pybal.conf:

runcommand.command = /bin/sh
runcommand.arguments = [ '/etc/pybal/runcommand/check-apache', server.host ]
runcommand.interval = 60
runcommand.timeout = 10
runcommand.command 
The path to the command which is being run. Since we are using a shell script and PyBal does not invoke a shell by itself, we have to do that explicitly.
runcommand.arguments 
A (Python) list of command arguments. This list can refer to the monitor's server object, as shown here.
runcommand.interval 
How often to run the check (seconds).
runcommand.timeout 
The command timeout; after this amount of seconds the entire process group of the command will be KILLed, and the server is marked down.

Currently we're using the following RunCommand script, in /etc/pybal/runcommand/check-apache:

#!/bin/sh

set -e

HOST=$1
SSH_USER=pybal-check
SSH_OPTIONS="-o PasswordAuthentication=no -o StrictHostKeyChecking=no -o ConnectTimeout=8"

# Open an SSH connection to the real-server. The command is overridden by the authorized_keys file.
ssh -i /root/.ssh/pybal-check $SSH_OPTIONS $SSH_USER@$HOST true

exit 0

The limited ssh accounts on the application servers are managed by the wikimedia-task-appserver package.

Old

To install an LVS load balancer, on a base Ubuntu install, do:

  1. apt-get install pybal (ignore the warning about the kernel not supporting IPVS)
  2. Set up configuration in /etc/pybal/
  3. Restart PyBal and check whether it is working correctly (tail /var/log/pybal.log)
  4. Bind the LVS ip(s) to the external interface (usually eth0); for persistence after booting add the following line to the loopback interface block in /etc/network/interfaces:
up ip addr add ip/32 dev $IFACE

Diagnosing problems

Run ipvsadm -l on the director. Healthy output looks like this:

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  upload.pmtpa.wikimedia.org:h wlc
  -> sq10.pmtpa.wmnet:http        Route   10     5202       5295
  -> sq1.pmtpa.wmnet:http         Route   10     8183       12213
  -> sq4.pmtpa.wmnet:http         Route   10     7824       13360
  -> sq5.pmtpa.wmnet:http         Route   10     7843       12936
  -> sq6.pmtpa.wmnet:http         Route   10     7930       12769
  -> sq8.pmtpa.wmnet:http         Route   10     7955       11010
  -> sq2.pmtpa.wmnet:http         Route   10     7987       13190
  -> sq7.pmtpa.wmnet:http         Route   10     8003       7953

All the servers are getting a decent amount of traffic, there's just normal variation.

If a realserver is refusing connections or doesn't have the VIP configured, it will look like this:

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  upload.pmtpa.wikimedia.org:h wlc
  -> sq10.pmtpa.wmnet:http        Route   10     2          151577
  -> sq1.pmtpa.wmnet:http         Route   10     2497       1014
  -> sq4.pmtpa.wmnet:http         Route   10     2459       1047
  -> sq5.pmtpa.wmnet:http         Route   10     2389       1048
  -> sq6.pmtpa.wmnet:http         Route   10     2429       1123
  -> sq8.pmtpa.wmnet:http         Route   10     2416       1024
  -> sq2.pmtpa.wmnet:http         Route   10     2389       970
  -> sq7.pmtpa.wmnet:http         Route   10     2457       1008

Active connections for the problem server are depressed, inactive connections normal or above normal. This problem must be fixed immediately, because in wlc mode, LVS load balances based on the ActiveConn column, meaning that servers that are down get all the traffic.

LVS director list

Cluster Director VIP
pmtpa apaches lvs3 10.2.1.1
search backend 1 lvs3 10.2.1.11
search backend 2 lvs3 10.2.1.12
search backend 3 lvs3 10.2.1.13
rendering lvs3 10.2.1.21
pmtpa text lvs4 208.80.152.2
m lvs4 208.80.152.5
pmtpa upload lvs2 208.80.152.3
esams text amslvs1 / amslvs3 91.198.174.232
esams upload amslvs2 / amslvs4 91.198.174.234
esams bits amslvs1 / amslvs3 91.198.174.233

A good way to generate this list is:

 dsh -N ALL -f -e 'ipvsadm -l ' 

and look for the hosts that give you a pile of output. Because most hosts have config files for both text and upload squids, they will pretend to serve for both. You can check what they are really doing by looking at the output.

Example: output like

fuchsia:  	IP Virtual Server version 1.2.1 (size=1048576)
fuchsia:  	Prot LocalAddress:Port Scheduler Flags
fuchsia:  	  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
fuchsia:  	TCP  rr.esams.wikimedia.org:www wlc
fuchsia:  	  -> knsq6.esams.wikimedia.org:ww Route   10     26707      31425     
fuchsia:  	  -> knsq5.esams.wikimedia.org:ww Route   10     26708      31426     
fuchsia:  	  -> knsq24.esams.wikimedia.org:w Route   10     26741      31116     
... (more lines with lots of ActiveConn and InActConn)
fuchsia:  	TCP  upload.esams.wikimedia.org:w wlc
fuchsia:  	  -> knsq17.esams.wikimedia.org:w Route   10     0          5         
fuchsia:  	  -> knsq13.esams.wikimedia.org:w Route   10     0          5         
fuchsia:  	  -> knsq19.esams.wikimedia.org:w Route   10     0          5         
... (more lines with 0 ActiveConn)

means that the host is doing lvs for rr.esams.wikimedia.org but not for upload.esams.wikimedia.org.

Note that with newer kernels, ActiveConn remains 0 always.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox