Eqiad Migration Planning/Steps
Day 1: Tue Jan 22
Preparation (before maintenance window)
Check LVS pools apaches, api and rendering for down/depooled machines. A few machines may be broken (and should be removed from the config from the time being), but all others should be up and happy in health checks.
# ipvsadm -l # less /var/log/pybal.log
Check whether the Nagios check for these LVS pools exists and is up.
Check whether all pooled application servers have the right LVS service IPs bound to loopback.
Check deployed MediaWiki revision / git status on all application servers
MySQL warm up?
Ensure media writes to the NetApp are disabled
Migrate bits apaches to eqiad
Check whether the 4 bits apaches are healthy according to a bits Varnish server:
# varnishlog -i Backend_health -O
Test a few top bits URLs manually from the new bits app servers to see if valid content is being returned. To retrieve the most requested URLs, on a bits Varnish server:
# varnishtop -i RxURL
To test such a URL, use CURL, or:
fenari: $ /home/mark/firstbyte.py apache_host_name 80 bits.wikimedia.org URI
Run varnishtop for a histogram of HTTP status codes, and compare before/after migration:
# varnishtop -i TxStatus
Deploy Gerrit patch set 44251 and run Puppet for node group XXX. This will change the apache backends for the eqiad Varnish servers only, giving us a chance to fall back on pmtpa bits Varnish servers quickly if needed.
Check if the distribution of HTTP status codes changes drastically, esp. HTTP 2xx vs. 4xx/5xx.
If bits@eqiad is confirmed to work correctly, after a while deploy Gerrit patchset 44252 and run Puppet for node group XXX. This will switch the pmtpa bits Varnish servers to use the eqiad bits appservers as well.
Set all database shards to read-only
Core databases
External storage
Parser Cache
Parser cache configuration currently lives in wmf-config/CommonSettings.php (near line 350).
Gerrit patchset XXX
Redis
Redis configuration currently lives in wmf-config/CommonSettings.php (near line 360).
Gerrit patchset XXX
Redis master switch
Memcached
Memcached configuration currently lives in wmf-config/mc.php.
Note that because memcached cache content is not replicated between the data center sites, Tampa's memcached servers will need to be cleared prior to switch back.
Master switch on all database shards
General master switch instructions are here: Switch_master.
Possibly mha will be used?
Text Squids backend changes
This is the actual switch of directing clients to eqiad Apaches.
The Squid configuration resides in /home/w/conf/squid on fenari, and is backed by a git repository nowadays. Mark has prepared 3 commits in branch eqiad-switchover, that migrate the image scalers, the API application servers and the regular application servers to eqiad.
For each of these commits, use the following sequence:
Merge the commit onto master:
$ git merge XXX
- As root, run make to generate the new configuration files. Make sure there are no permission errors.
# make
Now, run a diff of all new configurations against the configurations currently deployed. Make sure the differences reflect the backend changes you expect.
# diff -ru deployed/ generated/ | less
Finally, deploy the configurations to all Squids. Make sure you have ssh agent forwarding enabled for this step. The configurations will be deployed directly and become active immediately, but will also be pushed to Puppet's volatile file module.
# ./deploy cache
(you can deploy to just pmtpa.text and eqiad.text if you prefer, as long as you do both.)
First migrate the image scalers. They run a limited subset of MediaWiki, and any problems are unlikely to cause harm.
Next, the API application servers.
Finally, normal clients: the regular application servers.
Mobile Varnish backend changes
Deploy Gerrit patch set 44257, and run Puppet on hosts cp1041 .. cp1044.