Eqiad Migration Planning
From Wikitech
(Difference between revisions)
(→Coordination) |
(→Needed Server Builds) |
||
| Line 8: | Line 8: | ||
** Image scalers? (not critical to be in Precise) Current Lucid build is useable if Precise build is not ready | ** Image scalers? (not critical to be in Precise) Current Lucid build is useable if Precise build is not ready | ||
** Apache - completed, and being deployed in both Tampa (in production) and Eqiad (for testing) | ** Apache - completed, and being deployed in both Tampa (in production) and Eqiad (for testing) | ||
| + | * JobRunners | ||
| + | **will be on separate boxes in eqiad | ||
| + | **Peter will set up separate (i.e. non-appserver) job runners in Tampa first, as part of the precise upgrades | ||
| + | **once done, set up the same in eqiad | ||
| + | **we'll raise the fork limit from 5 to something like 12 | ||
| + | |||
* Swift | * Swift | ||
** servers online; needs cluster replication enabled | ** servers online; needs cluster replication enabled | ||
| Line 18: | Line 24: | ||
** Networked but issues with the Intel 10gNics and Dell's SPF+ | ** Networked but issues with the Intel 10gNics and Dell's SPF+ | ||
** None yet installed | ** None yet installed | ||
| + | ** Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad | ||
| + | |||
* Databases - done | * Databases - done | ||
** <strike>one more slave is needed per shard</strike> | ** <strike>one more slave is needed per shard</strike> | ||
| Line 32: | Line 40: | ||
** Done: server 'flourine' for apache logs | ** Done: server 'flourine' for apache logs | ||
* Upload Varnish | * Upload Varnish | ||
| − | ** Reinstall with precise, streaming+persistent patches | + | ** Reinstall with precise, streaming+persistent patches; currently testing the build in production, serving about 100mbps |
| + | ** to deploy on the rest of the servers (from 8 today to 16) | ||
| + | * Deployment Host (10 minutes job according to Tim, and he volunteered to own it) | ||
| + | ** Deploy from deployment host to all application servers | ||
| + | ** rsync the deployment code from the primary deployment server to the secondary | ||
| + | ** Require a clean git repo | ||
| + | ** Application servers in the other datacenter will use the secondary deployment system for rsync | ||
== Software / Config Requirements == | == Software / Config Requirements == | ||
Revision as of 16:32, 13 September 2012
Contents |
Coordination
- We now have an incomplete tracking ticket in RT that depends on more specific tickets.
- Sept 12 Update - http://etherpad.wikimedia.org/TechOps-12Sept2012
Needed Server Builds
- App and API Apaches
- Image scalers? (not critical to be in Precise) Current Lucid build is useable if Precise build is not ready
- Apache - completed, and being deployed in both Tampa (in production) and Eqiad (for testing)
- JobRunners
- will be on separate boxes in eqiad
- Peter will set up separate (i.e. non-appserver) job runners in Tampa first, as part of the precise upgrades
- once done, set up the same in eqiad
- we'll raise the fork limit from 5 to something like 12
- Swift
- servers online; needs cluster replication enabled
- Still need to migrate Math, Captcha, Misc objects from ms7 to Swift
- H/w issues need to be resolved
- Might have to run Swift and ImageScalers in Tampa while the rest of the stack are running in Eqiad
- Aaron to test performance lag
- Memcached servers
- mc1001-mc1010 wired, mc1011+ cables arriving and will be wired next week.
- Networked but issues with the Intel 10gNics and Dell's SPF+
- None yet installed
- Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
- Databases - done
-
one more slave is needed per shard
-
- Poolcounter
- Done: helium and potassium are installed and puppetized
- Netapp
- /home/wikipedia for deployments
- /home - cpmpleted
- Deployment server (fenari's deployment support infrastructure part, misc::deployment etc)
- awaiting new misc server racking next week
- Hume equivalent (misc::maintenance)
- Application logging server - for mediawiki wmerrors + apache syslog
-
eqiad version of the udp2log instance on nfs1 that writes to /home/w/logs - Done: server 'flourine' for apache logs
-
- Upload Varnish
- Reinstall with precise, streaming+persistent patches; currently testing the build in production, serving about 100mbps
- to deploy on the rest of the servers (from 8 today to 16)
- Deployment Host (10 minutes job according to Tim, and he volunteered to own it)
- Deploy from deployment host to all application servers
- rsync the deployment code from the primary deployment server to the secondary
- Require a clean git repo
- Application servers in the other datacenter will use the secondary deployment system for rsync
Software / Config Requirements
- Varnish software to handle media streaming efficiently
- awaiting patch from Varnish Software (target Sept?)
- patch MediaWiki to use a different upload hostname for large files. Then we could use Squid or some specialized media streaming proxy for large files.
- MediaWiki deploy support for per colo config variances [Bugzilla 39082]
- generating eqiad and pmtpa dsh groups
- new mediawiki conf files for eqiad
- replicating the git checkouts, etc. to new /home
Actually Failing Over
- deploy db.php with all shards set to read-only in both pmtpa and eqiad
- deploy squid and mobile + bits varnish configs pointing to eqiad apaches
- master swap every core db and writable es shard to eqiad
- deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
- the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
- dns changes - our current steady state is to point wikipedia-lb.wikimedia.org in the US to eqiad but future scenarios may include external dns switches.
- Swift replication reversal - from Eqiad to Tampa
Improving Failover
- pre-generate squid + varnish configs for different primary datacenter roles
- implement MHA to better automate the mysql master failovers
- migrate session storage to redis, with redundant replicas across colos
See more
- Records and original tracking doc - http://etherpad.wikimedia.org/EQIAD-rollout-sequence
- Category:Eqiad cluster
Parking Lot Issues
- Identify and plan around the deployment/migration date - tentatively Oct 15, 2012. Need to communicate date.
- Migration needs to happen before Fundraising season starts in Nov.
- Vacation 'freeze'; all hands on deck week before and after deployment
- Why? Not every person is vital to migration.
- migrate ns1 from tampa to ashburn, but not a critical item.