Eqiad Migration Planning

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(Coordination)
(Needed Server Builds)
Line 8: Line 8:
 
** Image scalers? (not critical to be in Precise) Current Lucid build is useable if Precise build is not ready
 
** Image scalers? (not critical to be in Precise) Current Lucid build is useable if Precise build is not ready
 
** Apache - completed, and being deployed in both Tampa (in production) and Eqiad (for testing)
 
** Apache - completed, and being deployed in both Tampa (in production) and Eqiad (for testing)
 +
* JobRunners
 +
**will be on separate boxes in eqiad
 +
**Peter will set up separate (i.e. non-appserver) job runners in Tampa first, as part of the precise upgrades
 +
**once done, set up the same in eqiad
 +
**we'll raise the fork limit from 5 to something like 12
 +
 
* Swift
 
* Swift
 
** servers online; needs cluster replication enabled
 
** servers online; needs cluster replication enabled
Line 18: Line 24:
 
** Networked but issues with the Intel 10gNics and Dell's SPF+
 
** Networked but issues with the Intel 10gNics and Dell's SPF+
 
** None yet installed  
 
** None yet installed  
 +
** Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
 +
 
* Databases - done
 
* Databases - done
 
** <strike>one more slave is needed per shard</strike>
 
** <strike>one more slave is needed per shard</strike>
Line 32: Line 40:
 
** Done: server 'flourine' for apache logs
 
** Done: server 'flourine' for apache logs
 
* Upload Varnish
 
* Upload Varnish
** Reinstall with precise, streaming+persistent patches
+
** Reinstall with precise, streaming+persistent patches; currently testing the build in production, serving about 100mbps
 +
** to deploy on the rest of the servers (from 8 today to 16)
 +
* Deployment Host (10 minutes job according to Tim, and he volunteered to own it)
 +
** Deploy from deployment host to all application servers
 +
** rsync the deployment code from the primary deployment server to the secondary
 +
** Require a clean git repo
 +
** Application servers in the other datacenter will use the secondary deployment system for rsync
  
 
== Software / Config Requirements ==
 
== Software / Config Requirements ==

Revision as of 16:32, 13 September 2012

Contents

Coordination

Needed Server Builds

  • App and API Apaches
    • Image scalers? (not critical to be in Precise) Current Lucid build is useable if Precise build is not ready
    • Apache - completed, and being deployed in both Tampa (in production) and Eqiad (for testing)
  • JobRunners
    • will be on separate boxes in eqiad
    • Peter will set up separate (i.e. non-appserver) job runners in Tampa first, as part of the precise upgrades
    • once done, set up the same in eqiad
    • we'll raise the fork limit from 5 to something like 12
  • Swift
    • servers online; needs cluster replication enabled
    • Still need to migrate Math, Captcha, Misc objects from ms7 to Swift
    • H/w issues need to be resolved
    • Might have to run Swift and ImageScalers in Tampa while the rest of the stack are running in Eqiad
    • Aaron to test performance lag
  • Memcached servers
    • mc1001-mc1010 wired, mc1011+ cables arriving and will be wired next week.
    • Networked but issues with the Intel 10gNics and Dell's SPF+
    • None yet installed
    • Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
  • Databases - done
    • one more slave is needed per shard
  • Poolcounter
    • Done: helium and potassium are installed and puppetized
  • Netapp
    • /home/wikipedia for deployments
    • /home - cpmpleted
  • Deployment server (fenari's deployment support infrastructure part, misc::deployment etc)
    • awaiting new misc server racking next week
  • Hume equivalent (misc::maintenance)
  • Application logging server - for mediawiki wmerrors + apache syslog
    • eqiad version of the udp2log instance on nfs1 that writes to /home/w/logs
    • Done: server 'flourine' for apache logs
  • Upload Varnish
    • Reinstall with precise, streaming+persistent patches; currently testing the build in production, serving about 100mbps
    • to deploy on the rest of the servers (from 8 today to 16)
  • Deployment Host (10 minutes job according to Tim, and he volunteered to own it)
    • Deploy from deployment host to all application servers
    • rsync the deployment code from the primary deployment server to the secondary
    • Require a clean git repo
    • Application servers in the other datacenter will use the secondary deployment system for rsync

Software / Config Requirements

  • Varnish software to handle media streaming efficiently
    • awaiting patch from Varnish Software (target Sept?)
    • patch MediaWiki to use a different upload hostname for large files. Then we could use Squid or some specialized media streaming proxy for large files.
  • MediaWiki deploy support for per colo config variances [Bugzilla 39082]
    • generating eqiad and pmtpa dsh groups
    • new mediawiki conf files for eqiad
  • replicating the git checkouts, etc. to new /home

Actually Failing Over

  • deploy db.php with all shards set to read-only in both pmtpa and eqiad
  • deploy squid and mobile + bits varnish configs pointing to eqiad apaches
  • master swap every core db and writable es shard to eqiad
  • deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
    • the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
  • dns changes - our current steady state is to point wikipedia-lb.wikimedia.org in the US to eqiad but future scenarios may include external dns switches.
  • Swift replication reversal - from Eqiad to Tampa

Improving Failover

  • pre-generate squid + varnish configs for different primary datacenter roles
  • implement MHA to better automate the mysql master failovers
  • migrate session storage to redis, with redundant replicas across colos

See more

Parking Lot Issues

  • Identify and plan around the deployment/migration date - tentatively Oct 15, 2012. Need to communicate date.
    • Migration needs to happen before Fundraising season starts in Nov.
    • Vacation 'freeze'; all hands on deck week before and after deployment
      • Why? Not every person is vital to migration.
    • migrate ns1 from tampa to ashburn, but not a critical item.
Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox