Eqiad Migration Planning

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(Parking Lot Issues)
(Copied in notes from last meeting)
Line 3: Line 3:
 
* We now have an incomplete [https://rt.wikimedia.org/Ticket/Display.html?id=3403 tracking ticket] in RT that depends on more specific tickets.
 
* We now have an incomplete [https://rt.wikimedia.org/Ticket/Display.html?id=3403 tracking ticket] in RT that depends on more specific tickets.
 
* Sept 12 Update - http://etherpad.wikimedia.org/TechOps-12Sept2012
 
* Sept 12 Update - http://etherpad.wikimedia.org/TechOps-12Sept2012
 +
 +
=== High Risk & Mitigation ===
 +
*  What could cause failover back to Tampa difficult should migration failed?
 +
** should Ceph fails?
  
 
== Needed Server Builds ==  
 
== Needed Server Builds ==  
* App and API Apaches
+
* App, Imagescalers and API Apaches
 
** Image scalers: Ready to deploy @ Eqiad  
 
** Image scalers: Ready to deploy @ Eqiad  
 
** Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing)
 
** Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing)
 
* JobRunners
 
* JobRunners
** Ready to deploy @ Eqiad
+
** Ready to test deploy @ Eqiad
 +
**** dependent on Deployment system - ready for test in EQIAD;
 +
**** deploy API, Apaches, Imagescalers ( PY )
 +
**** need to change apache and mw cfg  - (AI - tbd/RobLa)
 +
**** Need to identify / doc test requirements and success criteria (what are the use cases?) - CM/PY
 +
**** Chris M will work  with Ops (PY lead guy) on setting up the tests
 +
***** Overview of existing UI tests:  https://github.com/wikimedia/qa-browsertests/tree/master/features
  
 
* Swift
 
* Swift
** servers online; needs cluster replication enabled - netapp replicatio enabled
+
** servers online; needs cluster replication enabled - netapp replication enabled
 
** Still need to migrate Math, Captcha, Misc objects from ms7 to Swift
 
** Still need to migrate Math, Captcha, Misc objects from ms7 to Swift
** H/w issues need to be resolved - H/w being installed and final batch to ship on 5th Dec
+
** <s>H/w issues need to be resolved - H/w being installed and final batch to ship on 5th Dec</s>
 
** Might have to run Swift and ImageScalers in Tampa while the rest of the stack are running in Eqiad
 
** Might have to run Swift and ImageScalers in Tampa while the rest of the stack are running in Eqiad
 
** Aaron to test performance lag
 
** Aaron to test performance lag
 +
 +
** Ceph update
 +
*** overcame several issues/ steep learning curve; cluster more stable
 +
*** still an option to use Ceph
 +
*** MW multiwrite for thumbs - Aaron/Mark to discuss details (already happening with NAS)
  
 
* Memcached servers
 
* Memcached servers
 
** mc01 - mc16 (Tampa) in production - done
 
** mc01 - mc16 (Tampa) in production - done
 
** mc1001-mc1016 OS installed, ready for puppet to be run.
 
** mc1001-mc1016 OS installed, ready for puppet to be run.
** Networked but issues with the Intel 10gNics and Dell's SPF+ - workaround available; getting new SPF+  
+
** <s>Networked but issues with the Intel 10gNics and Dell's SPF+ - workaround available; getting new SPF+ - resolved</s>
 
** Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
 
** Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
  
 
* Databases - done
 
* Databases - done
 
** <strike>one more slave is needed per shard</strike>
 
** <strike>one more slave is needed per shard</strike>
 +
** Grants needed (SQL )
 
* Poolcounter
 
* Poolcounter
 
** Done: helium and potassium are installed and puppetized
 
** Done: helium and potassium are installed and puppetized
 
* Netapp  
 
* Netapp  
** /home/wikipedia for deployments
+
** /home/wikipedia for deployments (prolly not using it; use git-deploy)
 
** /home - completed in Tampa, not strictly necessary in eqiad
 
** /home - completed in Tampa, not strictly necessary in eqiad
 +
 
* Deployment server (fenari's deployment support infrastructure part, misc::deployment etc)
 
* Deployment server (fenari's deployment support infrastructure part, misc::deployment etc)
** awaiting new misc server racking next week
+
** awaiting new misc server racking next week - done.  server name is Tin
 +
 
 
* Hume equivalent (misc::maintenance) - postponed
 
* Hume equivalent (misc::maintenance) - postponed
 +
 
* Application logging server - for mediawiki wmerrors + apache syslog
 
* Application logging server - for mediawiki wmerrors + apache syslog
 
** <s>eqiad version of the udp2log instance on nfs1 that writes to /home/w/logs</s>
 
** <s>eqiad version of the udp2log instance on nfs1 that writes to /home/w/logs</s>
 
** Done: server 'flourine' for apache logs
 
** Done: server 'flourine' for apache logs
 +
 
* Upload Varnish - done
 
* Upload Varnish - done
* Deployment Host (10 minutes job according to Tim, and he volunteered to own it)
+
 
 
** <s>Server OS install</s>
 
** <s>Server OS install</s>
 
** Deploy from deployment host to all application servers
 
** Deploy from deployment host to all application servers
Line 47: Line 67:
 
== Software / Config Requirements ==
 
== Software / Config Requirements ==
 
* Varnish software to handle media streaming efficiently
 
* Varnish software to handle media streaming efficiently
** awaiting patch from Varnish Software (target Sept?)
+
** awaiting patch from Varnish Software (target Sept?) - done
** patch MediaWiki to use a different upload hostname for large files. Then we could use Squid or some specialized media streaming proxy for large files.
+
** <s>patch MediaWiki to use a different upload hostname for large files. Then we could use Squid or some specialized media streaming proxy for large files.</s>-- n/a here
* MediaWiki deploy support for per colo config variances [https://bugzilla.wikimedia.org/show_bug.cgi?id=39082 [Bugzilla 39082<nowiki>]</nowiki>]
+
 
 +
* MediaWiki deploy support for per colo config variances ([https://bugzilla.wikimedia.org/show_bug.cgi?id=39082 Bugzilla 39082])
 
** generating eqiad and pmtpa dsh groups  
 
** generating eqiad and pmtpa dsh groups  
 +
** mostly done - rolling out by end of month  https://gerrit.wikimedia.org/r/#/c/32167/  https://gerrit.wikimedia.org/r/#/c/32168/ ..
 
** new mediawiki conf files for eqiad
 
** new mediawiki conf files for eqiad
 +
 +
 
* replicating the git checkouts, etc. to new /home
 
* replicating the git checkouts, etc. to new /home
 +
** not an issue
  
 
== Actually Failing Over ==
 
== Actually Failing Over ==
Line 62: Line 87:
 
* dns changes - our current steady state is to point wikipedia-lb.wikimedia.org in the US to eqiad but future scenarios may include external dns switches.
 
* dns changes - our current steady state is to point wikipedia-lb.wikimedia.org in the US to eqiad but future scenarios may include external dns switches.
 
* Swift replication reversal - from Eqiad to Tampa
 
* Swift replication reversal - from Eqiad to Tampa
 +
* Rollback plan - needs to add details
  
 
== Improving Failover ==  
 
== Improving Failover ==  
Line 76: Line 102:
 
** Migration needs to happen before Fundraising season starts in Nov.
 
** Migration needs to happen before Fundraising season starts in Nov.
 
** Vacation 'freeze'; all hands on deck week before and after deployment
 
** Vacation 'freeze'; all hands on deck week before and after deployment
*** Why?  Not every person is vital to migration.
+
*** Why?  Not every person is vital to migration. --second. if you're not vital to migration, this seems like overkill - who are u pls?
 
** migrate ns1 from tampa to ashburn, but not a critical item.
 
** migrate ns1 from tampa to ashburn, but not a critical item.
  
 
* An update from CT Woo from October 2012 regarding the status of the migration is available [http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/063668.html here]. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).
 
* An update from CT Woo from October 2012 regarding the status of the migration is available [http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/063668.html here]. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).
 
[[Category:Eqiad cluster|*]]
 
[[Category:Eqiad cluster|*]]
 
* Weekly Countdown meeting http://etherpad.wmflabs.org/pad/p/EqiadMigration - meeting minutes
 

Revision as of 22:31, 18 December 2012

Contents

Coordination

High Risk & Mitigation

  • What could cause failover back to Tampa difficult should migration failed?
    • should Ceph fails?

Needed Server Builds

  • App, Imagescalers and API Apaches
    • Image scalers: Ready to deploy @ Eqiad
    • Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing)
  • JobRunners
    • Ready to test deploy @ Eqiad
        • dependent on Deployment system - ready for test in EQIAD;
        • deploy API, Apaches, Imagescalers ( PY )
        • need to change apache and mw cfg - (AI - tbd/RobLa)
        • Need to identify / doc test requirements and success criteria (what are the use cases?) - CM/PY
        • Chris M will work with Ops (PY lead guy) on setting up the tests
  • Swift
    • servers online; needs cluster replication enabled - netapp replication enabled
    • Still need to migrate Math, Captcha, Misc objects from ms7 to Swift
    • H/w issues need to be resolved - H/w being installed and final batch to ship on 5th Dec
    • Might have to run Swift and ImageScalers in Tampa while the rest of the stack are running in Eqiad
    • Aaron to test performance lag
    • Ceph update
      • overcame several issues/ steep learning curve; cluster more stable
      • still an option to use Ceph
      • MW multiwrite for thumbs - Aaron/Mark to discuss details (already happening with NAS)
  • Memcached servers
    • mc01 - mc16 (Tampa) in production - done
    • mc1001-mc1016 OS installed, ready for puppet to be run.
    • Networked but issues with the Intel 10gNics and Dell's SPF+ - workaround available; getting new SPF+ - resolved
    • Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
  • Databases - done
    • one more slave is needed per shard
    • Grants needed (SQL )
  • Poolcounter
    • Done: helium and potassium are installed and puppetized
  • Netapp
    • /home/wikipedia for deployments (prolly not using it; use git-deploy)
    • /home - completed in Tampa, not strictly necessary in eqiad
  • Deployment server (fenari's deployment support infrastructure part, misc::deployment etc)
    • awaiting new misc server racking next week - done. server name is Tin
  • Hume equivalent (misc::maintenance) - postponed
  • Application logging server - for mediawiki wmerrors + apache syslog
    • eqiad version of the udp2log instance on nfs1 that writes to /home/w/logs
    • Done: server 'flourine' for apache logs
  • Upload Varnish - done
    • Server OS install
    • Deploy from deployment host to all application servers
    • rsync the deployment code from the primary deployment server to the secondary
    • Require a clean git repo
    • Application servers in the other datacenter will use the secondary deployment system for rsync

Software / Config Requirements

  • Varnish software to handle media streaming efficiently
    • awaiting patch from Varnish Software (target Sept?) - done
    • patch MediaWiki to use a different upload hostname for large files. Then we could use Squid or some specialized media streaming proxy for large files.-- n/a here


  • replicating the git checkouts, etc. to new /home
    • not an issue

Actually Failing Over

  • deploy db.php with all shards set to read-only in both pmtpa and eqiad
  • deploy squid and mobile + bits varnish configs pointing to eqiad apaches
  • master swap every core db and writable es shard to eqiad
  • deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
    • the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
  • dns changes - our current steady state is to point wikipedia-lb.wikimedia.org in the US to eqiad but future scenarios may include external dns switches.
  • Swift replication reversal - from Eqiad to Tampa
  • Rollback plan - needs to add details

Improving Failover

  • pre-generate squid + varnish configs for different primary datacenter roles
  • implement MHA to better automate the mysql master failovers
  • migrate session storage to redis, with redundant replicas across colos

See more

Parking Lot Issues

  • Identify and plan around the deployment/migration date - tentatively Oct 15, 2012 [see below]. Need to communicate date.
    • Migration needs to happen before Fundraising season starts in Nov.
    • Vacation 'freeze'; all hands on deck week before and after deployment
      • Why? Not every person is vital to migration. --second. if you're not vital to migration, this seems like overkill - who are u pls?
    • migrate ns1 from tampa to ashburn, but not a critical item.
  • An update from CT Woo from October 2012 regarding the status of the migration is available here. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).
Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox