Eqiad Migration Planning

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(Risk & Mitigation)
(Syncing from http://etherpad.wmflabs.org/pad/p/EqiadMigration)
Line 1: Line 1:
 +
 
== Coordination ==
 
== Coordination ==
  
Line 6: Line 7:
 
* Weekly Countdown meeting http://etherpad.wmflabs.org/pad/p/EqiadMigration - meeting minutes
 
* Weekly Countdown meeting http://etherpad.wmflabs.org/pad/p/EqiadMigration - meeting minutes
  
== Risk & Mitigation ==
 
Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime.
 
*  What could cause fallback to Tampa a big problem should migration failed?
 
** should Ceph fail?
 
** should Swift@Tampa fail?
 
** Database integrity
 
  
 
== Outstanding Server/System Readiness ==  
 
== Outstanding Server/System Readiness ==  
Line 18: Line 13:
 
** Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing) - PY
 
** Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing) - PY
 
** Ready to test deploy @ Eqiad - PY
 
** Ready to test deploy @ Eqiad - PY
 +
** update - all up except jobrunners (PY)
 +
  
 
* Deployment system
 
* Deployment system
 
** using Git-deploy & ready for testing @ EQIAD - RyanLane, PY and ChrisM
 
** using Git-deploy & ready for testing @ EQIAD - RyanLane, PY and ChrisM
*** need to change apache and mw cfg - (AI - tbd/RobLa)
+
*** need to change apache and mw cfg
 +
**** See [https://bugzilla.wikimedia.org/show_bug.cgi?id=43338 bug 43338]
 +
**** for systems that are already installed, it works
 +
**** need to get
 
*** Need to identify / doc test requirements and success criteria (what are the use cases?) - CM/PY
 
*** Need to identify / doc test requirements and success criteria (what are the use cases?) - CM/PY
 
*** Chris M will work  with Ops (PY lead guy) on setting up the tests  
 
*** Chris M will work  with Ops (PY lead guy) on setting up the tests  
 
**** Overview of existing UI tests:  https://github.com/wikimedia/qa-browsertests/tree/master/features
 
**** Overview of existing UI tests:  https://github.com/wikimedia/qa-browsertests/tree/master/features
 +
 +
**update
 +
*** to test git deploy @ Tampa
 +
*** MW changes:
 +
**** move localization
 +
****
 +
** Apache cfg changes
 +
 +
*  Make the repository layout match the on-disk layout
 +
*  Hard coded paths in Apache
 +
  
 
* Swift in Tampa & Ceph in EQIAD
 
* Swift in Tampa & Ceph in EQIAD
Line 38: Line 49:
 
*** Servers are being provisioned - Faidon
 
*** Servers are being provisioned - Faidon
 
*** MW multiwrite for thumbs - Aaron/Mark to discuss details (already happening with NAS)
 
*** MW multiwrite for thumbs - Aaron/Mark to discuss details (already happening with NAS)
 +
*** multiwrite to just swift and ceph
 +
**** AI - Aaron
 +
** Ceph update
 +
*** eqiad cluster - being built (7 provisioned; testing on 2); 5 more to go
 +
*** memory leak - main concern at this moment
 +
*** adapt rewrite.py to VCL
 +
*** performance issue - not showstopper for this deployment
 +
*** waiting for 0.56, about to be released any day now
 +
*** optimization work
 +
*** copied 8TB over
 +
*** Plan B: use Swift in Tampa
  
 
* Memcached servers
 
* Memcached servers
Line 43: Line 65:
 
** mc1001-mc1016 OS installed, ready for puppet to be run.
 
** mc1001-mc1016 OS installed, ready for puppet to be run.
 
** Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
 
** Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
 +
** update
 +
*** servers willl be ready by next week, hopefully
 +
  
 
* Parser Cache servers
 
* Parser Cache servers
 
** servers are provisioned; awaiting parser cache sharding  - Asher/Tim
 
** servers are provisioned; awaiting parser cache sharding  - Asher/Tim
 +
** update
 +
*** sharding parser cache servers - (AI - Tim)
 +
https://bugzilla.wikimedia.org/show_bug.cgi?id=42463
  
 
* Databases  
 
* Databases  
Line 53: Line 81:
 
* Poolcounter
 
* Poolcounter
 
** Done: helium and potassium are installed and puppetized
 
** Done: helium and potassium are installed and puppetized
 +
** how do we test it?
 +
*** test with telnetting to the port and requesting stats
 +
 
* Netapp  
 
* Netapp  
 
** /home/wikipedia for deployments (prolly not using it; use git-deploy)
 
** /home/wikipedia for deployments (prolly not using it; use git-deploy)
Line 69: Line 100:
 
* Setup and Deploy parsoid servers @ Eqiad
 
* Setup and Deploy parsoid servers @ Eqiad
  
*  
+
*
  
 
* Upload Varnish - done
 
* Upload Varnish - done
Line 85: Line 116:
  
 
== Actually Failing Over ==
 
== Actually Failing Over ==
 +
*Sequence  (-AI  Asher)
 
* deploy db.php with all shards set to read-only in both pmtpa and eqiad
 
* deploy db.php with all shards set to read-only in both pmtpa and eqiad
 
* deploy squid and mobile + bits varnish configs pointing to eqiad apaches  
 
* deploy squid and mobile + bits varnish configs pointing to eqiad apaches  
Line 93: Line 125:
 
* Swift replication reversal - from Eqiad to Tampa
 
* Swift replication reversal - from Eqiad to Tampa
 
* Rollback plan - needs to add details
 
* Rollback plan - needs to add details
 +
 +
 +
== Risk & Mitigation ==
 +
Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime.
 +
*  What could cause fallback to Tampa a big problem should migration failed?
 +
** should Ceph fail?
 +
** should Swift@Tampa fail?
 +
** Database integrity
  
 
== Improving Failover ==  
 
== Improving Failover ==  
Line 112: Line 152:
 
* An update from CT Woo from October 2012 regarding the status of the migration is available [http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/063668.html here]. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).
 
* An update from CT Woo from October 2012 regarding the status of the migration is available [http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/063668.html here]. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).
 
[[Category:Eqiad cluster|*]]
 
[[Category:Eqiad cluster|*]]
 +
 +
 +
AI - create/doc  CheckList - PY/ChrisM
 +
 +
AI - a automated test scripts - ChrisM
 +
 +
 +
 +
Use Cases - Tests
 +
 +
* Developer
 +
** Check-in/out codes
 +
** code review
 +
** Code push/deploy
 +
** revert deployment
 +
 +
* User
 +
** registers
 +
** search article
 +
** read article
 +
** comment on article
 +
** edit article
 +
** create article
 +
** localization
 +
**
 +
 +
* Community member
 +
** tag article
 +
** (exercise special pages features)
 +
**
 +
 +
 +
* Ops
 +
** monitoring works - ganglia, nagios, torrus, .....
 +
** check amanda backups
 +
**

Revision as of 17:39, 28 December 2012

Contents

Coordination


Outstanding Server/System Readiness

  • App, Imagescalers, Bits, Jobrunners and API Apaches
    • Image scalers: Ready to deploy @ Eqiad - PY
    • Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing) - PY
    • Ready to test deploy @ Eqiad - PY
    • update - all up except jobrunners (PY)


  • Deployment system
    • using Git-deploy & ready for testing @ EQIAD - RyanLane, PY and ChrisM
    • update
      • to test git deploy @ Tampa
      • MW changes:
        • move localization
    • Apache cfg changes
  • Make the repository layout match the on-disk layout
  • Hard coded paths in Apache


  • Swift in Tampa & Ceph in EQIAD
    • Current plan is to have Ceph running at Eqiad (final decision - end of Dec by Mark/Faidon)
    • Swift @ Tampa is in production already
    • servers online; needs cluster replication enabled - netapp replication enabled
    • Still need to migrate Math, Captcha, Misc objects from ms7 to Swift - Aaron
    • Might have to run Swift and ImageScalers in Tampa while the rest of the stack are running in Eqiad
    • Aaron to test performance lag
    • Ceph update
      • overcame several issues/ steep learning curve; cluster more stable
      • currrently performing stability & stress tests
      • Servers are being provisioned - Faidon
      • MW multiwrite for thumbs - Aaron/Mark to discuss details (already happening with NAS)
      • multiwrite to just swift and ceph
        • AI - Aaron
    • Ceph update
      • eqiad cluster - being built (7 provisioned; testing on 2); 5 more to go
      • memory leak - main concern at this moment
      • adapt rewrite.py to VCL
      • performance issue - not showstopper for this deployment
      • waiting for 0.56, about to be released any day now
      • optimization work
      • copied 8TB over
      • Plan B: use Swift in Tampa
  • Memcached servers
    • mc01 - mc16 (Tampa) in production - done
    • mc1001-mc1016 OS installed, ready for puppet to be run.
    • Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
    • update
      • servers willl be ready by next week, hopefully


  • Parser Cache servers
    • servers are provisioned; awaiting parser cache sharding - Asher/Tim
    • update
      • sharding parser cache servers - (AI - Tim)

https://bugzilla.wikimedia.org/show_bug.cgi?id=42463

  • Databases
    • servers and replication - ready for switchover
    • Grants needed (SQL )
  • Poolcounter
    • Done: helium and potassium are installed and puppetized
    • how do we test it?
      • test with telnetting to the port and requesting stats
  • Netapp
    • /home/wikipedia for deployments (prolly not using it; use git-deploy)
    • /home - completed in Tampa, not strictly necessary in eqiad
  • Deployment server (fenari's deployment support infrastructure part, misc::deployment etc)
    • done. server name is Tin
    • This might not be needed if we are using git-deploy
  • Hume equivalent (misc::maintenance) - postponed
  • Application logging server - for mediawiki wmerrors + apache syslog
    • eqiad version of the udp2log instance on nfs1 that writes to /home/w/logs
    • Done: server 'flourine' for apache logs
  • Setup and Deploy parsoid servers @ Eqiad
  • Upload Varnish - done

Software / Config Requirements


  • replicating the git checkouts, etc. to new /home
    • not an issue

Actually Failing Over

  • Sequence (-AI Asher)
  • deploy db.php with all shards set to read-only in both pmtpa and eqiad
  • deploy squid and mobile + bits varnish configs pointing to eqiad apaches
  • master swap every core db and writable es shard to eqiad
  • deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
    • the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
  • dns changes - our current steady state is to point wikipedia-lb.wikimedia.org in the US to eqiad but future scenarios may include external dns switches.
  • Swift replication reversal - from Eqiad to Tampa
  • Rollback plan - needs to add details


Risk & Mitigation

Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime.

  • What could cause fallback to Tampa a big problem should migration failed?
    • should Ceph fail?
    • should Swift@Tampa fail?
    • Database integrity

Improving Failover

  • pre-generate squid + varnish configs for different primary datacenter roles
  • implement MHA to better automate the mysql master failovers
  • migrate session storage to redis, with redundant replicas across colos

See more

Parking Lot Issues

  • Identify and plan around the deployment/migration date - tentatively Oct 15, 2012 [see below]. Need to communicate date.
    • Migration needs to happen before Fundraising season starts in Nov.
    • Vacation 'freeze'; all hands on deck week before and after deployment
      • Why? Not every person is vital to migration. --second. if you're not vital to migration, this seems like overkill - who are u pls?
    • migrate ns1 from tampa to ashburn, but not a critical item.
  • An update from CT Woo from October 2012 regarding the status of the migration is available here. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).


AI - create/doc CheckList - PY/ChrisM

AI - a automated test scripts - ChrisM


Use Cases - Tests

  • Developer
    • Check-in/out codes
    • code review
    • Code push/deploy
    • revert deployment
  • User
    • registers
    • search article
    • read article
    • comment on article
    • edit article
    • create article
    • localization
  • Community member
    • tag article
    • (exercise special pages features)


  • Ops
    • monitoring works - ganglia, nagios, torrus, .....
    • check amanda backups
Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox