Eqiad Migration Planning

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(Coordination)
(Outstanding Server/System Readiness)
Line 11: Line 11:
 
== Outstanding Server/System Readiness ==  
 
== Outstanding Server/System Readiness ==  
 
* App, Imagescalers, Bits, Jobrunners and API Apaches
 
* App, Imagescalers, Bits, Jobrunners and API Apaches
** Image scalers: Ready to deploy @ Eqiad - PY
+
** All Ready - awaiting code deploy
** Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing) - PY
+
** Ready to test deploy @ Eqiad - PY
+
** update - all up except jobrunners (PY)
+
  
 +
* Parsoid servers@Eqiad
 +
 +
* setup pc1001-1003
 +
** https://rt.wikimedia.org/Ticket/Display.html?id=3644 /  bugzilla http://bugzilla.wikimedia.org/42463
 +
*** Pending SqlBagOStuff sharding - https://gerrit.wikimedia.org/r/#/c/41023/
 +
*** target completion - Friday 4th Jan
  
 
* Deployment system
 
* Deployment system
** using Git-deploy & ready for testing @ EQIAD - RyanLane, PY and ChrisM
+
** https://bugzilla.wikimedia.org/show_bug.cgi?id=43338 - Dev tasks related to git-deploy migration; ready and use it on 1/16/13
*** need to change apache and mw cfg
+
  ***  https://bugzilla.wikimedia.org/43339: Deploy git-deploy to the Beta Cluster - Antoine
**** See [https://bugzilla.wikimedia.org/show_bug.cgi?id=43338 bug 43338]
+
  *** https://bugzilla.wikimedia.org/43614  l10n generation in git-deploy - Brad / Ryan
**** for systems that are already installed, it works
+
*** https://bugzilla.wikimedia.org/43340 - Design new on-disk layout for MediaWiki install on tin/eqiad Apaches - Sam/Tim
**** need to get
+
*** https://bugzilla.wikimedia.org/43615 Audit of the salt scripts for completeness (looking in current scripts) - Aarons
*** Need to identify / doc test requirements and success criteria (what are the use cases?) - CM/PY
+
*** Chris M will work with Ops (PY lead guy) on setting up the tests
+
**** Overview of existing UI tests: https://github.com/wikimedia/qa-browsertests/tree/master/features
+
 
+
**update
+
*** to test git deploy @ Tampa
+
*** MW changes:
+
**** move localization
+
****
+
** Apache cfg changes
+
 
+
*  Make the repository layout match the on-disk layout
+
* Hard coded paths in Apache
+
 
+
 
+
* Swift in Tampa & Ceph in EQIAD
+
** Current plan is to have Ceph running at Eqiad (final decision - end of Dec by Mark/Faidon)
+
** Swift @ Tampa is in production already
+
** servers online; needs cluster replication enabled - netapp replication enabled
+
** Still need to migrate Math, Captcha, Misc objects from ms7 to Swift - Aaron
+
** Might have to run Swift and ImageScalers in Tampa while the rest of the stack are running in Eqiad
+
** Aaron to test performance lag
+
** Ceph update
+
*** overcame several issues/ steep learning curve; cluster more stable
+
*** currrently performing stability & stress tests
+
*** Servers are being provisioned - Faidon
+
*** MW multiwrite for thumbs - Aaron/Mark to discuss details (already happening with NAS)
+
*** multiwrite to just swift and ceph
+
**** AI - Aaron
+
** Ceph update
+
*** eqiad cluster - being built (7 provisioned; testing on 2); 5 more to go
+
*** memory leak - main concern at this moment
+
*** adapt rewrite.py to VCL
+
*** performance issue - not showstopper for this deployment
+
*** waiting for 0.56, about to be released any day now
+
*** optimization work
+
*** copied 8TB over
+
*** Plan B: use Swift in Tampa
+
 
+
* Memcached servers
+
** mc01 - mc16 (Tampa) in production - done
+
** mc1001-mc1016 OS installed, ready for puppet to be run.
+
** Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
+
** update
+
*** servers willl be ready by next week, hopefully
+
 
+
  
 +
*  Setup Ceph in eqiad for image storage (Swift in Tampa & Ceph in EQIAD)
 +
**  2 more servers set up (up to 4 now), intra-cluster replication ETA is Saturday early morning PST
 +
** holding off adding more as to not disrupt swift->ceph replication speed
 +
** swift->ceph copy 17.5T out of 43TB, complete in 12 days (very rough estimate)
 +
** some stability issues - close cooperation with Ceph developers, being fixed realtime
 +
** h310 perc issue - workaround with raid 0
 +
** 0.56 has been released and deployed to the eqiad cluster
 +
** various other hiccups, both hardware & software related
 +
** still pending: puppetization, rewrite.py -> VCL, testing with MediaWiki
 +
 
* Parser Cache servers
 
* Parser Cache servers
 
** servers are provisioned; awaiting parser cache sharding  - Asher/Tim
 
** servers are provisioned; awaiting parser cache sharding  - Asher/Tim
Line 81: Line 48:
  
 
* Poolcounter
 
* Poolcounter
** Done: helium and potassium are installed and puppetized
+
** test with telnetting to the port and requesting stats
** how do we test it?
+
*** test with telnetting to the port and requesting stats
+
 
+
* Netapp
+
** /home/wikipedia for deployments (prolly not using it; use git-deploy)
+
** /home - completed in Tampa, not strictly necessary in eqiad
+
 
+
* Deployment server (fenari's deployment support infrastructure part, misc::deployment etc)
+
**  done.  server name is Tin
+
** This might not be needed if we are using git-deploy
+
  
 
* Hume equivalent (misc::maintenance) - postponed
 
* Hume equivalent (misc::maintenance) - postponed
 
* Application logging server - for mediawiki wmerrors + apache syslog
 
** <s>eqiad version of the udp2log instance on nfs1 that writes to /home/w/logs</s>
 
** Done: server 'flourine' for apache logs
 
 
* Setup and Deploy parsoid servers @ Eqiad
 
 
 
 
* Upload Varnish - done
 
  
 
== Software / Config Requirements ==
 
== Software / Config Requirements ==

Revision as of 17:21, 7 January 2013

Contents

Coordination

Outstanding Server/System Readiness

  • App, Imagescalers, Bits, Jobrunners and API Apaches
    • All Ready - awaiting code deploy
  • Parsoid servers@Eqiad
  • Deployment system
** https://bugzilla.wikimedia.org/show_bug.cgi?id=43338 - Dev tasks related to git-deploy migration; ready and use it on 1/16/13
***  https://bugzilla.wikimedia.org/43339: Deploy git-deploy to the Beta Cluster - Antoine
***  https://bugzilla.wikimedia.org/43614   l10n generation in git-deploy - Brad / Ryan
*** https://bugzilla.wikimedia.org/43340 - Design new on-disk layout for MediaWiki install on tin/eqiad Apaches - Sam/Tim
***  https://bugzilla.wikimedia.org/43615 Audit of the salt scripts for completeness (looking in current scripts) - Aarons
  • Setup Ceph in eqiad for image storage (Swift in Tampa & Ceph in EQIAD)
    • 2 more servers set up (up to 4 now), intra-cluster replication ETA is Saturday early morning PST
    • holding off adding more as to not disrupt swift->ceph replication speed
    • swift->ceph copy 17.5T out of 43TB, complete in 12 days (very rough estimate)
    • some stability issues - close cooperation with Ceph developers, being fixed realtime
    • h310 perc issue - workaround with raid 0
    • 0.56 has been released and deployed to the eqiad cluster
    • various other hiccups, both hardware & software related
    • still pending: puppetization, rewrite.py -> VCL, testing with MediaWiki
  • Parser Cache servers
    • servers are provisioned; awaiting parser cache sharding - Asher/Tim
    • update
      • sharding parser cache servers - (AI - Tim)

https://bugzilla.wikimedia.org/show_bug.cgi?id=42463

  • Databases
    • servers and replication - ready for switchover
    • Grants needed (SQL )
  • Poolcounter
    • test with telnetting to the port and requesting stats
  • Hume equivalent (misc::maintenance) - postponed

Software / Config Requirements


  • replicating the git checkouts, etc. to new /home
    • not an issue

Actually Failing Over

  • Sequence (-AI Asher)
  • deploy db.php with all shards set to read-only in both pmtpa and eqiad
  • deploy squid and mobile + bits varnish configs pointing to eqiad apaches
  • master swap every core db and writable es shard to eqiad
  • deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
    • the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
  • No DNS or Ceph/Swift changes required
  • Rollback plan - needs to add details

Risk & Mitigation

Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime.

  • What could cause fallback to Tampa a big problem should migration failed?
    • should Ceph fail?
    • should Swift@Tampa fail?
    • Database integrity

Improving Failover

  • pre-generate squid + varnish configs for different primary datacenter roles
  • implement MHA to better automate the mysql master failovers
  • migrate session storage to redis, with redundant replicas across colos

See more

Parking Lot Issues

  • Identify and plan around the deployment/migration date - tentatively Oct 15, 2012 [see below]. Need to communicate date.
    • Migration needs to happen before Fundraising season starts in Nov.
    • Vacation 'freeze'; all hands on deck week before and after deployment
      • Why? Not every person is vital to migration. --second. if you're not vital to migration, this seems like overkill - who are u pls?
    • migrate ns1 from tampa to ashburn, but not a critical item.
  • An update from CT Woo from October 2012 regarding the status of the migration is available here. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).


AI - create/doc CheckList - PY/ChrisM

AI - a automated test scripts - ChrisM


Use Cases - Tests

  • Developer
    • Check-in/out codes
    • code review
    • Code push/deploy
    • revert deployment
  • User
    • registers
    • search article
    • read article
    • comment on article
    • edit article
    • create article
    • localization
  • Community member
    • tag article
    • (exercise special pages features)


  • Ops
    • monitoring works - ganglia, nagios, torrus, .....
    • check amanda backups
Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox