Eqiad Migration Planning
From Wikitech
(Difference between revisions)
(→Risk & Mitigation) |
(Syncing from http://etherpad.wmflabs.org/pad/p/EqiadMigration) |
||
| Line 1: | Line 1: | ||
| + | |||
== Coordination == | == Coordination == | ||
| Line 6: | Line 7: | ||
* Weekly Countdown meeting http://etherpad.wmflabs.org/pad/p/EqiadMigration - meeting minutes | * Weekly Countdown meeting http://etherpad.wmflabs.org/pad/p/EqiadMigration - meeting minutes | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
== Outstanding Server/System Readiness == | == Outstanding Server/System Readiness == | ||
| Line 18: | Line 13: | ||
** Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing) - PY | ** Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing) - PY | ||
** Ready to test deploy @ Eqiad - PY | ** Ready to test deploy @ Eqiad - PY | ||
| + | ** update - all up except jobrunners (PY) | ||
| + | |||
* Deployment system | * Deployment system | ||
** using Git-deploy & ready for testing @ EQIAD - RyanLane, PY and ChrisM | ** using Git-deploy & ready for testing @ EQIAD - RyanLane, PY and ChrisM | ||
| − | *** need to change apache and mw cfg | + | *** need to change apache and mw cfg |
| + | **** See [https://bugzilla.wikimedia.org/show_bug.cgi?id=43338 bug 43338] | ||
| + | **** for systems that are already installed, it works | ||
| + | **** need to get | ||
*** Need to identify / doc test requirements and success criteria (what are the use cases?) - CM/PY | *** Need to identify / doc test requirements and success criteria (what are the use cases?) - CM/PY | ||
*** Chris M will work with Ops (PY lead guy) on setting up the tests | *** Chris M will work with Ops (PY lead guy) on setting up the tests | ||
**** Overview of existing UI tests: https://github.com/wikimedia/qa-browsertests/tree/master/features | **** Overview of existing UI tests: https://github.com/wikimedia/qa-browsertests/tree/master/features | ||
| + | |||
| + | **update | ||
| + | *** to test git deploy @ Tampa | ||
| + | *** MW changes: | ||
| + | **** move localization | ||
| + | **** | ||
| + | ** Apache cfg changes | ||
| + | |||
| + | * Make the repository layout match the on-disk layout | ||
| + | * Hard coded paths in Apache | ||
| + | |||
* Swift in Tampa & Ceph in EQIAD | * Swift in Tampa & Ceph in EQIAD | ||
| Line 38: | Line 49: | ||
*** Servers are being provisioned - Faidon | *** Servers are being provisioned - Faidon | ||
*** MW multiwrite for thumbs - Aaron/Mark to discuss details (already happening with NAS) | *** MW multiwrite for thumbs - Aaron/Mark to discuss details (already happening with NAS) | ||
| + | *** multiwrite to just swift and ceph | ||
| + | **** AI - Aaron | ||
| + | ** Ceph update | ||
| + | *** eqiad cluster - being built (7 provisioned; testing on 2); 5 more to go | ||
| + | *** memory leak - main concern at this moment | ||
| + | *** adapt rewrite.py to VCL | ||
| + | *** performance issue - not showstopper for this deployment | ||
| + | *** waiting for 0.56, about to be released any day now | ||
| + | *** optimization work | ||
| + | *** copied 8TB over | ||
| + | *** Plan B: use Swift in Tampa | ||
* Memcached servers | * Memcached servers | ||
| Line 43: | Line 65: | ||
** mc1001-mc1016 OS installed, ready for puppet to be run. | ** mc1001-mc1016 OS installed, ready for puppet to be run. | ||
** Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad | ** Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad | ||
| + | ** update | ||
| + | *** servers willl be ready by next week, hopefully | ||
| + | |||
* Parser Cache servers | * Parser Cache servers | ||
** servers are provisioned; awaiting parser cache sharding - Asher/Tim | ** servers are provisioned; awaiting parser cache sharding - Asher/Tim | ||
| + | ** update | ||
| + | *** sharding parser cache servers - (AI - Tim) | ||
| + | https://bugzilla.wikimedia.org/show_bug.cgi?id=42463 | ||
* Databases | * Databases | ||
| Line 53: | Line 81: | ||
* Poolcounter | * Poolcounter | ||
** Done: helium and potassium are installed and puppetized | ** Done: helium and potassium are installed and puppetized | ||
| + | ** how do we test it? | ||
| + | *** test with telnetting to the port and requesting stats | ||
| + | |||
* Netapp | * Netapp | ||
** /home/wikipedia for deployments (prolly not using it; use git-deploy) | ** /home/wikipedia for deployments (prolly not using it; use git-deploy) | ||
| Line 69: | Line 100: | ||
* Setup and Deploy parsoid servers @ Eqiad | * Setup and Deploy parsoid servers @ Eqiad | ||
| − | * | + | * |
* Upload Varnish - done | * Upload Varnish - done | ||
| Line 85: | Line 116: | ||
== Actually Failing Over == | == Actually Failing Over == | ||
| + | *Sequence (-AI Asher) | ||
* deploy db.php with all shards set to read-only in both pmtpa and eqiad | * deploy db.php with all shards set to read-only in both pmtpa and eqiad | ||
* deploy squid and mobile + bits varnish configs pointing to eqiad apaches | * deploy squid and mobile + bits varnish configs pointing to eqiad apaches | ||
| Line 93: | Line 125: | ||
* Swift replication reversal - from Eqiad to Tampa | * Swift replication reversal - from Eqiad to Tampa | ||
* Rollback plan - needs to add details | * Rollback plan - needs to add details | ||
| + | |||
| + | |||
| + | == Risk & Mitigation == | ||
| + | Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime. | ||
| + | * What could cause fallback to Tampa a big problem should migration failed? | ||
| + | ** should Ceph fail? | ||
| + | ** should Swift@Tampa fail? | ||
| + | ** Database integrity | ||
== Improving Failover == | == Improving Failover == | ||
| Line 112: | Line 152: | ||
* An update from CT Woo from October 2012 regarding the status of the migration is available [http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/063668.html here]. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser). | * An update from CT Woo from October 2012 regarding the status of the migration is available [http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/063668.html here]. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser). | ||
[[Category:Eqiad cluster|*]] | [[Category:Eqiad cluster|*]] | ||
| + | |||
| + | |||
| + | AI - create/doc CheckList - PY/ChrisM | ||
| + | |||
| + | AI - a automated test scripts - ChrisM | ||
| + | |||
| + | |||
| + | |||
| + | Use Cases - Tests | ||
| + | |||
| + | * Developer | ||
| + | ** Check-in/out codes | ||
| + | ** code review | ||
| + | ** Code push/deploy | ||
| + | ** revert deployment | ||
| + | |||
| + | * User | ||
| + | ** registers | ||
| + | ** search article | ||
| + | ** read article | ||
| + | ** comment on article | ||
| + | ** edit article | ||
| + | ** create article | ||
| + | ** localization | ||
| + | ** | ||
| + | |||
| + | * Community member | ||
| + | ** tag article | ||
| + | ** (exercise special pages features) | ||
| + | ** | ||
| + | |||
| + | |||
| + | * Ops | ||
| + | ** monitoring works - ganglia, nagios, torrus, ..... | ||
| + | ** check amanda backups | ||
| + | ** | ||
Revision as of 17:39, 28 December 2012
Contents |
Coordination
- We now have an incomplete tracking ticket in RT that depends on more specific tickets.
- Platform Engineering will be using Bug 39106 for tracking dev tasks
- Sept 12 Update - http://etherpad.wikimedia.org/TechOps-12Sept2012
- Weekly Countdown meeting http://etherpad.wmflabs.org/pad/p/EqiadMigration - meeting minutes
Outstanding Server/System Readiness
- App, Imagescalers, Bits, Jobrunners and API Apaches
- Image scalers: Ready to deploy @ Eqiad - PY
- Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing) - PY
- Ready to test deploy @ Eqiad - PY
- update - all up except jobrunners (PY)
- Deployment system
- using Git-deploy & ready for testing @ EQIAD - RyanLane, PY and ChrisM
- need to change apache and mw cfg
- See bug 43338
- for systems that are already installed, it works
- need to get
- Need to identify / doc test requirements and success criteria (what are the use cases?) - CM/PY
- Chris M will work with Ops (PY lead guy) on setting up the tests
- Overview of existing UI tests: https://github.com/wikimedia/qa-browsertests/tree/master/features
- need to change apache and mw cfg
- using Git-deploy & ready for testing @ EQIAD - RyanLane, PY and ChrisM
- update
- to test git deploy @ Tampa
- MW changes:
- move localization
- Apache cfg changes
- update
- Make the repository layout match the on-disk layout
- Hard coded paths in Apache
- Swift in Tampa & Ceph in EQIAD
- Current plan is to have Ceph running at Eqiad (final decision - end of Dec by Mark/Faidon)
- Swift @ Tampa is in production already
- servers online; needs cluster replication enabled - netapp replication enabled
- Still need to migrate Math, Captcha, Misc objects from ms7 to Swift - Aaron
- Might have to run Swift and ImageScalers in Tampa while the rest of the stack are running in Eqiad
- Aaron to test performance lag
- Ceph update
- overcame several issues/ steep learning curve; cluster more stable
- currrently performing stability & stress tests
- Servers are being provisioned - Faidon
- MW multiwrite for thumbs - Aaron/Mark to discuss details (already happening with NAS)
- multiwrite to just swift and ceph
- AI - Aaron
- Ceph update
- eqiad cluster - being built (7 provisioned; testing on 2); 5 more to go
- memory leak - main concern at this moment
- adapt rewrite.py to VCL
- performance issue - not showstopper for this deployment
- waiting for 0.56, about to be released any day now
- optimization work
- copied 8TB over
- Plan B: use Swift in Tampa
- Memcached servers
- mc01 - mc16 (Tampa) in production - done
- mc1001-mc1016 OS installed, ready for puppet to be run.
- Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
- update
- servers willl be ready by next week, hopefully
- Parser Cache servers
- servers are provisioned; awaiting parser cache sharding - Asher/Tim
- update
- sharding parser cache servers - (AI - Tim)
https://bugzilla.wikimedia.org/show_bug.cgi?id=42463
- Databases
- servers and replication - ready for switchover
- Grants needed (SQL )
- Poolcounter
- Done: helium and potassium are installed and puppetized
- how do we test it?
- test with telnetting to the port and requesting stats
- Netapp
- /home/wikipedia for deployments (prolly not using it; use git-deploy)
- /home - completed in Tampa, not strictly necessary in eqiad
- Deployment server (fenari's deployment support infrastructure part, misc::deployment etc)
- done. server name is Tin
- This might not be needed if we are using git-deploy
- Hume equivalent (misc::maintenance) - postponed
- Application logging server - for mediawiki wmerrors + apache syslog
-
eqiad version of the udp2log instance on nfs1 that writes to /home/w/logs - Done: server 'flourine' for apache logs
-
- Setup and Deploy parsoid servers @ Eqiad
- Upload Varnish - done
Software / Config Requirements
- MediaWiki deploy support for per colo config variances (Bugzilla 39082)
- generating eqiad and pmtpa dsh groups
- mostly done - rolling out by end of month https://gerrit.wikimedia.org/r/#/c/32167/ https://gerrit.wikimedia.org/r/#/c/32168/ ..
- new mediawiki conf files for eqiad
- replicating the git checkouts, etc. to new /home
- not an issue
Actually Failing Over
- Sequence (-AI Asher)
- deploy db.php with all shards set to read-only in both pmtpa and eqiad
- deploy squid and mobile + bits varnish configs pointing to eqiad apaches
- master swap every core db and writable es shard to eqiad
- deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
- the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
- dns changes - our current steady state is to point wikipedia-lb.wikimedia.org in the US to eqiad but future scenarios may include external dns switches.
- Swift replication reversal - from Eqiad to Tampa
- Rollback plan - needs to add details
Risk & Mitigation
Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime.
- What could cause fallback to Tampa a big problem should migration failed?
- should Ceph fail?
- should Swift@Tampa fail?
- Database integrity
Improving Failover
- pre-generate squid + varnish configs for different primary datacenter roles
- implement MHA to better automate the mysql master failovers
- migrate session storage to redis, with redundant replicas across colos
See more
- Records and original tracking doc - http://etherpad.wikimedia.org/EQIAD-rollout-sequence
- Category:Eqiad cluster
Parking Lot Issues
- Identify and plan around the deployment/migration date -
tentatively Oct 15, 2012[see below]. Need to communicate date.- Migration needs to happen before Fundraising season starts in Nov.
- Vacation 'freeze'; all hands on deck week before and after deployment
- Why? Not every person is vital to migration. --second. if you're not vital to migration, this seems like overkill - who are u pls?
- migrate ns1 from tampa to ashburn, but not a critical item.
- An update from CT Woo from October 2012 regarding the status of the migration is available here. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).
AI - create/doc CheckList - PY/ChrisM
AI - a automated test scripts - ChrisM
Use Cases - Tests
- Developer
- Check-in/out codes
- code review
- Code push/deploy
- revert deployment
- User
- registers
- search article
- read article
- comment on article
- edit article
- create article
- localization
- Community member
- tag article
- (exercise special pages features)
- Ops
- monitoring works - ganglia, nagios, torrus, .....
- check amanda backups