Eqiad Migration Planning
From Wikitech
(Difference between revisions)
(→Parking Lot Issues) |
(Copied in notes from last meeting) |
||
| Line 3: | Line 3: | ||
* We now have an incomplete [https://rt.wikimedia.org/Ticket/Display.html?id=3403 tracking ticket] in RT that depends on more specific tickets. | * We now have an incomplete [https://rt.wikimedia.org/Ticket/Display.html?id=3403 tracking ticket] in RT that depends on more specific tickets. | ||
* Sept 12 Update - http://etherpad.wikimedia.org/TechOps-12Sept2012 | * Sept 12 Update - http://etherpad.wikimedia.org/TechOps-12Sept2012 | ||
| + | |||
| + | === High Risk & Mitigation === | ||
| + | * What could cause failover back to Tampa difficult should migration failed? | ||
| + | ** should Ceph fails? | ||
== Needed Server Builds == | == Needed Server Builds == | ||
| − | * App and API Apaches | + | * App, Imagescalers and API Apaches |
** Image scalers: Ready to deploy @ Eqiad | ** Image scalers: Ready to deploy @ Eqiad | ||
** Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing) | ** Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing) | ||
* JobRunners | * JobRunners | ||
| − | ** Ready to deploy @ Eqiad | + | ** Ready to test deploy @ Eqiad |
| + | **** dependent on Deployment system - ready for test in EQIAD; | ||
| + | **** deploy API, Apaches, Imagescalers ( PY ) | ||
| + | **** need to change apache and mw cfg - (AI - tbd/RobLa) | ||
| + | **** Need to identify / doc test requirements and success criteria (what are the use cases?) - CM/PY | ||
| + | **** Chris M will work with Ops (PY lead guy) on setting up the tests | ||
| + | ***** Overview of existing UI tests: https://github.com/wikimedia/qa-browsertests/tree/master/features | ||
* Swift | * Swift | ||
| − | ** servers online; needs cluster replication enabled - netapp | + | ** servers online; needs cluster replication enabled - netapp replication enabled |
** Still need to migrate Math, Captcha, Misc objects from ms7 to Swift | ** Still need to migrate Math, Captcha, Misc objects from ms7 to Swift | ||
| − | ** H/w issues need to be resolved - H/w being installed and final batch to ship on 5th Dec | + | ** <s>H/w issues need to be resolved - H/w being installed and final batch to ship on 5th Dec</s> |
** Might have to run Swift and ImageScalers in Tampa while the rest of the stack are running in Eqiad | ** Might have to run Swift and ImageScalers in Tampa while the rest of the stack are running in Eqiad | ||
** Aaron to test performance lag | ** Aaron to test performance lag | ||
| + | |||
| + | ** Ceph update | ||
| + | *** overcame several issues/ steep learning curve; cluster more stable | ||
| + | *** still an option to use Ceph | ||
| + | *** MW multiwrite for thumbs - Aaron/Mark to discuss details (already happening with NAS) | ||
* Memcached servers | * Memcached servers | ||
** mc01 - mc16 (Tampa) in production - done | ** mc01 - mc16 (Tampa) in production - done | ||
** mc1001-mc1016 OS installed, ready for puppet to be run. | ** mc1001-mc1016 OS installed, ready for puppet to be run. | ||
| − | ** Networked but issues with the Intel 10gNics and Dell's SPF+ - workaround available; getting new SPF+ | + | ** <s>Networked but issues with the Intel 10gNics and Dell's SPF+ - workaround available; getting new SPF+ - resolved</s> |
** Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad | ** Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad | ||
* Databases - done | * Databases - done | ||
** <strike>one more slave is needed per shard</strike> | ** <strike>one more slave is needed per shard</strike> | ||
| + | ** Grants needed (SQL ) | ||
* Poolcounter | * Poolcounter | ||
** Done: helium and potassium are installed and puppetized | ** Done: helium and potassium are installed and puppetized | ||
* Netapp | * Netapp | ||
| − | ** /home/wikipedia for deployments | + | ** /home/wikipedia for deployments (prolly not using it; use git-deploy) |
** /home - completed in Tampa, not strictly necessary in eqiad | ** /home - completed in Tampa, not strictly necessary in eqiad | ||
| + | |||
* Deployment server (fenari's deployment support infrastructure part, misc::deployment etc) | * Deployment server (fenari's deployment support infrastructure part, misc::deployment etc) | ||
| − | ** awaiting new misc server racking next week | + | ** awaiting new misc server racking next week - done. server name is Tin |
| + | |||
* Hume equivalent (misc::maintenance) - postponed | * Hume equivalent (misc::maintenance) - postponed | ||
| + | |||
* Application logging server - for mediawiki wmerrors + apache syslog | * Application logging server - for mediawiki wmerrors + apache syslog | ||
** <s>eqiad version of the udp2log instance on nfs1 that writes to /home/w/logs</s> | ** <s>eqiad version of the udp2log instance on nfs1 that writes to /home/w/logs</s> | ||
** Done: server 'flourine' for apache logs | ** Done: server 'flourine' for apache logs | ||
| + | |||
* Upload Varnish - done | * Upload Varnish - done | ||
| − | + | ||
** <s>Server OS install</s> | ** <s>Server OS install</s> | ||
** Deploy from deployment host to all application servers | ** Deploy from deployment host to all application servers | ||
| Line 47: | Line 67: | ||
== Software / Config Requirements == | == Software / Config Requirements == | ||
* Varnish software to handle media streaming efficiently | * Varnish software to handle media streaming efficiently | ||
| − | ** awaiting patch from Varnish Software (target Sept?) | + | ** awaiting patch from Varnish Software (target Sept?) - done |
| − | ** patch MediaWiki to use a different upload hostname for large files. Then we could use Squid or some specialized media streaming proxy for large files. | + | ** <s>patch MediaWiki to use a different upload hostname for large files. Then we could use Squid or some specialized media streaming proxy for large files.</s>-- n/a here |
| − | * MediaWiki deploy support for per colo config variances [https://bugzilla.wikimedia.org/show_bug.cgi?id=39082 | + | |
| + | * MediaWiki deploy support for per colo config variances ([https://bugzilla.wikimedia.org/show_bug.cgi?id=39082 Bugzilla 39082]) | ||
** generating eqiad and pmtpa dsh groups | ** generating eqiad and pmtpa dsh groups | ||
| + | ** mostly done - rolling out by end of month https://gerrit.wikimedia.org/r/#/c/32167/ https://gerrit.wikimedia.org/r/#/c/32168/ .. | ||
** new mediawiki conf files for eqiad | ** new mediawiki conf files for eqiad | ||
| + | |||
| + | |||
* replicating the git checkouts, etc. to new /home | * replicating the git checkouts, etc. to new /home | ||
| + | ** not an issue | ||
== Actually Failing Over == | == Actually Failing Over == | ||
| Line 62: | Line 87: | ||
* dns changes - our current steady state is to point wikipedia-lb.wikimedia.org in the US to eqiad but future scenarios may include external dns switches. | * dns changes - our current steady state is to point wikipedia-lb.wikimedia.org in the US to eqiad but future scenarios may include external dns switches. | ||
* Swift replication reversal - from Eqiad to Tampa | * Swift replication reversal - from Eqiad to Tampa | ||
| + | * Rollback plan - needs to add details | ||
== Improving Failover == | == Improving Failover == | ||
| Line 76: | Line 102: | ||
** Migration needs to happen before Fundraising season starts in Nov. | ** Migration needs to happen before Fundraising season starts in Nov. | ||
** Vacation 'freeze'; all hands on deck week before and after deployment | ** Vacation 'freeze'; all hands on deck week before and after deployment | ||
| − | *** Why? Not every person is vital to migration. | + | *** Why? Not every person is vital to migration. --second. if you're not vital to migration, this seems like overkill - who are u pls? |
** migrate ns1 from tampa to ashburn, but not a critical item. | ** migrate ns1 from tampa to ashburn, but not a critical item. | ||
* An update from CT Woo from October 2012 regarding the status of the migration is available [http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/063668.html here]. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser). | * An update from CT Woo from October 2012 regarding the status of the migration is available [http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/063668.html here]. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser). | ||
[[Category:Eqiad cluster|*]] | [[Category:Eqiad cluster|*]] | ||
| − | |||
| − | |||
Revision as of 22:31, 18 December 2012
Contents |
Coordination
- We now have an incomplete tracking ticket in RT that depends on more specific tickets.
- Sept 12 Update - http://etherpad.wikimedia.org/TechOps-12Sept2012
High Risk & Mitigation
- What could cause failover back to Tampa difficult should migration failed?
- should Ceph fails?
Needed Server Builds
- App, Imagescalers and API Apaches
- Image scalers: Ready to deploy @ Eqiad
- Apache/API: Ready to deploy @ Eqiad (mw1017-mw1019 puppetized for deploy testing)
- JobRunners
- Ready to test deploy @ Eqiad
- dependent on Deployment system - ready for test in EQIAD;
- deploy API, Apaches, Imagescalers ( PY )
- need to change apache and mw cfg - (AI - tbd/RobLa)
- Need to identify / doc test requirements and success criteria (what are the use cases?) - CM/PY
- Chris M will work with Ops (PY lead guy) on setting up the tests
- Overview of existing UI tests: https://github.com/wikimedia/qa-browsertests/tree/master/features
- Ready to test deploy @ Eqiad
- Swift
- servers online; needs cluster replication enabled - netapp replication enabled
- Still need to migrate Math, Captcha, Misc objects from ms7 to Swift
-
H/w issues need to be resolved - H/w being installed and final batch to ship on 5th Dec - Might have to run Swift and ImageScalers in Tampa while the rest of the stack are running in Eqiad
- Aaron to test performance lag
- Ceph update
- overcame several issues/ steep learning curve; cluster more stable
- still an option to use Ceph
- MW multiwrite for thumbs - Aaron/Mark to discuss details (already happening with NAS)
- Ceph update
- Memcached servers
- mc01 - mc16 (Tampa) in production - done
- mc1001-mc1016 OS installed, ready for puppet to be run.
-
Networked but issues with the Intel 10gNics and Dell's SPF+ - workaround available; getting new SPF+ - resolved - Decided to use Redis and use MW multi-write feature to write to both existing MC and the new MC servers, then enable Redis replication from Tampa to Eqiad
- Databases - done
-
one more slave is needed per shard - Grants needed (SQL )
-
- Poolcounter
- Done: helium and potassium are installed and puppetized
- Netapp
- /home/wikipedia for deployments (prolly not using it; use git-deploy)
- /home - completed in Tampa, not strictly necessary in eqiad
- Deployment server (fenari's deployment support infrastructure part, misc::deployment etc)
- awaiting new misc server racking next week - done. server name is Tin
- Hume equivalent (misc::maintenance) - postponed
- Application logging server - for mediawiki wmerrors + apache syslog
-
eqiad version of the udp2log instance on nfs1 that writes to /home/w/logs - Done: server 'flourine' for apache logs
-
- Upload Varnish - done
-
Server OS install - Deploy from deployment host to all application servers
- rsync the deployment code from the primary deployment server to the secondary
- Require a clean git repo
- Application servers in the other datacenter will use the secondary deployment system for rsync
-
Software / Config Requirements
- Varnish software to handle media streaming efficiently
- awaiting patch from Varnish Software (target Sept?) - done
-
patch MediaWiki to use a different upload hostname for large files. Then we could use Squid or some specialized media streaming proxy for large files.-- n/a here
- MediaWiki deploy support for per colo config variances (Bugzilla 39082)
- generating eqiad and pmtpa dsh groups
- mostly done - rolling out by end of month https://gerrit.wikimedia.org/r/#/c/32167/ https://gerrit.wikimedia.org/r/#/c/32168/ ..
- new mediawiki conf files for eqiad
- replicating the git checkouts, etc. to new /home
- not an issue
Actually Failing Over
- deploy db.php with all shards set to read-only in both pmtpa and eqiad
- deploy squid and mobile + bits varnish configs pointing to eqiad apaches
- master swap every core db and writable es shard to eqiad
- deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
- the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
- dns changes - our current steady state is to point wikipedia-lb.wikimedia.org in the US to eqiad but future scenarios may include external dns switches.
- Swift replication reversal - from Eqiad to Tampa
- Rollback plan - needs to add details
Improving Failover
- pre-generate squid + varnish configs for different primary datacenter roles
- implement MHA to better automate the mysql master failovers
- migrate session storage to redis, with redundant replicas across colos
See more
- Records and original tracking doc - http://etherpad.wikimedia.org/EQIAD-rollout-sequence
- Category:Eqiad cluster
Parking Lot Issues
- Identify and plan around the deployment/migration date -
tentatively Oct 15, 2012[see below]. Need to communicate date.- Migration needs to happen before Fundraising season starts in Nov.
- Vacation 'freeze'; all hands on deck week before and after deployment
- Why? Not every person is vital to migration. --second. if you're not vital to migration, this seems like overkill - who are u pls?
- migrate ns1 from tampa to ashburn, but not a critical item.
- An update from CT Woo from October 2012 regarding the status of the migration is available here. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).