Eqiad Migration Planning

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(Outstanding Server/System Readiness)
(Outstanding Server/System Readiness)
Line 20: Line 20:
 
** Target - 1/11/13 (RobH)
 
** Target - 1/11/13 (RobH)
  
* setup pc1001-1003 (PY/Asher)
+
* setup pc1001-1003 (PY/Asher)- https://rt.wikimedia.org/Ticket/Display.html?id=3644 /  bugzilla http://bugzilla.wikimedia.org/42463
** https://rt.wikimedia.org/Ticket/Display.html?id=3644 /  bugzilla http://bugzilla.wikimedia.org/42463
+
** Deployed 1/14/13
*** Deployed 1/14/13
+
** https://bugzilla.wikimedia.org/show_bug.cgi?id=39082 - Add support for deploying per-datacenter config variances - Antoine
+
  
* Setup Ceph in eqiad for image storage (Swift in Tampa & Ceph in EQIAD) (Faidon/Mark)
+
* Setup Ceph in eqiad for image storage (Swift in Tampa & Ceph in EQIAD) (Faidon/Mark)
 
**  2 more servers set up (up to 4 now), intra-cluster replication ETA is Saturday early morning PST
 
**  2 more servers set up (up to 4 now), intra-cluster replication ETA is Saturday early morning PST
 
** holding off adding more as to not disrupt swift->ceph replication speed
 
** holding off adding more as to not disrupt swift->ceph replication speed
Line 61: Line 59:
 
****  Eqiad -  Grants needed
 
****  Eqiad -  Grants needed
 
****  See "Actually Failing Over" below.
 
****  See "Actually Failing Over" below.
*** varnish configuration switchover script - AI - Mark
+
*** varnish configuration switchover script - Mark
  
 
== Software / Config Requirements ==
 
== Software / Config Requirements ==

Revision as of 18:17, 15 January 2013

Contents

Coordination

Outstanding Server/System Readiness

  • App, Imagescalers, Bits, Jobrunners and API Apaches
    • All Ready - awaiting code deploy
  • Parsoid servers@Eqiad
    • Target - 1/11/13 (RobH)
  • Setup Ceph in eqiad for image storage (Swift in Tampa & Ceph in EQIAD) (Faidon/Mark)
    • 2 more servers set up (up to 4 now), intra-cluster replication ETA is Saturday early morning PST
    • holding off adding more as to not disrupt swift->ceph replication speed
    • swift->ceph copy 17.5T out of 43TB, complete in 12 days (very rough estimate)
    • some stability issues - close cooperation with Ceph developers, being fixed realtime
    • h310 perc issue - workaround with raid 0
    • 0.56 has been released and deployed to the eqiad cluster
    • various other hiccups, both hardware & software related
    • still pending: puppetization, rewrite.py -> VCL, testing with MediaWiki
  • Database Master switchover (PY / Asher)
    • MHA
    • https://bugzilla.wikimedia.org/show_bug.cgi?id=43453 - Checklist/script to switch datacenters - Tim
      • Automated DB/Apache switchcover script
        • Tampa - Read-only
        • Eqiad - Grants needed
        • See "Actually Failing Over" below.
      • varnish configuration switchover script - Mark

Software / Config Requirements


  • replicating the git checkouts, etc. to new /home
    • not an issue

Actually Failing Over

  • Sequence (-AI Asher)
  • deploy db.php with all shards set to read-only in both pmtpa and eqiad
  • deploy squid and mobile + bits varnish configs pointing to eqiad apaches
  • master swap every core db and writable es shard to eqiad
  • deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
    • the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
  • No DNS or Ceph/Swift changes required
  • Rollback plan - needs to add details
  • . Deployment! - D-day
  • Day minus 1 (1/21/13) preparation Work
    • Automated test run
  • D-Day 1/22/13
    • see actual failingover paragrah above
  • D-day + 1 1/23/13

Risk & Mitigation

Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime.

  • What could cause fallback to Tampa a big problem should migration failed?
    • should Ceph fail?
    • should Swift@Tampa fail?
    • Database integrity
    • Performance
  • Need to determine Switchback Threshold - ??

Improving Switchover

  • pre-generate squid + varnish configs for different primary datacenter roles
  • implement MHA to better automate the mysql master failovers
  • migrate session storage to redis, with redundant replicas across colos

See more

Parking Lot Issues

  • Identify and plan around the deployment/migration date - tentatively Oct 15, 2012 [see below]. Need to communicate date.
    • Migration needs to happen before Fundraising season starts in Nov.
    • Vacation 'freeze'; all hands on deck week before and after deployment
    • migrate ns1 from tampa to ashburn, but not a critical item.
  • An update from CT Woo from October 2012 regarding the status of the migration is available here. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).

AI - create/doc CheckList - PY/ChrisM

AI - a automated test scripts - ChrisM


Use Cases - Tests

  • Developer
    • Check-in/out codes
    • code review
    • Code push/deploy
    • revert deployment
  • User
    • registers
    • search article
    • read article
    • comment on article
    • edit article
    • create article
    • localization
  • Community member
    • tag article
    • (exercise special pages features)


  • Ops
    • monitoring works - ganglia, nagios, torrus, .....
    • check amanda backups
Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox