Eqiad Migration Planning

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(Outstanding Server/System Readiness)
(Risk & Mitigation)
Line 90: Line 90:
 
** should Swift@Tampa fail?
 
** should Swift@Tampa fail?
 
** Database integrity
 
** Database integrity
 +
 +
* 2a.Test checklist: - http://wikitech.wikimedia.org/view/Eqiad_Migration_Planning/Checklist
  
 
== Improving Failover ==  
 
== Improving Failover ==  

Revision as of 17:35, 7 January 2013

Contents

Coordination

Outstanding Server/System Readiness

  • App, Imagescalers, Bits, Jobrunners and API Apaches
    • All Ready - awaiting code deploy
  • Parsoid servers@Eqiad
    • Target - 1/11/13 (RobH)
  • Setup Ceph in eqiad for image storage (Swift in Tampa & Ceph in EQIAD) (Faidon/Mark)
    • 2 more servers set up (up to 4 now), intra-cluster replication ETA is Saturday early morning PST
    • holding off adding more as to not disrupt swift->ceph replication speed
    • swift->ceph copy 17.5T out of 43TB, complete in 12 days (very rough estimate)
    • some stability issues - close cooperation with Ceph developers, being fixed realtime
    • h310 perc issue - workaround with raid 0
    • 0.56 has been released and deployed to the eqiad cluster
    • various other hiccups, both hardware & software related
    • still pending: puppetization, rewrite.py -> VCL, testing with MediaWiki
  • Database Master switchover (PY / Asher)
    • MHA
    • Grants needed (SQL )
  • Poolcounter
    • test with telnetting to the port and requesting stats

Software / Config Requirements


  • replicating the git checkouts, etc. to new /home
    • not an issue

Actually Failing Over

  • Sequence (-AI Asher)
  • deploy db.php with all shards set to read-only in both pmtpa and eqiad
  • deploy squid and mobile + bits varnish configs pointing to eqiad apaches
  • master swap every core db and writable es shard to eqiad
  • deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
    • the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
  • No DNS or Ceph/Swift changes required
  • Rollback plan - needs to add details

Risk & Mitigation

Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime.

  • What could cause fallback to Tampa a big problem should migration failed?
    • should Ceph fail?
    • should Swift@Tampa fail?
    • Database integrity

Improving Failover

  • pre-generate squid + varnish configs for different primary datacenter roles
  • implement MHA to better automate the mysql master failovers
  • migrate session storage to redis, with redundant replicas across colos

See more

Parking Lot Issues

  • Identify and plan around the deployment/migration date - tentatively Oct 15, 2012 [see below]. Need to communicate date.
    • Migration needs to happen before Fundraising season starts in Nov.
    • Vacation 'freeze'; all hands on deck week before and after deployment
    • migrate ns1 from tampa to ashburn, but not a critical item.
  • An update from CT Woo from October 2012 regarding the status of the migration is available here. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).

AI - create/doc CheckList - PY/ChrisM

AI - a automated test scripts - ChrisM


Use Cases - Tests

  • Developer
    • Check-in/out codes
    • code review
    • Code push/deploy
    • revert deployment
  • User
    • registers
    • search article
    • read article
    • comment on article
    • edit article
    • create article
    • localization
  • Community member
    • tag article
    • (exercise special pages features)


  • Ops
    • monitoring works - ganglia, nagios, torrus, .....
    • check amanda backups
Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox