External storage/Update 2011-08
A description of the project to bring the external storage service up to date.
Contents |
Background and context
The "External_storage" hosts are used to store compressed old revisions of wiki pages. When a user asks for a diff of a sufficiently old revision of a page, mediawiki grabs the compressed content from the external store database and uses it to display the user's request.
External storage is made up of several (12 maybe?) clusters of sets of 3 databases running mysql. Most of these are three copies of the same content, are read-only, and are not set up with replication. One cluster gets all new writes. The sharding to different clusters is done somewhere within mediawiki and is currently a mystery, but not something we actually need to worry about. The clusters are defined in dbs.php. Many of the clusters are running mysql on apache nodes (srv###).
There seem to be two types of compression in use on the clusters. One has a gzip header, the other does not. <fill in examples of each.>
- I believe there may be (or may have been) a few other (old) types in use, but I don't know the details of that. Tim and Ariel would know. -- Mark Bergsma 10:07, 12 August 2011 (UTC)
- Text storage row types are listed at Text storage data -- Tim 14:07, 31 August 2011 (UTC)
Goals
- get external storage replicated to eqiad
- get external storage off of apache nodes and onto dedicated database hardware
- Recompress the existing clusters into a new one, for far better compression
- document external storage's setup, maintenance requirements, and procedures
Additional research
- Find and examine the 'recompression' job, figure out what it does, how it works, and how it will affect this project
- Find out how much space is used on each of the clusters
- How much space if they're compressed?
- See http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/47573/match=recompression+results -- Mark Bergsma 10:06, 12 August 2011 (UTC)
- summary: full text -> compressed diff blob gets 93% reduction.
- How much of our current text is full text vs. compressed full text vs. compressed diff blob?
- How much space if they're compressed?
- Determine if we can consolidate the clusters onto fewer larger boxes, and if so, how many
- Understand how the sharding works to understand how to adjust the shards if we can consolidate onto fewer boxes
- the answer may be to simply maintain sharded tables on a consolidated host, but we'll see.
- find out what hardware is currently available for this project
- ms1-3 in pmtpa (currently in use)
- es1001-1004 in eqiad
- what machines currently back external store?
- http://noc.wikimedia.org/conf/highlight.php?file=db.php - externalLoads array
- other things that will cause problems
- https://bugzilla.wikimedia.org/show_bug.cgi?id=22624 <- mentioned in db.php; are cluster 1 and 2 special?
- what are all the compression methods?
- see Text_storage_data. Re-run the queries that created this page to bring it up to date.
Overall intended approach
Data Migration
- ensure the ms3 cluster is in the correct compression format
- if not:
- take one of the three hosts out of rotation (ms1)
- record replication details - we're going to need to re-attach to get new data
- recompress from the other slave (ms2) onto ms1
- slave a host in eqiad (es1001) off ms1 so there's a backup
- slave ms1 off of ms3 using previously recorded binlog position to get new data
- rotate master so ms1 is getting new writes
- reslave ms2 off of ms1
- wipe ms3
- prep the second cluster
- create 3 empty dbs: ms3, es1002, es1003
- make ms3 the master, set up es1002 and es1003 to slave off ms3
- get all apache-based dbs onto db hardware (note: this is all read-only data)
- for each existing cluster:
- recompress the data into a new cluster table on ms3
- back up then delete mysql data on the apache host
- rearrange slaving for the second cluster
- promote es1002 to master, es1003 and ms3 both slaving off es1002
Documentation
- Starting from http://wikitech.wikimedia.org/view/External_storage, expand and improve documentation until it sufficiently describes the current environment and how to work with it
- architecture of the current cluster
- how to move new inserts from one cluster to the other
- how to age data off the current cluster
- noop?
- how to set up a new cluster
Database replication - desired end state
This section assumes we will be able to fit all external store data on two clusters. This is likely to be true but has not yet been validated.
We have 7 database machines between pmtpa and eqiad available to devote to the external store replication setup. The desired end state will have two clusters, each replicated to both data centers. Each one will have one master and slave in one colo and one slave in the other, with the masters in opposite colos. New inserts will always go to the 'active' cluster; this is the one whose master is in the same colo as the main database. (Note: the active cluster will be pmtpa until we are able to failover everything to eqiad.) The seventh box will hang off the active cluster and be available as a spare to swap out failed hosts.
Tasks
- rack and set up eqiad servers (rt-xxx)
- set up puppet for external storage class (rt-91)
