Dumps
Docs for end-users of the data dumps at MetaWikipedia:Data dumps.
For current development plans, see Dumps/Development 2011. For status of development, see Dumps/Development status 2011.
For documentation on the "adds/changes" dumps, see Dumps/Adds-changes dumps.
For information about the parallel jobs, see Dumps/Parallelization.
Contents |
Overview
User-visible files appear at http://download.wikipedia.org/backup-index.html
Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps.
Status
For which hosts are serving data, see Dumps/Dump servers. For which hosts are generating which dumps, see Dumps/Snapshot hosts.
We want mirrors! For more information see Dumps/Mirror status.
Worker nodes
The worker processes go through the set of available wikis to dump automatically. Dumps are run on a "longest without a dump runs next" schedule. The plan is to have a complete dump for each wiki every 2 weeks, except for enwikipedia, which should have a complete dump once a month.
The shell script worker which starts one of these processes simply runs the python script <worker.py> in an endless loop. Multiple such workers can run at the same time on different hosts, as well as on the same host.
The worker.py script creates a lock file on the filesystem containing the dumps (as of this writing, /mnt/data/xmldatadumps/) in the subdirectory private/name-of-wiki/lock. No other process will try to write dumps for that project while the lock file is in place.
Local copies of the shell script and the python script live on the snapshot hosts in the directory /backups but currently are run out of /backups-atg (since this code is not yet in trunk) in screen sessions on the various hosts, as the user "backup".
Monitor node
The monitor node checks for and removes stale lock files from dump processes that have died, and updates the central index.html file which shows the dumps in progress and the status of the dumps that have completed (i.e. http://dumps.wikimedia.org/backup-index.html ). It does not start or stop worker processes.
The shell script monitor which starts the process simply runs the python script monitor.py in an endless loop.
As with the worker nodes, local copies of the shell script and the python script live on the snapshot hosts in the directory /backups but currently are run out of /backups-atg (since this code is not yet in trunk) in a screen sessions on one host, as the user "backup".
Code
Check /branches/ariel/xmldumps-backup for the python code in use. Eventually this will make its way back into trunk; it's still a bit gross right now.
Setup
Adding a new worker box
Install and add to site.pp, copying one of the existing snapshot stanzas in puppet. This does, among other things:
- set up the base MW install without apache running
- Add worker to /etc/exports/ on dataset2
- Add /mnt/data to /etc/fstab of worker host
- Build the utfnormal php module (done for lucid)
For now:
- Backups are running test code out of /backups-atg on each host so grab a copy of that from any existing host and copy it into /backups-atg on the new host. This will include conf files, you don't need to specify them separately.
- Check over the configuration file and make sure it looks sane, all the paths point to things that exist, etc. For too many details see the README in svn.
- We run enwiki on its own host. If this host is going to do that work, check
/backups-atg/wikidump.conf.enwiki. - The next 8 or so largest wikis are run on their own separate host so they don't backlog the smaller wikis. For that, check
/backups-atg/wikidump.conf.bigwikis. - The remainder of the wikis run on one host. Check
/backups-atg/wikidump.conffor those.
- We run enwiki on its own host. If this host is going to do that work, check
File layout
- <base>/
- index.html - List of all databases and their last-touched status
- <db>/
- <date>/
- index.html - List of items in the database
- <date>/
Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.
Error handling
If a dump step returns an error condition, the runner script should detect this and mark the item as "failed" on the HTML pages. The runner will keep on trying other steps and remaining databases, unless the runner script itself fails somehow.
It will also e-mail xmldatadumps-admin-l@lists.wikimedia.org with a notification.
If the server crashes while it's running, the status files are left as-is and the display shows it as still running until the monitor node decides the lock file is stale enough to mark is as aborted.
Troubleshooting
If the host runs low on disk space, you can reduce the number of backups that are kept. Edit the file /home/wikipedia/src/backup/wikidump.conf on the monitor host and look for the line that says "keep=<some value>".
Testing
If you need to run dumps for one wiki, be the user backup on one of the snapshot hosts and then do
- cd /backups
- python ./worker.py name-of-wiki-here
For example,
- python ./worker.py enwiki
If you have the stub file and want to generate the full XML file with text revisions from the stub file, be the user backup on one of the snapshot hosts, and run just that job for that date:
- cd /backups
- python ./worker.py --job metahistorybz2dump --date YYYYmmdd [--configfile configfilenamehere] name-of-wiki-here
Example:
- python ./worker.py --job metahistorybz2dump --date 20100904 --configfile wikidump.conf.bigwikis ruwiki
For other types of page dumps, change the job name. If you give the command with --job help it will produce a list of the various jobs; they should be self explanatory.
Ordinarily this will run with prefetch. If you want no prefetch (i.e. you want to get every text revision directly from the database instead of looking at the files from the previous run to get what you can from there), you can give the addition option --noprefetch.
Note that the name of the wiki must always be last on the command line, and it is the name as seen in /home/wikipedia/common/all.dblist .
The old way to run jobs (still ok but you probably won't need it) is described below:
- cd /backups
- /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=name-of-wiki --stub=gzip:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --full
Example:
- /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=ruwiki --stub=gzip:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/ruwiki/20100331/ruwiki-20100331-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-pages-meta-history.xml.bz2 --full
If you want to run against a different type of stub file and produce the text XML output for it, adjust the names appropriately for the stub, prefetch and output files, and change the --full option accordingly (to e.g. --current).
If you want to run without spawning the fetchText.php process for text revision retrievals, leave off the --spawn=/usr/bin/php option.
Backup stages
- First stage: dumps of various database tables, both private and public
- Second stage: list of page titles, page abstracts for Yahoo
- Third stage: page stubs (dumpBackup.php), gzipped
- Possibly additional recombine phase to combine chunks produced in parallel, into one complete file
- Fourth stage: XML files with revision texts, bzipped (dumpTextPass.php, fetchText.php)
- Possibly additional recombine phase to combine chunks produced in parallel, into one complete file
- Fifth stage: 7z compression of the XML file with all revision texts (full history)
- Possibly additional recombine phase to combine chunks produced in parallel, into one complete file
Programs used
See Dumps/Software dependencies.
The scripts call mysqldump, dumpBackup.php, and dumpTextPass.php directly for dump generation.
Missing features
Currently, image tarballs are still not being made.
Static HTML dumps might also be included in this mess in future?
"Incremental" dumps?
Limitations
The scripts in the /backups directory on the snapshot hosts are not updated by scap or any of the usual mechanisms. The php scripts, in contrast, do get updated, and the updated versions will be invoked the next time worker.py starts up, i.e. on the next wiki project by date that is due for a run. This might be a problem since comprehensive testing of XML dumps is usually not done before a code push.
Notes
(This stuff may not be current.)
Not all error detection is probably working right now. Failures on the mysqldump runs are not detected. Tar failures are not detected. <-- current?
Failures of dumpPages.php should be detected, but indirectly from the failure of mwdumper to parse its XML output. <-- current?
- The page XML dumps should be consistent, all three outputs draw from one input, which is drawn from one long SQL transaction plus supplementary data loads which should be independent of changes.
- The other SQL dumps are not going to be 100% time-consistent. But that's not too important.
grantswiki and internalwiki are special-cased so they _should_ get completely backed up into /var/backup/private instead of the public dir. <-- eh??