Dumps
Docs for end-users of the data dumps at MetaWikipedia:Data dumps.
Contents |
Top-level procedure
The dumped files are stored and served to the web from storage2.
User-visible files appear at http://download.wikipedia.org/backup-index.html
Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps.
Monitor node
- srv31
- /home/wikipedia/src/backup/monitor
Worker nodes
The worker threads go through the set of available wikis to dump automatically.
Adding a new worker box
- apt-get install php5-dev libicu-dev g++ php-config gcc-3.4 mysql-client-5.0 p7zip-full swig
- svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/normal/
- cd normal; make
- install in dir found for php -i | grep extension ex /usr/lib/php5/20060613'
- mv php_utfnormal.so /usr/lib/php5/20060613
- add /etc/php5/conf.d/utfnormal.ini with extension=php_utfnormal.so
- 7zip of at least 4.58 due to https://bugs.edge.launchpad.net/hardy-backports/+bug/370618 or chmod 644 in place
Configuration notes
Worker nodes need:
- MediaWiki install for batch use
- utfnormal extension available for PHP
- storage2 dumps dir mounted
- dbzip2 installed
Locks and logs
The worker threads use lock files in the private directories. Lock files are touched by a background thread during dump; the monitor node looks for stale lock files and releases them for later re-running by a worker node.
Raw output from the script will go into the file you tee output into. A separate text log isn't kept at the moment, but status information is saved into HTML files for public consumption:
File layout
- <base>/
- index.html - List of all databases and their last-touched status
- <db>/
- <date>/
- index.html - List of items in the database
- <date>/
Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.
Error handling
If a dump step returns an error condition, the runner script should detect this and mark the item as "failed" on the HTML pages. The runner will keep on trying other steps and remaining databases, unless the runner script itself fails somehow.
It will also e-mail brion with a notification, at least in theory. (This is not yet fully tested)
If the server crashes while it's running, the status files are left as-is and the display shows it as still running until the monitor node decides the lock file is stale enough to mark is as aborted.
Programs used
The dump runner script is available in our cvs, in the 'backup' module, as WikiBackup.py.
- mysqldump
- dumpBackup.php, dumpTextPass.php to generate the XML dumps
- Requires working PHP 5 and MediaWiki installation on amaryllis/srv31! Don't remove from mediawiki-installation dsh group!
- need the XMLReader PHP extension, zlib, and bzip2 enabled
- using the ActiveAbstract MW extension for Yahoo's wacky stuff
- 7za (of p7zip) must be installed and in the path
Other missing features
Currently, image tarballs are still not being made.
Static HTML dumps might also be included in this mess in future?
Notes
Not all error detection is probably working right now. Failures on the mysqldump runs are not detected. Tar failures are not detected.
Failures of dumpPages.php should be detected, but indirectly from the failure of mwdumper to parse its XML output.
The mysql dumps and page XML dump pull from bacon, as configured in one of the higher-level backup scripts. Currently bacon is *NOT* being stopped from replication...
- The page XML dumps should be consistent, all three outputs draw from one input, which is drawn from one long SQL transaction plus supplementary data loads which should be independent of changes.
- The other SQL dumps are not going to be 100% time-consistent. But that's not too important.
grantswiki and internalwiki are special-cased so they _should_ get completely backed up into /var/backup/private instead of the public dir.