Dumps

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(Worker nodes)
(Configuration notes)
Line 51: Line 51:
 
== Configuration notes ==
 
== Configuration notes ==
  
The configuration file lives in the directory <code>/backups</code> in the file <code>wikidump.conf</code>. Am ong other useful things that can be set in this file are the directory of the mediawiki installation and the filesystem to which to write the XML dumps.
+
The configuration file lives in the directory <code>/backups</code> in the file <code>wikidump.conf</code>. There are separate configuration files for different groups of wikis. Among other useful things that can be set in this file are the directory of the mediawiki installation and the filesystem to which to write the XML dumps.
  
 
=== Adding a new worker box ===
 
=== Adding a new worker box ===

Revision as of 18:27, 1 February 2011

Docs for end-users of the data dumps at MetaWikipedia:Data dumps.

For current development plans, see Dumps/Development 2011. For status of development, see Dumps/Development status 2011.

Contents

Top-level procedure

The dumped files are stored and served to the web from dataset2.

User-visible files appear at http://download.wikipedia.org/backup-index.html

Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps.

Status

Dumps are being served on Dataset2.

We are interested in mirroring of the dumps; please add information there if you can host or know of an organization that can.

We have copied one complete run of our public XML files (about 1.3T?) off to Google storage, which they have kindly donated to us. We'd like to run a copy once every two weeks, keep the last four copies and then one copy permanently every six months. Script here.

Parallel dumps of en wiki will resume again in the next little while. See Dumps/Parallelization for much more on this.

Monitor node

  • snapshot2
    • /home/wikipedia/src/backup/monitor
      • This shell script runs monitor.py in an endless while loop. Its sole purpose is to check for and remove stale lock files from dump processes that have died, and to update the central index.html file which shows the dumps in progress and the status of the dumps that have completed (i.e. http://dumps.wikimedia.org/backup-index.html ). It does not start or stop worker threads.
      • The local copy of this script is running out of /backups-atg in a screen session on the given host, as the user "backup".

Worker nodes

The worker threads go through the set of available wikis to dump automatically. Dumps are run on a "longest without a dump runs next" schedule. The plan is to have a complete dump for each wiki every 2 weeks, except for enwikipedia, which should have a complete dump once a month.

  • snapshot2 -- currently running 3 threads which cover all dumps but en wiki and the larger wikis (de es fr etc) out of /backups-atg. Command: ./worker
  • snapshot2 -- currently running 3 threads which cover all of the larger wikis (9 of them) except en wiki, out of /backups-atg. Command: ./worker wikidump.conf.bigwikis
  • snapshot3 -- running en wiki dumps in stages
    Currently doing the recombine from 2010 09 04 history bz2, via
    python ./worker.py --job metahistorybz2dumprecombine --date 20100904 --configfile wikidump.conf.enwiki enwiki
    Also running the history dumps for 2011 01 15, via
    python ./worker.py --job metahistoryb2zdump --date 20110115 --configfile wikidump.conf.enwiki.new enwiki
  • snapshot1 -- in process of OS upgrade
    • /home/wikipedia/src/backup/worker
      • This shell script runs worker.py in an endless loop. The python script is invoked without arguments and in this mode will look for the oldest dump directory and will produce sql and XML files for that project.
      • The worker.py script creates a lock file on the filesystem containing the dumps (as of this writing, /mnt/dataset1/xmldatadumps/) in the subdirectory private/name-of-wiki/lock. No other process will try to write dumps for that project while the lock file is in place.
      • Local copies of the shell script and the python script live on the snapshot hosts in the directory /backups. Current copies in use are run out of /backups-atg in screen sessions on the various hosts, as the user "backup".

Code

Check /branches/ariel/xmldumps-backup for the python code in use. Eventually this will make its way back into trunk; it's still a bit gross right now.

Configuration notes

The configuration file lives in the directory /backups in the file wikidump.conf. There are separate configuration files for different groups of wikis. Among other useful things that can be set in this file are the directory of the mediawiki installation and the filesystem to which to write the XML dumps.

Adding a new worker box

A portion of this is in the process of being puppetized.

  1. Install like a regular app server but without apache running
  2. Add worker to /etc/exports/ on dataset1 <-- done in puppet
  3. Add /mnt/dumps to /etc/fstab of worker host <-- done in puppet (and it's /mnt/data now)
  4. Build the utfnormal php module: <-- seriously can we build a package and puppetize this?
    • apt-get install php5-dev libicu-dev g++ php-config gcc-3.4 mysql-client-5.0 p7zip-full swig subversion
    • svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/normal/
    • cd normal; make
    • install in dir found for php -i | grep extension ex /usr/lib/php5/20060613'
    • mv php_utfnormal.so /usr/lib/php5/20060613
    • add /etc/php5/conf.d/utfnormal.ini with extension=php_utfnormal.so
  5. 7zip of at least 4.58 due to https://bugs.edge.launchpad.net/hardy-backports/+bug/370618 or chmod 644 in place
  6. svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/backup/ /backups
  7. svn co 'svn+ssh://user@svn.wikimedia.org/svn-private/wmf/xmlsnapshots/conf' conf'
  8. mv wikidump.conf ../.

Locks and logs

The worker threads use lock files in the private directories (e.g. .../private/name-of-wiki/lock). Lock files are touched by a background thread during dump; the monitor node looks for stale lock files and deletes them so that the job may be run again later by a worker node.

Raw output from the script will go into the file you tee output into. A separate text log isn't kept at the moment, but status information is saved into HTML files for public consumption:

File layout

  • <base>/

Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.

Error handling

If a dump step returns an error condition, the runner script should detect this and mark the item as "failed" on the HTML pages. The runner will keep on trying other steps and remaining databases, unless the runner script itself fails somehow.

It will also e-mail xmldatadumps-admin-l@lists.wikimedia.org with a notification.

If the server crashes while it's running, the status files are left as-is and the display shows it as still running until the monitor node decides the lock file is stale enough to mark is as aborted.

 Troubleshooting

If the host runs low on disk space, you can reduce the number of backups that are kept. Edit the file /home/wikipedia/src/backup/wikidump.conf on the monitor host and look for the line that says "keep=<some value>".

Testing

If you need to run dumps for one wiki, be the user backup on one of the snapshot hosts and then do

  • cd /backups
  • python ./worker.py name-of-wiki-here

For example,

  • python ./worker.py enwiki

If you have the stub file and want to generate the full XML file with text revisions from the stub file, be the user backup on one of the snapshot hosts, and run just that job for that date:

  • cd /backups
  • python ./worker.py --job metahistorybz2dump --date YYYYmmdd [--configfile configfilenamehere] name-of-wiki-here

Example:

  • python ./worker.py --job metahistorybz2dump --date 20100904 --configfile wikidump.conf.bigwikis ruwiki

For other types of page dumps, change the job name. If you give the command with --job help it will produce a list of the various jobs; they should be self explanatory.

Ordinarily this will run with prefetch. If you want no prefetch (i.e. you want to get every text revision directly from the database instead of looking at the files from the previous run to get what you can from there), you can give the addition option --noprefetch.

Note that the name of the wiki must always be last on the command line, and it is the name as seen in /home/wikipedia/common/all.dblist .


The old way to run jobs (still ok but you probably won't need it) is described below:

  • cd /backups
  • /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=name-of-wiki --stub=gzip:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --full

Example:

  • /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=ruwiki --stub=gzip:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/ruwiki/20100331/ruwiki-20100331-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-pages-meta-history.xml.bz2 --full

If you want to run against a different type of stub file and produce the text XML output for it, adjust the names appropriately for the stub, prefetch and output files, and change the --full option accordingly (to e.g. --current).

If you want to run without spawning the fetchText.php process for text revision retrievals, leave off the --spawn=/usr/bin/php option.

Backup stages

  • First stage: dumps of various database tables, both private and public
  • Second stage: list of page titles, page abstracts for Yahoo
  • Third stage: page stubs (dumpBackup.php)
  • Fourth stage: XML files with revision texts, bzipped (dumpTextPass.php, fetchText.php)
  • Fifth stage: 7z compression of the XML file with all revision texts (full history)

Programs used

The dump runner script is available in our cvs, in the 'backup' module, as WikiBackup.py.

  • mysqldump
  • dumpBackup.php, dumpTextPass.php, fetchText.php to generate the XML dumps
    • Requires working PHP 5 and MediaWiki installation on the snapshot hosts! Don't remove from mediawiki-installation dsh group!
    • need the XMLReader PHP extension, zlib, and bzip2 enabled
    • using the ActiveAbstract MW extension for Yahoo's wacky stuff
    • 7za (of p7zip) must be installed and in the path

Other missing features

Currently, image tarballs are still not being made.

Static HTML dumps might also be included in this mess in future?

Limitations

There is no mechanism for running one stage of a dump of a given project via the worker.py script.Now supports running specific jobs for a given dump.

The only way to prevent a particular project from being dumped is to manually create the lock file private/name-of-wiki/lock.Now supports a "skip file" list which specifies which projects to skip past.

The scripts in the /backups directory on the snapshot hosts are not updated by scap or any of the usual mechanisms. (The php scripts, in contrast, do get updated, and the updated versions will be invoked the next time worker.py starts up, i.e. on the next wiki project by date that is due for a run.)

If you shoot a worker thread (either the main worker script or the python script) in the middle of a dump and want it to rerun that wiki (or complete it), you will need to either (re)run the missing steps by hand or remove the lock file and the directory with the timestamp and newly created files in it, then start the script again.Running "by hand" now means running each job in turn for that date, i.e. python ./worker.py --date 20110110 --job abstractsdump elwikidb which, if still a bit of a PITA, is not as bad as having to concoct the dumpText.php --prefetch somethingorother --gzip some other thing etc. string by hand.

There is no simple mechanism for running only certain wikis on a given snapshot host (e.g. enwiki on one, the next 5-6 large ones on another, and the smaller ones on a third so that the smaller project dumps aren't held up waiting for the longer ones to complete).Now with the skip db list and with separate config files we can create local lists of wikis to be dumped by separate processes, either on the same or different hosts.

Notes

Not all error detection is probably working right now. Failures on the mysqldump runs are not detected. Tar failures are not detected.

Failures of dumpPages.php should be detected, but indirectly from the failure of mwdumper to parse its XML output.

The mysql dumps and page XML dump pull from bacon, as configured in one of the higher-level backup scripts. Currently bacon is *NOT* being stopped from replication... -- Ummmm... from what?? This is completely outdated and I have no idea what it refers to. Bacon? Eggs? Ham?? :-P

  • The page XML dumps should be consistent, all three outputs draw from one input, which is drawn from one long SQL transaction plus supplementary data loads which should be independent of changes.
  • The other SQL dumps are not going to be 100% time-consistent. But that's not too important.

grantswiki and internalwiki are special-cased so they _should_ get completely backed up into /var/backup/private instead of the public dir.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox