Dumps/Rerunning a job

From Wikitech
< Dumps
Revision as of 16:38, 21 January 2012 by ArielGlenn (Talk | contribs)

Jump to: navigation, search

(This needs cleanup.)

Contents

Fixing a broken dump

Back in the day, the dumps were meant to be generated in an endless loop with no human intervention; if a dump broken you wated another 10 days or so until your project's turn came around again and then there was a new one.

These days folks want the data Right Now, and some dumps take a good long time to run (*cough*en wp*cough*). If you see a broken or failed dump, this is how you can fix it up.

Rerunning a complete dump

If most of the steps failed, or the script failed or died early on, you might as well rerun the entire thing.

  1. Clean up the old firectory. From any snapshot host you get to the toplevel directory at /mnt/data/xmldatadumps/public/name-of-wiki and find the subdirectory you want. These will all have names of the form YYYYmmdd indicating the date the run was started. Make a note of this date, you need it to rerun the dump.
  2. Remove any lock file that may be left lying around. The lock file will be named /mnt/data/xmldatadumps/private/name-of-wiki/lock but ONLY REMOVE IT if there is not a new dump (for a more current date) running for this project. If there is a new dump already in progress for the project by the time you discover the broken dump, you can't use this approach. You'll need to either wait for the new dump to complete or run the old one one step at a time (see section below).
  3. Determine which configuration file you need to use:
    • wikidump.conf.enwiki for en wikipedia
    • wikidump.conf.bigwikis for the wikis listed in /backups-atg/bigwikis.dblist
    • wikidump.conf for the rest
  4. Determine which host you should run from (see Dumps/Snapshot hosts for which host runs which wikis).
  5. If there's already a root screen session on that host, use it, otherwise start a new one. Open a window,
    • su - backup
    • bash
    • cd /backups-atg
    • python ./worker.py --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
    The date in the above will be the date in the directory name you removed.
    Example: to rerun the enwiki dumps for January 2012 you would run
    python ./worker.py --date 20120104 --configfile wikidump.conf.enwiki --log enwiki

NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, you can't do this. You'll need to either wait for the new dump to complete or run the old one one step at a time (see below).

Rerunning one piece of a dump

  1. As above, you'll need to determine the date, which configuration file you need, and which host to run from.
  2. You don't need to do anything about lockfiles.
  3. You don't need to clean up any old files.
  4. Determine which job (which step) needs to be re-run. Presumably the failed step has been recorded on the web-viewable page for the particular dump (http://dumps.wikimedia.org/wikiname-here/YYYYmmdd/) in which case it should be marked as status:failed in the dumpruninfo.txt file in the run directory. Use the job name from that file, and remember that they are listed in reverse order of execution. If you were told by a user or aren't sure which job is the one, see Dumps/Phases of a dump run to figure out the right job(s).
  5. If there's already a root screen session on the host, use it, otherwise start a new one. Open a window,
    • su - backup
    • bash
    • cd /backups-atg
    • python ./worker.py --job job-name-you-found --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
    The date in the above will be the date in the directory name you removed.
    Example: to rerun the generation of the bzip2 pages meta history file for the enwiki dumps for January 2012 you would run
    python ./worker.py --job metahistorybz2dump --date 20120104 --configfile wikidump.conf.enwiki --log enwiki

Do this for each step that needs to be rerun for a given wiki, waiting for each step to complete before doing the next one.

NOTE: don't run multiple steps for a given wiki at the same time. You'll probably get garbage results.

Rerunning a dump from a given step onwards

As above, you'll need to determine the date of the dump, the configuration file needed, and the host the job should run on.

  1. You need to remove any existing lockfile, as described above in the section on rerunning an entire dump.
  2. You don't need to clean up any old files.
  3. Determine which job (which step) is the first one that needs to be re-run. Presumably the point of failure has been recorded on the web-viewable page for the particular dump (http://dumps.wikimedia.org/wikiname-here/YYYYmmdd/) in which case it should be marked as status:failed in the dumpruninfo.txt file in the run directory. Use the job name from that file, and remember that they are listed in reverse order of execution. If you were told by a user or aren't sure which job is the one, see Dumps/Phases of a dump run to figure out the right job(s).
  4. If there's already a root screen session on the host, use it, otherwise start a new one. Open a window,
    • su - backup
    • bash
    • cd /backups-atg
    • python ./worker.py --job job-name-you-found --restartfrom --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
    The date in the above will be the date in the directory name you removed.
    Example: to rerun the en wiki dump for January 2012 from the generation of the bzip2 pages meta history file on, you would run
    python ./worker.py --job metahistorybz2dump --restartfrom --date 20120104 --configfile wikidump.conf.enwiki --log enwiki

NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, you can't do this. You'll need to either wait for the new dump to complete or run the old one one step at a time (see above).

Generating new dumps

When new wikis are enabled on the site, they are added to all.dblist which is checked by the dump scripts. They get dumped as soon as a worker completes a run already in progress, so you don't have to do anything special for them.

Running a (new) specific dump by hand

Once in a while we get a request for a dump of a wiki out of sequence, so that it can be archived before it is shut down and removed, for example.

  1. Determine which configuration file you need to use and which host the dump should run from, as above in the section on rerunning an entire dump.
  2. If there's already a root screen session on that host, use it, otherwise start a new one. Open a window,
    • su - backup
    • bash
    • cd /backups-atg
    • python ./worker.py --configfile wikidump.conf.XXX --log name-of-wiki
    Example: to run the enwiki dumps you would type
    python ./worker.py --configfile wikidump.conf.enwiki --log enwiki


Other stuff

(This needs cleanup.)

If you have the stub file and want to generate the full XML file with text revisions from the stub file, be the user backup on one of the snapshot hosts, and run just that job for that date:

  • cd /backups-atg
  • python ./worker.py --job metahistorybz2dump --date YYYYmmdd [--configfile configfilenamehere] name-of-wiki-here

Example:

  • python ./worker.py --job metahistorybz2dump --date 20100904 --configfile wikidump.conf.bigwikis ruwiki

Ordinarily this will run with prefetch. If you want no prefetch (i.e. you want to get every text revision directly from the database instead of looking at the files from the previous run to get what you can from there), you can give the addition option --noprefetch.

Note that the name of the wiki must always be last on the command line, and it is the name as seen in /home/wikipedia/common/all.dblist .

The old way to run jobs (still ok but you probably won't need it) is described below:

  • cd /backups-atg
  • /usr/bin/php -q /apache/common/php/maintenance/dumpTextPass.php --wiki=name-of-wiki --stub=gzip:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --full

Example:

  • /usr/bin/php -q /apache/common/php/maintenance/dumpTextPass.php --wiki=ruwiki --stub=gzip:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/ruwiki/20100331/ruwiki-20100331-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-pages-meta-history.xml.bz2 --full

If you want to run against a different type of stub file and produce the text XML output for it, adjust the names appropriately for the stub, prefetch and output files, and change the --full option accordingly (to e.g. --current).

If you want to run without spawning the fetchText.php process for text revision retrievals, leave off the --spawn=/usr/bin/php option.

If you don't have the stub you can generate it by:

(example)

/usr/bin/php -q /var/www/html/elwvtrunk/maintenance/dumpBackup.php --wiki=elwvtrunk --full --stub --report=10000 --force-normal --output=gzip:/home/ariel/src/mediawiki/testing/dumps/public/elwvtrunk/20120121/elwvtrunk-20120121-stub-meta-history.xml.gz --output=gzip:/home/ariel/src/mediawiki/testing/dumps/public/elwvtrunk/20120121/elwvtrunk-20120121-stub-meta-current.xml.gz --filter=latest --output=gzip:/home/ariel/src/mediawiki/testing/dumps/public/elwvtrunk/20120121/elwvtrunk-20120121-stub-articles.xml.gz --filter=latest --filter=notalk --filter=namespace:!NS_USER

This generates all three stubs at once.


for full history,

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox