Dumps/Rerunning a job

From Wikitech
< Dumps(Difference between revisions)
Jump to: navigation, search
(Rerunning a complete dump)
Line 22: Line 22:
 
#* bash
 
#* bash
 
#* cd /backups-atg
 
#* cd /backups-atg
#* python ./worker.py --date YYYYmmdd --configfile wikidump.conf.XXX name-of-wiki
+
#* python ./worker.py --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
 
#:: The date in the above will be the date in the directory name you removed.
 
#:: The date in the above will be the date in the directory name you removed.
 
#::Example: to rerun the enwiki dumps for January 2012 you would run
 
#::Example: to rerun the enwiki dumps for January 2012 you would run
#::<code>python ./worker.py --date 20120104 --configfile wikidump.conf.enwiki enwiki</code>  
+
#::<code>python ./worker.py --date 20120104 --configfile wikidump.conf.enwiki --log enwiki</code>  
  
 
NOTE THAT if a new dump is already running for the project by the time you discover the broken dump,  
 
NOTE THAT if a new dump is already running for the project by the time you discover the broken dump,  

Revision as of 13:51, 20 January 2012

(This needs cleanup.)

Contents

Fixing a broken dump

Back in the day, the dumps were meant to be generated in an endless loop with no human intervention; if a dump broken you wated another 10 days or so until your project's turn came around again and then there was a new one.

These days folks want the data Right Now, and some dumps take a good long time to run (*cough*en wp*cough*). If you see a broken or failed dump, this is how you can fix it up.

Rerunning a complete dump

If most of the steps failed, or the script failed or died early on, you might as well rerun the entire thing.

  1. Clean up the old firectory. From any snapshot host you get to the toplevel directory at /mnt/data/xmldatadumps/public/name-of-wiki and find the subdirectory you want. These will all have names of the form YYYYmmdd indicating the date the run was started. Make a note of this date, you need it to rerun the dump.
  2. Remove any lock file that may be left lying around. The lock file will be named /mnt/data/xmldatadumps/private/name-of-wiki/lock but ONLY REMOVE IT if there is not a new dump (for a more current date) running for this project. If there is a new dump already in progress for the project by the time you discover the broken dump, you can't use this approach. You'll need to either wait for the new dump to complete or run the old one one step at a time (see section below).
  3. Determine which configuration file you need to use:
    • wikidump.conf.enwiki for en wikipedia
    • wikidump.conf.bigwikis for the wikis listed in /backups-atg/bigwikis.dblist
    • wikidump.conf for the rest
  4. Determine which host you should run from (see Dumps/Snapshot hosts for which host runs which wikis).
  5. If there's already a root screen session on that host, use it, otherwise start a new one. Open a window,
    • su - backup
    • bash
    • cd /backups-atg
    • python ./worker.py --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
    The date in the above will be the date in the directory name you removed.
    Example: to rerun the enwiki dumps for January 2012 you would run
    python ./worker.py --date 20120104 --configfile wikidump.conf.enwiki --log enwiki

NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, you can't do this. You'll need to either wait for the new dump to complete or run the old one one step at a time (see below).

Rerunning one piece of a dump

  1. As above, you'll need to determine the date, which configuration file you need, and which host to run from.
  2. You don't need to do anything about lockfiles.
  3. You don't need to clean up any old files.
  4. Determine which job (which step) needs to be re-run. Presumably the failed step has been recorded on the web-viewable page for the particular dump (http://dumps.wikimedia.org/wikiname-here/YYYYmmdd/) in which case it should be marked as status:failed in the dumpruninfo.txt file in the run directory. Use the job name from that file. If you were told by a user or aren't sure which job is the one, see Dumps/Phases of a dump run to figure out the right job(s).
  5. If there's already a root screen session on the host, use it, otherwise start a new one. Open a window,
    • su - backup
    • bash
    • cd /backups-atg
    • python ./worker.py --job job-name-you-found --date YYYYmmdd --configfile wikidump.conf.XXX name-of-wiki
    The date in the above will be the date in the directory name you removed.
    Example: to rerun the generation of the bzip2 pages meta history file for the enwiki dumps for January 2012 you would run
    python ./worker.py --job metahistorybz2dump --date 20120104 --configfile wikidump.conf.enwiki enwiki

Do this for each step that needs to be rerun for a given wiki, waiting for each step to complete before doing the next one.

NOTE: don't run multiple steps for a given wiki at the same time. You'll probably get garbage results.

Rerunning a dump from a given step onwards

Other stuff (needs cleanup)

If you need to run dumps for one wiki, be the user backup on one of the snapshot hosts and then do

  • cd /backups
  • python ./worker.py name-of-wiki-here

For example,

  • python ./worker.py enwiki

If you have the stub file and want to generate the full XML file with text revisions from the stub file, be the user backup on one of the snapshot hosts, and run just that job for that date:

  • cd /backups
  • python ./worker.py --job metahistorybz2dump --date YYYYmmdd [--configfile configfilenamehere] name-of-wiki-here

Example:

  • python ./worker.py --job metahistorybz2dump --date 20100904 --configfile wikidump.conf.bigwikis ruwiki

For other types of page dumps, change the job name. If you give the command with --job help it will produce a list of the various jobs; they should be self explanatory.

Ordinarily this will run with prefetch. If you want no prefetch (i.e. you want to get every text revision directly from the database instead of looking at the files from the previous run to get what you can from there), you can give the addition option --noprefetch.

Note that the name of the wiki must always be last on the command line, and it is the name as seen in /home/wikipedia/common/all.dblist .


The old way to run jobs (still ok but you probably won't need it) is described below:

  • cd /backups
  • /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=name-of-wiki --stub=gzip:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --full

Example:

  • /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=ruwiki --stub=gzip:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/ruwiki/20100331/ruwiki-20100331-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-pages-meta-history.xml.bz2 --full

If you want to run against a different type of stub file and produce the text XML output for it, adjust the names appropriately for the stub, prefetch and output files, and change the --full option accordingly (to e.g. --current).

If you want to run without spawning the fetchText.php process for text revision retrievals, leave off the --spawn=/usr/bin/php option.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox