Dumps/Rerunning a job

From Wikitech
< Dumps(Difference between revisions)
Jump to: navigation, search
Line 1: Line 1:
 
(This needs cleanup.)
 
(This needs cleanup.)
 +
 +
==  Fixing a broken dump ==
 +
 +
Back in the day, the dumps were meant to be generated in an endless loop with no human intervention; if a dump broken you wated another 10 days or so until your project's turn came around again and then there was a new one. 
 +
 +
These days folks want the data Right Now, and some dumps take a good long time to run (*cough*en wp*cough*).  If you see a broken or failed dump, this is how you can fix it up.
 +
 +
=== Rerunning a complete dump ===
 +
 +
If most of the steps failed, or the script failed or died early on, you might as well rerun the entire thing.
 +
 +
# Clean up the old firectory. From any snapshot host you get to the toplevel directory at /mnt/data/xmldatadumps/public/name-of-wiki and find the subdirectory you want. These will all have names of the form YYYYmmdd indicating the date the run was started. Make a note of this date, you need it to rerun the dump.
 +
# Remove any lock file that may be left lying around. The lock file will be named /mnt/data/xmldatadumps/private/name-of-wiki/lock  but ONLY REMOVE IT if there is not a new dump (for a more current date) running for this project.  If there is a new dump already in progress for the project by the time you discover the broken dump, you can't use this approach.  You'll need to either wait for the new dump to complete or run the old one one step at a time (see section below).
 +
# Determine which configuration file you need to use:
 +
#* <code>wikidump.conf.enwiki</code> for en wikipedia
 +
#* <code>wikidump.conf.bigwikis</code> for the wikis listed in /backups-atg/bigwikis.dblist
 +
#* <code>wikidump.conf</code> for the rest
 +
# Determine which host you should run from (see [[Dumps/Snapshot hosts]] for which host runs which wikis).
 +
# If there's already a root screen session on that host, use it. Open a window,
 +
#* su - backup
 +
#* bash
 +
#* cd /backups-atg
 +
#* python ./worker.py --date YYYYmmdd --configfile wikidump.conf.XXX name-of-wiki
 +
#:: The date in the above will be the date in the directory name you removed.
 +
#::Example: to rerun the enwiki dumps for January 2012 you would run
 +
#::<code>python ./worker.py --date 20120104 --configfile wikidump.conf.enwiki enwiki</code>
 +
 +
NOTE THAT if a new dump is already running for the project by the time you discover the broken dump,
 +
you can't do this.  You'll need to either wait for the new dump to complete or run the old one one step at a time (see below).
 +
 +
=== Rerunning one piece of a dump ===
 +
 +
=== Rerunning a dump from a given step onwards ===
 +
 +
=== Other stuff (needs cleanup) ===
  
 
If you need to run dumps for one wiki, be the user ''backup'' on one of the snapshot hosts and then do
 
If you need to run dumps for one wiki, be the user ''backup'' on one of the snapshot hosts and then do

Revision as of 09:36, 20 January 2012

(This needs cleanup.)

Contents

Fixing a broken dump

Back in the day, the dumps were meant to be generated in an endless loop with no human intervention; if a dump broken you wated another 10 days or so until your project's turn came around again and then there was a new one.

These days folks want the data Right Now, and some dumps take a good long time to run (*cough*en wp*cough*). If you see a broken or failed dump, this is how you can fix it up.

Rerunning a complete dump

If most of the steps failed, or the script failed or died early on, you might as well rerun the entire thing.

  1. Clean up the old firectory. From any snapshot host you get to the toplevel directory at /mnt/data/xmldatadumps/public/name-of-wiki and find the subdirectory you want. These will all have names of the form YYYYmmdd indicating the date the run was started. Make a note of this date, you need it to rerun the dump.
  2. Remove any lock file that may be left lying around. The lock file will be named /mnt/data/xmldatadumps/private/name-of-wiki/lock but ONLY REMOVE IT if there is not a new dump (for a more current date) running for this project. If there is a new dump already in progress for the project by the time you discover the broken dump, you can't use this approach. You'll need to either wait for the new dump to complete or run the old one one step at a time (see section below).
  3. Determine which configuration file you need to use:
    • wikidump.conf.enwiki for en wikipedia
    • wikidump.conf.bigwikis for the wikis listed in /backups-atg/bigwikis.dblist
    • wikidump.conf for the rest
  4. Determine which host you should run from (see Dumps/Snapshot hosts for which host runs which wikis).
  5. If there's already a root screen session on that host, use it. Open a window,
    • su - backup
    • bash
    • cd /backups-atg
    • python ./worker.py --date YYYYmmdd --configfile wikidump.conf.XXX name-of-wiki
    The date in the above will be the date in the directory name you removed.
    Example: to rerun the enwiki dumps for January 2012 you would run
    python ./worker.py --date 20120104 --configfile wikidump.conf.enwiki enwiki

NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, you can't do this. You'll need to either wait for the new dump to complete or run the old one one step at a time (see below).

Rerunning one piece of a dump

Rerunning a dump from a given step onwards

Other stuff (needs cleanup)

If you need to run dumps for one wiki, be the user backup on one of the snapshot hosts and then do

  • cd /backups
  • python ./worker.py name-of-wiki-here

For example,

  • python ./worker.py enwiki

If you have the stub file and want to generate the full XML file with text revisions from the stub file, be the user backup on one of the snapshot hosts, and run just that job for that date:

  • cd /backups
  • python ./worker.py --job metahistorybz2dump --date YYYYmmdd [--configfile configfilenamehere] name-of-wiki-here

Example:

  • python ./worker.py --job metahistorybz2dump --date 20100904 --configfile wikidump.conf.bigwikis ruwiki

For other types of page dumps, change the job name. If you give the command with --job help it will produce a list of the various jobs; they should be self explanatory.

Ordinarily this will run with prefetch. If you want no prefetch (i.e. you want to get every text revision directly from the database instead of looking at the files from the previous run to get what you can from there), you can give the addition option --noprefetch.

Note that the name of the wiki must always be last on the command line, and it is the name as seen in /home/wikipedia/common/all.dblist .


The old way to run jobs (still ok but you probably won't need it) is described below:

  • cd /backups
  • /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=name-of-wiki --stub=gzip:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --full

Example:

  • /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=ruwiki --stub=gzip:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/ruwiki/20100331/ruwiki-20100331-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-pages-meta-history.xml.bz2 --full

If you want to run against a different type of stub file and produce the text XML output for it, adjust the names appropriately for the stub, prefetch and output files, and change the --full option accordingly (to e.g. --current).

If you want to run without spawning the fetchText.php process for text revision retrievals, leave off the --spawn=/usr/bin/php option.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox