Dumps/Rerunning a job
(This needs cleanup.)
Contents |
Fixing a broken dump
Back in the day, the dumps were meant to be generated in an endless loop with no human intervention; if a dump broken you wated another 10 days or so until your project's turn came around again and then there was a new one.
These days folks want the data Right Now, and some dumps take a good long time to run (*cough*en wp*cough*). If you see a broken or failed dump, this is how you can fix it up.
Rerunning a complete dump
If most of the steps failed, or the script failed or died early on, you might as well rerun the entire thing.
- Clean up the old firectory. From any snapshot host you get to the toplevel directory at /mnt/data/xmldatadumps/public/name-of-wiki and find the subdirectory you want. These will all have names of the form YYYYmmdd indicating the date the run was started. Make a note of this date, you need it to rerun the dump.
- Remove any lock file that may be left lying around. The lock file will be named /mnt/data/xmldatadumps/private/name-of-wiki/lock but ONLY REMOVE IT if there is not a new dump (for a more current date) running for this project. If there is a new dump already in progress for the project by the time you discover the broken dump, you can't use this approach. You'll need to either wait for the new dump to complete or run the old one one step at a time (see section below).
- Determine which configuration file you need to use:
-
wikidump.conf.enwikifor en wikipedia -
wikidump.conf.bigwikisfor the wikis listed in /backups-atg/bigwikis.dblist -
wikidump.conffor the rest
-
- Determine which host you should run from (see Dumps/Snapshot hosts for which host runs which wikis).
- If there's already a root screen session on that host, use it, otherwise start a new one. Open a window,
- su - backup
- bash
- cd /backups-atg
- python ./worker.py --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
- The date in the above will be the date in the directory name you removed.
- Example: to rerun the enwiki dumps for January 2012 you would run
python ./worker.py --date 20120104 --configfile wikidump.conf.enwiki --log enwiki
NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, you can't do this. You'll need to either wait for the new dump to complete or run the old one one step at a time (see below).
Rerunning one piece of a dump
- As above, you'll need to determine the date, which configuration file you need, and which host to run from.
- You don't need to do anything about lockfiles.
- You don't need to clean up any old files.
- Determine which job (which step) needs to be re-run. Presumably the failed step has been recorded on the web-viewable page for the particular dump (http://dumps.wikimedia.org/wikiname-here/YYYYmmdd/) in which case it should be marked as status:failed in the dumpruninfo.txt file in the run directory. Use the job name from that file, and remember that they are listed in reverse order of execution. If you were told by a user or aren't sure which job is the one, see Dumps/Phases of a dump run to figure out the right job(s).
- If there's already a root screen session on the host, use it, otherwise start a new one. Open a window,
- su - backup
- bash
- cd /backups-atg
- python ./worker.py --job job-name-you-found --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
- The date in the above will be the date in the directory name you removed.
- Example: to rerun the generation of the bzip2 pages meta history file for the enwiki dumps for January 2012 you would run
python ./worker.py --job metahistorybz2dump --date 20120104 --configfile wikidump.conf.enwiki --log enwiki
Do this for each step that needs to be rerun for a given wiki, waiting for each step to complete before doing the next one.
NOTE: don't run multiple steps for a given wiki at the same time. You'll probably get garbage results.
Rerunning a dump from a given step onwards
As above, you'll need to determine the date of the dump, the configuration file needed, and the host the job should run on.
- You need to remove any existing lockfile, as described above in the section on rerunning an entire dump.
- You don't need to clean up any old files.
- Determine which job (which step) is the first one that needs to be re-run. Presumably the point of failure has been recorded on the web-viewable page for the particular dump (http://dumps.wikimedia.org/wikiname-here/YYYYmmdd/) in which case it should be marked as status:failed in the dumpruninfo.txt file in the run directory. Use the job name from that file, and remember that they are listed in reverse order of execution. If you were told by a user or aren't sure which job is the one, see Dumps/Phases of a dump run to figure out the right job(s).
- If there's already a root screen session on the host, use it, otherwise start a new one. Open a window,
- su - backup
- bash
- cd /backups-atg
- python ./worker.py --job job-name-you-found --restartfrom --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
- The date in the above will be the date in the directory name you removed.
- Example: to rerun the en wiki dump for January 2012 from the generation of the bzip2 pages meta history file on, you would run
python ./worker.py --job metahistorybz2dump --restartfrom --date 20120104 --configfile wikidump.conf.enwiki --log enwiki
NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, you can't do this. You'll need to either wait for the new dump to complete or run the old one one step at a time (see above).
Other stuff (needs cleanup)
If you need to run dumps for one wiki, be the user backup on one of the snapshot hosts and then do
- cd /backups
- python ./worker.py name-of-wiki-here
For example,
- python ./worker.py enwiki
If you have the stub file and want to generate the full XML file with text revisions from the stub file, be the user backup on one of the snapshot hosts, and run just that job for that date:
- cd /backups
- python ./worker.py --job metahistorybz2dump --date YYYYmmdd [--configfile configfilenamehere] name-of-wiki-here
Example:
- python ./worker.py --job metahistorybz2dump --date 20100904 --configfile wikidump.conf.bigwikis ruwiki
For other types of page dumps, change the job name. If you give the command with --job help it will produce a list of the various jobs; they should be self explanatory.
Ordinarily this will run with prefetch. If you want no prefetch (i.e. you want to get every text revision directly from the database instead of looking at the files from the previous run to get what you can from there), you can give the addition option --noprefetch.
Note that the name of the wiki must always be last on the command line, and it is the name as seen in /home/wikipedia/common/all.dblist .
The old way to run jobs (still ok but you probably won't need it) is described below:
- cd /backups
- /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=name-of-wiki --stub=gzip:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --full
Example:
- /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=ruwiki --stub=gzip:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/ruwiki/20100331/ruwiki-20100331-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-pages-meta-history.xml.bz2 --full
If you want to run against a different type of stub file and produce the text XML output for it, adjust the names appropriately for the stub, prefetch and output files, and change the --full option accordingly (to e.g. --current).
If you want to run without spawning the fetchText.php process for text revision retrievals, leave off the --spawn=/usr/bin/php option.