Dumps/Rerunning a job
ArielGlenn (Talk | contribs) |
ArielGlenn (Talk | contribs) |
||
| Line 1: | Line 1: | ||
(This needs cleanup.) | (This needs cleanup.) | ||
| + | |||
| + | == Fixing a broken dump == | ||
| + | |||
| + | Back in the day, the dumps were meant to be generated in an endless loop with no human intervention; if a dump broken you wated another 10 days or so until your project's turn came around again and then there was a new one. | ||
| + | |||
| + | These days folks want the data Right Now, and some dumps take a good long time to run (*cough*en wp*cough*). If you see a broken or failed dump, this is how you can fix it up. | ||
| + | |||
| + | === Rerunning a complete dump === | ||
| + | |||
| + | If most of the steps failed, or the script failed or died early on, you might as well rerun the entire thing. | ||
| + | |||
| + | # Clean up the old firectory. From any snapshot host you get to the toplevel directory at /mnt/data/xmldatadumps/public/name-of-wiki and find the subdirectory you want. These will all have names of the form YYYYmmdd indicating the date the run was started. Make a note of this date, you need it to rerun the dump. | ||
| + | # Remove any lock file that may be left lying around. The lock file will be named /mnt/data/xmldatadumps/private/name-of-wiki/lock but ONLY REMOVE IT if there is not a new dump (for a more current date) running for this project. If there is a new dump already in progress for the project by the time you discover the broken dump, you can't use this approach. You'll need to either wait for the new dump to complete or run the old one one step at a time (see section below). | ||
| + | # Determine which configuration file you need to use: | ||
| + | #* <code>wikidump.conf.enwiki</code> for en wikipedia | ||
| + | #* <code>wikidump.conf.bigwikis</code> for the wikis listed in /backups-atg/bigwikis.dblist | ||
| + | #* <code>wikidump.conf</code> for the rest | ||
| + | # Determine which host you should run from (see [[Dumps/Snapshot hosts]] for which host runs which wikis). | ||
| + | # If there's already a root screen session on that host, use it. Open a window, | ||
| + | #* su - backup | ||
| + | #* bash | ||
| + | #* cd /backups-atg | ||
| + | #* python ./worker.py --date YYYYmmdd --configfile wikidump.conf.XXX name-of-wiki | ||
| + | #:: The date in the above will be the date in the directory name you removed. | ||
| + | #::Example: to rerun the enwiki dumps for January 2012 you would run | ||
| + | #::<code>python ./worker.py --date 20120104 --configfile wikidump.conf.enwiki enwiki</code> | ||
| + | |||
| + | NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, | ||
| + | you can't do this. You'll need to either wait for the new dump to complete or run the old one one step at a time (see below). | ||
| + | |||
| + | === Rerunning one piece of a dump === | ||
| + | |||
| + | === Rerunning a dump from a given step onwards === | ||
| + | |||
| + | === Other stuff (needs cleanup) === | ||
If you need to run dumps for one wiki, be the user ''backup'' on one of the snapshot hosts and then do | If you need to run dumps for one wiki, be the user ''backup'' on one of the snapshot hosts and then do | ||
Revision as of 09:36, 20 January 2012
(This needs cleanup.)
Contents |
Fixing a broken dump
Back in the day, the dumps were meant to be generated in an endless loop with no human intervention; if a dump broken you wated another 10 days or so until your project's turn came around again and then there was a new one.
These days folks want the data Right Now, and some dumps take a good long time to run (*cough*en wp*cough*). If you see a broken or failed dump, this is how you can fix it up.
Rerunning a complete dump
If most of the steps failed, or the script failed or died early on, you might as well rerun the entire thing.
- Clean up the old firectory. From any snapshot host you get to the toplevel directory at /mnt/data/xmldatadumps/public/name-of-wiki and find the subdirectory you want. These will all have names of the form YYYYmmdd indicating the date the run was started. Make a note of this date, you need it to rerun the dump.
- Remove any lock file that may be left lying around. The lock file will be named /mnt/data/xmldatadumps/private/name-of-wiki/lock but ONLY REMOVE IT if there is not a new dump (for a more current date) running for this project. If there is a new dump already in progress for the project by the time you discover the broken dump, you can't use this approach. You'll need to either wait for the new dump to complete or run the old one one step at a time (see section below).
- Determine which configuration file you need to use:
-
wikidump.conf.enwikifor en wikipedia -
wikidump.conf.bigwikisfor the wikis listed in /backups-atg/bigwikis.dblist -
wikidump.conffor the rest
-
- Determine which host you should run from (see Dumps/Snapshot hosts for which host runs which wikis).
- If there's already a root screen session on that host, use it. Open a window,
- su - backup
- bash
- cd /backups-atg
- python ./worker.py --date YYYYmmdd --configfile wikidump.conf.XXX name-of-wiki
- The date in the above will be the date in the directory name you removed.
- Example: to rerun the enwiki dumps for January 2012 you would run
python ./worker.py --date 20120104 --configfile wikidump.conf.enwiki enwiki
NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, you can't do this. You'll need to either wait for the new dump to complete or run the old one one step at a time (see below).
Rerunning one piece of a dump
Rerunning a dump from a given step onwards
Other stuff (needs cleanup)
If you need to run dumps for one wiki, be the user backup on one of the snapshot hosts and then do
- cd /backups
- python ./worker.py name-of-wiki-here
For example,
- python ./worker.py enwiki
If you have the stub file and want to generate the full XML file with text revisions from the stub file, be the user backup on one of the snapshot hosts, and run just that job for that date:
- cd /backups
- python ./worker.py --job metahistorybz2dump --date YYYYmmdd [--configfile configfilenamehere] name-of-wiki-here
Example:
- python ./worker.py --job metahistorybz2dump --date 20100904 --configfile wikidump.conf.bigwikis ruwiki
For other types of page dumps, change the job name. If you give the command with --job help it will produce a list of the various jobs; they should be self explanatory.
Ordinarily this will run with prefetch. If you want no prefetch (i.e. you want to get every text revision directly from the database instead of looking at the files from the previous run to get what you can from there), you can give the addition option --noprefetch.
Note that the name of the wiki must always be last on the command line, and it is the name as seen in /home/wikipedia/common/all.dblist .
The old way to run jobs (still ok but you probably won't need it) is described below:
- cd /backups
- /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=name-of-wiki --stub=gzip:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --full
Example:
- /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=ruwiki --stub=gzip:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/ruwiki/20100331/ruwiki-20100331-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-pages-meta-history.xml.bz2 --full
If you want to run against a different type of stub file and produce the text XML output for it, adjust the names appropriately for the stub, prefetch and output files, and change the --full option accordingly (to e.g. --current).
If you want to run without spawning the fetchText.php process for text revision retrievals, leave off the --spawn=/usr/bin/php option.