Dumps/Rerunning a job
(This needs cleanup.)
Contents |
Fixing a broken dump
Back in the day, the dumps were meant to be generated in an endless loop with no human intervention; if a dump broken you wated another 10 days or so until your project's turn came around again and then there was a new one.
These days folks want the data Right Now, and some dumps take a good long time to run (*cough*en wp*cough*). If you see a broken or failed dump, this is how you can fix it up.
Rerunning a complete dump
If most of the steps failed, or the script failed or died early on, you might as well rerun the entire thing.
- Clean up the old firectory. From any snapshot host you get to the toplevel directory at /mnt/data/xmldatadumps/public/name-of-wiki and find the subdirectory you want. These will all have names of the form YYYYmmdd indicating the date the run was started. Make a note of this date, you need it to rerun the dump.
- Remove any lock file that may be left lying around. The lock file will be named /mnt/data/xmldatadumps/private/name-of-wiki/lock but ONLY REMOVE IT if there is not a new dump (for a more current date) running for this project. If there is a new dump already in progress for the project by the time you discover the broken dump, you can't use this approach. You'll need to either wait for the new dump to complete or run the old one one step at a time (see section below).
- Determine which configuration file you need to use:
-
wikidump.conf.enwikifor en wikipedia -
wikidump.conf.bigwikisfor the wikis listed in /backups-atg/bigwikis.dblist -
wikidump.conffor the rest
-
- Determine which host you should run from (see Dumps/Snapshot hosts for which host runs which wikis).
- If there's already a root screen session on that host, use it, otherwise start a new one. Open a window,
- su - backup
- bash
- cd /backups-atg
- python ./worker.py --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
- The date in the above will be the date in the directory name you removed.
- Example: to rerun the enwiki dumps for January 2012 you would run
python ./worker.py --date 20120104 --configfile wikidump.conf.enwiki --log enwiki
NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, you can't do this. You'll need to either wait for the new dump to complete or run the old one one step at a time (see below).
Rerunning one piece of a dump
- As above, you'll need to determine the date, which configuration file you need, and which host to run from.
- You don't need to do anything about lockfiles.
- You don't need to clean up any old files.
- Determine which job (which step) needs to be re-run. Presumably the failed step has been recorded on the web-viewable page for the particular dump (http://dumps.wikimedia.org/wikiname-here/YYYYmmdd/) in which case it should be marked as status:failed in the dumpruninfo.txt file in the run directory. Use the job name from that file, and remember that they are listed in reverse order of execution. If you were told by a user or aren't sure which job is the one, see Dumps/Phases of a dump run to figure out the right job(s).
- If there's already a root screen session on the host, use it, otherwise start a new one. Open a window,
- su - backup
- bash
- cd /backups-atg
- python ./worker.py --job job-name-you-found --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
- The date in the above will be the date in the directory name you removed.
- Example: to rerun the generation of the bzip2 pages meta history file for the enwiki dumps for January 2012 you would run
python ./worker.py --job metahistorybz2dump --date 20120104 --configfile wikidump.conf.enwiki --log enwiki
Do this for each step that needs to be rerun for a given wiki, waiting for each step to complete before doing the next one.
NOTE: don't run multiple steps for a given wiki at the same time. You'll probably get garbage results.
Rerunning a dump from a given step onwards
As above, you'll need to determine the date of the dump, the configuration file needed, and the host the job should run on.
- You need to remove any existing lockfile, as described above in the section on rerunning an entire dump.
- You don't need to clean up any old files.
- Determine which job (which step) is the first one that needs to be re-run. Presumably the point of failure has been recorded on the web-viewable page for the particular dump (http://dumps.wikimedia.org/wikiname-here/YYYYmmdd/) in which case it should be marked as status:failed in the dumpruninfo.txt file in the run directory. Use the job name from that file, and remember that they are listed in reverse order of execution. If you were told by a user or aren't sure which job is the one, see Dumps/Phases of a dump run to figure out the right job(s).
- If there's already a root screen session on the host, use it, otherwise start a new one. Open a window,
- su - backup
- bash
- cd /backups-atg
- python ./worker.py --job job-name-you-found --restartfrom --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
- The date in the above will be the date in the directory name you removed.
- Example: to rerun the en wiki dump for January 2012 from the generation of the bzip2 pages meta history file on, you would run
python ./worker.py --job metahistorybz2dump --restartfrom --date 20120104 --configfile wikidump.conf.enwiki --log enwiki
NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, you can't do this. You'll need to either wait for the new dump to complete or run the old one one step at a time (see above).
Rerunning a step without using the python scripts
Sometimes you may want to rerun a step using mysql or the MediaWiki maintenance scripts directly, especially if the particular step causes problems more than once.
In order to see what job was run by the worker.py script, you can either look at the log (dumplog.txt) or you can run the step from worker.py giving the "dryrun" option, which tells it "don't actually do this, write to stderr the commands that would be run".
- Determine which host the wiki is dumped from, which configuration file is used, the date of the dump and the job name, as described in the section above about rerunning one piece of a dump.
- Give the appropriate worker.py command, as in that same section, adding the option "--dryrun" before the name of the wiki.
Examples
- To see how the category table gets dumped, type:
python ./worker.py --date 20120109 --job categorytable --dryrun elwiktionary
- to get the output
Command to run: /usr/bin/mysqldump -h 10.0.6.21 -u XXX -pXXX --opt --quick --skip-add-locks --skip-lock-tables elwiktionary category | /bin/gzip > /mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-category.sql.gz
- To see how the stub xml files get dumped, type:
python ./worker.py --date 20120109 --job xmlstubsdump --dryrun elwiktionary
- to get the output
Command to run: /usr/bin/php -q /apache/common/multiversion/MWScript.php dumpBackup.php --wiki=elwiktionary --full --stub --report=10000 --force-normal --output=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-meta-history.xml.gz --output=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-meta-current.xml.gz --filter=latest --output=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-articles.xml.gz --filter=latest --filter=notalk --filter=namespace:!NS_USER
- As you see from the above, all three stub files are written at the same time.
- To see how the full history xml bzipped file is dumped, type:
python ./worker.py --date 20120109 --job metahistorybz2dump --dryrun elwiktionary
- to get the output
Command to run: /usr/bin/php -q /apache/common/multiversion/MWScript.php dumpTextPass.php --wiki=elwiktionary --stub=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/data/xmldatadumps/public/elwiktionary/20120117/elwiktionary-20120117-pages-meta-history.xml.bz2 --force-normal --report=1000 --spawn=/usr/bin/php --output=bzip2:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-pages-meta-history.xml.bz2 --full
Generating new dumps
When new wikis are enabled on the site, they are added to all.dblist which is checked by the dump scripts. They get dumped as soon as a worker completes a run already in progress, so you don't have to do anything special for them.
Running a (new) specific dump by hand
Once in a while we get a request for a dump of a wiki out of sequence, so that it can be archived before it is shut down and removed, for example.
- Determine which configuration file you need to use and which host the dump should run from, as above in the section on rerunning an entire dump.
- If there's already a root screen session on that host, use it, otherwise start a new one. Open a window,
- su - backup
- bash
- cd /backups-atg
- python ./worker.py --configfile wikidump.conf.XXX --log name-of-wiki
- Example: to run the enwiki dumps you would type
python ./worker.py --configfile wikidump.conf.enwiki --log enwiki
Stubs and text revisions
A few notes about the generation of the files containing the text revisions of each page.
You need to have the "stub" XML files generated first. These get done much faster than the text dumps. For example, to generate the stubs files for en wikipedia without doing multiple pieces at a time, took less than a day in early 2010 but to generate the full hiistory file without parallel runs took over a month and today would take much longer.
While you can specify a range of pages to the script that generates the stubs, there is no such option for generating the revision text files. The revision ids in the stub file used as input determine which revisions are written as output.
In order to save time and wear and tear on the database servers, old data is reused to the extent possible; the production scripts run with a "prefetch" option which reads revision texts from a previous dump and, if they pass a basic sanity check, writes them out instead of polling the database for them. Thus, only new or restored revisions in the database should be requested by the script.
Using a different prefetch file for revision texts
Sometimes the file used for prefetch may be broken or the XML parser may balk at it for whatever reason. You can deal with this in two ways.
- You could mark the file as bad, by going into the dump directory for the date the prefetch file was generated and editing the file dumpruninfo.txt, changing "status:done;" to "status;bad" for the dump job (one of articlesdump, metacurrentdump or metahistorybz2dump), and rerun the step usin the python script worker.py.
- You could run the step by hand without the python script, (see the section above on how to do that), specifying prefetch from another earlier file or set of files. Example: to regenerate the ekwiktionary history file from 20120109 with a prefetch from the 20111224 output instead of the 20120101 files, type:
/usr/bin/php -q /apache/common/multiversion/MWScript.php dumpTextPass.php --wiki=elwiktionary --stub=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/data/xmldatadumps/public/elwiktionary/20111224/elwiktionary-20111224-pages-meta-history.xml.bz2 --force-normal --report=1000 --spawn=/usr/bin/php --output=bzip2:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-pages-meta-history.xml.bz2 --full
Skipping prefetch for revision texts
Sometimes you may not trust the contents of the previous dumps or you may not have them at all. In this case you can run without prefetch but it is much slower so avoid this if possible for larger wikis. In this case you can do one of the following:
- run the worker.py script with the option --noprefetch
- run the step by hand without the python script, (see the section above on how to do that), removing the prefetch option from the command. Example: to regenerate the ekwiktionary history file from 20120109 without prefetch, you would type:
- <code>/usr/bin/php -q /apache/common/multiversion/MWScript.php dumpTextPass.php --wiki=elwiktionary --stub=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-meta-history.xml.gz --force-normal --report=1000 --spawn=/usr/bin/php --output=bzip2:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-pages-meta-history.xml.bz2 --full