Dumps/Adds-changes dumps
ArielGlenn (Talk | contribs) (→When stuff breaks) |
ArielGlenn (Talk | contribs) (→Adds/changes dumps overview) |
||
| Line 1: | Line 1: | ||
==Adds/changes dumps overview== | ==Adds/changes dumps overview== | ||
| − | We have an experimental service available which produces dumps of added/changed content on a daily basis for all projects that have not been closed and are not private. | + | We have an <strong>experimental</strong> service available which produces dumps of added/changed content on a daily basis for all projects that have not been closed and are not private. |
The code for this service is available in [http://svn.wikimedia.org/viewvc/mediawiki/branches/ariel/xmldumps-backup/incrementals/ subversion]. | The code for this service is available in [http://svn.wikimedia.org/viewvc/mediawiki/branches/ariel/xmldumps-backup/incrementals/ subversion]. | ||
| Line 21: | Line 21: | ||
===Phase three=== | ===Phase three=== | ||
| − | This job runs around 11:30 pm UTC, leaving hopefully plenty of time for the previous job to complete. It checks the directories for successful runs and writes a main index.html file with links for each project to the stub and content files for the latest successful run. | + | This job runs around 11:30 pm UTC, leaving hopefully plenty of time for the previous job to complete. It checks the directories for successful runs and writes a main index.html file with links for each project to the stub and content files for the latest successful run. |
==When stuff breaks== | ==When stuff breaks== | ||
Revision as of 11:13, 24 November 2011
Contents |
Adds/changes dumps overview
We have an experimental service available which produces dumps of added/changed content on a daily basis for all projects that have not been closed and are not private.
The code for this service is available in subversion.
The job runs in three phases, currently on snapshot3, as the backup user. These three phases are run from cron, from a copy of the scripts that live in the directory /backups-atg/incrementals on that host.
Directory structure:
Everything for a given run is stored in dumproot/projectname/yyyymmdd/ much as we do for regular dumps.
Phase one
This job runs around 7 am UTC time. We record the largest revision id for the given project, in the file maxrevid.txt.
Phase two
This job runs around 7 pm UTC time, a delay of 12 hours. At this writing it takes about 90 minutes or so to complete. We generate a stubs file containing metadata in xml format for each revision added since the previous day, consulting the file maxrevid.txt for the previous day to get the start of the range. We then generate a meta-history xml file which contains the text of these revisions grouped together and sorted by page id. Md5 sums of these are available in an md5sums.txt file. A status.txt file is available to indicate whether we had a successful run ("done") or not.
Phase three
This job runs around 11:30 pm UTC, leaving hopefully plenty of time for the previous job to complete. It checks the directories for successful runs and writes a main index.html file with links for each project to the stub and content files for the latest successful run.
When stuff breaks
You can rerun various jobs by hand for specified dates.
From the directory /backups-atg/incrementals, as the backup user, you can run
python ./generatemaxrevids.pyto retrieve the maxrevids for today at the time of the run. (avoid this when possible, let cron do it at the scheduled hour.)python ./generateincrementals.py yyyymmddto generate the stubs and revs text files for a given date; this presumes that the revids form the previous step are already recorded in the file maxrevids.txt in the directory for the given date and project. You can add --verbose to get information about what it's doing. If it complains about lock files in place you can remove these by hand, providing that the cron job is not running at the time and there is no other copy of this job running.python ./incrmonitor.pyto regenerate the index.html file listing all projects, after the previous two steps are complete.
TODOs
Need to add info about the lock files for each stage. Maybe add a tool to find and remove lock files for any given stage. Some day we are going to have it die in the middle one of these phases and it will be a slight pain to recover from it. Need to make sure that revids are generated every day *no matter what*. Once we start getting a backlog it's going to be annoying. We could get a current max and then interpolate by hand... maybe make a tool for that.
Some numbers
Here's a few fun numbers from the November 23 2011 run. Writing the stubs file for 167985 revisions for en wikipedia took 2 minutes, and writing the revisions text file took 24 minutes. Writing the stubs file for 36272 revisions for de wikipedia took less than a minute, and writing the revisions text file took 5 minutes. Writing the stubs file for 43133 revisions for commons took 1 minute, and writing the revisions text file took 2 minutes.