Dumps

From Wikitech
(Difference between revisions)
Jump to: navigation, search
(Worker nodes)
 
(101 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
Docs for end-users of the data dumps at [[MetaWikipedia:Data dumps]].
 
Docs for end-users of the data dumps at [[MetaWikipedia:Data dumps]].
  
For current development plans, see [[Dumps/Development 2011]].  For status of development, see [[Dumps/Development status 2011]].
+
For a list of various information sources about the dumps, see [[Dumps/Other information sources]].
  
== Top-level procedure ==
+
*For documentation on the "adds/changes" dumps, see [[Dumps/Adds-changes dumps]].
 +
*For documentation on the media dumps, see [[Dumps/media]].
 +
*For current development plans, see [[Dumps/Development 2012]].
 +
*For historical information about the dumps, see [[Dumps/History]].
  
The dumped files are stored and served to the web from [[dataset2]].
+
{| cellspacing="0" cellpadding="0" style="clear: {{{clear|right}}}; margin-bottom: .5em; float: right; padding: .5em 0 .8em 1.4em; background: none; width: {{{width|{{{1|auto}}}}}};"
 +
| __TOC__
 +
|}
 +
== Overview ==
  
 
User-visible files appear at http://download.wikipedia.org/backup-index.html
 
User-visible files appear at http://download.wikipedia.org/backup-index.html
Line 13: Line 19:
 
=== Status ===
 
=== Status ===
  
Dumps are being served on [[Dataset2]].
+
For which hosts are serving data, see [[Dumps/Dump servers]]. For which hosts are generating which dumps, see [[Dumps/Snapshot hosts]].
 
+
We are interested in [[metawikipedia:Mirroring Wikimedia project XML dumps|mirroring]] of the dumps; please add information there if you can host or know of an organization that can.
+
 
+
We have copied  one complete run of our public XML files (about 1.3T?) off to [[Google storage]], which they have kindly donated to us.  We are in the process of moving things around to comply with a better (non-usurpable by other GOogle storage users) naming scheme. We'd like to run a copy once every two weeks, keep the last four copies and then one copy permanently every six months. Script [http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/googlestorage/ here].
+
 
   
 
   
Parallel dumps of en wiki are running. See [[Dumps/Parallelization]] for much more on this.
+
We want mirrors!  For more information see [[Dumps/Mirror status]].
 
+
=== Monitor node ===
+
 
+
The monitor node checks for and remove stale lock files from dump processes that have died, and to update the central <code>index.html</code> file which shows the dumps in progress and the status of the dumps that have completed (i.e. <code>http://dumps.wikimedia.org/backup-index.html</code> ). ''It does not start or stop worker threads.''
+
 
+
* [[snapshot2]] -- current monitor node, running out of /backups-atg in a screen session, as the user "backup", via
+
*: <code>/home/wikipedia/src/backup/monitor</code>
+
  
 
=== Worker nodes ===
 
=== Worker nodes ===
Line 32: Line 27:
 
The worker processes go through the set of available wikis to dump automatically. Dumps are run on a "longest without a dump runs next" schedule. The plan is to have a complete dump for each wiki every 2 weeks, except for enwikipedia, which should have a complete dump once a month.
 
The worker processes go through the set of available wikis to dump automatically. Dumps are run on a "longest without a dump runs next" schedule. The plan is to have a complete dump for each wiki every 2 weeks, except for enwikipedia, which should have a complete dump once a month.
  
The shell script <code>worker</code> which starts one of these processes simply runs the python script <worker.py> in an endless loop.
+
The shell script <code>worker</code> which starts one of these processes simply runs the python script <worker.py> in an endless loop. Multiple such workers can run at the same time on different hosts, as well as on the same host.  
  
 
The <code>worker.py</code> script creates a lock file on the filesystem containing the dumps (as of this writing, <code>/mnt/data/xmldatadumps/</code>) in the subdirectory <code>private/name-of-wiki/lock</code>.  No other process will try to write dumps for that project while the lock file is in place.
 
The <code>worker.py</code> script creates a lock file on the filesystem containing the dumps (as of this writing, <code>/mnt/data/xmldatadumps/</code>) in the subdirectory <code>private/name-of-wiki/lock</code>.  No other process will try to write dumps for that project while the lock file is in place.
  
Local copies of the shell script and the python script live on the snapshot hosts in the directory <code>/backups</code>. Current copies in use are run out of /backups-atg (since this code is not yet in trunk) in screen sessions on the various hosts, as the user "backup".
+
Local copies of the shell script and the python script <small>live on the snapshot hosts in the directory <code>/backups</code> but</small> currently are run out of /backups-atg (since this code is not yet in trunk) in screen sessions on the various hosts, as the user "backup".
  
* [[snapshot2]] -- currently running 3 threads which cover all dumps but en wiki and the larger wikis (de es fr etc) out of /backups-atg via
+
=== Monitor node ===
*:  <code>./worker</code>
+
* [[snapshot2]] -- currently running 3 threads which cover all of the larger wikis (9 of them) except en wiki, out of /backups-atg via
+
*: <code>./worker wikidump.conf.bigwikis</code>
+
* [[snapshot3]] -- running en wiki dumps in stages
+
*: Currently doing the recombine from 2010 09 04 history bz2, via
+
*::<code>python ./worker.py --job metahistorybz2dumprecombine --date 20100904 --configfile wikidump.conf.enwiki enwiki</code>
+
*: Also running the history dumps for 2011 01 15, via
+
*::<code>python ./worker.py --job metahistoryb2zdump --date 20110115 --configfile wikidump.conf.enwiki.new enwiki</code>
+
* [[snapshot1]] -- in process of OS upgrade
+
  
=== Code ===
+
The monitor node checks for and removes stale lock files from dump processes that have died, and updates the central <code>index.html</code> file which shows the dumps in progress and the status of the dumps that have completed (i.e. <code>http://dumps.wikimedia.org/backup-index.html</code> ). ''It does not start or stop worker processes.''
  
Check [http://svn.wikimedia.org/viewvc/mediawiki/branches/ariel/xmldumps-backup/?sortby=file /branches/ariel/xmldumps-backup] for the python code in use.  Eventually this will make its way back into trunk; it's still a bit gross right now.
+
The shell script <code>monitor</code> which starts the process simply runs the python script <code>monitor.py</code> in an endless loop.
  
== Configuration notes ==
+
As with the worker nodes, local copies of the shell script and the python script <small>live on the snapshot hosts in the directory <code>/backups</code> but</small> currently are run out of /backups-atg (since this code is not yet in trunk) in a screen session on one host, as the user "backup".
  
The configuration file lives in the directory <code>/backups</code> in the file <code>wikidump.conf</code>. There are separate configuration files for different groups of wikis. Among other useful things that can be set in this file are the directory of the mediawiki installation and the filesystem to which to write the XML dumps.
+
=== Code ===
  
=== Adding a new worker box ===
+
Check [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmldumps-backup;hb=ariel /operations/dumps.git, branch 'ariel'] for the python code in use.  Eventually this will make its way back into master; it's still a bit gross right now.
  
A portion of this is in the process of being puppetized.
+
Getting a copy:
 +
: <code>git clone https://gerrit.wikimedia.org/r/p/operations/dumps.git</code>
 +
: <code>git checkout ariel</code>
  
# Install like a regular app server but without apache running
+
Getting a copy as a committer:  
# Add worker to '''/etc/exports/''' on [[dataset1]] <-- done in puppet
+
: <code>git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git</code>
# Add '''/mnt/dumps''' to '''/etc/fstab''' of worker host <-- done in puppet (and it's /mnt/data now)
+
: <code>git checkout ariel</code>
# Build the utfnormal php module: <-- seriously can we build a package and puppetize this?
+
#*'''apt-get install php5-dev libicu-dev  g++ php-config gcc-3.4  mysql-client-5.0 p7zip-full swig subversion'''
+
#* '''svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/normal/ '''
+
#* '''cd normal; make'''
+
#* install in dir found for '''php -i | grep extension''' ex '''/usr/lib/php5/20060613'
+
#* '''mv php_utfnormal.so /usr/lib/php5/20060613'''
+
#* add '''/etc/php5/conf.d/utfnormal.ini''' with '''extension=php_utfnormal.so''' 
+
# 7zip of at least 4.58 due to https://bugs.edge.launchpad.net/hardy-backports/+bug/370618 or '''chmod 644''' in place
+
# '''svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/backup/ /backups'''
+
# '''svn co 'svn+ssh://user@svn.wikimedia.org/svn-private/wmf/xmlsnapshots/conf' conf'
+
# mv wikidump.conf ../.
+
  
== Locks and logs ==
+
=== Programs used ===
  
The worker threads use lock files in the private directories (e.g. .../private/name-of-wiki/lock). Lock files are touched by a background thread during dump; the monitor node looks for stale lock files and deletes them so that the job may be run again later by a worker node.  
+
See also [[Dumps/Software dependencies]].
  
Raw output from the script will go into the file you tee output into. A separate text log isn't kept at the moment, but status information is saved into HTML files for public consumption:
+
The scripts call mysqldump, getSlaveServer.php, eval.php, dumpBackup.php, and dumpTextPass.php directly for dump generation. These in turn require backup.inc and backupPrefetch.inc and may call ActiveAbstract/AbstractFilter.php and fetchText.php.
  
== File layout ==
+
The generation of XML files relies on Export.php under the hood and of course the entire MW infrastructure.
  
* <base>/
+
The worker.py script relies on a few C programs for various bz2 operations: checkforbz2footer and recompressxml, both in /usr/local/bin/. These are in the git repo, see [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmldumps-backup/mwbzutils;h=e76ee6cb52fd40e570e2e62a969f8b57902de1b9;hb=ariel].
** [http://download.wikimedia.org/ index.html] - List of all databases and their last-touched status
+
** [http://download.wikimedia.org/afwiki/ <db>/]
+
*** <date>/
+
**** [http://download.wikimedia.org/afwiki/20060122/ index.html] - List of items in the database
+
  
Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.
+
== Setup ==
  
== Error handling ==
+
=== Adding a new worker box ===
  
If a dump step returns an error condition, the runner script should detect this and mark the item as "failed" on the HTML pages. The runner will keep on trying other steps and remaining databases, unless the runner script itself fails somehow.
+
Install and add to site.pp, copying one of the existing snapshot stanzas in puppet.  This does, among other things:
 +
# set up the base MW install without apache running
 +
# Add worker to '''/etc/exports/''' on [[dataset2]]
 +
# Add '''/mnt/data''' to '''/etc/fstab''' of worker host
 +
# Build the utfnormal php module (done for lucid)
  
It will also e-mail ''xmldatadumps-admin-l@lists.wikimedia.org'' with a notification.
+
For now:
 +
# Backups are running test code out of /backups-atg on each host so grab a copy of that from any existing host and copy it into /backups-atg on the new host. This will include conf files, you don't need to specify them separately.
 +
#: '''In transition, being moved to /backups. To be updated as soon as move is complete.'''
 +
# Check over the configuration file and make sure it looks sane, all the paths point to things that exist, etc.  For too many details see [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob_plain;f=xmldumps-backup/README.config;hb=ariel the README.config file in the git repo].
 +
#* We run enwiki on its own host.  If this host is going to do that work, check <code>/backups-atg/wikidump.conf.enwiki</code>.
 +
#* The next 8 or so largest wikis are run on their own separate host so they don't backlog the smaller wikis.  For that, check <code>/backups-atg/wikidump.conf.bigwikis</code>.
 +
#* The remainder of the wikis run on one host.  Check <code>/backups-atg/wikidump.conf</code> for those.
 +
<!--We will eventually do...
 +
# '''git pul something for public repo ...  /backups'''
 +
# '''git pull something else for private repo with config files in it... /backups/conf'
 +
# mv wikidump.conf ../.-->
  
If the server crashes while it's running, the status files are left as-is and the display shows it as still running until the monitor node decides the lock file is stale enough to mark is as aborted.
+
== Dealing with problems ==
  
== Troubleshooting ==
+
===Space ===
 +
If the host serving the dumps runs low on disk space, you can reduce the number of backups that are kept.  Edit the appropriate file /backups-atg/wikidump.conf* on the host running the set of dumps you would like to adjust, en wiki = wikidump.con.enwiki, the next 8 or so big wikis = wikidump.conf.bigwiki, the rest = wikidump.conf) and change the line that says "keep=<some value>" to some smaller number.
  
If the host runs low on disk space, you can reduce the number of backups that are kept. Edit the file /home/wikipedia/src/backup/wikidump.conf on the monitor host and look for the line that says "keep=<some value>".
+
===Failed runs===
 +
Logs will be kept of each run. You can find them in the directory for the particular dump, filename <code>dumplog.txt</code>.  You can look at them to see if there are any error messages that were generated for a given run.
  
== Testing ==
+
The worker script can send email if a dump does not complete successfully.  (Better enable this.)  It currently sends email to...
  
If you need to run dumps for one wiki, be the user ''backup'' on one of the snapshot hosts and then do
+
When one or more steps of a dump fail, the index.html file for that dump includes a notation of the failure and sometimes more information about it. Note that one step of a dump failing does not prevent other steps from running unless they depend on the data from that failed step as input.
* cd /backups
+
* python ./worker.py name-of-wiki-here
+
  
For example,
+
See [[Dumps/Rerunning a job]] for how to rerun all or part of a given dump. This also explains what files may need to be cleaned up before rerunning.
* python ./worker.py enwiki
+
  
If you have the stub file and want to generate the full XML file with text revisions from the stub file, be the user ''backup'' on one of the snapshot hosts, and run just that job for that date:
+
===Dumps not running===
 +
This covers restarting after: rebooting a host, rebooting the dataset host with the nfs share where dumps are written (which may cause dumps to hang), or when the dumps stop running for other reasons.
  
* cd /backups
+
If the host crashes while the script running, the status files are left as-is and the display shows it as still running until the monitor node decides the lock file is stale enough to mark is as aborted. To restart, start a screen session on the host as root and fire up the appropriate number of worker scripts with the appropriate config file option. See [[Dumps/Snapshot hosts]] for which hosts do what; this lists which commands gets run on each host in how many windows.  If the monitor script is not running, restart it in a separate window of the same screen session; see the Dump servers page for the command and for which host it runs on.
* python ./worker.py --job metahistorybz2dump --date YYYYmmdd <nowiki>[--configfile configfilenamehere]</nowiki> name-of-wiki-here
+
  
Example:
+
If the worker script encounters more than three failed dumps in a row (currently configured as such? or did I hardcode that?) it will exit; this avoids generation of piles of broken dumps which later would need to be cleaned up.  Once the underlying problem is fixed, you can go to the screen session of the host running those wikis and rerun the previous command in all the windows. See [[Dumps/Snapshot hosts]] for which hosts do what if you're not sure.
  
* python ./worker.py --job metahistorybz2dump --date 20100904 --configfile wikidump.conf.bigwikis ruwiki
+
===Running a specifc dump on request===
 +
See [[Dumps/Rerunning a job]] for how to run a specific dumpThis is done for special cases only.
  
For other types of page dumps, change the job name. If you give the command with --job help  it will produce a list of the various jobs; they should be self explanatory.
+
== Deploying new code ==
  
Ordinarily this will run with prefetch. If you want no prefetch (i.e. you want to get every text revision directly from the database instead of looking at the files from the previous run to get what you can from there), you can give the addition option --noprefetch.
+
See [[Dumps/How to deploy]] for this.
  
Note that the name of the wiki must always be last on the command line, and it is the name as seen in /home/wikipedia/common/all.dblist .
+
== Bugs, known limitations, etc. ==
  
 +
See [[Dumps/Known issues and wish list]] for this.
  
The old way to run jobs (still ok but you probably won't need it) is described below:
+
== File layout ==
  
* cd /backups
+
* <base>/
* /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=name-of-wiki --stub=gzip:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/name-of-wiki/timestamp/name-of-wiki-timestamp-pages-meta-history.xml.bz2 --full
+
** [http://dumps.wikimedia.org/index.html index.html] - Information about the server
 +
** [http://dumps.wikimedia.org/backup-index.html backup-index.html] - List of all databases and their last-touched status
 +
** [http://dumps.wikimedia.org/afwiki/ <db>/]
 +
*** <date>/
 +
**** [http://dumps.wikimedia.org/afwiki/20060122/ index.html] - List of items in the database
  
Example:
+
Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.
 
+
* /usr/bin/php -q /apache/common/php-1.5/maintenance/dumpTextPass.php --wiki=ruwiki --stub=gzip:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/dumps/public/ruwiki/20100331/ruwiki-20100331-pages-meta-history.xml.bz2 --force-normal --report=1000 --server=10.0.0.234 --spawn=/usr/bin/php --output=bzip2:/mnt/dumps/public/ruwiki/20100531/ruwiki-20100531-pages-meta-history.xml.bz2 --full
+
 
+
If you want to run against a different type of stub file and produce the text XML output for it, adjust the names appropriately for the stub, prefetch and output files, and change the <code>--full</code> option accordingly (to e.g. <code>--current</code>).
+
 
+
If you want to run without spawning the fetchText.php process for text revision retrievals, leave off the <code>--spawn=/usr/bin/php</code> option.
+
 
+
== Backup stages ==
+
 
+
*First stage: dumps of various database tables, both private and public
+
*Second stage: list of page titles, page abstracts for Yahoo
+
*Third stage: page stubs (dumpBackup.php), gzipped
+
*: <small>Possibly additional recombine phase to combine chunks produced in parallel, into one complete file</small>
+
*Fourth stage: XML files with revision texts, bzipped (dumpTextPass.php, fetchText.php)
+
*: <small>Possibly additional recombine phase to combine chunks produced in parallel, into one complete file</small>
+
*Fifth stage: 7z compression of the XML file with all revision texts (full history)
+
*: <small>Possibly additional recombine phase to combine chunks produced in parallel, into one complete file</small>
+
 
+
== Programs used ==
+
 
+
The dump runner script is available in our cvs, in the 'backup' module, as WikiBackup.py.
+
 
+
* [[mysqldump]]
+
* dumpBackup.php, dumpTextPass.php, fetchText.php to generate the XML dumps
+
** Requires working PHP 5 and MediaWiki installation on the snapshot hosts! Don't remove from [[mediawiki-installation]] dsh group!
+
** need the XMLReader PHP extension, zlib, and bzip2 enabled
+
** using the ActiveAbstract MW extension for Yahoo's wacky stuff
+
** 7za (of [[p7zip]]) must be installed and in the path
+
 
+
== Other missing features ==
+
 
+
Currently, image tarballs are still not being made.
+
 
+
Static HTML dumps might also be included in this mess in future?
+
 
+
== Limitations ==
+
 
+
<s>There is no mechanism for running one stage of a dump of a given project via the worker.py script.</s>Now supports running specific jobs for a given dump.
+
 
+
<s>The only way to prevent a particular project from being dumped is to manually create the lock file <code>private/name-of-wiki/lock</code>.</s>Now supports a "skip file" list which specifies which projects to skip past.
+
 
+
The scripts in the /backups directory on the snapshot hosts are not updated by scap or any of the usual mechanisms.  (The php scripts, in contrast, do get updated, and the updated versions will be invoked the next time worker.py starts up, i.e. on the next wiki project by date that is due for a run.)
+
 
+
<s>If you shoot a worker thread (either the main worker script or the python script) in the middle of a dump and want it to rerun that wiki (or complete it), you will need to either (re)run the missing steps by hand or remove the lock file and the directory with the timestamp and newly created files in it, then start the script again.</s>Running "by hand" now means running each job in turn for that date, i.e. <code>python ./worker.py --date 20110110 --job abstractsdump elwikidb</code> which, if still a bit of a PITA, is not as bad as having to concoct the dumpText.php --prefetch somethingorother --gzip some other thing etc. string by hand.
+
 
+
<s>There is no simple mechanism for running only certain wikis on a given snapshot host (e.g. enwiki on one, the next 5-6 large ones on another, and the smaller ones on a third so that the smaller project dumps aren't held up waiting for the longer ones to complete).</s>Now with the skip db list and with separate config files we can create local lists of wikis to be dumped by separate processes, either on the same or different hosts.
+
 
+
== Notes ==
+
 
+
Not all error detection is probably working right now. Failures on the mysqldump runs are not detected. Tar failures are not detected.
+
 
+
Failures of dumpPages.php should be detected, but indirectly from the failure of mwdumper to parse its XML output.
+
 
+
The mysql dumps and page XML dump pull from bacon, as configured in one of the higher-level backup scripts. Currently bacon is *NOT* being stopped from replication... -- Ummmm... from what?? This is completely outdated and I have no idea what it refers to. Bacon?  Eggs?  Ham?? :-P
+
* The page XML dumps should be consistent, all three outputs draw from one input, which is drawn from one long SQL transaction plus supplementary data loads which should be independent of changes.
+
* The other SQL dumps are not going to be 100% time-consistent. But that's not too important.
+
  
grantswiki and internalwiki are special-cased so they _should_ get completely backed up into /var/backup/private instead of the public dir.
 
  
 
[[Category:How-To]]
 
[[Category:How-To]]
 
[[Category:Risk management]]
 
[[Category:Risk management]]
 
[[Category:dumps]]
 
[[Category:dumps]]

Latest revision as of 10:39, 20 June 2012

Docs for end-users of the data dumps at MetaWikipedia:Data dumps.

For a list of various information sources about the dumps, see Dumps/Other information sources.

Contents

[edit] Overview

User-visible files appear at http://download.wikipedia.org/backup-index.html

Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps.

[edit] Status

For which hosts are serving data, see Dumps/Dump servers. For which hosts are generating which dumps, see Dumps/Snapshot hosts.

We want mirrors! For more information see Dumps/Mirror status.

[edit] Worker nodes

The worker processes go through the set of available wikis to dump automatically. Dumps are run on a "longest without a dump runs next" schedule. The plan is to have a complete dump for each wiki every 2 weeks, except for enwikipedia, which should have a complete dump once a month.

The shell script worker which starts one of these processes simply runs the python script <worker.py> in an endless loop. Multiple such workers can run at the same time on different hosts, as well as on the same host.

The worker.py script creates a lock file on the filesystem containing the dumps (as of this writing, /mnt/data/xmldatadumps/) in the subdirectory private/name-of-wiki/lock. No other process will try to write dumps for that project while the lock file is in place.

Local copies of the shell script and the python script live on the snapshot hosts in the directory /backups but currently are run out of /backups-atg (since this code is not yet in trunk) in screen sessions on the various hosts, as the user "backup".

[edit] Monitor node

The monitor node checks for and removes stale lock files from dump processes that have died, and updates the central index.html file which shows the dumps in progress and the status of the dumps that have completed (i.e. http://dumps.wikimedia.org/backup-index.html ). It does not start or stop worker processes.

The shell script monitor which starts the process simply runs the python script monitor.py in an endless loop.

As with the worker nodes, local copies of the shell script and the python script live on the snapshot hosts in the directory /backups but currently are run out of /backups-atg (since this code is not yet in trunk) in a screen session on one host, as the user "backup".

[edit] Code

Check /operations/dumps.git, branch 'ariel' for the python code in use. Eventually this will make its way back into master; it's still a bit gross right now.

Getting a copy:

git clone https://gerrit.wikimedia.org/r/p/operations/dumps.git
git checkout ariel

Getting a copy as a committer:

git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git
git checkout ariel

[edit] Programs used

See also Dumps/Software dependencies.

The scripts call mysqldump, getSlaveServer.php, eval.php, dumpBackup.php, and dumpTextPass.php directly for dump generation. These in turn require backup.inc and backupPrefetch.inc and may call ActiveAbstract/AbstractFilter.php and fetchText.php.

The generation of XML files relies on Export.php under the hood and of course the entire MW infrastructure.

The worker.py script relies on a few C programs for various bz2 operations: checkforbz2footer and recompressxml, both in /usr/local/bin/. These are in the git repo, see [1].

[edit] Setup

[edit] Adding a new worker box

Install and add to site.pp, copying one of the existing snapshot stanzas in puppet. This does, among other things:

  1. set up the base MW install without apache running
  2. Add worker to /etc/exports/ on dataset2
  3. Add /mnt/data to /etc/fstab of worker host
  4. Build the utfnormal php module (done for lucid)

For now:

  1. Backups are running test code out of /backups-atg on each host so grab a copy of that from any existing host and copy it into /backups-atg on the new host. This will include conf files, you don't need to specify them separately.
    In transition, being moved to /backups. To be updated as soon as move is complete.
  2. Check over the configuration file and make sure it looks sane, all the paths point to things that exist, etc. For too many details see the README.config file in the git repo.
    • We run enwiki on its own host. If this host is going to do that work, check /backups-atg/wikidump.conf.enwiki.
    • The next 8 or so largest wikis are run on their own separate host so they don't backlog the smaller wikis. For that, check /backups-atg/wikidump.conf.bigwikis.
    • The remainder of the wikis run on one host. Check /backups-atg/wikidump.conf for those.

[edit] Dealing with problems

[edit] Space

If the host serving the dumps runs low on disk space, you can reduce the number of backups that are kept. Edit the appropriate file /backups-atg/wikidump.conf* on the host running the set of dumps you would like to adjust, en wiki = wikidump.con.enwiki, the next 8 or so big wikis = wikidump.conf.bigwiki, the rest = wikidump.conf) and change the line that says "keep=<some value>" to some smaller number.

[edit] Failed runs

Logs will be kept of each run. You can find them in the directory for the particular dump, filename dumplog.txt. You can look at them to see if there are any error messages that were generated for a given run.

The worker script can send email if a dump does not complete successfully. (Better enable this.) It currently sends email to...

When one or more steps of a dump fail, the index.html file for that dump includes a notation of the failure and sometimes more information about it. Note that one step of a dump failing does not prevent other steps from running unless they depend on the data from that failed step as input.

See Dumps/Rerunning a job for how to rerun all or part of a given dump. This also explains what files may need to be cleaned up before rerunning.

[edit] Dumps not running

This covers restarting after: rebooting a host, rebooting the dataset host with the nfs share where dumps are written (which may cause dumps to hang), or when the dumps stop running for other reasons.

If the host crashes while the script running, the status files are left as-is and the display shows it as still running until the monitor node decides the lock file is stale enough to mark is as aborted. To restart, start a screen session on the host as root and fire up the appropriate number of worker scripts with the appropriate config file option. See Dumps/Snapshot hosts for which hosts do what; this lists which commands gets run on each host in how many windows. If the monitor script is not running, restart it in a separate window of the same screen session; see the Dump servers page for the command and for which host it runs on.

If the worker script encounters more than three failed dumps in a row (currently configured as such? or did I hardcode that?) it will exit; this avoids generation of piles of broken dumps which later would need to be cleaned up. Once the underlying problem is fixed, you can go to the screen session of the host running those wikis and rerun the previous command in all the windows. See Dumps/Snapshot hosts for which hosts do what if you're not sure.

[edit] Running a specifc dump on request

See Dumps/Rerunning a job for how to run a specific dump. This is done for special cases only.

[edit] Deploying new code

See Dumps/How to deploy for this.

[edit] Bugs, known limitations, etc.

See Dumps/Known issues and wish list for this.

[edit] File layout

Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox