Dumps

From Wikitech
(Difference between revisions)
Jump to: navigation, search
( Troubleshooting)
(Programs used)
Line 102: Line 102:
  
 
* [[mysqldump]]
 
* [[mysqldump]]
* dumpBackup.php, dumpTextPass.php to generate the XML dumps
+
* dumpBackup.php, dumpTextPass.php, fetchText.php to generate the XML dumps
** Requires working PHP 5 and MediaWiki installation on amaryllis/srv31! Don't remove from [[mediawiki-installation]] dsh group!
+
** Requires working PHP 5 and MediaWiki installation on the snapshot hosts! Don't remove from [[mediawiki-installation]] dsh group!
 
** need the XMLReader PHP extension, zlib, and bzip2 enabled
 
** need the XMLReader PHP extension, zlib, and bzip2 enabled
 
** using the ActiveAbstract MW extension for Yahoo's wacky stuff
 
** using the ActiveAbstract MW extension for Yahoo's wacky stuff

Revision as of 22:47, 2 June 2010

Docs for end-users of the data dumps at MetaWikipedia:Data dumps.

Contents

Top-level procedure

The dumped files are stored and served to the web from storage2.

User-visible files appear at http://download.wikipedia.org/backup-index.html

Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps.

Status

A dump of en wiki full history is running on snapshot2 is being used to test dumps with no prefetch and no spawn. snapshot3 has live dumps running on a locally hacked copy of the code; fixes to bug 3264 will be checked in shortly.

Monitor node

  • snapshot3
    • /home/wikipedia/src/backup/monitor
      • This shell script runs monitor.py in an endless while loop. Its sole purpose is to check for and remove stale lock files from dump processes that have died, and to update the central index.html file which shows the dumps in progress and the status of the dumps that have completed (i.e. http://dumps.wikimedia.org/backup-index.html ). It does not start or stop worker threads.
      • The local copy of this script on each host, as well as the associated python script, lives in the directory /backups.

Worker nodes

The worker threads go through the set of available wikis to dump automatically.

  • snapshot3 8core test server pulled from apache fleet
  • snapshot2
    • /home/wikipedia/src/backup/worker
      • This shell script runs worker.py in an endless loop. The python script is invoked without arguments and in this mode will look for the oldest dump directory and will produce sql and XML files for that project.
      • The worker.py script creates a lock file on the filesystem containing the dumps (as of this writing, /mnt/dumps/) in the subdirectory private/name-of-wiki/lock. No other process will try to write dumps for that project while the lock file is in place.
      • Local copies of the shell script and the python script live on the snapshot hosts in the directory /backups.

Configuration notes

Adding a new worker box

  1. Install like a regular apache app server
  2. Add worker to /etc/exports/ on storage2
  3. Add /mnt/dumps to /etc/fstab of worker host
  4. apt-get install php5-dev libicu-dev g++ php-config gcc-3.4 mysql-client-5.0 p7zip-full swig subversion
  5. svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/normal/
  6. cd normal; make
  7. install in dir found for php -i | grep extension ex /usr/lib/php5/20060613'
  8. mv php_utfnormal.so /usr/lib/php5/20060613
  9. add /etc/php5/conf.d/utfnormal.ini with extension=php_utfnormal.so
  10. 7zip of at least 4.58 due to https://bugs.edge.launchpad.net/hardy-backports/+bug/370618 or chmod 644 in place
  11. svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/backup/ /backups
  12. svn co 'svn+ssh://user@svn.wikimedia.org/svn-private/wmf/xmlsnapshots/conf' conf'
    1. mv wikidump.conf ../.

Locks and logs

The worker threads use lock files in the private directories. Lock files are touched by a background thread during dump; the monitor node looks for stale lock files and releases them for later re-running by a worker node.

Raw output from the script will go into the file you tee output into. A separate text log isn't kept at the moment, but status information is saved into HTML files for public consumption:

File layout

  • <base>/

Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.

Error handling

If a dump step returns an error condition, the runner script should detect this and mark the item as "failed" on the HTML pages. The runner will keep on trying other steps and remaining databases, unless the runner script itself fails somehow.

It will also e-mail xmldatadumps-admin-l@lists.wikimedia.org with a notification.

If the server crashes while it's running, the status files are left as-is and the display shows it as still running until the monitor node decides the lock file is stale enough to mark is as aborted.

 Troubleshooting

If the host runs low on disk space, you can reduce the number of backups that are kept. Edit the file /home/wikipedia/src/backup/wikidump.conf on the monitor host and look for the line that says "keep=<some value>".

Testing

If you need to run dumps for one wiki, be the user backup on one of the snapshot hosts and then do

  • cd /backups
  • python ./worker.py name-of-wiki-here

For example,

  • python ./worker.py enwiki

If you have the stub file and want to generate the full XML file with text revisions form the stub file, do

...

Backup stages

  • First stage: dumps of various database tables, both private and public
  • Second stage: list of page titles, page abstracts for Yahoo
  • Third stage: page stubs (dumpBackup.php)
  • Fourth stage: XML files with revision texts, bzipped (dumpTextPass.php, fetchText.php)
  • Fifth stage: 7z compression of the XML file with all revision texts (full history)

Programs used

The dump runner script is available in our cvs, in the 'backup' module, as WikiBackup.py.

  • mysqldump
  • dumpBackup.php, dumpTextPass.php, fetchText.php to generate the XML dumps
    • Requires working PHP 5 and MediaWiki installation on the snapshot hosts! Don't remove from mediawiki-installation dsh group!
    • need the XMLReader PHP extension, zlib, and bzip2 enabled
    • using the ActiveAbstract MW extension for Yahoo's wacky stuff
    • 7za (of p7zip) must be installed and in the path

Other missing features

Currently, image tarballs are still not being made.

Static HTML dumps might also be included in this mess in future?

Notes

Not all error detection is probably working right now. Failures on the mysqldump runs are not detected. Tar failures are not detected.

Failures of dumpPages.php should be detected, but indirectly from the failure of mwdumper to parse its XML output.

The mysql dumps and page XML dump pull from bacon, as configured in one of the higher-level backup scripts. Currently bacon is *NOT* being stopped from replication...

  • The page XML dumps should be consistent, all three outputs draw from one input, which is drawn from one long SQL transaction plus supplementary data loads which should be independent of changes.
  • The other SQL dumps are not going to be 100% time-consistent. But that's not too important.

grantswiki and internalwiki are special-cased so they _should_ get completely backed up into /var/backup/private instead of the public dir.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox