Dumps

From Wikitech
(Difference between revisions)
Jump to: navigation, search
Line 1: Line 1:
Some quick notes on new/old/broken/experimental stuff as of 2005-09-03
+
Docs for end-users of the data dumps at [[MetaWikipedia:Data dumps]].
  
 +
 +
Updated notes for Wikimedia site setup, 2006-01-22.
  
 
== Top-level procedure ==
 
== Top-level procedure ==
  
On benet as root in a screen session, run:
+
=== Florida ===
  # run-backup | tee some-file-if-you-like
+
  
This will iterate through the backup-XXX scripts for special, wikipedia, etc. It may or may not run all the proper wikis.
+
On [[srv31]] as root in a screen session, run:
 +
  # /home/wikipedia/src/backup/backup-pmtpa 2>&1 | tee some-log-file
  
Each individual wiki gets backed up by the backup-wiki shell script (currently an alias to backup-wiki-1.6). I've recently revamped this script to be, I hope, more readable and maintainable.
+
Files are saved onto benet; make sure there's ~50 gigabytes free before running.
 +
 
 +
User-visible files appear at http://download.wikimedia.org/
 +
 
 +
=== Korea ===
 +
 
 +
On [[amaryllis]] as root in a screen session, run:
 +
  # /usr/local/backup/backup-yaseo 2>&1 | tee some-log-file
 +
 
 +
Files are saved on amaryllis.
 +
 
 +
User-visible files appear at http://download-yaseo.wikimedia.org/ (this needs fixing)
  
It probably still has flaws.
 
  
 
== Locks and logs ==
 
== Locks and logs ==
  
While each backup runs it should have a backup.lock file open in the dump directory (eg /var/backup/public/special/meta). It will also write status lines to backup.log.
+
At the moment the new dump script doesn't use lock files, so make sure you don't run two sessions on the same cluster. Lock files will be added so it can be automatically started in some reasonably safe fashion...
  
On error conditions it will write an ABORT line to the log and abort that individual wiki's backup. Currently other backups will continue (and may all fail if there's a general problem). In theory on abort it should also e-mail me but that seems broken at the moment; the dest address can be set in backup-wiki-1.6 in the config at the top.
+
Raw output from the script will go into the file you tee output into. A separate text log isn't kept at the moment, but status information is saved into HTML files for public consumption:
  
If the lock file is there, a second run will skip over that wiki; you could use this to ignore particular wikis for instance. An abort should leave the lock file there.
+
* <base>/
 +
** [http://download.wikimedia.org/ index.html] - List of all databases and their last-touched status
 +
** <db>/
 +
*** <date>/
 +
**** [http://download.wikimedia.org/afwiki/20060122/ index.html] - List of items in the database
 +
 
 +
At the moment there's not a handy link back to prior dumps, but you can remove a level from the URL and get a directory listing.
 +
 
 +
Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.
 +
 
 +
== Error handling ==
 +
 
 +
If a dump step returns an error condition, the runner script should detect this and mark the item as "failed" on the HTML pages. The runner will keep on trying other steps and remaining databases, unless the runner script itself fails somehow.
 +
 
 +
It may be wise to add e-mail or other notification of errors.
  
 
== Programs used ==
 
== Programs used ==
  
* A bunch of mysqldump's for various private and public tables
+
The dump runner script is available in our cvs, in the 'backup' module, as WikiBackup.py.
* dumpBackup.php to generate the XML dump
+
 
** Requires working PHP and MediaWiki installation on benet! Don't remove from mediawiki_install dsh group!
+
* mysqldump
* mwdumper to filter and compress the dump into pages_current.xml.gz, pages_public.xml.gz, and pages_full.xml.gz
+
* dumpBackup.php, dumpTextPass.php to generate the XML dumps
** Requires mono (installed from rpms, there are some in /h/w/src)
+
** Requires working PHP 5 and MediaWiki installation on amaryllis/srv31! Don't remove from mediawiki_install dsh group!
** mwdumper source in CVS, makefile installs into /usr/local
+
** need the XMLReader PHP extension, zlib, and bzip2 enabled
* tar for upload dump
+
** using the ActiveAbstract MW extension for Yahoo's wacky stuff
** Requires /mnt/wikipedia to be mounted... and maybe /mnt/upload for en?
+
** 7za (of p7zip) must be installed and in the path
* md5sum
+
 
** wooooo
+
 
 +
== Other missing features ==
 +
 
 +
Currently, image tarballs are still not being made.
 +
 
 +
MD5 checksum files aren't being generated.
 +
 
 +
Static HTML dumps might also be included in this mess in future?
  
 
== Notes ==
 
== Notes ==

Revision as of 10:31, 22 January 2006

Docs for end-users of the data dumps at MetaWikipedia:Data dumps.


Updated notes for Wikimedia site setup, 2006-01-22.

Contents

Top-level procedure

Florida

On srv31 as root in a screen session, run:

 # /home/wikipedia/src/backup/backup-pmtpa 2>&1 | tee some-log-file

Files are saved onto benet; make sure there's ~50 gigabytes free before running.

User-visible files appear at http://download.wikimedia.org/

Korea

On amaryllis as root in a screen session, run:

 # /usr/local/backup/backup-yaseo 2>&1 | tee some-log-file

Files are saved on amaryllis.

User-visible files appear at http://download-yaseo.wikimedia.org/ (this needs fixing)


Locks and logs

At the moment the new dump script doesn't use lock files, so make sure you don't run two sessions on the same cluster. Lock files will be added so it can be automatically started in some reasonably safe fashion...

Raw output from the script will go into the file you tee output into. A separate text log isn't kept at the moment, but status information is saved into HTML files for public consumption:

  • <base>/
    • index.html - List of all databases and their last-touched status
    • <db>/
      • <date>/

At the moment there's not a handy link back to prior dumps, but you can remove a level from the URL and get a directory listing.

Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.

Error handling

If a dump step returns an error condition, the runner script should detect this and mark the item as "failed" on the HTML pages. The runner will keep on trying other steps and remaining databases, unless the runner script itself fails somehow.

It may be wise to add e-mail or other notification of errors.

Programs used

The dump runner script is available in our cvs, in the 'backup' module, as WikiBackup.py.

  • mysqldump
  • dumpBackup.php, dumpTextPass.php to generate the XML dumps
    • Requires working PHP 5 and MediaWiki installation on amaryllis/srv31! Don't remove from mediawiki_install dsh group!
    • need the XMLReader PHP extension, zlib, and bzip2 enabled
    • using the ActiveAbstract MW extension for Yahoo's wacky stuff
    • 7za (of p7zip) must be installed and in the path


Other missing features

Currently, image tarballs are still not being made.

MD5 checksum files aren't being generated.

Static HTML dumps might also be included in this mess in future?

Notes

Not all error detection is probably working right now. Failures on the mysqldump runs are not detected. Tar failures are not detected.

Failures of dumpPages.php should be detected, but indirectly from the failure of mwdumper to parse its XML output.

The mysql dumps and page XML dump pull from bacon, as configured in one of the higher-level backup scripts. Currently bacon is *NOT* being stopped from replication...

  • The page XML dumps should be consistent, all three outputs draw from one input, which is drawn from one long SQL transaction plus supplementary data loads which should be independent of changes.
  • The other SQL dumps are not going to be 100% time-consistent. But that's not too important.

grantswiki and internalwiki are special-cased so they _should_ get completely backed up into /var/backup/private instead of the public dir.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox