Dumps/Development 2012

From Wikitech
< Dumps(Difference between revisions)
Jump to: navigation, search
(Created page with "This is a detailed explanation of ongoing tasks and their status. As of the start of '''January 2012''' === Backups, availability === * Files backed up to a local host: <span ...")
 
(Enhancement)
 
(20 intermediate revisions by 2 users not shown)
Line 5: Line 5:
 
=== Backups, availability ===
 
=== Backups, availability ===
  
* Files backed up to a local host: <span style="color:orange">In progress</span>
+
* Files backed up to a local host: <span style="color:green">done</span>
 
*: Currently a copy of the XML dumps through November of 2010 is on tridge, our backup host.  There is not room to copy over new files; this is an interim measure only.
 
*: Currently a copy of the XML dumps through November of 2010 is on tridge, our backup host.  There is not room to copy over new files; this is an interim measure only.
*: Dataset1001 is not ready for install; it will get a full rsync of all files as soon as it's ready.
+
*: Dataset1001 has an rsync of all public data. (Shoudl rsync /data/private too I guess.)
*: Dataset1 is still out of order.  I've given up all hope on it.  
+
* (Public) dump content backed up to remote storage: <span style="color:orange">In progress</span>
* Files backed up to remote storage: <span style="color:orange">In progress</span>
+
*: Google:
*: We have a copy of the latest complete dumps from November 2010 or before copied over to Google Storage. This does not include private wikis or private files, so it is not a complete solution.  Additionally, we expect to retain more copies of XML files than we copy over to Google.
+
*:: --We have a copy of the latest complete dumps from November 2010 or before copied over to Google Storage.
*: We have been advised to use specific naming schemes for Google storage "buckets" which cannot be preempted by other users; the files are being moved now to a bucket with this naming scheme.  A script for doing the bimonthly copy from the xml dumps server is ready and can need to be updated with the new naming scheme.
+
*:: --We don't expect to copy all dumps produced but only a selection, probably a full run at six-month intervals.  
*: We need to get developer keys for each developer who might run the script; this process is also underway.
+
*:: --Note that these can be downloaded only by folks with google accounts.
*: We can and should copy files up to archive.org; we should contact folks there about an api for facilitating this. Waiting for contact.
+
*:: --We have been advised to use specific naming schemes for Google storage "buckets" which cannot be preempted by other users; the files are being moved now to a bucket with this naming scheme.  A script for doing the bimonthly copy from the xml dumps server is ready and can need to be updated with the new naming scheme.
 +
*:: --We need to get developer keys for each developer who might run the script; this process is also underway.
 +
*: Archive.org:
 +
*:: --We have contacts there now to help shepard things along.
 +
*:: --Code is in progress to use their S3-ish api.
 +
*:: --We don't expect to copy all dumps produced but only a selection, probably a full run at six-month intervals.
 +
*:: --There is a lab project that was set up by Hydriz to copy all dumps ever produced; we need to discuss this. See [https://labsconsole.wikimedia.org/wiki/Nova_Resource:Dumps the labs project page].
 +
* Off site backups: <span style="color:red">Not started</span>
 +
*: This mans a full copy of dumps, page stats, mediawiki tarballs but also all private data, to durable media stored off-site.
 +
*: Questions to be answered (need discussion): How often? Do we do incrementals?  What third party location would hold these backups?  What media would we use?
 
* Mirroring of the files: <span style="color:orange">In progress</span>
 
* Mirroring of the files: <span style="color:orange">In progress</span>
 
*: We have had discussions with a couple of folks about possible mirrors.  Again this would only be public files. Needs followup.  More info: [[Dumps/Mirror status]].
 
*: We have had discussions with a couple of folks about possible mirrors.  Again this would only be public files. Needs followup.  More info: [[Dumps/Mirror status]].
*: We have one mirror site now (yay!) but there are some hiccups with the rsync, looking into it.
+
*: We have three mirror sites; see [http://download.wikimedia.org/mirrors.html]
 
* Make old dumps from every six months or so (2002 through 2009) available <span style="color:orange">In progress</span>
 
* Make old dumps from every six months or so (2002 through 2009) available <span style="color:orange">In progress</span>
 +
*: 2002, 2003, 2005, 2006 available for download.
 
* Old dumps from community members: <span style="color:orange">In progress</span>
 
* Old dumps from community members: <span style="color:orange">In progress</span>
 
*: We have some leads. Needs followup.
 
*: We have some leads. Needs followup.
 +
* Files copied to gluster cluster for access to labs: <span style="color:green">Done</span>
 +
*: The last 5 good dumps are available in gluster storage at /publicdata-project of labstore1. It is up to date, and is accessible by any instance at <tt>/public/datasets</tt>.
 +
* Manage toolserver copies of dumps somehow: <span style="color:red">Not started</span>
 +
*: Until recently everyone had their own copies of whatever dumps they wanted lying around, takiing up lots of space and requring extra downloads.  They were discussing holding all dumps in one centralized location.  Can we provide an rsync of the last 5 to them, or (ewww) make the gluster cluster available to them?
  
 
=== Speed ===
 
=== Speed ===
Line 26: Line 40:
  
 
=== Robustness ===
 
=== Robustness ===
* Can rerun a specified checkpoint file, or rerun from that point on: <span style="color:orange">Done</span>
+
* Can rerun a specified checkpoint file, or rerun from that point on: <span style="color:green">Done</span>
 
* Safeguards against wrong or corrupt text in the XML files: <span style="color:orange">In progress</span>
 
* Safeguards against wrong or corrupt text in the XML files: <span style="color:orange">In progress</span>
*: Need to use ms5 hash, as soon as that code in core is deployed.
+
*: Need to use sha1 hash, as soon as that code in core is deployed and column populated.
 
* Automated random spot checks of dump file content: <span style="color:red">Not started</span>
 
* Automated random spot checks of dump file content: <span style="color:red">Not started</span>
 
* Restore missing db rev text from older dumps where possible: <span style="color:red">Not started</span>
 
* Restore missing db rev text from older dumps where possible: <span style="color:red">Not started</span>
 
* Scheduled and regular testing of dumps before new MW code deployment: <span style="color:red">Not started</span>
 
* Scheduled and regular testing of dumps before new MW code deployment: <span style="color:red">Not started</span>
 
* Test suite for dumps: <span style="color:orange">In progress</span> We now have a contractor, yay! [[Dumps/Testing]]
 
* Test suite for dumps: <span style="color:orange">In progress</span> We now have a contractor, yay! [[Dumps/Testing]]
 +
* Easy deployment of new python scripts while current jobs are running: <span style="color:orange">In progress</span>
 +
*: Need to finish migration to new deployment setup, make sure <code>worker</code> can exit gracefully on demand
  
 
=== Configuration, running, monitoring ===
 
=== Configuration, running, monitoring ===
* Start and stop runs cleanly via script: <span style="color:orange">Done</span>
+
* Make wikitech docs for dumps suck less: <span style="color:green">Done</span>
 +
* Start and stop runs cleanly via script: <span style="color:green">Done</span>
 
*: We can restart a dump from a given stage now or fromm any checkpoint and have it run through to completion.  
 
*: We can restart a dump from a given stage now or fromm any checkpoint and have it run through to completion.  
 
* Stats for number of downloads, bandwidth, bot downloads: <span style="color:red">Not started</span>
 
* Stats for number of downloads, bandwidth, bot downloads: <span style="color:red">Not started</span>
Line 41: Line 58:
 
* Packages for needed php modules etc., puppetization: <span style="color:orange">In progress</span>
 
* Packages for needed php modules etc., puppetization: <span style="color:orange">In progress</span>
 
*: Need to update the "writeuptopageid" package and the mwbzutils need to be packaged; everything else other than the actual backup scripts is packaged and puppetized.
 
*: Need to update the "writeuptopageid" package and the mwbzutils need to be packaged; everything else other than the actual backup scripts is packaged and puppetized.
* Docs and sample conf files for backup scripts: <span style="color:orange">Done</span>
+
* Docs and sample conf files for backup scripts: <span style="color:green">Done</span>
  
 
=== Enhancement ===
 
=== Enhancement ===
Line 48: Line 65:
 
* Incremental dumps: <span style="color:orange">In Progress</span>
 
* Incremental dumps: <span style="color:orange">In Progress</span>
 
*: Have deployed [[Dumps/Adds-changes_dumps|adds/changes]] content dumps, does not include deletes/undeletes, also not robust right now.
 
*: Have deployed [[Dumps/Adds-changes_dumps|adds/changes]] content dumps, does not include deletes/undeletes, also not robust right now.
* Dumps of image subsets: <span style="color:red">Not started</span>
 
 
* Multistream Bz2 dumps of pages-articles for all wikis, plus scripts to put them to use: <span style="color:orange">In Progress</span>
 
* Multistream Bz2 dumps of pages-articles for all wikis, plus scripts to put them to use: <span style="color:orange">In Progress</span>
*: Running for enwiki, need to deploy for the resst, need scripts to use them, need to announce once a rough package is available.
+
*: Running for enwiki, need to deploy for the rest, [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=toys/bz2multistream;h=95916c873c8c691a3450a919bdbc2f227fd9afb8;hb=ariel] demo using them is now available
* Full image dumps: <span style="color:red">Not started</span>
+
* Full image dumps: <span style="color:orange">In progress</span>
*: emijrp's suggestion that we create a series of 200gb tarballs that others can upload/mirror sounds like themost feasible. This could happen once dataset1001 is up, working off the rsynced copy on ms1002 for now.  We'll need to rethink it when SWIFT comes on line.
+
*: copy of production media to server for rsync: done
*: gmaxwell is working on getting his disk array up and running again so that we can rsync off to him.
+
*: rsync to external mirrors setup: done
 +
*: generation of tarballs per wiki: first run happening now  
 +
*: script to "http-sync" media once it's in Swift: in progress
 +
*: Old plans: [[Dumps/Image dumps plans 2012]]
 +
 
  
 
[[Category:Dumps]]
 
[[Category:Dumps]]

Latest revision as of 06:14, 1 October 2012

This is a detailed explanation of ongoing tasks and their status.

As of the start of January 2012

Contents

[edit] Backups, availability

  • Files backed up to a local host: done
    Currently a copy of the XML dumps through November of 2010 is on tridge, our backup host. There is not room to copy over new files; this is an interim measure only.
    Dataset1001 has an rsync of all public data. (Shoudl rsync /data/private too I guess.)
  • (Public) dump content backed up to remote storage: In progress
    Google:
    --We have a copy of the latest complete dumps from November 2010 or before copied over to Google Storage.
    --We don't expect to copy all dumps produced but only a selection, probably a full run at six-month intervals.
    --Note that these can be downloaded only by folks with google accounts.
    --We have been advised to use specific naming schemes for Google storage "buckets" which cannot be preempted by other users; the files are being moved now to a bucket with this naming scheme. A script for doing the bimonthly copy from the xml dumps server is ready and can need to be updated with the new naming scheme.
    --We need to get developer keys for each developer who might run the script; this process is also underway.
    Archive.org:
    --We have contacts there now to help shepard things along.
    --Code is in progress to use their S3-ish api.
    --We don't expect to copy all dumps produced but only a selection, probably a full run at six-month intervals.
    --There is a lab project that was set up by Hydriz to copy all dumps ever produced; we need to discuss this. See the labs project page.
  • Off site backups: Not started
    This mans a full copy of dumps, page stats, mediawiki tarballs but also all private data, to durable media stored off-site.
    Questions to be answered (need discussion): How often? Do we do incrementals? What third party location would hold these backups? What media would we use?
  • Mirroring of the files: In progress
    We have had discussions with a couple of folks about possible mirrors. Again this would only be public files. Needs followup. More info: Dumps/Mirror status.
    We have three mirror sites; see [1]
  • Make old dumps from every six months or so (2002 through 2009) available In progress
    2002, 2003, 2005, 2006 available for download.
  • Old dumps from community members: In progress
    We have some leads. Needs followup.
  • Files copied to gluster cluster for access to labs: Done
    The last 5 good dumps are available in gluster storage at /publicdata-project of labstore1. It is up to date, and is accessible by any instance at /public/datasets.
  • Manage toolserver copies of dumps somehow: Not started
    Until recently everyone had their own copies of whatever dumps they wanted lying around, takiing up lots of space and requring extra downloads. They were discussing holding all dumps in one centralized location. Can we provide an rsync of the last 5 to them, or (ewww) make the gluster cluster available to them?

[edit] Speed

  • de wp takes too long. Folks using the dump aren't interested in parallel runs. Needs discussion.
  • be able to skip parts of prefetch files that are irrelevant, by locating the pageid in the appropriate bz2 block. In progress

[edit] Robustness

  • Can rerun a specified checkpoint file, or rerun from that point on: Done
  • Safeguards against wrong or corrupt text in the XML files: In progress
    Need to use sha1 hash, as soon as that code in core is deployed and column populated.
  • Automated random spot checks of dump file content: Not started
  • Restore missing db rev text from older dumps where possible: Not started
  • Scheduled and regular testing of dumps before new MW code deployment: Not started
  • Test suite for dumps: In progress We now have a contractor, yay! Dumps/Testing
  • Easy deployment of new python scripts while current jobs are running: In progress
    Need to finish migration to new deployment setup, make sure worker can exit gracefully on demand

[edit] Configuration, running, monitoring

  • Make wikitech docs for dumps suck less: Done
  • Start and stop runs cleanly via script: Done
    We can restart a dump from a given stage now or fromm any checkpoint and have it run through to completion.
  • Stats for number of downloads, bandwidth, bot downloads: Not started
  • Automated notification when run hangs: Not started
  • Packages for needed php modules etc., puppetization: In progress
    Need to update the "writeuptopageid" package and the mwbzutils need to be packaged; everything else other than the actual backup scripts is packaged and puppetized.
  • Docs and sample conf files for backup scripts: Done

[edit] Enhancement

  • Assign priorities to requests for new fields in dumps, implement: Not started
    See [2].
  • Incremental dumps: In Progress
    Have deployed adds/changes content dumps, does not include deletes/undeletes, also not robust right now.
  • Multistream Bz2 dumps of pages-articles for all wikis, plus scripts to put them to use: In Progress
    Running for enwiki, need to deploy for the rest, [3] demo using them is now available
  • Full image dumps: In progress
    copy of production media to server for rsync: done
    rsync to external mirrors setup: done
    generation of tarballs per wiki: first run happening now
    script to "http-sync" media once it's in Swift: in progress
    Old plans: Dumps/Image dumps plans 2012
Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox