Dumps/Development 2012
From Wikitech
This is a detailed explanation of ongoing tasks and their status.
As of the start of January 2012
Contents |
Backups, availability
- Files backed up to a local host: In progress
- Currently a copy of the XML dumps through November of 2010 is on tridge, our backup host. There is not room to copy over new files; this is an interim measure only.
- Dataset1001 is not ready for install; it will get a full rsync of all files as soon as it's ready.
- Dataset1 is still out of order. I've given up all hope on it.
- Files backed up to remote storage: In progress
- We have a copy of the latest complete dumps from November 2010 or before copied over to Google Storage. This does not include private wikis or private files, so it is not a complete solution. Additionally, we expect to retain more copies of XML files than we copy over to Google.
- We have been advised to use specific naming schemes for Google storage "buckets" which cannot be preempted by other users; the files are being moved now to a bucket with this naming scheme. A script for doing the bimonthly copy from the xml dumps server is ready and can need to be updated with the new naming scheme.
- We need to get developer keys for each developer who might run the script; this process is also underway.
- We can and should copy files up to archive.org; we should contact folks there about an api for facilitating this. Waiting for contact.
- Mirroring of the files: In progress
- We have had discussions with a couple of folks about possible mirrors. Again this would only be public files. Needs followup. More info: Dumps/Mirror status.
- We have one mirror site now (yay!) but there are some hiccups with the rsync, looking into it.
- Make old dumps from every six months or so (2002 through 2009) available In progress
- Old dumps from community members: In progress
- We have some leads. Needs followup.
Speed
- de wp takes too long. Folks using the dump aren't interested in parallel runs. Needs discussion.
- be able to skip parts of prefetch files that are irrelevant, by locating the pageid in the appropriate bz2 block. In progress
Robustness
- Can rerun a specified checkpoint file, or rerun from that point on: Done
- Safeguards against wrong or corrupt text in the XML files: In progress
- Need to use ms5 hash, as soon as that code in core is deployed.
- Automated random spot checks of dump file content: Not started
- Restore missing db rev text from older dumps where possible: Not started
- Scheduled and regular testing of dumps before new MW code deployment: Not started
- Test suite for dumps: In progress We now have a contractor, yay! Dumps/Testing
Configuration, running, monitoring
- Make wikitech docs for dumps suck less: Done
- Start and stop runs cleanly via script: Done
- We can restart a dump from a given stage now or fromm any checkpoint and have it run through to completion.
- Stats for number of downloads, bandwidth, bot downloads: Not started
- Automated notification when run hangs: Not started
- Packages for needed php modules etc., puppetization: In progress
- Need to update the "writeuptopageid" package and the mwbzutils need to be packaged; everything else other than the actual backup scripts is packaged and puppetized.
- Docs and sample conf files for backup scripts: Done
Enhancement
- Assign priorities to requests for new fields in dumps, implement: Not started
- See [1].
- Incremental dumps: In Progress
- Have deployed adds/changes content dumps, does not include deletes/undeletes, also not robust right now.
- Dumps of image subsets: Not started
- Multistream Bz2 dumps of pages-articles for all wikis, plus scripts to put them to use: In Progress
- Running for enwiki, need to deploy for the resst, need scripts to use them, need to announce once a rough package is available.
- Full image dumps: Not started
- emijrp's suggestion that we create a series of 200gb tarballs that others can upload/mirror sounds like themost feasible. This could happen once dataset1001 is up, working off the rsynced copy on ms1002 for now. We'll need to rethink it when SWIFT comes on line.
- gmaxwell is working on getting his disk array up and running again so that we can rsync off to him.