Dumps/media

From Wikitech
< Dumps(Difference between revisions)
Jump to: navigation, search
 
Line 40: Line 40:
  
 
Producing tarballs of remote media for each wiki means there is potentially a lot of overlap of images in the tarballs.  I would like us to keep 3 full runs of these, but I don't have a space estimate until the first run is complete.  Maybe we'll be able to keep more; maybe we will be asked for additional bundles based on specific categories.
 
Producing tarballs of remote media for each wiki means there is potentially a lot of overlap of images in the tarballs.  I would like us to keep 3 full runs of these, but I don't have a space estimate until the first run is complete.  Maybe we'll be able to keep more; maybe we will be asked for additional bundles based on specific categories.
 +
 +
[[Category:Dumps]]

Latest revision as of 13:40, 28 June 2012

Contents

[edit] Media dumps

Media is provided in two formats; as individual files on a flat filesystem served by http/ftp/rsync by our mirror sites, and as a series of tarballs per wiki project, containing media used by the specific wiki.

[edit] Servers

ms1001 in eqiad hosts the flat filesystem, and from here media are rsynced to our mirrors.

[edit] How generated

On site:

A cron job runs every day on snapshot3 doing the following:

For each wiki, we dump the image, imagelinks and redirects tables via /backups/imageinfo/wmfgetremoteimages.py. Files are written to /data/xmldatadumps/public/other/imageinfo/ on dataset2.

From the above we then generate the list of all remotely stored (i.e. on commons) media per wiki, using different args to the same script.

These files are all written to /data/xmldatadumps/public/other/imageinfo/ on dataset2.

Remote:

Tarballs are generated on a server provided by Your.org and made available from that mirror. The rsynced copy of the media itself and an rsynced copy of the above files (image/imagelinks/redirs info) is used as input to createmediatarballs.py to create two series of tarballs per wiki, one containing all locally uploaded media and the other containing all media uploaded to commons and used on the wiki.

One series of tarballs (with names looking like, e.g., enwiki-20120430-remote-media-1.tar, enwiki-20120430-remote-media-2.tar, and so on for remote media, and enwiki-20120430-local-media-1.tar, enwiki-20120430-local-media-2.tar and so on for local media), should contain all media for a given project. We bundle up the media into tarballs of 100k files per tarball for convenience of the downloader.

[edit] Status

The media rsyncs are running reliably.

We are very early in the tarball production process, still working out bugs at the hardware and software level. The first run is still in progress, some tarball series are incomplete, etc. etc. This is totally a use at yer own risk situation at the moment.

Initially we would try to run these tarballs once every few weeks or once a month.

[edit] Where are the commons tarballs?

Commons media is available for rsync. Making it available as a series of tarballs is not in our plans; unlike the other wikis, it's easy to figure out what to rsync in order to have all of the media it uses. And the tarball series would be huge. We figure that the benefits are outweighed by the resources that would be required to generate and host these.

[edit] Space needs

Producing tarballs of remote media for each wiki means there is potentially a lot of overlap of images in the tarballs. I would like us to keep 3 full runs of these, but I don't have a space estimate until the first run is complete. Maybe we'll be able to keep more; maybe we will be asked for additional bundles based on specific categories.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox