Dumps/media

From Wikitech
< Dumps
Revision as of 12:40, 18 May 2012 by ArielGlenn (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Media dumps

Media is provided in two formats; as individual files on a flat filesystem served by http/ftp/rsync by our mirror sites, and as a series of tarballs per wiki project, containing media used by the specific wiki.

Servers

ms1001 in eqiad hosts the flat filesystem, and from here media are rsynced to our mirrors.

How generated

On site:

A cron job runs every day on snapshot3 doing the following:

For each wiki, we dump the image, imagelinks and redirects tables via /backups/imageinfo/wmfgetremoteimages.py. Files are written to /data/xmldatadumps/public/other/imageinfo/ on dataset2.

From the above we then generate the list of all remotely stored (i.e. on commons) media per wiki, using different args to the same script.

These files are all written to /data/xmldatadumps/public/other/imageinfo/ on dataset2.

Remote:

Tarballs are generated on a server provided by Your.org and made available from that mirror. The rsynced copy of the media itself and an rsynced copy of the above files (image/imagelinks/redirs info) is used as input to createmediatarballs.py to create two series of tarballs per wiki, one containing all locally uploaded media and the other containing all media uploaded to commons and used on the wiki.

One series of tarballs (with names looking like, e.g., enwiki-20120430-remote-media-1.tar, enwiki-20120430-remote-media-2.tar, and so on for remote media, and enwiki-20120430-local-media-1.tar, enwiki-20120430-local-media-2.tar and so on for local media), should contain all media for a given project. We bundle up the media into tarballs of 100k files per tarball for convenience of the downloader.

Status

The media rsyncs are running reliably.

We are very early in the tarball production process, still working out bugs at the hardware and software level. The first run is still in progress, some tarball series are incomplete, etc. etc. This is totally a use at yer own risk situation at the moment.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox