Dumps/Known issues and wish list

From Wikitech
< Dumps(Difference between revisions)
Jump to: navigation, search
(Limitations)
(Missing features)
Line 3: Line 3:
 
== Missing features ==
 
== Missing features ==
  
Currently, image tarballs are not being made. They need to be handled in some other fashion.
+
Currently, image tarballs *are* being made If off-site).
  
There's an extension that can produce [http://www.mediawiki.org/wiki/Extension:DumpHTML static HTML dumps] separately, but it would take a long time to complete on today's en wiki.  So it's languishing.
+
There's an extension that can produce [http://www.mediawiki.org/wiki/Extension:DumpHTML static HTML dumps] separately, but it would take a long time to complete on today's en wiki.  So it's languishing.  Parsoid -> HTML may be the solution for this.
 +
 
 +
Really really need better import tools.  Working on that. Something that took advantage of multiple cores to write multiple sql files would be good.
  
 
"Incremental" dumps? We have the [[Dumps/Adds-changes_dumps|adds-changes dumps]] which are a starting point.
 
"Incremental" dumps? We have the [[Dumps/Adds-changes_dumps|adds-changes dumps]] which are a starting point.

Revision as of 16:13, 11 January 2013

(This needs cleanup.)

Missing features

Currently, image tarballs *are* being made If off-site).

There's an extension that can produce static HTML dumps separately, but it would take a long time to complete on today's en wiki. So it's languishing. Parsoid -> HTML may be the solution for this.

Really really need better import tools. Working on that. Something that took advantage of multiple cores to write multiple sql files would be good.

"Incremental" dumps? We have the adds-changes dumps which are a starting point.

A wish list for the dumps is available here.

The list of outstanding bugs is here.

Limitations

The scripts in the /backups directory on the snapshot hosts are not updated by scap or any of the usual mechanisms. But there are ow deploy scripts which aren't too bad.

The php scripts, in contrast, do get updated, and the updated versions will be invoked the next time worker.py starts up, i.e. on the next wiki project by date that is due for a run, or on the next step of a given run, *or* in the middle of a given dump of revision texts if the text fetcher is restarted. This can lead to inconsistency in the format. It might also be a problem since comprehensive testing of XML dumps is usually not done before a code push.

Notes

(This stuff may not be current.)

Not all error detection is probably working right now. Failures on the mysqldump runs are not detected. Tar failures are not detected. <-- current?

Failures of dumpPages.php should be detected, but indirectly from the failure of mwdumper to parse its XML output. <-- current?

  • The page XML dumps should be consistent, all three outputs draw from one input, which is drawn from one long SQL transaction plus supplementary data loads which should be independent of changes. Weeelll.. we don't lock the tables, so don't count on this either.
  • The other SQL dumps are not going to be 100% time-consistent. But that's not too important.
Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox