Compress old revisions

From Wikitech
Revision as of 02:00, 13 February 2005 by Jamesday (Talk)

Jump to: navigation, search

There is a script to compress individual old revisions. Two modes, single revision compression (50% space use) and multiple (20% use).

Contents

Concatenated multiple revision compression

This reduces the size of old records to about 20% of the original by combining multiple revisions and compressing them all into one record. Not available as a configuration setting so you need to apply it as a batch job.

  • cd /home/wikipedia/common/php-new/maintenance
  • nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " | tee -a /home/wikipedia/logs/compressOld/20050108enwiki

If the preceding run was interrupted after getting as far as Burke it would be resumed with nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Burke | tee -a /home/wikipedia/logs/compressOld/20050108enwiki.

The -q " cur_namespace not in (10,11,14,15) " part is optional but should be used at present for Wikimedia hosted projects, while deletion and undeletion of articles with concatenated compressed revisions is unavailable. It disables concatenated compression of template and category pages and their talk pages, which are currently being changed at a high rate.

Not a problem to apply concatenated compression to records which are already compressed.

Normal operation looks like this:

[user@zwinger:/home/wikipedia/common/php-1.4/maintenance]$ nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Cleanthes | tee -a /home/wikipedia/logs/compressOld/20050108enwiki
Depending on the size of your database this may take a while!
If you abort the script while it's running it shouldn't harm anything,
but if you haven't backed up your data, you SHOULD abort now!

Press control-c to abort first (will proceed automatically in 5 seconds)
Starting from Cleanthes
Starting article selection query cur_title >= 'Cleanthes' AND  cur_namespace not in (10,11,14,15)  ...

Cleanthes
Talk:Cleanthes
Wikipedia_talk:Cleanup ..................../...................././
Waiting for 10.0.0.2 10.0.0.1 10.0.0.3 10.0.0.24 10.0.0.23
Cleanup
MediaWiki_talk:Cleanup
Wikipedia:Cleanup ........../........../........../........../
.........../........../.........../............./............../
............../............./............./............./............/
............/............/............./.............../

When there are a large number of revisions for an article it's possible that you'll lose connection (timeout) to one of the database servers. Restarting after that is harmless, if irritating:

Waiting for 10.0.0.2 10.0.0.1 A database error has occurred
Query: COMMIT
Function: Database::immediateCommit
Error: 2013 Lost connection to MySQL server during query (10.0.0.1)

Backtrace:
Database.php line 345 calls wfdebugdiebacktrace()
Database.php line 297 calls databasemysql::reportqueryerror()
Database.php line 1345 calls databasemysql::query()
Database.php line 1262 calls databasemysql::immediatecommit()
compressOld.inc line 249 calls databasemysql::masterposwait()
compressOld.inc line 226 calls waitforslaves()
compressOld.php line 74 calls compresswithconcat()

Cause is being investigated - may be servmon kills of the idle slave threads during long master operations, since it can take a long time to retrieve all old records sometimes, perhaps 400 seconds for 20,000 on a lightly loaded master.

Single revision compression

This produces about a 50% reduction and is also available automatically via a config file setting. Use the batch job either to apply the compression if it wasn't on before.

  • cd /home/wikipedia/common/php-new/maintenance
  • nice php compressOld.php en wikipedia -t 1 -c 100 5467442
  • -t 1 : the time to sleep between batches, in seconds
  • -c 100: the number of old records per batch
  • 5467442: the old_id to start at, usually 1 to start. Displayed as it runs, if you stop the job, note the last value reached and use it to resume the job later. You get a warning for every record which has already been converted, so don't start much below the point you need.
  • batch size of 5000 is OK off peak

Completed. Left about 40GB lost to fragmentation. Will take a table rebuilt to free it but that can't be done on Ariel using an InnoDB table because it will add 40GB of space to the tablespace for the copy.

Full options

* Usage:
*
* Non-wikimedia
* php compressOld.php [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>]
*     [-a <first-article>] [--exclude-ns0]
*
* Wikimedia
* php compressOld.php <database> [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>]
*     [-f <max-factor>] [-h <factor-threshold>] [--exclude-ns0] [-q <query condition>]
*
* <type> is either:
*   gzip: compress revisions independently
*   concat: concatenate revisions and compress in chunks (default)
*
* <start-id> is the old_id to start from
*
* The following options apply only to the concat type:
*    <begin-date> is the earliest date to check for uncompressed revisions
*    <end-date> is the latest revision date to compress
*    <chunk-size> is the maximum number of revisions in a concat chunk
*    <max-factor> is the maximum ratio of compressed chunk bytes to uncompressed avg. revision bytes
*    <factor-threshold> is a minimum number of KB, where <max-factor> cuts in
*    <first-article> is the title of the first article to process
*    <query-condition> is an extra set of SQL query conditions for the article selection query

Database fragmentation

Because the compression reduces record sizes it can result in substantial database record fragmentation. In the case of English language Wikipedia the old text started at 80GB and was reduced to 40GB but the MySQL InnoDB storage engine didn't make the space free for reuse by other tables in the tablespace.

The space can be fully freed by using alter table old engine=InnoDB but this requires as much extra free space in the tablespace as the complete new copy of the table requuires. If the space isn't available in the tablespace, the tablespace will be enlarged to make room. If you're short of disk space that can be impossible or could leave insufficient space for temporary files and logs. In a multiple wiki situation it's best to apply the compression to the smallest wikis first, alter them to free the space, and move on up to larger sizes. By the time you get to the largest you'll have freed much of the space they will need.

Alternatively, you can temporarily convert some tables to MyISAM using alter table tablename engine=MyISAM to move them out of the tablespace and into the normal free space, freeing space in the tablespace. Once the alter table for the big projects has completed you can use alter table tablename engine=InnoDB to convert them back to InnoDB.

A combination of both doing smaller wikis first and converting some tables in some wikis to MyISAM may be necessary if space is very tight. For Wikimedia, the minimum safe free disk space is between 9 and 10GB. Even at 10GB there's the risk that a large set of temporary files can leave the server without sufficient log space and break replication.

If using MySQL version 4.1 there's also the option of putting each database into its own tablespace. You'll still need enough free space for the copy of the table but won't have the main tablespace size expanded.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox