Compress old revisions

From Wikitech
(Difference between revisions)
Jump to: navigation, search
m (say results)
(describe concatenated compression)
Line 1: Line 1:
There is a script to compress individual old revisions:
+
There is a script to compress individual old revisions. Two modes, single revision compression (50% space use) and multiple (20% use).
 +
 
 +
== Concatenated multiple revision compression ==
 +
This reduces the size of old records to about 20% of the original by combining multiple revisions and compressing them all into one record. Not available as a configuration setting so you need to apply it as a batch job.
 +
 
 +
* cd /home/wikipedia/common/php-new/maintenance
 +
* nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " | tee -a /home/wikipedia/logs/compressOld/20050108enwiki
 +
 
 +
If the preceding run was interrupted after getting as far as Burke it would be resumed with <tt>nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Burke | tee -a /home/wikipedia/logs/compressOld/20050108enwiki</tt>.
 +
 
 +
The <tt>-q " cur_namespace not in (10,11,14,15) "</tt> part is optional but should be used at present for Wikimedia hosted projects, while deletion and undeletion of articles with concatenated compressed revisions is unavailable. It disables concatenated compression of template and category pages and their talk pages, which are currently being changed at a high rate.
 +
 
 +
Not a problem to apply concatenated compression to records which are already compressed.
 +
 
 +
== Single revision compression ==
 +
This produces about a 50% reduction and is also available automatically via a config file setting. Use the batch job either to apply the compression if it wasn't on before.
  
 
* cd /home/wikipedia/common/php-new/maintenance
 
* cd /home/wikipedia/common/php-new/maintenance
Line 8: Line 23:
 
* 5467442: the old_id to start at, usually 1 to start. Displayed as it runs, if you stop the job, note the last value reached and use it to resume the job later. You get a warning for every record which has already been converted, so don't start much below the point you need.
 
* 5467442: the old_id to start at, usually 1 to start. Displayed as it runs, if you stop the job, note the last value reached and use it to resume the job later. You get a warning for every record which has already been converted, so don't start much below the point you need.
  
Progress for en wikipedia:
 
* As far as 5467442 of about 6.7 million
 
* resume off peak with nice php compressOld.php en wikipedia -t 1 -c 100 5467442
 
 
* batch size of 5000 is OK off peak
 
* batch size of 5000 is OK off peak
  
 
Completed. Left about 40GB lost to fragmentation. Will take a table rebuilt to free it but that can't be done on Ariel using an InnoDB table because it will add 40GB of space to the tablespace for the copy.
 
Completed. Left about 40GB lost to fragmentation. Will take a table rebuilt to free it but that can't be done on Ariel using an InnoDB table because it will add 40GB of space to the tablespace for the copy.
 +
 +
== Full options ==
 +
 +
* Usage:
 +
*
 +
* Non-wikimedia
 +
* php compressOld.php [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>]
 +
*    [-a <first-article>] [--exclude-ns0]
 +
*
 +
* Wikimedia
 +
* php compressOld.php <database> [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>]
 +
*    [-f <max-factor>] [-h <factor-threshold>] [--exclude-ns0] [-q <query condition>]
 +
*
 +
* <type> is either:
 +
*  gzip: compress revisions independently
 +
*  concat: concatenate revisions and compress in chunks (default)
 +
*
 +
* <start-id> is the old_id to start from
 +
*
 +
* The following options apply only to the concat type:
 +
*    <begin-date> is the earliest date to check for uncompressed revisions
 +
*    <end-date> is the latest revision date to compress
 +
*    <chunk-size> is the maximum number of revisions in a concat chunk
 +
*    <max-factor> is the maximum ratio of compressed chunk bytes to uncompressed avg. revision bytes
 +
*    <factor-threshold> is a minimum number of KB, where <max-factor> cuts in
 +
*    <first-article> is the title of the first article to process
 +
*    <query-condition> is an extra set of SQL query conditions for the article selection query
 +
 +
== Database fragmentation ==
 +
Because the compression reduces record sizes it can result in substantial database record fragmentation. In the case of English language Wikipedia the old text started at 80GB and was reduced to 40GB but the MySQL InnoDB storage engine didn't make the space free for reuse by other tables in the tablespace.
 +
 +
The space can be fully freed by using <tt>alter table old engine=InnoDB</tt> but this requires as much extra free space in the tablespace as the complete new copy of the table requuires. If the space isn't available in the tablespace, the tablespace will be enlarged to make room. If you're short of disk space that can be impossible or could leave insufficient space for temporary files and logs. In a multiple wiki situation it's best to apply the compression to the smallest wikis first, alter them to free the space, and move on up to larger sizes. By the time you get to the largest you'll have freed much of the space they will need.
 +
 +
Alternatively, you can temporarily convert some tables to MyISAM using <tt>alter table tablename engine=MyISAM</tt> to move them out of the tablespace and into the normal free space, freeing space in the tablespace. Once the alter table for the big projects has completed you can use <tt>alter table tablename  engine=InnoDB</tt> to convert them back to InnoDB.
 +
 +
A combination of both doing smaller wikis first and converting some tables in some wikis to MyISAM may be necessary if space is very tight. For Wikimedia, the minimum safe free disk space is between 9 and 10GB. Even at 10GB there's the risk that a large set of temporary files can leave the server without sufficient log space and break replication.
 +
 +
If using MySQL version 4.1 there's also the option of putting each database into its own tablespace. You'll still need enough free space for the copy of the table but won't have the main tablespace size expanded.

Revision as of 01:05, 13 February 2005

There is a script to compress individual old revisions. Two modes, single revision compression (50% space use) and multiple (20% use).

Contents

Concatenated multiple revision compression

This reduces the size of old records to about 20% of the original by combining multiple revisions and compressing them all into one record. Not available as a configuration setting so you need to apply it as a batch job.

  • cd /home/wikipedia/common/php-new/maintenance
  • nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " | tee -a /home/wikipedia/logs/compressOld/20050108enwiki

If the preceding run was interrupted after getting as far as Burke it would be resumed with nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Burke | tee -a /home/wikipedia/logs/compressOld/20050108enwiki.

The -q " cur_namespace not in (10,11,14,15) " part is optional but should be used at present for Wikimedia hosted projects, while deletion and undeletion of articles with concatenated compressed revisions is unavailable. It disables concatenated compression of template and category pages and their talk pages, which are currently being changed at a high rate.

Not a problem to apply concatenated compression to records which are already compressed.

Single revision compression

This produces about a 50% reduction and is also available automatically via a config file setting. Use the batch job either to apply the compression if it wasn't on before.

  • cd /home/wikipedia/common/php-new/maintenance
  • nice php compressOld.php en wikipedia -t 1 -c 100 5467442
  • -t 1 : the time to sleep between batches, in seconds
  • -c 100: the number of old records per batch
  • 5467442: the old_id to start at, usually 1 to start. Displayed as it runs, if you stop the job, note the last value reached and use it to resume the job later. You get a warning for every record which has already been converted, so don't start much below the point you need.
  • batch size of 5000 is OK off peak

Completed. Left about 40GB lost to fragmentation. Will take a table rebuilt to free it but that can't be done on Ariel using an InnoDB table because it will add 40GB of space to the tablespace for the copy.

Full options

* Usage:
*
* Non-wikimedia
* php compressOld.php [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>]
*     [-a <first-article>] [--exclude-ns0]
*
* Wikimedia
* php compressOld.php <database> [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>]
*     [-f <max-factor>] [-h <factor-threshold>] [--exclude-ns0] [-q <query condition>]
*
* <type> is either:
*   gzip: compress revisions independently
*   concat: concatenate revisions and compress in chunks (default)
*
* <start-id> is the old_id to start from
*
* The following options apply only to the concat type:
*    <begin-date> is the earliest date to check for uncompressed revisions
*    <end-date> is the latest revision date to compress
*    <chunk-size> is the maximum number of revisions in a concat chunk
*    <max-factor> is the maximum ratio of compressed chunk bytes to uncompressed avg. revision bytes
*    <factor-threshold> is a minimum number of KB, where <max-factor> cuts in
*    <first-article> is the title of the first article to process
*    <query-condition> is an extra set of SQL query conditions for the article selection query

Database fragmentation

Because the compression reduces record sizes it can result in substantial database record fragmentation. In the case of English language Wikipedia the old text started at 80GB and was reduced to 40GB but the MySQL InnoDB storage engine didn't make the space free for reuse by other tables in the tablespace.

The space can be fully freed by using alter table old engine=InnoDB but this requires as much extra free space in the tablespace as the complete new copy of the table requuires. If the space isn't available in the tablespace, the tablespace will be enlarged to make room. If you're short of disk space that can be impossible or could leave insufficient space for temporary files and logs. In a multiple wiki situation it's best to apply the compression to the smallest wikis first, alter them to free the space, and move on up to larger sizes. By the time you get to the largest you'll have freed much of the space they will need.

Alternatively, you can temporarily convert some tables to MyISAM using alter table tablename engine=MyISAM to move them out of the tablespace and into the normal free space, freeing space in the tablespace. Once the alter table for the big projects has completed you can use alter table tablename engine=InnoDB to convert them back to InnoDB.

A combination of both doing smaller wikis first and converting some tables in some wikis to MyISAM may be necessary if space is very tight. For Wikimedia, the minimum safe free disk space is between 9 and 10GB. Even at 10GB there's the risk that a large set of temporary files can leave the server without sufficient log space and break replication.

If using MySQL version 4.1 there's also the option of putting each database into its own tablespace. You'll still need enough free space for the copy of the table but won't have the main tablespace size expanded.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox