Compress old revisions
(describe concatenated compression) |
(add normal output, note on disconnections) |
||
| Line 12: | Line 12: | ||
Not a problem to apply concatenated compression to records which are already compressed. | Not a problem to apply concatenated compression to records which are already compressed. | ||
| + | |||
| + | Normal operation looks like this: | ||
| + | |||
| + | <pre>[user@zwinger:/home/wikipedia/common/php-1.4/maintenance]$ nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Cleanthes | tee -a /home/wikipedia/logs/compressOld/20050108enwiki | ||
| + | Depending on the size of your database this may take a while! | ||
| + | If you abort the script while it's running it shouldn't harm anything, | ||
| + | but if you haven't backed up your data, you SHOULD abort now! | ||
| + | |||
| + | Press control-c to abort first (will proceed automatically in 5 seconds) | ||
| + | Starting from Cleanthes | ||
| + | Starting article selection query cur_title >= 'Cleanthes' AND cur_namespace not in (10,11,14,15) ... | ||
| + | |||
| + | Cleanthes | ||
| + | Talk:Cleanthes | ||
| + | Wikipedia_talk:Cleanup ..................../...................././ | ||
| + | Waiting for 10.0.0.2 10.0.0.1 10.0.0.3 10.0.0.24 10.0.0.23 | ||
| + | Cleanup | ||
| + | MediaWiki_talk:Cleanup | ||
| + | Wikipedia:Cleanup ........../........../........../........../ | ||
| + | .........../........../.........../............./............../ | ||
| + | ............../............./............./............./............/ | ||
| + | ............/............/............./.............../ | ||
| + | </pre> | ||
| + | |||
| + | When there are a large number of revisions for an article it's possible that you'll lose connection (timeout) to one of the database servers. Restarting after that is harmless, if irritating: | ||
| + | |||
| + | <pre>Waiting for 10.0.0.2 10.0.0.1 A database error has occurred | ||
| + | Query: COMMIT | ||
| + | Function: Database::immediateCommit | ||
| + | Error: 2013 Lost connection to MySQL server during query (10.0.0.1) | ||
| + | |||
| + | Backtrace: | ||
| + | Database.php line 345 calls wfdebugdiebacktrace() | ||
| + | Database.php line 297 calls databasemysql::reportqueryerror() | ||
| + | Database.php line 1345 calls databasemysql::query() | ||
| + | Database.php line 1262 calls databasemysql::immediatecommit() | ||
| + | compressOld.inc line 249 calls databasemysql::masterposwait() | ||
| + | compressOld.inc line 226 calls waitforslaves() | ||
| + | compressOld.php line 74 calls compresswithconcat()</pre> | ||
| + | |||
| + | Cause is being investigated - may be servmon kills of the idle slave threads during long master operations, since it can take a long time to retrieve all old records sometimes, perhaps 400 seconds for 20,000 on a lightly loaded master. | ||
== Single revision compression == | == Single revision compression == | ||
Revision as of 02:00, 13 February 2005
There is a script to compress individual old revisions. Two modes, single revision compression (50% space use) and multiple (20% use).
Contents |
Concatenated multiple revision compression
This reduces the size of old records to about 20% of the original by combining multiple revisions and compressing them all into one record. Not available as a configuration setting so you need to apply it as a batch job.
- cd /home/wikipedia/common/php-new/maintenance
- nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " | tee -a /home/wikipedia/logs/compressOld/20050108enwiki
If the preceding run was interrupted after getting as far as Burke it would be resumed with nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Burke | tee -a /home/wikipedia/logs/compressOld/20050108enwiki.
The -q " cur_namespace not in (10,11,14,15) " part is optional but should be used at present for Wikimedia hosted projects, while deletion and undeletion of articles with concatenated compressed revisions is unavailable. It disables concatenated compression of template and category pages and their talk pages, which are currently being changed at a high rate.
Not a problem to apply concatenated compression to records which are already compressed.
Normal operation looks like this:
[user@zwinger:/home/wikipedia/common/php-1.4/maintenance]$ nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Cleanthes | tee -a /home/wikipedia/logs/compressOld/20050108enwiki Depending on the size of your database this may take a while! If you abort the script while it's running it shouldn't harm anything, but if you haven't backed up your data, you SHOULD abort now! Press control-c to abort first (will proceed automatically in 5 seconds) Starting from Cleanthes Starting article selection query cur_title >= 'Cleanthes' AND cur_namespace not in (10,11,14,15) ... Cleanthes Talk:Cleanthes Wikipedia_talk:Cleanup ..................../...................././ Waiting for 10.0.0.2 10.0.0.1 10.0.0.3 10.0.0.24 10.0.0.23 Cleanup MediaWiki_talk:Cleanup Wikipedia:Cleanup ........../........../........../........../ .........../........../.........../............./............../ ............../............./............./............./............/ ............/............/............./.............../
When there are a large number of revisions for an article it's possible that you'll lose connection (timeout) to one of the database servers. Restarting after that is harmless, if irritating:
Waiting for 10.0.0.2 10.0.0.1 A database error has occurred Query: COMMIT Function: Database::immediateCommit Error: 2013 Lost connection to MySQL server during query (10.0.0.1) Backtrace: Database.php line 345 calls wfdebugdiebacktrace() Database.php line 297 calls databasemysql::reportqueryerror() Database.php line 1345 calls databasemysql::query() Database.php line 1262 calls databasemysql::immediatecommit() compressOld.inc line 249 calls databasemysql::masterposwait() compressOld.inc line 226 calls waitforslaves() compressOld.php line 74 calls compresswithconcat()
Cause is being investigated - may be servmon kills of the idle slave threads during long master operations, since it can take a long time to retrieve all old records sometimes, perhaps 400 seconds for 20,000 on a lightly loaded master.
Single revision compression
This produces about a 50% reduction and is also available automatically via a config file setting. Use the batch job either to apply the compression if it wasn't on before.
- cd /home/wikipedia/common/php-new/maintenance
- nice php compressOld.php en wikipedia -t 1 -c 100 5467442
- -t 1 : the time to sleep between batches, in seconds
- -c 100: the number of old records per batch
- 5467442: the old_id to start at, usually 1 to start. Displayed as it runs, if you stop the job, note the last value reached and use it to resume the job later. You get a warning for every record which has already been converted, so don't start much below the point you need.
- batch size of 5000 is OK off peak
Completed. Left about 40GB lost to fragmentation. Will take a table rebuilt to free it but that can't be done on Ariel using an InnoDB table because it will add 40GB of space to the tablespace for the copy.
Full options
* Usage: * * Non-wikimedia * php compressOld.php [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>] * [-a <first-article>] [--exclude-ns0] * * Wikimedia * php compressOld.php <database> [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>] * [-f <max-factor>] [-h <factor-threshold>] [--exclude-ns0] [-q <query condition>] * * <type> is either: * gzip: compress revisions independently * concat: concatenate revisions and compress in chunks (default) * * <start-id> is the old_id to start from * * The following options apply only to the concat type: * <begin-date> is the earliest date to check for uncompressed revisions * <end-date> is the latest revision date to compress * <chunk-size> is the maximum number of revisions in a concat chunk * <max-factor> is the maximum ratio of compressed chunk bytes to uncompressed avg. revision bytes * <factor-threshold> is a minimum number of KB, where <max-factor> cuts in * <first-article> is the title of the first article to process * <query-condition> is an extra set of SQL query conditions for the article selection query
Database fragmentation
Because the compression reduces record sizes it can result in substantial database record fragmentation. In the case of English language Wikipedia the old text started at 80GB and was reduced to 40GB but the MySQL InnoDB storage engine didn't make the space free for reuse by other tables in the tablespace.
The space can be fully freed by using alter table old engine=InnoDB but this requires as much extra free space in the tablespace as the complete new copy of the table requuires. If the space isn't available in the tablespace, the tablespace will be enlarged to make room. If you're short of disk space that can be impossible or could leave insufficient space for temporary files and logs. In a multiple wiki situation it's best to apply the compression to the smallest wikis first, alter them to free the space, and move on up to larger sizes. By the time you get to the largest you'll have freed much of the space they will need.
Alternatively, you can temporarily convert some tables to MyISAM using alter table tablename engine=MyISAM to move them out of the tablespace and into the normal free space, freeing space in the tablespace. Once the alter table for the big projects has completed you can use alter table tablename engine=InnoDB to convert them back to InnoDB.
A combination of both doing smaller wikis first and converting some tables in some wikis to MyISAM may be necessary if space is very tight. For Wikimedia, the minimum safe free disk space is between 9 and 10GB. Even at 10GB there's the risk that a large set of temporary files can leave the server without sufficient log space and break replication.
If using MySQL version 4.1 there's also the option of putting each database into its own tablespace. You'll still need enough free space for the copy of the table but won't have the main tablespace size expanded.