PoolCounter

From Wikitech
Jump to: navigation, search

To avoid massive wastage of CPU due to parallel parsing when the cache of a popular article is invalidated (the "Michael Jackson problem"), we now have a pool counter which limits the number of processes that may work on a given parse operation at any given time.

Contents

Source

The abstract interface is in MediaWiki core (see $wgPoolCounterConf). The implementation we use is in Extension:PoolCounter.

  • The client source is in the MediaWiki extension itself.
  • The server source comes with the extension source, in the /daemon directory .
  • The server can be installed via APT, with the poolcounter package. The script that comes with the PoolCounter extension is in Debian-native form, so recompilation is just svn export && dpkg-buildpackage.

Architecture

The server is a single-threaded C program based on libevent. It does not use autoconf, it just has a makefile which is suitable for a normal Linux environment.

The server currently has no daemonize code, and so is backgrounded by start-stop-daemon -b.

The network protocol is line-based, with parameters separated by spaces. The following commands are defined:

ACQ4ANY
This is used to acquire a lock when the client is capable of using the cache entry generated by another process. If the pool worker limit is exceeded, the server will give a delayed response to this command. When a client completes its work, all processes which are waiting with ACQ4ANY will immediately be woken so that they can read the new cache entry.
ACQ4ME
This is used to acquire a lock when cache sharing is not possible or not applicable, for example when a stub threshold is set. When a lock of this kind is released, only one waiting process will be woken, so as to keep the worker population the same.
RELEASE
releases a lock
STATS
show statistics

Configuration

The server does not require configuration. Configuration of pool sizes, wait timeouts, etc. is done dynamically by the client. The server currently runs on tarin. Installation of the poolcounter package is done via puppet.

The client settings we use are in Wikimedia's operations/mediawiki-config repository (in wmf-config/PoolCounterSettings.php):

$wgPoolCountClientConf

servers 
An array of server IP addresses. Adding multiple servers causes locks to be distributed on the client side using a consistent hashing algorithm.
timeout 
The connect timeout in seconds.

$wgPoolCounterConf

The key in this configuration array identifies the MediaWiki class which does the work. Currently only parsing is defined, and the parsing job has the ID "ArticleView". The following parameters must be given for each class:

class 
must be PoolCounter_Client
timeout 
The amount of time in seconds that a process should wait for a lock before it gives up and takes some other action. In the current implementation, the other action is to return stale HTML to the user, if it is available. If there is no stale cache entry, an error will be shown.
workers 
The maximum number of processes that may simultaneously hold the lock. Setting this to a value greater than 1 helps to prevent malfunctioning servers from degrading service time, at the expense of wasted CPU.
maxqueue 
The maximum number of processes that may wait for the lock. If this is exceeded, the effect is the same as an instant timeout. Setting this to a sufficiently low value prevents a lock which is held for a very long period of time from jeopardising the stability of the cluster as a whole.

Testing

$ echo 'STATS FULL' | nc -w1 tarin 7531 
uptime: 633 days, 15209h 42m 26s
total processing time: 85809 days 2059430h 0m 24.000000s
average processing time: 0.957994s
gained time: 1867 days 44820h 50m 24.000000s
waiting time: 390 days 9365h 18m 24.000000s
waiting time for me: 389 days 9343h 3m 28.000000s
waiting time for anyone: 22h 14m 53.898438s
waiting time for good: 520 days 12503h 48m 24.000000s
wasted timeout time: 473 days 11375h 2m 44.000000s
total_acquired: 7739031655
total_releases: 7736374042
hashtable_entries: 119
processing_workers: 119
waiting_workers: 216
connect_errors: 0
failed_sends: 1
full_queues: 10294544
lock_mismatch: 227
release_mismatch: 0
processed_count: 7739031536

The hour counts may be broken due to a bug which was fixed in version control but never deployed by repackaging.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox