Site issue Sept 16 2012
Outage Summary
Wikimedia sites experienced 3 episodes of intermittent performance and brief unavailability on 16th & 17th August 2012. The first two incidents were reported on 16th at 13:22 UTC and 15:30 UTC. The third incident was on 17th September at 10:40 UTC.
- Duration: The durations of those incidents were 1.5 hours, 1.5 hours and 3.5 hours respectively
- Impact: During those episodes, editors on our Wikimedia sites lost their sessions intermittently and briefly. While readers were mostly not impacted, a small percentage of readers did experience response lag that caused pages not being rendered due to time-out.
- Cause: There were load spikes on the Apache servers causing them to swap intensively and fail intermittently.
- The zhwiki job queue has been growing quicker than the job runners can run them.
- At the moment there are 380303 jobs in the queue, with 808 distinct job_titles. Duplicate removal has been broken for a long time (ever since we had jobs with start/end parameters).
- Template DOM cache issue
- Resolution:
Detail
Below are the reports from Mark, Tim and Asher, following the incidents.
From Mark (following the 1st incident):
- We've just had a site outage, starting around 13:22 UTC when the MediaWiki/Apache cluster became unhappy, and as a result of that, other clusters such as the Text Squids as well. Leslie, Peter and I started looking at it when we got woken up by the texts. The Apaches were running into swap death, causing general unresponsiveness across the board, including the Memcached instances on those boxes and retries/refetches from the Squids, and thus amplifying the problem further. All we really knew was that Apache processes were rapidly growing in memory size, probably PHP. Debugging of the problem was hindered by the fact that these boxes would become completely unresponsive quickly.
- Some measures against the symptoms were taken such as restarting Apache and power cycling unresponsive boxes, but this was of course not helping to fix the root of the problem. As often, the problem would sometimes disappear eventually with the site becoming responsive again, but it never lasted more than 10 minutes.
- Eventually we found through tcpdump/ngrep on individual Apaches an unusually large amount of requests for:
/w/index.php?title=Template%3AAdvancedSiteNotices%2Fajax&variant=zh-tw&action=render
- on zh.wikipedia.org. This seems to be some site notice Gadget enabled on zhwiki, and this page was served as uncacheable. Andrew Garrett then disabled this Gadget. Although we can't definitively say this was the issue, the sites have been stable since.
- We should also really look into getting some better instrumentation for finding problematic requests I think. Currently it's rather hard to figure out what a problematic Apache/Mediawiki instance is really doing, or to find unusual/new request patterns.
From Tim:
- It was probably the template DOM cache, like the ocwiki problem. I've pushed a change to gerrit which will throw an exception if more than a certain number of DOM elements are created:
https://gerrit.wikimedia.org/r/23921 https://gerrit.wikimedia.org/r/23922 https://gerrit.wikimedia.org/r/23923
- It's configured with $wgMaxGeneratedPPNodeCount. The memory usage is probably somewhere in the vicinity of 100 to 300 bytes per element, so the default of 1 million will give a maximum of ~100-300MB memory usage. After deployment, the exception log (fluorine:/a/mw-log/exception.log) needs to be watched closely to make sure the change was not too disruptive to existing articles.
- If the problem happens again, deploying this change would probably be a good idea.
From: Mark following Sept 17 incident:
- So we just got woken up by another outage. The Apaches were not nearly out of memory this time, so it wasn't this problem again. PyBal on lvs4 seemed very unhappy with all apaches, depooling and repooling them at random. I noticed that the Parser cache has been at a consistent 1000ms latency for the past day:
- I cleared profiling data. Then noticed that Memcached gets were at roughly 50% of real time, with 2 down Memcached instances in the top profiling report, so I power cycled those. Actually I power cycled srv250 but instead srv249 went down. (Chris/Rob, please fix!) For some reason "mwscript mctest.php" was not complaining about those two memcached instances, which made me search elsewhere for a while first.
- After the two down memcached instances came back up, everything stabilized again, and PC hit latency was down to normal.
From Asher after further investigation:
- The zhwiki job queue has been growing quicker than the job runners can run them. I've manually deleted around 150k duplicates (our deduplication is completely broken due to the inclusion of mysql master positions in job_params which is used in full for the de-dupe attempt) but the trend still doesn't look good. The jobs are all of the refreshLinks2 variety and most are from template edits. At the moment there are 380303 jobs in the queue, with 808 distinct job_titles.
- I traced the full run of the following job "2012-09-18 06:25:51 refreshLinks2 Template:Country_data_JAM table=templatelinks start=818486 end=854177 masterPos=db54-bin.000375/120945239 t=41901 good"
- This one job took ~42 seconds to run (from 06:25:09.631586 to 06:25:51.544876) - It required 957 template inclusions (a few: Template:Country data Finland Template:Country data Fiji Template:Country data Falkland Islands Template:Country data Federated States of Micronesia) - 4439 SELECT mysql queries - 7463 memcached gets - 0 (zero) INSERT|REPLACE|UPDATE|DELETE mysql queries (including to parsercache) - 0 (zero) memcached writes (based on examining all sendto calls to all 79 open sockets to memcached)