Incident response

From Wikitech
Revision as of 02:09, 24 June 2011 by Tim (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

OMG THE SITE'S DOWN!!!! What to do?????

In overview:

  • Diagnose
  • Fix
  • Communicate

in approximately that order.

The nature of the tasks is that they can all be interleaved. Speculative or temporary fixes can be applied before a full diagnosis is made. Analysis of root causes can be complicated and is often best left until after the site is back up. Operational communication (server admin log, IRC, etc.) starts immediately, and community communication (#wikimedia-tech topic, mailing list) might start about 5-10 minutes in.

Contents

Diagnosis

Diagnosis should always start by observing the symptoms.

  • Open Ganglia, reqstats and the site itself in separate browser tabs.
  • Carefully read the reports from users, which typically come in on #wikimedia-tech. Ask for clarifications if they are unclear.

Ganglia is by far the most useful and important diagnosis tool. Interpreting it is complex but essential. Request rate statistics (e.g. reqstats) are useful to get a feel for the scale of the problem, and to confirm that the user reports are representative and not just confined to a few vocal users. Viewing the site itself is the least useful diagnosis tool, and can often be left out if the user reports are clear and trustworthy.

Shell-based tools such as MySQL "show processlist", strace, tcpdump, etc. are useful for providing more detail than Ganglia. However, they are potential time-wasters. Unfamiliarity with the ordinary output of these tools can lead to misdiagnosis. Complex chains of cause and effect can lead responders on a wild goose chase, especially when they are unfamiliar with the system.

Failure modes

Fast fail

Requests fail quickly. Backend resource utilisation drops by a large factor. Frontend request rate typically drops slightly, due to people going away when they see the error message, instead of following links and generating more requests. Frontend network should drop significantly if the error messages are smaller than the average request size.

Example causes:

  • Someone pushes out a PHP source file with a syntax error in it
  • An essential TCP service fails with an immediate "connection refused"

Overload

This is the most common cause of downtime. Overload occurs when the demand for a resource outstrips the supply. The queue length increases at a rate given by the difference between the demand rate and the supply rate.

The growth of the queue length in this situation is limited by two things:

  • Client disconnections. The client may give up waiting and voluntarily leave the queue.
  • Queue size limits. Once the queue reaches some size, something will happen that stops it from growing further. Ideally, this will be a rapidly-served error message. In the worst case, the limit is when the server runs out of memory and crashes.

As long as the server does not have some pathology at high queue sizes (such as swapping), it is normal for some percentage of requests to be properly served during an overload. However, if queue growth is limited by timeouts, the FIFO nature of a queue means that service times will be very long, approximately equal to the average timeout.

There are two kinds of overload causes:

Increase in demand 
For example: news event, JavaScript code change, accidental DoS due to an individual running expensive requests, deliberate DoS.
Reduction in supply 
For example: code change causing normal requests to become more expensive, hardware failure, daemon crash and restart, cache clear.

It can be difficult to distinguish between these two kinds of overload.

Note that for whatever reason, successful, deliberate DoS is extremely rare at Wikimedia. If you start with an assumption that the problem is due to stupidity, not malice, you're more likely to find a rapid and successful solution.

Common overload categories

Somewhere in the system, a resource has been exhausted. Problems will extend from the root cause, up through the stack to the user. Low utilisation will extend down through the stack to unrelated services.

For example, if MySQL is slow:

  • Looking up the stack, we will see overloads in MySQL and Apache, and error messages generated in Squid.
  • Looking down the stack, the overload in Apache will cause a large drop utilisation of unrelated services such as search.

Squid connection count

For the squid pool, the client is the browser, and disconnections occur both due to humans pressing the "stop" button, and due to automated timeouts. It's rare for any queue size limit to be reached in squid, since queue slots are fairly cheap. Squid's client-side timeouts tend to prevent the queue from becoming too large.

CPU overload in Squid could be the result of deliberate DoS, or it could be due to a spike in demand, say due to a JavaScript application change.

Squid should almost never be restarted during an overload. Restarting Squid makes it run slower, exacerbating the overload condition.

Apache process count

For the apache pool, the client is squid. Squid typically times out and disconnects after 30 seconds, then it begins serving HTTP 503 responses. However, when Squid disconnects, the PHP process is not destroyed (ignore_user_abort=true). This helps to maintain database consistency, but the tradeoff is that the apache process pool can become very large, and often requires manual intervention to reset it back to a reasonable size.

An apache process pool overload can easily be detected by looking at the total process count in ganglia.

Regardless of the root cause, an apache process pool overload should be dealt with by regularly restarting the apache processes using /home/wikipedia/bin/apache-restart-all. In an overload situation, the bulk of the process pool is taken up with long-running requests, so restarting kills more long-running requests than short requests. Regular restarting of apache allows parts of the site which are still fast to continue working.

Regular restarting is somewhat detrimental to database consistency, but the effects of this are relatively minor compared to the site being completely down.

There are two possible reasons for an apache process pool overload:

  • Some resource on the apache server itself has been exhausted, usually CPU.
  • Apache is acting as a client for some backend, and that backend is failing in a slow way.

If CPU usage on most servers is above 90%, and CPU usage has plateaued (i.e. it has stopped bouncing up and down due to random variations in demand), then you can assume that the problem is an apache CPU overload. Otherwise, the problem is with one of the many remote services that MediaWiki depends on.

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox