Incident response
(more about apache, some stuff about mysql) |
(fixing, communication) |
||
| Line 110: | Line 110: | ||
'''Important:''' disabling the source of slow queries in MediaWiki will typically not bring the site back up, if a large number of slow queries are queued in MySQL. That's one of the reasons why it's so important to kill first and patch second. Patching stops new queries from starting, it doesn't stop old queries from running. | '''Important:''' disabling the source of slow queries in MediaWiki will typically not bring the site back up, if a large number of slow queries are queued in MySQL. That's one of the reasons why it's so important to kill first and patch second. Patching stops new queries from starting, it doesn't stop old queries from running. | ||
| + | |||
| + | == Fixing == | ||
| + | |||
| + | Get your priorities straight. While the site is down, your priority is to get it back up. Do not let curiosity or a desire for a complete and elegant solution distract you from doing this as quickly possible. | ||
| + | |||
| + | Analysis of root causes can be done after the site is back up, based on logs. If you can't do it using the logs after the fact, then the logs aren't good enough and you should improve them for next time. | ||
| + | |||
| + | == Communicate == | ||
| + | |||
| + | === Logging === | ||
| + | |||
| + | It's absolutely essential that you communicate your actions to other sysadmins as you do them. Here are some reasons: | ||
| + | |||
| + | * It avoids duplication of effort, conflicts over text file edits, etc. | ||
| + | * It avoids confusing other sysadmins about the causes of the site changes that they observe. It is difficult enough to diagnose the cause of downtime. If a sysadmin changes something, and another sysadmin erroneously attributes the results, then that can significantly slow the diagnosis process. | ||
| + | * Bus factor. If you say what you are doing, other sysadmins have a chance of continuing your work should you lose internet connectivity. | ||
| + | * Sanity review. Responding to site downtime is a high-stress activity and is prone to errors. By writing about your actions and your thoughts, you give others the chance to review and comment on them. | ||
| + | * It makes post-mortem analysis possible. If actions are unlogged, then reconstructing the order of events becomes very difficult. If you hinder post-mortem analysis, then you make it more likely that the same problem will happen again. | ||
| + | |||
| + | === Paging === | ||
| + | |||
| + | You have to be able to recognise when the problem is beyond your ability (or the ability of those people so far assembled) to fix alone. | ||
| + | |||
| + | Some issues require a lot of work to fix. For example, it takes a lot of work to recover from a power outage at Tampa. In such a case, it makes sense to get everyone online from the outset. | ||
| + | |||
| + | Some issues require special expertise. For example, database crashes need Domas to be online (or at least Tim). Network failures need Mark to be online. | ||
| + | |||
| + | If the site has been down for 45 minutes or more, it is time to stop working on the technical issues and to get some perspective. If a small team can't get the site back up in this amount of time, it has failed, and it is time to wake people up. | ||
| + | |||
| + | Of course, it's often prudent to page people long before 45 minutes is up. But 45 minutes is a good point in time to have a reality check. | ||
| + | |||
| + | === Post mortem === | ||
| + | |||
| + | It's often overlooked that our [[server admin log]] is on a wiki. A nice way to start a postmortem is to add server admin log entries that were omitted at the time. Once you've reconstructed the order of events, with precise times attached, you can start looking at logs. | ||
| + | |||
| + | It's sometimes useful to test your theories about the root causes of downtime. If your theory about the root cause is incorrect, it means that the real root cause is still out there, waiting to cause more downtime. So there is a strong incentive to be rigorous. | ||
Revision as of 03:07, 24 June 2011
OMG THE SITE'S DOWN!!!! What to do?????
In overview:
- Diagnose
- Fix
- Communicate
in approximately that order.
The nature of the tasks is that they can all be interleaved. Speculative or temporary fixes can be applied before a full diagnosis is made. Analysis of root causes can be complicated and is often best left until after the site is back up. Operational communication (server admin log, IRC, etc.) starts immediately, and community communication (#wikimedia-tech topic, mailing list) might start about 5-10 minutes in.
Contents |
Diagnosis
Diagnosis should always start by observing the symptoms.
- Open Ganglia, reqstats and the site itself in separate browser tabs.
- Carefully read the reports from users, which typically come in on #wikimedia-tech. Ask for clarifications if they are unclear.
Ganglia is by far the most useful and important diagnosis tool. Interpreting it is complex but essential. Request rate statistics (e.g. reqstats) are useful to get a feel for the scale of the problem, and to confirm that the user reports are representative and not just confined to a few vocal users. Viewing the site itself is the least useful diagnosis tool, and can often be left out if the user reports are clear and trustworthy.
Shell-based tools such as MySQL "show processlist", strace, tcpdump, etc. are useful for providing more detail than Ganglia. However, they are potential time-wasters. Unfamiliarity with the ordinary output of these tools can lead to misdiagnosis. Complex chains of cause and effect can lead responders on a wild goose chase, especially when they are unfamiliar with the system.
Failure modes
Fast fail
Requests fail quickly. Backend resource utilisation drops by a large factor. Frontend request rate typically drops slightly, due to people going away when they see the error message, instead of following links and generating more requests. Frontend network should drop significantly if the error messages are smaller than the average request size.
Example causes:
- Someone pushes out a PHP source file with a syntax error in it
- An essential TCP service fails with an immediate "connection refused"
Overload
This is the most common cause of downtime. Overload occurs when the demand for a resource outstrips the supply. The queue length increases at a rate given by the difference between the demand rate and the supply rate.
The growth of the queue length in this situation is limited by two things:
- Client disconnections. The client may give up waiting and voluntarily leave the queue.
- Queue size limits. Once the queue reaches some size, something will happen that stops it from growing further. Ideally, this will be a rapidly-served error message. In the worst case, the limit is when the server runs out of memory and crashes.
As long as the server does not have some pathology at high queue sizes (such as swapping), it is normal for some percentage of requests to be properly served during an overload. However, if queue growth is limited by timeouts, the FIFO nature of a queue means that service times will be very long, approximately equal to the average timeout.
There are two kinds of overload causes:
- Increase in demand
- For example: news event, JavaScript code change, accidental DoS due to an individual running expensive requests, deliberate DoS.
- Reduction in supply
- For example: code change causing normal requests to become more expensive, hardware failure, daemon crash and restart, cache clear.
It can be difficult to distinguish between these two kinds of overload.
Note that for whatever reason, successful, deliberate DoS is extremely rare at Wikimedia. If you start with an assumption that the problem is due to stupidity, not malice, you're more likely to find a rapid and successful solution.
Common overload categories
Somewhere in the system, a resource has been exhausted. Problems will extend from the root cause, up through the stack to the user. Low utilisation will extend down through the stack to unrelated services.
For example, if MySQL is slow:
- Looking up the stack, we will see overloads in MySQL and Apache, and error messages generated in Squid.
- Looking down the stack, the overload in Apache will cause a large drop in utilisation of unrelated services such as search.
Squid connection count
For the squid pool, the client is the browser, and disconnections occur both due to humans pressing the "stop" button, and due to automated timeouts. It's rare for any queue size limit to be reached in squid, since queue slots are fairly cheap. Squid's client-side timeouts tend to prevent the queue from becoming too large.
CPU overload in Squid could be the result of deliberate DoS, or it could be due to a spike in demand, say due to a JavaScript application change.
Squid should almost never be restarted during an overload. Restarting Squid makes it run slower, exacerbating the overload condition.
Apache process count
For the apache pool, the client is squid. Squid typically times out and disconnects after 30 seconds, then it begins serving HTTP 503 responses. However, when Squid disconnects, the PHP process is not destroyed (ignore_user_abort=true). This helps to maintain database consistency, but the tradeoff is that the apache process pool can become very large, and often requires manual intervention to reset it back to a reasonable size.
An apache process pool overload can easily be detected by looking at the total process count in ganglia.
Regardless of the root cause, an apache process pool overload should be dealt with by regularly restarting the apache processes using /home/wikipedia/bin/apache-restart-all. In an overload situation, the bulk of the process pool is taken up with long-running requests, so restarting kills more long-running requests than short requests. Regular restarting of apache allows parts of the site which are still fast to continue working.
Regular restarting is somewhat detrimental to database consistency, but the effects of this are relatively minor compared to the site being completely down.
There are two possible reasons for an apache process pool overload:
- Some resource on the apache server itself has been exhausted, usually CPU.
- Apache is acting as a client for some backend, and that backend is failing in a slow way.
If CPU usage on most servers is above 90%, and CPU usage has plateaued (i.e. it has stopped bouncing up and down due to random variations in demand), then you can assume that the problem is an apache CPU overload. Otherwise, the problem is with one of the many remote services that MediaWiki depends on.
If the site is down because a service that MediaWiki depends on has become slow, there are a number of tools that can help to identify the service:
- Ganglia
- High load or resource utilisation at the root cause server may be obvious at a glance on ganglia.
- Profiling
- Run /home/wikipedia/bin/clear-profile, and then observe the highest users of real time here. This will only show you requests which have completed within the php.ini timeout of 3 minutes, and without the apache process being killed by administrator intervention, so it's not so helpful for the most severe overloads.
- Strace
- This is useful for the most severe overloads. Log in to a random apache. Run ps -C apache2 -l and pick a process with a suspicious-looking WCHAN. Run lsof -p PID, then attach to it with strace -p PID. With luck (and perhaps some repetition), this will hopefully tell you what FD apache is waiting on. Using the lsof output, you can identify the corresponding remote service.
MySQL overload
MySQL overload can often be detected from Ganglia, by looking for an increase in load, or for anomalies in network usage and CPU utilisation.
Slow queries on MySQL typically lead to exhaustion of disk I/O resources. Fast, numerous queries may lead overload via high CPU and lock contention.
Slow queries can be identified by running SHOW PROCESSLIST. If slow queries are identified as the source of site downtime, the immediate response should be to kill them. To do this, a shell/awk one-liner typically suffices, such as:
- mysql -h $server -e 'show processlist' | awk '$0 ~ /...CRITERIA.../ {print "kill", $1, ";"}' | mysql -h $server
Once this is done and the site is back up, a secondary response can be considered, such as a temporary fix in MediaWiki by disabling the relevant module.
Monitoring of the number of running slow queries can be done with a related shell one liner:
- mysql -h $server -e 'show processlist' | grep CRITERIA | wc -l
Important: disabling the source of slow queries in MediaWiki will typically not bring the site back up, if a large number of slow queries are queued in MySQL. That's one of the reasons why it's so important to kill first and patch second. Patching stops new queries from starting, it doesn't stop old queries from running.
Fixing
Get your priorities straight. While the site is down, your priority is to get it back up. Do not let curiosity or a desire for a complete and elegant solution distract you from doing this as quickly possible.
Analysis of root causes can be done after the site is back up, based on logs. If you can't do it using the logs after the fact, then the logs aren't good enough and you should improve them for next time.
Communicate
Logging
It's absolutely essential that you communicate your actions to other sysadmins as you do them. Here are some reasons:
- It avoids duplication of effort, conflicts over text file edits, etc.
- It avoids confusing other sysadmins about the causes of the site changes that they observe. It is difficult enough to diagnose the cause of downtime. If a sysadmin changes something, and another sysadmin erroneously attributes the results, then that can significantly slow the diagnosis process.
- Bus factor. If you say what you are doing, other sysadmins have a chance of continuing your work should you lose internet connectivity.
- Sanity review. Responding to site downtime is a high-stress activity and is prone to errors. By writing about your actions and your thoughts, you give others the chance to review and comment on them.
- It makes post-mortem analysis possible. If actions are unlogged, then reconstructing the order of events becomes very difficult. If you hinder post-mortem analysis, then you make it more likely that the same problem will happen again.
Paging
You have to be able to recognise when the problem is beyond your ability (or the ability of those people so far assembled) to fix alone.
Some issues require a lot of work to fix. For example, it takes a lot of work to recover from a power outage at Tampa. In such a case, it makes sense to get everyone online from the outset.
Some issues require special expertise. For example, database crashes need Domas to be online (or at least Tim). Network failures need Mark to be online.
If the site has been down for 45 minutes or more, it is time to stop working on the technical issues and to get some perspective. If a small team can't get the site back up in this amount of time, it has failed, and it is time to wake people up.
Of course, it's often prudent to page people long before 45 minutes is up. But 45 minutes is a good point in time to have a reality check.
Post mortem
It's often overlooked that our server admin log is on a wiki. A nice way to start a postmortem is to add server admin log entries that were omitted at the time. Once you've reconstructed the order of events, with precise times attached, you can start looking at logs.
It's sometimes useful to test your theories about the root causes of downtime. If your theory about the root cause is incorrect, it means that the real root cause is still out there, waiting to cause more downtime. So there is a strong incentive to be rigorous.