Bits.wikimedia.org/Varnish testing
From Wikitech
< Bits.wikimedia.org(Difference between revisions)
| Line 34: | Line 34: | ||
* ~65% of CPU, 326MB of RES | * ~65% of CPU, 326MB of RES | ||
| − | + | Apparently when varnish reaches 100% cpu usage, accept thread gets into a state where it leaks worker threads. poll() reaches 100% cpu sooner than epoll, so effect visible way earlier. | |
| + | As sq1 is single-cpu/single-core, high ratio of context switches consumes way more CPU resources, than it would in multithreaded environment - this is why we saw much better scalability on multi-core machines. | ||
Revision as of 15:28, 11 January 2010
Few notes:
- We hit a bug where all threads are writing to acceptor pipe, but acceptor thread doesn't seem to pick that up
- Originally thought as 2.6.24 kernel problem, a 2.6.32.3 was deployed, but still got same problem
- This happens with both poll and epoll acceptors (managed to hit it much earlier with poll acceptor, may be coincidence)
- Currently we are running:
- varnish trunk (standard configure options)
- Following sysctl.conf additional changes loaded:
net.core.rmem_max=16777216 net.core.wmem_max=16777216 net.ipv4.tcp_rmem=4096 87380 16777216 net.ipv4.tcp_wmem=4096 65536 16777216 net.ipv4.tcp_fin_timeout = 3 net.core.netdev_max_backlog = 30000 net.ipv4.tcp_no_metrics_save=1 net.core.somaxconn = 262144 net.ipv4.tcp_syncookies = 0 net.ipv4.tcp_max_orphans = 262144 net.ipv4.tcp_max_syn_backlog = 262144 net.ipv4.tcp_synack_retries = 2 net.ipv4.tcp_syn_retries = 2
- ulimit -s 128
- ulimit -n 500000
- varnishd -n /dev/shm -smalloc,1G -f /usr/local/etc/varnish/bits.vcl -T 127.0.0.1:6000 -w 2000 -a 0.0.0.0:80 -p thread_pool_add_delay=1 -p send_timeout=30 -p listen_depth=4096
Last attempt was successful with half load and failed with full load immediately, it had:
- trunk
- send_timeout
- listen_depth
- sysctl changes
It was handling:
- 4500 requests/s
- 350mbps of traffic
- ~65% of CPU, 326MB of RES
Apparently when varnish reaches 100% cpu usage, accept thread gets into a state where it leaks worker threads. poll() reaches 100% cpu sooner than epoll, so effect visible way earlier. As sq1 is single-cpu/single-core, high ratio of context switches consumes way more CPU resources, than it would in multithreaded environment - this is why we saw much better scalability on multi-core machines.