Server admin log/Archive 9
From Wikitech
March 24
- 16:10 jeronim: unfirewalled all ICMP on benet to solve someone's problem with downloading from dumps.wm.org. /u/l/b/firewall-init.sh not altered because i don't know if that's the right script nowadays
- 07:14 Kyle: db1-4 are ready for service at 10.0.0.234-237 with 408GB /a's
- 5:40 jeluf: rebooting iris
March 23
- 19:52 brion: another mystery case of 'Error: 1114 The table '<whatever>' is full' on adler. Various tables (job, text, pagelinks, etc). Plenty of disk space free, dump still running; unclear what's full. Adler's error log shows lots of "InnoDB: many active transactions running concurrently?060323 19:52:43InnoDB: Warning: cannot find a free slot for an undo log. Do you have too"
- 19:49 jeluf: KNAMS back, switched back to old DNS map.
- 19:05 jeluf: www.kennisnet.nl down, too, no SSH. DC outage assumed. Switched PowerDNS to point all of Europe to Florida.
- 18:50 jeluf: KNAMS squids not responding. Load balancer?
- 02:33 brion: starting enwiki backup again; last run got hit by a mysterious "Error: 1114 The table '#sql_a6a_0' is full (10.0.0.101)"
March 22
- 07:30 domas: srv59, srv51 hit by /h/w/src/memcache/install-fc3, continuing...
March 21
- 20:25 jeluf: srv59 is listed twice in the list of memcached servers. Replaced one of them by srv71.
- 20:00 jeluf: Users complain about bad performance. No servers seem to be broken, but tugelas are behaving odd. There are fast ones (0.05s for 100 requests) and slow ones (5s for 100 requests). Slow ones have bi values of 450, fast ones have bi values of 20. bo is 0. mctest at 20:17 UTC:
10.0.2.51:11000 set: 100 incr: 100 get: 100 time: 4.16831994057 10.0.2.55:11000 set: 100 incr: 100 get: 100 time: 0.0873651504517 10.0.2.53:11000 set: 100 incr: 100 get: 100 time: 0.0911560058594 10.0.2.54:11000 set: 100 incr: 100 get: 100 time: 3.38875198364 10.0.2.56:11000 set: 100 incr: 100 get: 100 time: 0.061262845993 10.0.2.70:11000 set: 100 incr: 100 get: 100 time: 3.37843799591 10.0.2.58:11000 set: 100 incr: 100 get: 100 time: 0.126893043518 10.0.2.59:11000 set: 100 incr: 100 get: 100 time: 6.54098010063 10.0.2.59:11000 set: 100 incr: 100 get: 100 time: 6.14648485184 10.0.2.62:11000 set: 100 incr: 100 get: 100 time: 4.1362080574 10.0.2.64:11000 set: 100 incr: 100 get: 100 time: 4.54642486572 10.0.2.65:11000 set: 100 incr: 100 get: 100 time: 0.0734169483185 10.0.2.66:11000 set: 100 incr: 100 get: 100 time: 3.67762804031 10.0.2.68:11000 set: 100 incr: 100 get: 100 time: 0.155061006546 10.0.2.69:11000 set: 100 incr: 100 get: 100 time: 5.22008705139 localhost set: 100 incr: 0 get: 0 time: 0.0392808914185
- 9:00 jeluf: rebooted hawthorn, mayflower, sage, clematis
- 7:00 Kyle: Racked 4 new database servers, pending names and ip's.
March 20
- 19:51 brion: dumps started up again in pmtpa
- 19:30 jeluf: added symlink to init.d/nfs from rc3.d on benet
- 19:13 brion: manually banged on benet, got it back online on the external IP. Somehow it's switched from using eth0 to using eth1, and config needs to be adjusted.
- 18:54 brion: Someone, somewhere, somehow rebooted benet for some reason around
midnight UTCtwo hours ago and there's a network problem, can't be reached from zwinger.- 16:44 PM rebooted benet
- 15:30 jeluf: dumps.wikimedia.org down, connection refused when trying to ssh to the box, HTTP times out.
March 19
- 17:40 ævar: Synced a new plwikiquote logo
March 18
- 08:40 jeluf: added srv36 to external storage cluster 3.
March 17
- 21:55 brion: srv60's memcached/tugela/whatever is VERY slow, 120s response time. can't ssh in. temporarily replacing it with srv59 in the mc cluster
March 16
- 23:14 brion: added redirects for quickipedia.(org|net) as requested
- 21:45 jeluf: Set up mysql server on srv36, replicating data from srv34 (cluster3). No old data imported to srv36, yet.
- 20:00 jeluf: Set up squid on srv8, moved one IP from srv6 to srv8
- 19:00 jeluf: restarted srv7's squid, using /usr/sbin/squid instead of /usr/local/squid/bin/squid
March 15
- 20:01 brion: adjusted checkers.php logging to use @ on all error_log() calls, so files that are forgotten on yaseo don't display warnings
- 19:00 jeluf: moved IP .204 from srv7 to srv9 (now they have 3 IPs each)
- 14:00 jeluf: restarted srv7's squid
March 14
- 08:04 brion: fixed bad permissions on some servers which broke sync-dblist script (uses rsync to copy *.dblist out)
- 07:44 brion: set up zh.wikinews.org
- 07:00 brion: setting up spcom s3kr1t wiki
March 13
- 22:55 brion: fixed (hopefully) the fallback for text loading. it was broken, badly, didn't notice before :P
- 22:45 jeluf: fixed replication of srv33. It has a gap from 15:00-22:45. Added back to pool. If a revision does not exist, the master should be asked anyway.
- 15:45 midom: srv32 manually resynced with srv34, srv33 still down
- 14:45 jeluf: srv32 and srv33 have out-of-sync replicas, shut them down. srv34 overloaded, went read-only
- 14:00 ævar: / on srv34 filled up, cleared out /tmp/mediawiki/, approx 70MB left
March 12
- 23:00 mark: Moved back ns0.wikimedia.org's IP to zwinger to get DNS back up
- 00:54 brion: renaming wikimaniawiki to wikimania2005wiki to future-proof and convenience things
March 11
- 22:10 jeluf: set up NFS, NTP, timezone, ... on ixia, added it to the mysql pool
- 07:30 jeluf: ixia doesn't start replication:
060311 2:13:04 Failed to open the relay log './lomaria-relay-bin.312' (relay_log_pos 36322078) 060311 2:13:04 Could not find target log during relay log initialization 060311 2:13:04 Failed to initialize the master info structure
- The file is there, permissions are there, no idea what's wrong
- 06:55 jeluf: restarted mysql on lomaria
- 05:09 brion: fundraising display partially back online. waiting for dns to clear, and will start regularly updating again....
- 01:12 brion: got friedrich switched; on 207.142.131.232. rebooting to test...
March 10
- 23:00 brion: taking friedrich out of apache service to replace tingxi
- 22:20 JeLuF: Taking lomaria down to copy its DB to ixia. Will take some hours.
- 07:20 Solar: yongle back up, but not public interface, just private. (It only had one cat5, let me know if you want me to hook another up to csw1.
March 9
- 05:22 Solar: Correct password on ixia
March 8
- 23:25 brion: thistle caught up, back in service
- 23:23 brion: taking thistle out of rotation temporarily; it's behind on master. reports of edits overwriting without conflict message may or may not be releated
- 19:35 jeluf: Changed IP of mail.wikimedia.org from .207 to .221. This allows us to move ns0 back to zwinger (needs to be done later, when the change is known on all DNS servers)
- 08:00 jeluf: khaldun had two default gateways. Removed default gw 10.0.0.4, ping to goeje works, NFS works
- 08:00 jeluf:
Khaldun down, NFS times out. No user complaints yet - is khaldun still in use at all?Update: Zwinger can reach khaldun, but goeje can't. Routing?
March 7
- 23:53 avar: Made Naconkantari sysop on kowiki due to massive WP is communism vandalism which none of the kowiki admins were awake to clean up.
- 02:48 Tim: Added Ozemail proxies to the trusted XFF list
March 6
- 11:50 jeluf: Changed config to use spamd instead of spamassassin
- 11:30 domas: reduced postfix, apache concurrency on goeje
- 11:30 jeluf, domas: goeje up, rebooted by PM.
- 09:00 jeluf: goeje down, postfix and apache shutdowns didn't help
- 08:00 jeluf: goeje overloaded, load avg 260, slow to no response. shut down postfix, shutting down apache
- 07:45 jeluf: replication of srv33 in sync with master. Restarted srv33 with mysql port 3306 enabled.
March 5
- 23:10 brion: added external.log for ExternalStoreDB load failures. we think mysterious text load failures might have been from srv33
- 23:05 jeluf: started srv33 with mysqld port set to 3307
- 22:50 jeluf killed wiki by starting lagged external storage srv33, killed it.
- 22:40 brion: jens put us back to read/write as the threads finished
- 22:19 brion: adler broken. nobody bothering to update the admin log
- InnoDB: many active transactions running concurrently?
- 060228 20:08:48InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
- InnoDB: many active transactions running concurrently?
- Processlist showed several hundred attempts to invalidate one image page (Vynil_record.jpg). Perhaps from automated job?
March 4
- 21:35 brion: fixed problem (whitespace in language file), captchas back on except for sr.wikipedia, which is reasonably well-populated
- 21:28 brion: disabling captchas on all sr projects; broken on sr for some reason
- 12:05 brion: yaseo uploads resolved (bad symlink into /mnt/wikipedia/htdocs on yaseo docroot), math also fixed (rewrite condition crashed apache; changed it and now works)
- 11:43 brion: noticed amaryllis / part is very small (10g) and full. nice.
- 11:40 brion: yaseo uploads borked for some reason. tossed in a symlink on amaryllis so /mnt/upload works there, but not sure why many still don't work on http
March 3
- 20:00 jeluf: set up new queues info-ch and info-als on OTRS.
- 05:04 Tim: set up daily cron job on goeje, to backup its root directory to hypatia once per day, at 06:00.
- 03:13 brion: started another enwiki dump, yaseo dump
- 03:03 brion: installing setproctitle on srv31; php is whining
March 2
- 23:09 brion: adding dns entries for wikimania200[56].wikimedia.org, will set up new wiki and redirects shortly
- 14:00 Tim: started rsync of goeje's root directory to hypatia:/var/backup/ssl-server, for backup and maybe failover capability in the future.
March 1
- 22:16 brion: turning on wgEmailAuthentication on public wikis. Somehow goeje got blacklisted by spamcop, allegedly for sending to blackhole addresses. There's a small possibility that active spamming was attempted through the wiki.
- 05:30 Solar: srv55, srv57, srv61, srv67 have new ram, and are up, but out of sync
- 04:20-04:25 Tim: srv54, a tugela server, was accidentally rebooted. This took the site down for about 5 minutes, probably due to unconfigurable fwrite() timeouts on persistent connections.
Archives
- Server admin log/Archive 1 (2004 Jun - 2004 Sep)
- Server admin log/Archive 2 (2004 Oct - 2004 Nov)
- Server admin log/Archive 3 (2004 Dec - 2005 Mar)
- Server admin log/Archive 4 (2005 Apr - 2005 Jul)
- Server admin log/Archive 5 (2005 Aug - 2005 Oct)
- Server admin log/Archive 6 (2005 Nov - 2006 Feb)