Server admin log/test
From Wikitech
October 18
- 00:03 Brion: trying a log thingay
October 17
- 21:10 brion: enabled Commons foreign image repo on Wikitech
- 18:45 brion: created Wikimedia-Boston list for SJ
- 16:55 brion: adding nomcomwiki to special.dblist so it shows up right in sitematrix
- 16:45 brion: deleted some junk comments from bugzilla
- 16:31 brion: changed autoconfirm settings for 'fishbowl' wikis -- 0 age for autoconfirm, plus set upload & move for all users just in case autoconfirm doesn't kick in right
- 14:22 RobH: srv131 back up.
- 09:03 Tim: copying srv129 and srv139 ES data directories to storage2:/export/backup
- 02:49 Tim: excessive lag on db16, killed long-running queries and temporarily depooled. CUPS odyssey continues.
- 01:59 Tim: removing cups on all servers where it is running
- 00:00 RobH: restarted srv43-47
October 16
- 20:42 brion: added 3 more dump threads on srv31... we need to find some more batch servers to work with for the time being until new dump system is in place :)
- 20:20 RobH: pulled samuel from the rack, decommissioned, RIP samuel.
- 19:35 RobH: migrated rack B4 from asw3 to asw-b4-pmtpa.
- 18:40 RobH: rebooted scs-ext opps!
- 18:26 RobH: srv61 reinstalled and redeployed.
- 18:24 RobH: Adler re-racked with rails, booted up to maintenance mode prompt.
- 17:34 mark: 208.80.152.0/25 NTP restriction is actually also not broad enough - changed it to /22 in ntpd.conf on zwinger
- 17:02 brion: thumbnails on commons are insanely slow and/or broken
- 14:44 Tim: added a more comprehensive redirection list to squid.conf.php for storage1 images
- 14:04 Tim: redirected images for /wikipedia/en/ to storage1, apparently they were moved a while ago. Refactored the relevant squid.conf section.
- 13:38 Tim: disabled directory index on amane. Was generating massive amounts of NFS traffic by generating a directory index for some timeline directories.
- 12:51 Tim: increased memory limit on srv159 to 8x200MB. Still well under physical.
- 11:38 Tim: cleaned up temporary files on srv159, had filled its disk
- 11:25 Tim: synced upload scripts (including to ms1)
- 10:06 Tim: removed sq50 from the squid node lists and uninstalled squid on it
- 09:22 - 09:52 mark, Tim, JeLuF: initial attempts to bring the squids back up failed due to incorrect permissions on the recreated swap logs. Most were back up by around 09:32, except newer knams and yaseo squids which were missing from the squids_global node group. The node group was updated and the remainder of the squids brought up around 09:52.
- 09:19 JeLuF: deployed squid.conf with an error in it. All squid instances exited.
- 08:26 Tim: Restarted ntpd on search7, was broken
- 06:42 Tim: ntp.conf on zwinger had the wrong netmask for the 208.x net, it was /26 instead of /25. So a lot of squids were out of it, and some had a clock skew of 10 minutes (as visible on ganglia). Fixed ntp.conf, not stepped yet. Will affect squid logs.
October 15
- 19:49 brion: added '<span onmouseover="_tipon' to spam regex; some kind of weird edit submissions coming with this stuff like [1]
- 12:00 Tim: trying to bring srv159 up as an image scaler. Limiting memory usage to 8x100 = 800MB with MediaWiki.
- 11:21 srv127 died just the same. Mark suggests using one with DRAC next.
- 10:20 Tim: all image scalers (srv43 and srv100) swapped to death again. Preparing srv127 as an image scaler with swap off.
- 08:43 Tim: reduced depool-threshold for the scalers to 0.1 since srv100 is quite capable of handling the load by itself while we're waiting for the other servers to come back up.
- 07:45 Tim: half the scaling cluster went down again, ganglia shows high system CPU. Installing wikimedia-task-scaler on srv100.
- 02:30 Tim: moved image scalers into their own ganglia cluster
- 02:17 Tim: apache on srv43-47 hadn't been restarted and so was still running without -DSCALER. This partially explains the swapping. Restarted them. Took srv38-39 back out of the image scaler pool, they have different rsvg and ffmpeg binary paths and break without a MediaWiki reconfiguration.
- 02:13 tomasz: upgraded srv9 to ubuntu 8.04
- 02:00 tomasz: upgraded srv9 to ubuntu 7.10
October 14
- 19:16 brion: restarted lighty on storage1 again -- it was back in 'fastcgi overloaded' mode, possibly due to the previously broken backend, possibly not
- 19:11 mark: Pooled old scaling servers srv38, srv39
- 18:50 brion: at least four of new image scalers are down -- can't reach by SSH. thumbnailing is borked
- 16:41 brion: fixed image scaling for now -- storage1 fastcgi backends were overloaded, so it was rejecting things. did some killall -9s to shut them all down and restarted lighty. ok so far
- 16:20 brion: image scaling is broken in some way, investigating
- 02:54 Tim: fixed srv43-47, this is now the image scaling cluster
- 00:10 Tim: oops, forgot to add VIPs, switched back.
- 00:05 Tim: switched image scaling LVS to srv43-47
October 13
- 23:45 Tim: prepping srv43-47 as image scaling servers
- 21:45 jeluf: moved more image directories to ms1. Now, upload/wikipedia/[abghijmnopqrstuwxy]* are on ms1
- 21:35 jeluf: killed mwsearchd on srv39, removed both the rc3.d link and the cronjob that start mwsearchd
- 21:30 RobH: search8 and search9 are online, awaiting configuration.
- 21:15 brion: thumb rendering failures reported... found some runaway convert procs poking at an animated GIF, killed them.
- rev:42058 will force GIFs over 1 megapixel to render a single frame instead of animations as a quick hackaround...
- 20:48 domas: thistle serving as s2a server
- 20:28 RobH: stopping mysql on adler so it can be re-racked with rails.
- 19:53 RobH: search7 back online, awaiting addition to the search cluster.
- 19:35 mark: Set up an Exim instance on srv9 for outgoing donation mail, as well as incoming for delivery into IMAP for CiviMail (*spit*).
- 17:00 RobH: srv21-srv29 decommissioned and unracked.
- 12:05 domas: put lomaria back in rotation
- 11:50 domas: Enabled write-behind caching on db15. Restarted.
- 10:40 domas: restarted replication on db15 and lomaria
- 10:27 domas: loading dewiki data from SQL dump into thistle
- 09:09 Tim: restarted logmsgbot
- 08:27 Tim: folded s2b back into s2
- 08:06 Tim: db13 in rotation
- 08:02 domas: copying from db15 to lomaria
- 07:38 Tim: started replication on db13
- 04:51 Tim: copying
- 03:27 Tim: Preparing for copy from db15 to db13
- 00:00 domas: something wrong with db15 i/o performance. it is behaving way worse, than it should.
October 12
- 23:58 brion: updated CodeReview to add a commit so loadbalancer saves our master position. playing with serverstatus extension on yongle to find out wtf it keeps getting stuck
- 22:05 brion: db15 sucks hard. putting categories back to db13
- 22:01 brion: db15 got all laggy with the load. taking back out of general rotation, leaving it on categories/recentchangeslinked
- 21:58 brion: db15 seems all happy. swapping it in in place of db13, and giving it some general load on s2. we'll have to resync db13 at some point? and toolserver?
- 19:41 Tim: shutting down db15 for restart with innodb_flush_log_at_trx_commit=2. But db8 seems to be handling the load now so I'm going to bed.
- 19:20 Tim: depooled db15.
- 19:09 Tim: split off some wikis into s2b and put db8 on it. To reduce I/O and hopefully stop the lag.
- 18:51 Tim: db15 still chronically lagged. Offloading all s2 RCL and category queries to db13.
- 18:38 Tim: offloading commons RCL queries to db13
- 18:36 Tim: dewiki r/w with ixia (master) only
- 18:33 Tim: offloading commons category queries to db13
- 18:25 Tim: balancing load. Fixed ganglia on various mysql servers.
- 18:06 Tim: going to r/w on s2. Not s2a yet because db15/db8 can't handle the load.
- 17:46 Tim: db8->db15 copy finished, deploying
- 17:33 Tim: installed NRPE on thistle.
- 16:54 Tim: copied mysqld binaries from db11 to db15 and thistle. Plan for thistle is to use it for s2a.
- 16:40 Tim: ixia/db8 can't handle the load between them with db13 out, even with s2a diverted. Restored db13 to the pool. Running out of candidates for a copy destination. Need db13 in because it's keeping the site up, can't copy to thistle because it's too small with RAID 10. Plan B: set up virgin server db15. Copying from db8.
- 16:07 Tim: repooled ixia/db8 r/o
- 15:53 Tim: removed ixia binlogs 290-349. 270-289 were deleted during the initial response.
- 14:54 mark: Pooled search6 as part of search cluster 2, by request of rainman
- 14:37 Tim: deployed r41995 as a live patch to replace buggy temp hack.
- 14:14 Tim: cleaned up binlogs on db2. Yes the horse has bolted, but we may as well shut the gate.
- 14:11 Tim: copy now in progress as planned.
- 13:48 Tim: going to try the resync option. Maybe with s2 it won't take as long as s1. Will try to sync up db8 from ixia with db13 serving read-only load for the duration of the copy.
- 13:40 Tim: ixia (s2 master) disk full. Classic scenario, binlogs stopped first, writing continued for 10 minutes before replag was reported.
- 13:00 jeluf: moved wikipedia/m* image directories to ms1
- 08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled.
- 02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while.
October 11
- 7:00 jeluf: moved wikipedia/s* image directories to ms1
October 10
- 21:30 jeluf: moved wikipedia/[jqtuwxy]* to ms1
- 19:20 RobH: Bayes online.
- 19:11 brion: recreated special page update logs in /home/wikipedia/logs, hopefully fixing special page updates
- 13:05 Tim: reverted live patch and merged properly tested fix r41928 instead.
- 12:31 Tim: deployed a live patch to fix a regression in MessageCache::loadFromDB() concurrency limiting lock
- 12:17 domas: killed long running threads
- ~12:04: s2 down due to slave server overload
October 9
- 22:52 brion: enabled Collection on de.wikibooks so they can try it out
- 20:00 jeluf: moved wikipedia/i* images to ms1
- 17:05 RobH: thistle raid died due to hdd failed, replaced hdd, reinstalled as raid10.
- 12:00 domas: switched s3 master to db1, did erase bunch of db.php stuff by accident (don't know how :). restored from db.php~ :-)
- 09:31 mark: pascal died yet again, revived it. Will move the htcp proxy tonight...
October 8
- 21:05 brion: yongle still gets stuck from time to time, breaking mobile, apple search, and svn-proxy. i suspect svn-proxy but can't easily prove it still. using separate svn command (in theory) but it's not showing me stuck processes.
- ??:?? rob fixed srv37, then later srv133 into mediawiki-installation node group. he did an audit and didn't see any other problems. i ran a scap to make sure all are now up to date
- Speculation: possible that rumored ongoing image disappearances have been caused by the image-destruction bug still being in place on srv133 for the last month.
- 19:02 mark: Upgraded packages on search1 - search6 and searchidx1
- 18:59 brion: aaron complaining of srv37 not properly updated (doesn't recognize Special:RatingHistory). flaggedrevs.php was out of date there. checking scap infrastructure, stuff seems ok so far...
October 7
- 21:47 brion: started two dump threads (srv31)
- 21:16 RobH: installed and configured gmond on all knams squids.
- 21:00 jeluf: moved wikipedia/g* to ms1
- 18:55 RobH: fixed private uploads issue for arbcom-en and wikimaniateam.
- 17:26 RobH: reinstalled and redeployed knsq24 and knsq29
- 15:00-16:00 robert: switched enwiki to lucene-search 2.1 running on new servers. Test run till tomorrow, if anything goes wrong, reroute search_pool_1 to old searchers on lvs3. Will switch on spell checking when all of the servers are racked. Thanks RobH for tunning config files.
- 15:54 RobH: srv101 crashed again, running tests.
- 15:45 RobH: srv146 was powered down for no reason. Powered back up.
- 15:42 RobH: srv138 locked up, rebooted, back online.
- 15:32 RobH: srv110 was locked up, rebooted, synced, back online.
- 15:31 RobH: srv101 back up and synced.
- 15:22 RobH: rebooted srv56, was locked up, handed off to rainman to finish repair.
- 15:21 RobH: updated lucene.php and synced.
- 15:04 RobH: updated memcached to remove srv110 and add in spare srv137.
- 15:00 RobH: removed all servers from lvs:search_pool_1 and put in search1 and search2 with rainman
October 6
- 23:55 brion: tweaked bugzilla to point rXXXX at CodeReview instead of ViewVC
- 14:29 domas: amane lighty was closing connections immediately, worked properly after restart. upgraded to 1.4.20 on the way.
- 14:36 RobH: setup ganglia on all pmtpa squids.
- 13:50 mark: The slow page loading on the frontend squids appears to be limited to english main page only, for unknown reasons. Set another article as pybal check URL to prevent pooling/depooling oscillation by PyBal for now.
- 09:27 mark: yaseo squids are fully in swap, set DNS scenario yaseo-down
October 5
- 23:14 mark: Frontend squids are not working well at the moment, sometimes serving cached objects with very high delays. I wonder if they are under (socket) memory pressure. Reduced cache_mem on the backend instance on sq25 to free up some memory for testing.
- 20:35 jeluf: wikipedia/b* moved, too
- 19:00 jeluf: switched squids to send requests for upload.wikimedia.org/wikipedia/a* to ms1
- 14:30 jeluf: Moving all wikipedia/a* image directories to ms1
October 4
- 23:17 mark: Repooled knsq16-30 frontends in LVS. Also found that mint was fighting with fuchsia about being LVS master, due to reboot this afternoon.
- 14:30 mark: Several servers in J-16 were shutting down, or going down around this time. Reason unknown, possibly auto shutdown because of high temperature, possibly they were turned off by someone locally.
- 14:03 mark: SARA power failure. Feed B lost power for ~ 6 seconds.
- 00:26 mark: Depooled srv61
- 00:07 brion: found srv37 and srv61 have broken json_decode (wtf!)
- updating packages on srv37. srv61 seems to have internal auth breakage
- updated packages on srv61 too. su still borked, may need LDAP fix or something?
October 3
- 21:40 brion: transferring old upload backups from storage2 to storage3. once complete, can restart dumps!
- 20:01 brion: running updateRestrictions on all wikis (done)
- 17:51 RobH: srv135 & srv136 reinstalled as ubuntu.
- 17:34 RobH: srv132 & srv133 reinstalled as ubuntu.
- 17:13 RobH: srv130 back online.
- 16:40 RobH: depooled srv131, srv132, srv135, srv136 for reinstall.
- 00:25 brion: switched codereview-proxy.wikimedia.org to use local SVN command instead of PECL SVN module; it seemed to be getting bogged down with diffs, but hard to really say for sure
October 1
- 20:02 RobH: srv63 back online.
- 19:35 RobH: srv61 and srv133 back online.
- 18:22 RobH: storage3 online and handed off to brion.
- 17:35 RobH: updated mc-pmtpa.php to put srv61 as spare.
- 17:32 RobH: srv61 faulty fan replaced, back online.
- 09:31 Tim: srv104 (cluster18) hit max_rows, finally. Removed it from the write list.
- 08:36 Tim: fixed ipb_allow_usertalk default on all wikis
- 23:46 mark: Reinstalled knsq24
- 22:55 mark: Reenabled switchports of knsq16 - knsq30
- 20:45 jeluf: fixed resolv.conf on srv131
- 20:45 jeluf: mounted ms1:/export/upload as /mnt/upload5, started lighttpd on ms1
- 19:47 brion: enabled revision deletion on test.wikipedia.org for some public testing.
- 14:25 RobH: Cleaned out the squid cache on knsq16, knsq17, knsq18, knsq19, knsq21, knsq22, knsq23, knsq25, knsq26, knsq27, knsq28, knsq30. DRAC not responsive on knsq20, knsq24, knsq29.
Archives
- Server admin log/Archive 1 (2004 Jun - 2004 Sep)
- Server admin log/Archive 2 (2004 Oct - 2004 Nov)
- Server admin log/Archive 3 (2004 Dec - 2005 Mar)
- Server admin log/Archive 4 (2005 Apr - 2005 Jul)
- Server admin log/Archive 5 (2005 Aug - 2005 Oct)
- Server admin log/Archive 6 (2005 Nov - 2006 Feb)
- Server admin log/Archive 7 (2006 Mar - 2006 Jun)
- Server admin log/Archive 8 (2006 Jul - 2006 Sep)
- Server admin log/Archive 9 (2006 Oct - 2007 Jan)
- Server admin log/Archive 10 (2007 Feb - 2007 Jun)
- Server admin log/Archive 11 (2007 Jul - 2007 Dec)
- Server admin log/Archive 12 (2008 Jan - 2008 Jul)
- Server admin log/2008-08
- Server admin log/2008-09