Server admin log/Archive 20
From Wikitech
September 2
- 16:48 mark: Replaced srv52's memcached for srv69's off the spare list
- 16:41 srv52 goes down, unreachable
September 1
- 08:00 domas: live revert of DifferenceEngine.php to pre-24607 - requires additional patrolling index (ergh!), which was not created (ergh too). why do people think that reindexing recentchanges because of minor link is a good idea? :-/
- The schema change requirement was noted and made quite clear. If it wasn't taken live before the software was updated, it's no fault of the development team. Rob Church 14:44, 2 September 2007 (PDT)
- 03:16 Tim: fixed upload on advisorywiki
August 31
- 19:50 brion: starting an offsite copy of public upload files from storage2 to gmaxwell's server
- 13:40 brion: srv149 spewing logs with errors about read-only filesystem; can't log in; no ipmi; mark shut its switchport off
- 13:25 Users are reporting empty wiki page problems rendered by the new servers, e.g http://de.wikipedia.org/w/index.php?title=Sebnitz&oldid=36169808
- That test case is gone from the cache now, but I did see it before it went. Can't reproduce directly to the apache. Maybe an overloaded ext store server? -- Tim
- 06:50 mark: Massive packet loss within 3356, prepending AS14907 twice to 30217, removed prepend to 4323
August 30
- 21:40 mark: Python / PyBal on alrazi had crashed with a segfault - restarted it
- 21:37 mark: Installed ganglia on sq31 - sq50
- 17:24 mark: Upgraded srv154 - srv189 to wikimedia-task-appserver 0.21, which has an additional depend on tetex-extra, needed for math rendering
- 16:03 Tim: put srv154-189 into rotation
- 15:55 Setup apaches srv154 - srv189, only waiting to be pooled...
- 12:14 Setup apache on srv161
- 11:52 Setup apache on srv160
- srv159 has NFS /home brokenness, fix later
- 00:55 brion: disabling wap server pending security review
August 29
- 22:40 mark: Installed srv155 - srv189
- 19:44 mark: Reinstalled srv154
- 16:08 Rob: db3 drive 0:5 failed. Replaced with onhand spare, reinstalled ubuntu, needs Dev attn.
- Incident starting at 15:50:
- 15:50: CPU/disk utilisation spike on s2 slaves (not s2a)
- ~15:55: Slaves start logging "Sort aborted" errors
- 15:55: mysqld on ixia crashes
- 15:58: mysqld on thistle crashes
- 16:03: mysqld on lomaria crashes
- 16:09-16:18: Timeout on s2 of $wgDBClusterTimeout=10 brings down entire apache pool in a cascading overload
- 16:19 Tim and Mark investigate
- 16:23 Tim: depooled ixia
- 16:26 Tim: switched s2 to r/o
- 16:28 Tim: reduced $wgDBClusterTimeout to zero on s2 only. Partial recovery of apache pool seen immediately
- 16:37 Tim: 4-CPU apaches still dead, restarted with ddsh
- 16:45 full recovery of apache pool
- 16:59 Tim: Lomaria overloaded with startup traffic, depooled. Also switched s2 back to r/w.
- 17:12 Tim: brought lomaria back in with reduced load and returned $wgDBClusterTimeout back to normal.
- 17:14 - 17:35 Tim: wrote log entry
- 17:37 Tim: returned lomaria to normal load
- 13:55 Rob: Reinstalled db1, Raid0, Ubuntu.
- 13:52 Rob: Restarted srv135 from kernel panic, sync'd and back online.
- 13:50 mark: Set AllowOverrideFrom in /etc/ssmtp/ssmtp.conf on srv153 to allow MW to set the from address. Also made this the default on new Ubuntu installs.
- 13:14 Rob: Rebooted srv18 after cpu temp warnings. Sync'd server, back online, no more temp warnings.
- 12:46 Rob: sq39's replacement powersupply arrived. Server back online and squid process cleaned and started.
- 09:54 mark: Ran apt-get dist-upgrade on srv153
- 05:30 domas: kill-STOP recaches, until we get crashed DB servers (3) up (or get new machines? :)
- 00:26 Tim: brought srv153 back into rotation after fixing some issues
August 28
- 19:33 brion: freeing up some space on amane; trimming and moving various old dumps and other misc files
- 19:08 Tim: fixed ssmtp.conf on srv153
- 16:10 brion: rerunning htdig cronjob on lily.... at leats some lists not indexed or something
- 15:53 brion: fixed img_sha1 on pa_uswikimedia and auditcomwiki
- 15:36 Tim: put srv153 into rotation
- 15:08 brion: manually rotated captcha.log, hit 2gb and stopped in june
August 27
- 15:57 Tim: switch done, r/w restored. New s3 master binlog pos: db5-bin.001, 79
- 15:37 Tim: restarting db5 with log_bin enabled
- 15:28 Tim: switched s3 to read-only mode
- 15:21 db1 down (s3 master)
- 13:17 Tim: running populateSha1.php on commonswiki
August 25
- 22:53 brion: noticed dberror.log is flooded with 'Error selecting database' blah with various wrong dbs; possibly from job runners, but the only place I found suspicious was nextJobDB.php and I tried a live hack to prevent that. Needs more investigation.
- 14:30 Tim: got rid of the APC statless thing, APC is buggy and crashes regularly when used in this way
- 14:00ish brion: scapping again; Tim made the img metadata update on-demand a bit nicer
August 24
- 17:45 brion: fixed SVN conflict in Articles.php in master copy
- 17:29 brion: restarted slave on db2
- 17:27 brion: applying image table updates on db2, didn't seem to make it in somehow. tim was going to run this but i can't find it running and he's not online and didn't log it
- 17:16 brion: restarting slave on ariel
- ??:?? tim is running a batch update of image rows in the background of some kind
- ??:?? tim may have changed which server enwiki watchlists come from while ariel is non-synced
- 16:04 brion: applying img_sha1 update to ariel so we can restart replication and get watchlists for enwiki going again...
- ....15:10 tim reverted code to r24312 to avoid image update buggage for now
- 15:10 brion: took db3 (down), db2 (stopped due to schema buggage) out of rotation
- 15:00ish -- massive overload on db8 due to image row updates
- 14:47 brion: starting scap, finally!
- 14:10 brion: unblocked wikibugs IRC mailbox from wikibugs-l list, was autoblocked for excessive bounces
- 13:59 brion: confirmed that schema update job on samuel looks done
- 13:40 Tim: restarted job runners, only 2 were left out of 9. Wiped job log files.
August 23
- 22:00 Tim: replication on samuel stopped due to a replicated event from testwiki that referenced oi_metadata. Applied the new patches for testwiki only and restarted replication. Brion's update script will now get an SQL error from testwiki, but hopefully this won't have serious consequences.
- 20:04 brion: switched -- samuel_bin_log.009 514565744 to db1-bin.009 496650201. samuel's load temporarily off while db changes apply...
- 19:56 brion: switching masters on s3 to apply final db changes to samuel
- 15:34 brion: knams more or less back on the net, mark wants to wait a bit to make sure it stays up. apache load has been heavy for a while, probably due to having to serve more uncached pages. dbs have lots of idle connections
- 15:32 brion: updated setup-apache script to recopy the sudoers file after reinstalling sudo, hopefully this'll fix the bugses
- 15:12 brion: srv61, 68 have bad sudoers files. srv144 missing convert
- 14:08 brion: depooled knams (scenario knams-down)
- 13:50 brion: knams unreachable from FL
- 9:08 mark: Repooled yaseo, apparently depooling it causes inaccessibility in China
August 22
- 20:35 brion: applying patch-oi_metadata.sql, patch-archive-user-index.sql, patch-cu_changes_indexes.sql on db5, will then need a master switch and update to samuel
- 20:20 brion: found that not all schema updates were applied. possibly just s3, possibly more. investigating.
- 14:00ish brion: amane->storage2 rsync completed more or less intact; rerunning with thumbs included for a fuller copy
August 21
- 20:21 brion: amane->storage2 rsync running again with updated rsync from CVS; crashing bug alleged to be fixed
- 14:43 Rob: sq39 offline due to bad powersupply, replacement ordered.
- 13:30 Tim, mark: setting up srv37, srv38 and srv39 as an image scaling cluster. Moving them out of ordinary apache rotation for now.
- ~12:00 Tim: convert missing on srv61, srv68, srv144, attempting to reinstall
- 9:30 mark: Reachability problems to yaseo, depooled it
August 20
- 15:00 brion: restarted amane->storage2 sync, this time with gdb sitting on their asses to catch the segfault for debugging
- ~12:00 Tim: started static HTML dump
- 11:15 Tim: running setup-apache on srv135
August 18
- 22:32 brion: schema updates done!
August 17
- 20:42 brion: started schema updates on old masters db2 lomaria db1
- 20:39 brion: s1 switched from db2 (db2-bin.160, 270102185) to db4 (db4-bin.131 835632675)
- 20:32 brion: s2 switched from lomaria (lomaria-bin.051 66321679) to db8 (db8-bin.066 55061448)
- 20:13 brion: s3 switched from db1 (db1-bin.009 496276016) to samuel (samuel_bin_log.001, 79)
- 19:54 brion: noticed amane->storage2 rsync segfaulted again. starting another one skipping thumb directories, will fiddle with updating and investigating further later
- 19:48 brion: doing master switches to prepare for final schema updates this weekend
August 16
- 16:21 Rob: srv135 back up, needs bootstrap for setup.
- 16:01 brion: think I got the mail issue sorted out. Using sendmail mode (really ssmtp), and tweaked ssmtp.conf on isidore: set the host to match smtp.pmtpa.wmnet, and set FromLineOverride=YES so it doesn't mess up the from address anymore
- 15:42 brion: having a lovely goddamn time with bugzilla mail. setting it back from SMTP to sendmail avoids the erorr messages with dab.'s email address, but appears to just send email to a black hole. setting back to SMTP for now
August 15
- 3:40 jeluf: Start copy of database from db5 to samuel.
August 14
- 20:30 jeluf: Lots of apache problems on nagios after some updates on InitialiseSettings.php. Restarting all apaches.
- 19:50 Rob: srv63 down. Suspected Bad mainboard.
- 19:38 Rob: srv51 back online and sync'd.
- 19:24 jeluf srv66 bootstrapped
- 19:20 Rob: srv133 rebooted. Network port re-enabled. Back online and sync'd.
- 19:18 Rob: db3 had a bad disk. Replaced and reinstalled. Needs setup.
- 18:18 brion: db1 in r/w, with reduced read load. seems working
- 18:12 brion: putting db1 back into rotation r/o
- 18:07 brion: applying SQL onto db1 to recover final changes from relay log which were rolled back in innodb recovery
- 17:28 brion: temporarily put s3/s3a/default to use db5 as r/o master, will put db1 back once it's recovered
- 17:24 brion: put s3/s3a/default to r/o while investigating
- 17:20 Rob: db1 crashed! Checked console, kernel panic locked it. Rebooted and it came back up with raid intact.
- 16:33 Rob: srv66 returned from RMA and reinstalled. Requires setup scripts to be run.
- 15:26 Rob: srv134 restarted from heat crash & sync'd.
- 15:17 Rob: srv146 restarted from crash and sync'd.
- 15:00ish brion: restarted amane->storage2 rsync job with rsync 3.0-cvs, much friendlier for huge file trees
- 15:00 Rob: biruni rebooted and FSCK. Back online.
- 15:00 Rob: sq12 cache updated and squid services restarted.
- 14:51 Rob: sq12 rebooted from crash. Back online.
- 14:47 Rob: Rebooted and ran FSCK on srv59. It is back up, needs to be brought back in to rotation.
- Sync'd, and in rotation.
August 13
- 21:00 domas: APC collapsed after sync-file, restarting all apaches helped.
- 18:59 brion: playing with rsync amane->storage2
- 18:17 brion: working on updating the upload copy on storage2. removing the old dump file which eats up all the space, will then start on an internal rsync job
August 11
- 18:50 brion: updated auth server for oai
- 18:34 brion: starting schema updates on slaves
- 15:43 brion: seem to have more or less resolved user-level problems after a bunch of apache restarts. i think there was some hanging going on, maybe master waits or attempts to connect to gone servers
- 15:36 brion: s3 slaves happy now, fixed repl settings
- 15:29 brion: s3 master moved to db1, hopefully back in r/w mode :)
- 15:22 brion: db1/db5/webster appear consistent, so moving s3 master to db1. working on it manually...
- 15:05ish brion: many but not all reqs working in r/o atm
- 14:50ish brion: samuel broken in some way; putting s3/s3a to read-only and falling back to another box temporarily while working it out
August 8
- 21:35 brion: shutting down biruni; read-only filesystem due to 2007-08-07 Biruni hard drive failure
- 21:09 brion: cleaning up srv80, vincent, kluge, hypatia, biruni as well. need to poke at the setup scripts and find out wtf is wrong
- 20:55 brion: same on srv37
- 19:58 brion: updated broken sudoers file on humboldt, was not updating files on scap correctly
- 13:48 brion: adding wikiversity.com redirect
- 13:26 brion: metadata update for commons image row 'Dasha_00010644_edit.jpg' was sticking due to mysterious row locks for last couple days -- may have been related to a reported stickage/outage yesterday. Script run after run after run tried to update the row but couldn't. Finally managed to delete the row so it's no longer trying, but the delete took over three minutes. :P
August 5
- 07:44 Tim: set $wgUploadNavigationUrl on enwiki, left a note on the relevant talk pages
August 4
- 02:10 Tim: Fixed srv7 again
Aug 2
- 06:00 mark: Installed yf1015 for use as HTTPS gateway
Aug 1
- 07:00 domas: amane lighty upgraded
July 29
- 23:43 brion: shut off srv134 via ipmi, since nobody got to it
July 27
- 4:30 jeluf: removed srv120 from the external storage pool, fixed srv130
July 25
- 16:33 brion: srv134 bitching about read-only filesystem, possible hd prob
- 15:44 Rob: srv59 had a kernel panic. Restarted and is now back online.
- 15:38 Rob: will reattached to port 15 of the SCS.
- 15:00 Rob: biruni HDD replaced, FC4 reinstalled. (Had to use 32 bit, system did not support 64.) Requires scripts run and server to be put in rotation.
July 24
- 21:59 brion: srv59 is down; replaced it in memcache pool with spare srv61.
- 02:38 brion: fixed bugzilla queries; had accidentally borked the shadow db configuration -- disabled that (for srv8) a few hours ago due to the earlier reported replication borkage, but did it wrong so it was trying to connect to localhost
July 21
- 22:30 mark: srv7 was out of space again, deleted a few bin logs. Replication I/O thread on srv8 is not running and doesn't want to come up, can any of you MySQL heads please fix that? :)
- 19:26 Tim: db3 is down, removed from rotation.
- 11:38 mark: Installed Ubuntu Feisty on yf1016 for use by JeLuF
July 19
- 19:40 brion: importing pywikipedia svn module
July 18
July 17
- 18:56 brion: set up and mounted upload3 and math mounts on vincent -- these were missing, probably causing bugzilla:10610 and likely some upload-related problems.
- 15:08 river: borrowing storage1 to dump external text clusters
- 13:51 Rob: Rebooted storage1 from a kernel panic.
- 13:50 Rob: biruni filesystem in read-only. Rebooted and running FSCK.
- HDD is toast. Emailed SM for RMA.
July 15
- 15:06 Tim: Biruni was hanging on various operations such as ordinary ssh login, or a "sync" command. Restarted using "echo b > /proc/sysrq-trigger" in a non-pty ssh session.
- Probably hung during startup, ping but no ssh
July 14
- 19:35 mark: Fixed a problem with our DNS SOA records
- 13:33 Tim: put ex-search servers vincent, hypatia, humboldt, kluge, srv37 into apache rotation. Fixed ganglia and nagios.
July 13
- 23:20 mark: Disabled Oscar's accounts on boardwiki/chairwiki per Anthere's request
July 12
- 18:35 brion: pa.us.wikimedia.org up.
- ~12:10 Tim: restarted postfix on leuksman, causing a flood of messages delivered to various locations.
- 11:44 Tim: running setup-apache on ex-search servers (vincent, hypatia humboldt, kluge, srv37)
- 11:38 mark: Set up Quagga on mint as a test box for my BGP implementation, gave it a multihop BGP feed from csw5-pmtpa / AS14907
- 11:02 Tim: reassigning srv57 and srv58 to search
July 11
- 13:10 Tim: updating lucene for enwiki, will be moving on to the other clusters shortly
- 12:49 Tim: removed /etc/crond.d/search-restart from search servers and restarted crond
July 10
- 21:30 brion: installed DeletedContributions ext
- 19:20 mark: Brought knsq6 back up, DIMM hopefully replaced
- 19:20 jeluf: setting skip-slave-start on read-only external storage clusters.
July 9
- 17:53 brion: temporarily taking db4 out of rotation cause people freak out about lag warnings
- 14:18 brion: running schema changes patch-backlinkindexes.sql on db1/db4 so they don't get forgotten
- 13:55 brion: fixed oai audit db setting (broken by master switch)
July 8
- 20:47 jeluf: added external storage cluster #13. Removed #10 from the write list.
- 18:24 Tim: changed cron jobs on the search servers to restart lsearchd instead of mwsearchd.
- 17:05 brion: switched s1 master, started the rest of the latest schema changes
- 17:00 brion: switched s2 master
- 16:54 brion: switched s3 master
- 16:23 brion: schema changes index tweaks done on slaves, waiting for a master switch to complete
July 7
- 20:10 jeluf: cleaned up binlogs on srv95.
July 6
- 21:16 brion: did rs.wikimedia.org updates -- imported pages and images from old offsite wiki, set up redirects from *.vikimedija.org domains that are pointed to us
- 19:55 brion: created mediawiki-api list
- 19:39 brion: starting schema changes run for index updates
- 19:25 Tim: restarted PHP on anthony to fix *.wap.wikipedia.org
July 5
- 21:59 river: taking clematis and 5TB space on the backup array to test replicating external text to knams
- 14:07 Rob: Reinstalled srv135 with a new HDD. Needs setup scripts run.
- 13:55 Rob: Rebooted srv99 from a kernel panic. It is back up.
- 13:20 Rob: adler offline, will not boot.
- 13:09 Rob: replaced cable for rose, shows at full duplex speed again.
July 4
- 21:40 jeluf: cleaned up disk space on srv95, 120, 126
July 3
- 20:54 brion: Closed out inactive wikimedical-l list
- 15:56 brion: lag problem resolved. stop/start slave got them running again; presumably the connections broke due to the net problems but it thought they were still alive, so didn't reconnect
- 15:53 brion: lag problems on enwiki -- all slaves lagged (approx 3547 sec), but no apparent reason why
- 15:00ish? mark did firmware updates on the switch and soemthing didn't come back up right and everything was dead for a few minutes
- 00:52 river: temporarily depooled thistle to dump s2.
July 2
- 18:54 brion: set bugzilla to use srv8 as shadow database
- 18:53 brion: replication started on srv8
- 18:39 brion: srv7 db back up, putting otrs back in play. fiddling with replication
- 18:25 Tim: bootstrapping srv80
- 18:24 brion: shut down srv7 db to copy
- 17:54 brion: shutting down srv8 db and clearing space for copy from srv7. otrs mail is being queued per mark
- 17:40 Tim, rainman: installing LS2 for the remaining wikis. Splitting off a new search pool, on VIP 10.0.5.11.
July 1
- 18:38 mark: Upgraded lily to Feisty, including a new customized Mailman, and a newer PowerDNS
- 13:15 mark: Upgraded asw-c3-pmtpa and asw-c4-pmtpa firmware to 3100a
Archives
- Server admin log/Archive 1 (2004 Jun - 2004 Sep)
- Server admin log/Archive 2 (2004 Oct - 2004 Nov)
- Server admin log/Archive 3 (2004 Dec - 2005 Mar)
- Server admin log/Archive 4 (2005 Apr - 2005 Jul)
- Server admin log/Archive 5 (2005 Aug - 2005 Oct)
- Server admin log/Archive 6 (2005 Nov - 2006 Feb)
- Server admin log/Archive 7 (2006 Mar - 2006 Jun)
- Server admin log/Archive 8 (2006 Jul - 2006 Sep)
- Server admin log/Archive 9 (2006 Oct - 2007 Jan)
- Server admin log/Archive 10 (2007 Feb - 2007 Jun)