Server admin log/test

From Wikitech
< Server admin log
Revision as of 04:00, 5 November 2008 by Grawp (Talk | contribs)

Jump to: navigation, search

October 18

  • 00:03 Brion: trying a log thingay

October 17

  • 21:10 brion: enabled Commons foreign image repo on Wikitech
  • 18:45 brion: created Wikimedia-Boston list for SJ
  • 16:55 brion: adding nomcomwiki to special.dblist so it shows up right in sitematrix
  • 16:45 brion: deleted some junk comments from bugzilla
  • 16:31 brion: changed autoconfirm settings for 'fishbowl' wikis -- 0 age for autoconfirm, plus set upload & move for all users just in case autoconfirm doesn't kick in right
  • 14:22 RobH: srv131 back up.
  • 09:03 Tim: copying srv129 and srv139 ES data directories to storage2:/export/backup
  • 02:49 Tim: excessive lag on db16, killed long-running queries and temporarily depooled. CUPS odyssey continues.
  • 01:59 Tim: removing cups on all servers where it is running
  • 00:00 RobH: restarted srv43-47

October 16

  • 20:42 brion: added 3 more dump threads on srv31... we need to find some more batch servers to work with for the time being until new dump system is in place :)
  • 20:20 RobH: pulled samuel from the rack, decommissioned, RIP samuel.
  • 19:35 RobH: migrated rack B4 from asw3 to asw-b4-pmtpa.
  • 18:40 RobH: rebooted scs-ext opps!
  • 18:26 RobH: srv61 reinstalled and redeployed.
  • 18:24 RobH: Adler re-racked with rails, booted up to maintenance mode prompt.
  • 17:34 mark: 208.80.152.0/25 NTP restriction is actually also not broad enough - changed it to /22 in ntpd.conf on zwinger
  • 17:02 brion: thumbnails on commons are insanely slow and/or broken
  • 14:44 Tim: added a more comprehensive redirection list to squid.conf.php for storage1 images
  • 14:04 Tim: redirected images for /wikipedia/en/ to storage1, apparently they were moved a while ago. Refactored the relevant squid.conf section.
  • 13:38 Tim: disabled directory index on amane. Was generating massive amounts of NFS traffic by generating a directory index for some timeline directories.
  • 12:51 Tim: increased memory limit on srv159 to 8x200MB. Still well under physical.
  • 11:38 Tim: cleaned up temporary files on srv159, had filled its disk
  • 11:25 Tim: synced upload scripts (including to ms1)
  • 10:06 Tim: removed sq50 from the squid node lists and uninstalled squid on it
  • 09:22 - 09:52 mark, Tim, JeLuF: initial attempts to bring the squids back up failed due to incorrect permissions on the recreated swap logs. Most were back up by around 09:32, except newer knams and yaseo squids which were missing from the squids_global node group. The node group was updated and the remainder of the squids brought up around 09:52.
  • 09:19 JeLuF: deployed squid.conf with an error in it. All squid instances exited.
  • 08:26 Tim: Restarted ntpd on search7, was broken
  • 06:42 Tim: ntp.conf on zwinger had the wrong netmask for the 208.x net, it was /26 instead of /25. So a lot of squids were out of it, and some had a clock skew of 10 minutes (as visible on ganglia). Fixed ntp.conf, not stepped yet. Will affect squid logs.

October 15

  • 19:49 brion: added '<span onmouseover="_tipon' to spam regex; some kind of weird edit submissions coming with this stuff like [1]
  • 12:00 Tim: trying to bring srv159 up as an image scaler. Limiting memory usage to 8x100 = 800MB with MediaWiki.
  • 11:21 srv127 died just the same. Mark suggests using one with DRAC next.
  • 10:20 Tim: all image scalers (srv43 and srv100) swapped to death again. Preparing srv127 as an image scaler with swap off.
  • 08:43 Tim: reduced depool-threshold for the scalers to 0.1 since srv100 is quite capable of handling the load by itself while we're waiting for the other servers to come back up.
  • 07:45 Tim: half the scaling cluster went down again, ganglia shows high system CPU. Installing wikimedia-task-scaler on srv100.
  • 02:30 Tim: moved image scalers into their own ganglia cluster
  • 02:17 Tim: apache on srv43-47 hadn't been restarted and so was still running without -DSCALER. This partially explains the swapping. Restarted them. Took srv38-39 back out of the image scaler pool, they have different rsvg and ffmpeg binary paths and break without a MediaWiki reconfiguration.
  • 02:13 tomasz: upgraded srv9 to ubuntu 8.04
  • 02:00 tomasz: upgraded srv9 to ubuntu 7.10

October 14

  • 19:16 brion: restarted lighty on storage1 again -- it was back in 'fastcgi overloaded' mode, possibly due to the previously broken backend, possibly not
  • 19:11 mark: Pooled old scaling servers srv38, srv39
  • 18:50 brion: at least four of new image scalers are down -- can't reach by SSH. thumbnailing is borked
  • 16:41 brion: fixed image scaling for now -- storage1 fastcgi backends were overloaded, so it was rejecting things. did some killall -9s to shut them all down and restarted lighty. ok so far
  • 16:20 brion: image scaling is broken in some way, investigating
  • 02:54 Tim: fixed srv43-47, this is now the image scaling cluster
  • 00:10 Tim: oops, forgot to add VIPs, switched back.
  • 00:05 Tim: switched image scaling LVS to srv43-47

October 13

  • 23:45 Tim: prepping srv43-47 as image scaling servers
  • 21:45 jeluf: moved more image directories to ms1. Now, upload/wikipedia/[abghijmnopqrstuwxy]* are on ms1
  • 21:35 jeluf: killed mwsearchd on srv39, removed both the rc3.d link and the cronjob that start mwsearchd
  • 21:30 RobH: search8 and search9 are online, awaiting configuration.
  • 21:15 brion: thumb rendering failures reported... found some runaway convert procs poking at an animated GIF, killed them.
    • rev:42058 will force GIFs over 1 megapixel to render a single frame instead of animations as a quick hackaround...
  • 20:48 domas: thistle serving as s2a server
  • 20:28 RobH: stopping mysql on adler so it can be re-racked with rails.
  • 19:53 RobH: search7 back online, awaiting addition to the search cluster.
  • 19:35 mark: Set up an Exim instance on srv9 for outgoing donation mail, as well as incoming for delivery into IMAP for CiviMail (*spit*).
  • 17:00 RobH: srv21-srv29 decommissioned and unracked.
  • 12:05 domas: put lomaria back in rotation
  • 11:50 domas: Enabled write-behind caching on db15. Restarted.
  • 10:40 domas: restarted replication on db15 and lomaria
  • 10:27 domas: loading dewiki data from SQL dump into thistle
  • 09:09 Tim: restarted logmsgbot
  • 08:27 Tim: folded s2b back into s2
  • 08:06 Tim: db13 in rotation
  • 08:02 domas: copying from db15 to lomaria
  • 07:38 Tim: started replication on db13
  • 04:51 Tim: copying
  • 03:27 Tim: Preparing for copy from db15 to db13
  • 00:00 domas: something wrong with db15 i/o performance. it is behaving way worse, than it should.

October 12

  • 23:58 brion: updated CodeReview to add a commit so loadbalancer saves our master position. playing with serverstatus extension on yongle to find out wtf it keeps getting stuck
  • 22:05 brion: db15 sucks hard. putting categories back to db13
  • 22:01 brion: db15 got all laggy with the load. taking back out of general rotation, leaving it on categories/recentchangeslinked
  • 21:58 brion: db15 seems all happy. swapping it in in place of db13, and giving it some general load on s2. we'll have to resync db13 at some point? and toolserver?
  • 19:41 Tim: shutting down db15 for restart with innodb_flush_log_at_trx_commit=2. But db8 seems to be handling the load now so I'm going to bed.
  • 19:20 Tim: depooled db15.
  • 19:09 Tim: split off some wikis into s2b and put db8 on it. To reduce I/O and hopefully stop the lag.
  • 18:51 Tim: db15 still chronically lagged. Offloading all s2 RCL and category queries to db13.
  • 18:38 Tim: offloading commons RCL queries to db13
  • 18:36 Tim: dewiki r/w with ixia (master) only
  • 18:33 Tim: offloading commons category queries to db13
  • 18:25 Tim: balancing load. Fixed ganglia on various mysql servers.
  • 18:06 Tim: going to r/w on s2. Not s2a yet because db15/db8 can't handle the load.
  • 17:46 Tim: db8->db15 copy finished, deploying
  • 17:33 Tim: installed NRPE on thistle.
  • 16:54 Tim: copied mysqld binaries from db11 to db15 and thistle. Plan for thistle is to use it for s2a.
  • 16:40 Tim: ixia/db8 can't handle the load between them with db13 out, even with s2a diverted. Restored db13 to the pool. Running out of candidates for a copy destination. Need db13 in because it's keeping the site up, can't copy to thistle because it's too small with RAID 10. Plan B: set up virgin server db15. Copying from db8.
  • 16:07 Tim: repooled ixia/db8 r/o
  • 15:53 Tim: removed ixia binlogs 290-349. 270-289 were deleted during the initial response.
  • 14:54 mark: Pooled search6 as part of search cluster 2, by request of rainman
  • 14:37 Tim: deployed r41995 as a live patch to replace buggy temp hack.
  • 14:14 Tim: cleaned up binlogs on db2. Yes the horse has bolted, but we may as well shut the gate.
  • 14:11 Tim: copy now in progress as planned.
  • 13:48 Tim: going to try the resync option. Maybe with s2 it won't take as long as s1. Will try to sync up db8 from ixia with db13 serving read-only load for the duration of the copy.
  • 13:40 Tim: ixia (s2 master) disk full. Classic scenario, binlogs stopped first, writing continued for 10 minutes before replag was reported.
  • 13:00 jeluf: moved wikipedia/m* image directories to ms1
  • 08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled.
  • 02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while.

October 11

  • 7:00 jeluf: moved wikipedia/s* image directories to ms1

October 10

  • 21:30 jeluf: moved wikipedia/[jqtuwxy]* to ms1
  • 19:20 RobH: Bayes online.
  • 19:11 brion: recreated special page update logs in /home/wikipedia/logs, hopefully fixing special page updates
  • 13:05 Tim: reverted live patch and merged properly tested fix r41928 instead.
  • 12:31 Tim: deployed a live patch to fix a regression in MessageCache::loadFromDB() concurrency limiting lock
  • 12:17 domas: killed long running threads
  • ~12:04: s2 down due to slave server overload

October 9

  • 22:52 brion: enabled Collection on de.wikibooks so they can try it out
  • 20:00 jeluf: moved wikipedia/i* images to ms1
  • 17:05 RobH: thistle raid died due to hdd failed, replaced hdd, reinstalled as raid10.
  • 12:00 domas: switched s3 master to db1, did erase bunch of db.php stuff by accident (don't know how :). restored from db.php~ :-)
  • 09:31 mark: pascal died yet again, revived it. Will move the htcp proxy tonight...

October 8

  • 21:05 brion: yongle still gets stuck from time to time, breaking mobile, apple search, and svn-proxy. i suspect svn-proxy but can't easily prove it still. using separate svn command (in theory) but it's not showing me stuck processes.
  •  ??:?? rob fixed srv37, then later srv133 into mediawiki-installation node group. he did an audit and didn't see any other problems. i ran a scap to make sure all are now up to date
    • Speculation: possible that rumored ongoing image disappearances have been caused by the image-destruction bug still being in place on srv133 for the last month.
  • 19:02 mark: Upgraded packages on search1 - search6 and searchidx1
  • 18:59 brion: aaron complaining of srv37 not properly updated (doesn't recognize Special:RatingHistory). flaggedrevs.php was out of date there. checking scap infrastructure, stuff seems ok so far...

October 7

  • 21:47 brion: started two dump threads (srv31)
  • 21:16 RobH: installed and configured gmond on all knams squids.
  • 21:00 jeluf: moved wikipedia/g* to ms1
  • 18:55 RobH: fixed private uploads issue for arbcom-en and wikimaniateam.
  • 17:26 RobH: reinstalled and redeployed knsq24 and knsq29
  • 15:00-16:00 robert: switched enwiki to lucene-search 2.1 running on new servers. Test run till tomorrow, if anything goes wrong, reroute search_pool_1 to old searchers on lvs3. Will switch on spell checking when all of the servers are racked. Thanks RobH for tunning config files.
  • 15:54 RobH: srv101 crashed again, running tests.
  • 15:45 RobH: srv146 was powered down for no reason. Powered back up.
  • 15:42 RobH: srv138 locked up, rebooted, back online.
  • 15:32 RobH: srv110 was locked up, rebooted, synced, back online.
  • 15:31 RobH: srv101 back up and synced.
  • 15:22 RobH: rebooted srv56, was locked up, handed off to rainman to finish repair.
  • 15:21 RobH: updated lucene.php and synced.
  • 15:04 RobH: updated memcached to remove srv110 and add in spare srv137.
  • 15:00 RobH: removed all servers from lvs:search_pool_1 and put in search1 and search2 with rainman

October 6

  • 23:55 brion: tweaked bugzilla to point rXXXX at CodeReview instead of ViewVC
  • 14:29 domas: amane lighty was closing connections immediately, worked properly after restart. upgraded to 1.4.20 on the way.
  • 14:36 RobH: setup ganglia on all pmtpa squids.
  • 13:50 mark: The slow page loading on the frontend squids appears to be limited to english main page only, for unknown reasons. Set another article as pybal check URL to prevent pooling/depooling oscillation by PyBal for now.
  • 09:27 mark: yaseo squids are fully in swap, set DNS scenario yaseo-down

October 5

  • 23:14 mark: Frontend squids are not working well at the moment, sometimes serving cached objects with very high delays. I wonder if they are under (socket) memory pressure. Reduced cache_mem on the backend instance on sq25 to free up some memory for testing.
  • 20:35 jeluf: wikipedia/b* moved, too
  • 19:00 jeluf: switched squids to send requests for upload.wikimedia.org/wikipedia/a* to ms1
  • 14:30 jeluf: Moving all wikipedia/a* image directories to ms1

October 4

  • 23:17 mark: Repooled knsq16-30 frontends in LVS. Also found that mint was fighting with fuchsia about being LVS master, due to reboot this afternoon.
  • 14:30 mark: Several servers in J-16 were shutting down, or going down around this time. Reason unknown, possibly auto shutdown because of high temperature, possibly they were turned off by someone locally.
  • 14:03 mark: SARA power failure. Feed B lost power for ~ 6 seconds.
  • 00:26 mark: Depooled srv61
  • 00:07 brion: found srv37 and srv61 have broken json_decode (wtf!)
    • updating packages on srv37. srv61 seems to have internal auth breakage
    • updated packages on srv61 too. su still borked, may need LDAP fix or something?

October 3

  • 21:40 brion: transferring old upload backups from storage2 to storage3. once complete, can restart dumps!
  • 20:01 brion: running updateRestrictions on all wikis (done)
  • 17:51 RobH: srv135 & srv136 reinstalled as ubuntu.
  • 17:34 RobH: srv132 & srv133 reinstalled as ubuntu.
  • 17:13 RobH: srv130 back online.
  • 16:40 RobH: depooled srv131, srv132, srv135, srv136 for reinstall.
  • 00:25 brion: switched codereview-proxy.wikimedia.org to use local SVN command instead of PECL SVN module; it seemed to be getting bogged down with diffs, but hard to really say for sure

October 1

  • 20:02 RobH: srv63 back online.
  • 19:35 RobH: srv61 and srv133 back online.
  • 18:22 RobH: storage3 online and handed off to brion.
  • 17:35 RobH: updated mc-pmtpa.php to put srv61 as spare.
  • 17:32 RobH: srv61 faulty fan replaced, back online.
  • 09:31 Tim: srv104 (cluster18) hit max_rows, finally. Removed it from the write list.
  • 08:36 Tim: fixed ipb_allow_usertalk default on all wikis
  • 23:46 mark: Reinstalled knsq24
  • 22:55 mark: Reenabled switchports of knsq16 - knsq30
  • 20:45 jeluf: fixed resolv.conf on srv131
  • 20:45 jeluf: mounted ms1:/export/upload as /mnt/upload5, started lighttpd on ms1
  • 19:47 brion: enabled revision deletion on test.wikipedia.org for some public testing.
  • 14:25 RobH: Cleaned out the squid cache on knsq16, knsq17, knsq18, knsq19, knsq21, knsq22, knsq23, knsq25, knsq26, knsq27, knsq28, knsq30. DRAC not responsive on knsq20, knsq24, knsq29.

Archives


Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox