Server admin log/Archive 20

From Wikitech
< Server admin log
Revision as of 15:07, 2 November 2008 by River (Talk | contribs)

Jump to: navigation, search

November 2

  • 15:06 river: added missing /mnt/upload5 mount on several apaches: srv37 srv61 srv76 srv69 srv63 srv118 srv132 srv135 srv133 srv138 srv136
  • 14:49 domas: few missing .frm files on db18 were causing trouble, resynced them from db19, resumed replication
  • 13:02 river: copying en from storage1 to ms1
  • 10:49 domas: replaced XFS with JFS on db18, installed ganglia on db17-db30
  • 10:36 river: completed move of commons, now being served from ms1 (except archive/)

November 1

  • 22:48 brion: fixed ContributionReporting to force a utf8 connection, now loads names in right charset
  • 22:20 brion: fixed $wgNoticeInfrastructure setting; defaults must have changed at some point
  • 22:15 domas: installed wikimedia-mysql4 on db21-23, established s1,s2,s3 replication. we now have full database copy in sdtpa \o/
  • 20:53 brion: deploying CentralNotice editing system on meta, woo
  • 20:27 brion: scapping to update reporting and centralnotice bits internally
  • 19:38 brion: rescapping to make sure 159 is unbroken
  • 19:27 brion: svn up'ing on wikitech just for domas
  • 19:25 brion: srv159 is out of space
    • We need to clean out the damn temp files somehow, eh?
  • 19:20 brion: scapping to update ContributionReporting ext
  • 12:56 mark: uppreffed traffic from knams to pmtpa via 6908/2828, as existing peering path had slight packet loss
  • 11:25 Tim: enabled subpages in the main namespace by default for all Wikisource wikis. This appears to be a defacto standard and is used by all wikisources with an entry in wgNamespacesWithSubpages.
  • 07:55 Tim: disabled ParserDiffTest, obsolete
  • 07:06 mark: XO circuit back up:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now up
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <w005.z207088246.xo.cnc.net>, session is now up

October 31

  • 23:11 brion: set up some logs for fundraising banner campaign clicks for later mining
  • 17:44 brion: adding support for Tomas skin on wikimediafoundation.org for new fundraiser templates
  • 14:24 mark: XO circuit went down:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <207.88.246.5>, session is now down because <Port State Down>
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now down because <Port State Down>

October 30

  • 23:11 Tim: fixed disk space on srv159, db1, srv103
  • 19:03 brion: updated triggers for donation reporting database a few minutes ago
  • 18:14 RobH: moved ms1 from pmtpa:a4 to sdtpa:a1, its back online.
  • 17:46 RobH: db26 OS installed and online
  • 17:28 brion: added a spam filter rule for private-l messages :)
  • 04:54 river: testing sun web server on ms1
  • 03:56 brion: updating squid conf to send upload /centralnotice to storage1 for testing
  • 03:53 brion: tweaked lighttpd config on storage1 for centralnotice static file testing, since amane's configuration is too crappy to support regexes needed to set headers on a directory
  • 02:59 brion: poking experimental expires options on amane for static centralnotice tests
  • 02:44 brion: brion broke lighttpd.conf briefly

October 29

  • 22:39 brion: enabling $wgCodeReviewENotif experimentally
  • 18:35 brion: disabled bitmap fonts in fontconfig on image scalers, seems to help with the "mad helvetica" problem
  • 18:02 RobH: db28 & db29 OS installed and online.
  • 17:59 brion: fixed some upload directory perms on foundationwiki
  • 17:12 RobH: db27 OS installed and online.
  • 16:54 RobH: db21 OS installed and online.
  • 16:38 RobH: db22, db23, db25, db30 were installed yesterday, forgot to admin log it, sorry ;/
  • 14:44 _mary_kate_: copying wikipedia/commons/thumb/4 from storage1 to ms1

October 28

  • 20:02 domas: re-enabled db16
  • 18:03 mark: Removed blackholes.securitysage.com from lily's spamassassin configuration
  • 17:52 domas: db16 fubar'ed by queries that built 100GB temporary tables, leading to jfs hangs, leading to unhappy kernel.
  • 15:23 RobH: updated dsh node group ALL, added backup of frontend data for bugzilla and blogs from isidore to tridge.
  • 12:33 rainman-sr: experimentally turning on "did you mean.." on search8,9 for enwiki
  • 10:44 mark: Reverted yesterday's search changes

October 27

  • 23:24 mark: Switched to lucenesearch 2.1 for all wikis
  • 23:06 mark: pooled search8 as the only search server in search pool 3
  • 22:25 mark: rainman-sr is making me do more ugly things to lucene.php
  • 22:22 mark: Pointed search for "all other wikis" hardcoded to search7 in lucene.php
  • 22:14 mark: Added zhwiki and plwiki to lucene search 2.1 pool 2

October 26

  • 15:43 mark: Set up OpenGear serial console server scs-a1-sdtpa
  • 13:37 mark: Set up iBGP between csw1-sdtpa and csw5-pmtpa (IPv4/IPv6)
  • 13:36 mark: Prepared csw1-sdtpa for production deployment (general configuration)
  • 09:56 domas: updated db18 firmware to 2.1.1 (September 2008)
  • 04:31 Tim: fixed the "service_ips" hostgroup in nagios
  • 03:03 Tim: hardware reboot of db18
  • 02:47 Tim: mysqld on db18 apparently hit a kernel bug. It was reported as a zombie but was still using 200% CPU in top. kswapd was simultaneously using 100% CPU. Did not respond to SIGKILL. The non-zombie parent, mysqld_safe, also did not respond to SIGKILL (wchan=flush_cpu_workqueue). Attempted a reboot with shutdown -r.
  • 02:47 brion: tweaked MaxClientsPerChild on yongle to see if that helps with the mysterious hangs i sometimes see where requests seem to get backed up; it's disrupting the CodeReview proxy as well as mobile & Mac Dictionary search

October 25

  • 20:46 brion: scapped to r42573
  • 08:17 Tim: svn up to 42536 for API overload fix. Re-enabling disabled query modules.
  • 05:55 Tim: svn up/scap to 42531 (for properly tested Interwiki.php fix).
  • 05:09 Tim: DB overload on many enwiki slave servers. Long running queries attributed to ApiQueryAllpages, ApiQueryBacklinks, ApiQueryCategoryMembers and ApiQueryLogEvents. Disabled those modules and killed related running threads.
  • 05:01 Tim: Interwiki links were broken due totally broken and untested getInterwikiCached() function. Live patch deployed at this time.
  • 04:33 Tim: Fixed svn conflicts in two files. Scap to r42524.
  • 04:20 Tim: disabled Drafts extension on test.wikipedia.org. Trevor, please contact me for code review.
  • 04:11 Tim: synced php-1.5 to srv35 and ran "make -B" in the serialized directory. Seems to have fixed test. Will scap.
  • 01:01 ariel: preemptively up mail quota to 7GB from 1GB for cbass, dmenard
  • 00:59 brion: testwiki is borked until we figure out how to get it to load updated message files. tried disabling $wgLocalMessageCache and $wgCheckSerialized to no effect
  • 00:51 brion: temporarily blocking scap during testing :) ... running serialized language file updates for test, broken by need to get magic word updates
  • 00:44 brion: preparing a svn up...
  • 00:37 ariel: up msecoquian's mail quota from 1GB to 6.9GB

October 24

  • 23:12 brion: set up ariel (the person) on sanger to do mail administration -- quota fixes etc
  • 16:24 TimStarling: reloaded ourusers.sql on all core and ext. mysql servers, adding a nagios user
  • 15:39 mark: slacking
  • 15:36 TimStarling: added special nagios user to ES instances on clematis
  • 14:00 domas: re-enabled db5, added db18 to s3
  • 10:45 domas: taking out db5 for copy to db18
  • 10:44 domas: fixed ntpd on bart, was pointing to multicast address that doesn't work
  • 09:57 Tim: removed decommissioned servers from monitoring: dryas, alrazi, diderot, friedrich, samuel
  • 07:50 Tim: added monitoring for toolserver ES clusters 17-19
  • 07:40 Tim: regenerated trusted XFF list with extra SAIX proxies
  • 05:00 Tim: fixed nagios check script handling of MySQL connection errors
  • 01:37 brion: setting $wgLicenseURL for Collection to point at GFDL English text
  • 01:01 brion: enabling Drafts on testwiki, but it seems to not be saving there... works on my local test, not sure what the issue is
  • 01:03 brion: disabling logentry, still borken?

October 23

  • 22:33 brion: trying re-enabling logentry ext on wikitech, now with cache disable to avoid edittoken for now
  • 21:34 brion: updating ipblocks table definition
  • 21:25 brion: re-ran svnImport to update path listings for CodeReview
  • 20:11 mark: Set up search7 - search9
  • 17:05 mark: Pooled search4 as a s1 search server to help with dead search2
  • 16:33 brion: updated mw-serve
  • 15:38 Tim: On the image scalers, temporarily mounted /a/tmp as /tmp with --bind to stop the disk full problem while we figure out some better solution
  • 15:24 Tim: removed temporary files on image scalers again
  • 14:54 RobH: Replaced dead disk in amane, rebuilding array.
  • 11:04 Tim: Added disk space monitoring for image scalers. Also added apache monitoring which was also missing.
  • 10:53 Tim: freed up disk space on image scalers, magick-* temporary files were filling their root partitions
  • 10:50 Tim: re-added cluster19 to the default write list. Not sure who took it out or why.
  • 10:32 Tim: freed up some space on srv103 (was down to 500MB)
  • 10:29 Tim: fixed monitoring for MegaRAID SAS
  • 07:10 Tim: Set up monitoring of RAID status for all Ubuntu DB servers using the wikimedia-raid-utils package that I just wrote. It doesn't do anything on the MegaRAID servers yet, but the Adaptec ones should work.
  • 05:05 Tim: running CodeReview svnImport.php

October 22

  • 18:26 brion: enabling ODT output for collection
  • 18:17 brion: updating collection and codereview extensions
  • 18:13 Brion: updated mw-serve code and configured to send error emails per jojo's request
  • 17:15 Brion: Changed bugzilla's mail delivery from local sendmail (SSMTP) to direct SMTP, per Mark's recommendation

October 21

  • 19:29 RobH: Bayes upgraded from 2GB to 10GB.
  • 13:49 Tim: Did a demonstration hack of nagios from CSRF to arbitrary shell. Disabled cmd.cgi.
  • 04:13 Tim: Brought srv43-47 up as image scalers with mem limit 6 x 200MB = 1200MB (2GB physical)

October 20

  • 18:11 RobH: srv118 rebooted, back online.
  • 17:25 RobH: srv79 was in kernel panic, rebooted.
  • 05:10 Tim: increased concurrency on srv159 to 15, for mem limit 15 x 200MB = 3000MB
  • 02:40 Tim: installed NRPE on khaldun and db20
  • 02:20 Tim: moved disk space checks on the ext stores from the "apaches" service group to the relevant ext store service group
  • 01:53 Tim: installed NRPE on the new ext stores
  • 01:45 Tim: Updated /etc/ssh/ssh_known_hosts on bart (copied from zwinger).
  • 00:30-01:30 Tim: Listed down servers on DC tasks. Removed broken servers from memcached rotation. Restarted apache on srv99, srv109, srv123. Purged master binlogs on srv102.

October 18

  • 21:45 RobH's mighty index finger brought amane and the site back up.
  • 21:00 river: Ran 'nc -l -p 623' command, amane's kernel panic'ed. Rob was called.
  • 20:55 mark, river: diagnosed the NFS communication problems to be caused by NIC hardware packet interception of port 623 packets... amane wasn't receiving NFS replies from ms1.
  • 19:40 mark: Upload got unhappy, ms1 NFS mount on amane was unreachable and stalling things
  • 13:40 Tim: down again, single process allocating all memory
  • 07:35 Tim: took it down again, while recording /proc/vmstat and /proc/stat
  • 06:27 Tim: restarted srv160
  • 05:45 Tim: took srv160 into the purple for a much more convincing overload, and different oprofile results
  • 03:40 Tim: used oprofile to determine what part of the kernel is responsible for the system CPU spike. Looks like a spinlock in dnotify.
  • 03:12 Tim: simulated a memory-intensive request rate spike to srv160. Large system CPU response spike, but it didn't go down completely. Will try a bigger one.

October 17

  • 21:10 brion: enabled Commons foreign image repo on Wikitech
  • 18:45 brion: created Wikimedia-Boston list for SJ
  • 16:55 brion: adding nomcomwiki to special.dblist so it shows up right in sitematrix
  • 16:45 brion: deleted some junk comments from bugzilla
  • 16:31 brion: changed autoconfirm settings for 'fishbowl' wikis -- 0 age for autoconfirm, plus set upload & move for all users just in case autoconfirm doesn't kick in right
  • 14:22 RobH: srv131 back up.
  • 09:03 Tim: copying srv129 and srv139 ES data directories to storage2:/export/backup
  • 02:49 Tim: excessive lag on db16, killed long-running queries and temporarily depooled. CUPS odyssey continues.
  • 01:59 Tim: removing cups on all servers where it is running
  • 00:00 RobH: restarted srv43-47

October 16

  • 20:42 brion: added 3 more dump threads on srv31... we need to find some more batch servers to work with for the time being until new dump system is in place :)
  • 20:20 RobH: pulled samuel from the rack, decommissioned, RIP samuel.
  • 19:35 RobH: migrated rack B4 from asw3 to asw-b4-pmtpa.
  • 18:40 RobH: rebooted scs-ext opps!
  • 18:26 RobH: srv61 reinstalled and redeployed.
  • 18:24 RobH: Adler re-racked with rails, booted up to maintenance mode prompt.
  • 17:34 mark: 208.80.152.0/25 NTP restriction is actually also not broad enough - changed it to /22 in ntpd.conf on zwinger
  • 17:02 brion: thumbnails on commons are insanely slow and/or broken
  • 14:44 Tim: added a more comprehensive redirection list to squid.conf.php for storage1 images
  • 14:04 Tim: redirected images for /wikipedia/en/ to storage1, apparently they were moved a while ago. Refactored the relevant squid.conf section.
  • 13:38 Tim: disabled directory index on amane. Was generating massive amounts of NFS traffic by generating a directory index for some timeline directories.
  • 12:51 Tim: increased memory limit on srv159 to 8x200MB. Still well under physical.
  • 11:38 Tim: cleaned up temporary files on srv159, had filled its disk
  • 11:25 Tim: synced upload scripts (including to ms1)
  • 10:06 Tim: removed sq50 from the squid node lists and uninstalled squid on it
  • 09:22 - 09:52 mark, Tim, JeLuF: initial attempts to bring the squids back up failed due to incorrect permissions on the recreated swap logs. Most were back up by around 09:32, except newer knams and yaseo squids which were missing from the squids_global node group. The node group was updated and the remainder of the squids brought up around 09:52.
  • 09:19 JeLuF: deployed squid.conf with an error in it. All squid instances exited.
  • 08:26 Tim: Restarted ntpd on search7, was broken
  • 06:42 Tim: ntp.conf on zwinger had the wrong netmask for the 208.x net, it was /26 instead of /25. So a lot of squids were out of it, and some had a clock skew of 10 minutes (as visible on ganglia). Fixed ntp.conf, not stepped yet. Will affect squid logs.

October 15

  • 19:49 brion: added '<span onmouseover="_tipon' to spam regex; some kind of weird edit submissions coming with this stuff like [1]
  • 12:00 Tim: trying to bring srv159 up as an image scaler. Limiting memory usage to 8x100 = 800MB with MediaWiki.
  • 11:21 srv127 died just the same. Mark suggests using one with DRAC next.
  • 10:20 Tim: all image scalers (srv43 and srv100) swapped to death again. Preparing srv127 as an image scaler with swap off.
  • 08:43 Tim: reduced depool-threshold for the scalers to 0.1 since srv100 is quite capable of handling the load by itself while we're waiting for the other servers to come back up.
  • 07:45 Tim: half the scaling cluster went down again, ganglia shows high system CPU. Installing wikimedia-task-scaler on srv100.
  • 02:30 Tim: moved image scalers into their own ganglia cluster
  • 02:17 Tim: apache on srv43-47 hadn't been restarted and so was still running without -DSCALER. This partially explains the swapping. Restarted them. Took srv38-39 back out of the image scaler pool, they have different rsvg and ffmpeg binary paths and break without a MediaWiki reconfiguration.
  • 02:13 tomasz: upgraded srv9 to ubuntu 8.04
  • 02:00 tomasz: upgraded srv9 to ubuntu 7.10

October 14

  • 19:16 brion: restarted lighty on storage1 again -- it was back in 'fastcgi overloaded' mode, possibly due to the previously broken backend, possibly not
  • 19:11 mark: Pooled old scaling servers srv38, srv39
  • 18:50 brion: at least four of new image scalers are down -- can't reach by SSH. thumbnailing is borked
  • 16:41 brion: fixed image scaling for now -- storage1 fastcgi backends were overloaded, so it was rejecting things. did some killall -9s to shut them all down and restarted lighty. ok so far
  • 16:20 brion: image scaling is broken in some way, investigating
  • 02:54 Tim: fixed srv43-47, this is now the image scaling cluster
  • 00:10 Tim: oops, forgot to add VIPs, switched back.
  • 00:05 Tim: switched image scaling LVS to srv43-47

October 13

  • 23:45 Tim: prepping srv43-47 as image scaling servers
  • 21:45 jeluf: moved more image directories to ms1. Now, upload/wikipedia/[abghijmnopqrstuwxy]* are on ms1
  • 21:35 jeluf: killed mwsearchd on srv39, removed both the rc3.d link and the cronjob that start mwsearchd
  • 21:30 RobH: search8 and search9 are online, awaiting configuration.
  • 21:15 brion: thumb rendering failures reported... found some runaway convert procs poking at an animated GIF, killed them.
    • rev:42058 will force GIFs over 1 megapixel to render a single frame instead of animations as a quick hackaround...
  • 20:48 domas: thistle serving as s2a server
  • 20:28 RobH: stopping mysql on adler so it can be re-racked with rails.
  • 19:53 RobH: search7 back online, awaiting addition to the search cluster.
  • 19:35 mark: Set up an Exim instance on srv9 for outgoing donation mail, as well as incoming for delivery into IMAP for CiviMail (*spit*).
  • 17:00 RobH: srv21-srv29 decommissioned and unracked.
  • 12:05 domas: put lomaria back in rotation
  • 11:50 domas: Enabled write-behind caching on db15. Restarted.
  • 10:40 domas: restarted replication on db15 and lomaria
  • 10:27 domas: loading dewiki data from SQL dump into thistle
  • 09:09 Tim: restarted logmsgbot
  • 08:27 Tim: folded s2b back into s2
  • 08:06 Tim: db13 in rotation
  • 08:02 domas: copying from db15 to lomaria
  • 07:38 Tim: started replication on db13
  • 04:51 Tim: copying
  • 03:27 Tim: Preparing for copy from db15 to db13
  • 00:00 domas: something wrong with db15 i/o performance. it is behaving way worse, than it should.

October 12

  • 23:58 brion: updated CodeReview to add a commit so loadbalancer saves our master position. playing with serverstatus extension on yongle to find out wtf it keeps getting stuck
  • 22:05 brion: db15 sucks hard. putting categories back to db13
  • 22:01 brion: db15 got all laggy with the load. taking back out of general rotation, leaving it on categories/recentchangeslinked
  • 21:58 brion: db15 seems all happy. swapping it in in place of db13, and giving it some general load on s2. we'll have to resync db13 at some point? and toolserver?
  • 19:41 Tim: shutting down db15 for restart with innodb_flush_log_at_trx_commit=2. But db8 seems to be handling the load now so I'm going to bed.
  • 19:20 Tim: depooled db15.
  • 19:09 Tim: split off some wikis into s2b and put db8 on it. To reduce I/O and hopefully stop the lag.
  • 18:51 Tim: db15 still chronically lagged. Offloading all s2 RCL and category queries to db13.
  • 18:38 Tim: offloading commons RCL queries to db13
  • 18:36 Tim: dewiki r/w with ixia (master) only
  • 18:33 Tim: offloading commons category queries to db13
  • 18:25 Tim: balancing load. Fixed ganglia on various mysql servers.
  • 18:06 Tim: going to r/w on s2. Not s2a yet because db15/db8 can't handle the load.
  • 17:46 Tim: db8->db15 copy finished, deploying
  • 17:33 Tim: installed NRPE on thistle.
  • 16:54 Tim: copied mysqld binaries from db11 to db15 and thistle. Plan for thistle is to use it for s2a.
  • 16:40 Tim: ixia/db8 can't handle the load between them with db13 out, even with s2a diverted. Restored db13 to the pool. Running out of candidates for a copy destination. Need db13 in because it's keeping the site up, can't copy to thistle because it's too small with RAID 10. Plan B: set up virgin server db15. Copying from db8.
  • 16:07 Tim: repooled ixia/db8 r/o
  • 15:53 Tim: removed ixia binlogs 290-349. 270-289 were deleted during the initial response.
  • 14:54 mark: Pooled search6 as part of search cluster 2, by request of rainman
  • 14:37 Tim: deployed r41995 as a live patch to replace buggy temp hack.
  • 14:14 Tim: cleaned up binlogs on db2. Yes the horse has bolted, but we may as well shut the gate.
  • 14:11 Tim: copy now in progress as planned.
  • 13:48 Tim: going to try the resync option. Maybe with s2 it won't take as long as s1. Will try to sync up db8 from ixia with db13 serving read-only load for the duration of the copy.
  • 13:40 Tim: ixia (s2 master) disk full. Classic scenario, binlogs stopped first, writing continued for 10 minutes before replag was reported.
  • 13:00 jeluf: moved wikipedia/m* image directories to ms1
  • 08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled.
  • 02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while.

October 11

  • 7:00 jeluf: moved wikipedia/s* image directories to ms1

October 10

  • 21:30 jeluf: moved wikipedia/[jqtuwxy]* to ms1
  • 19:20 RobH: Bayes online.
  • 19:11 brion: recreated special page update logs in /home/wikipedia/logs, hopefully fixing special page updates
  • 13:05 Tim: reverted live patch and merged properly tested fix r41928 instead.
  • 12:31 Tim: deployed a live patch to fix a regression in MessageCache::loadFromDB() concurrency limiting lock
  • 12:17 domas: killed long running threads
  • ~12:04: s2 down due to slave server overload

October 9

  • 22:52 brion: enabled Collection on de.wikibooks so they can try it out
  • 20:00 jeluf: moved wikipedia/i* images to ms1
  • 17:05 RobH: thistle raid died due to hdd failed, replaced hdd, reinstalled as raid10.
  • 12:00 domas: switched s3 master to db1, did erase bunch of db.php stuff by accident (don't know how :). restored from db.php~ :-)
  • 09:31 mark: pascal died yet again, revived it. Will move the htcp proxy tonight...

October 8

  • 21:05 brion: yongle still gets stuck from time to time, breaking mobile, apple search, and svn-proxy. i suspect svn-proxy but can't easily prove it still. using separate svn command (in theory) but it's not showing me stuck processes.
  •  ??:?? rob fixed srv37, then later srv133 into mediawiki-installation node group. he did an audit and didn't see any other problems. i ran a scap to make sure all are now up to date
    • Speculation: possible that rumored ongoing image disappearances have been caused by the image-destruction bug still being in place on srv133 for the last month.
  • 19:02 mark: Upgraded packages on search1 - search6 and searchidx1
  • 18:59 brion: aaron complaining of srv37 not properly updated (doesn't recognize Special:RatingHistory). flaggedrevs.php was out of date there. checking scap infrastructure, stuff seems ok so far...

October 7

  • 21:47 brion: started two dump threads (srv31)
  • 21:16 RobH: installed and configured gmond on all knams squids.
  • 21:00 jeluf: moved wikipedia/g* to ms1
  • 18:55 RobH: fixed private uploads issue for arbcom-en and wikimaniateam.
  • 17:26 RobH: reinstalled and redeployed knsq24 and knsq29
  • 15:00-16:00 robert: switched enwiki to lucene-search 2.1 running on new servers. Test run till tomorrow, if anything goes wrong, reroute search_pool_1 to old searchers on lvs3. Will switch on spell checking when all of the servers are racked. Thanks RobH for tunning config files.
  • 15:54 RobH: srv101 crashed again, running tests.
  • 15:45 RobH: srv146 was powered down for no reason. Powered back up.
  • 15:42 RobH: srv138 locked up, rebooted, back online.
  • 15:32 RobH: srv110 was locked up, rebooted, synced, back online.
  • 15:31 RobH: srv101 back up and synced.
  • 15:22 RobH: rebooted srv56, was locked up, handed off to rainman to finish repair.
  • 15:21 RobH: updated lucene.php and synced.
  • 15:04 RobH: updated memcached to remove srv110 and add in spare srv137.
  • 15:00 RobH: removed all servers from lvs:search_pool_1 and put in search1 and search2 with rainman

October 6

  • 23:55 brion: tweaked bugzilla to point rXXXX at CodeReview instead of ViewVC
  • 14:29 domas: amane lighty was closing connections immediately, worked properly after restart. upgraded to 1.4.20 on the way.
  • 14:36 RobH: setup ganglia on all pmtpa squids.
  • 13:50 mark: The slow page loading on the frontend squids appears to be limited to english main page only, for unknown reasons. Set another article as pybal check URL to prevent pooling/depooling oscillation by PyBal for now.
  • 09:27 mark: yaseo squids are fully in swap, set DNS scenario yaseo-down

October 5

  • 23:14 mark: Frontend squids are not working well at the moment, sometimes serving cached objects with very high delays. I wonder if they are under (socket) memory pressure. Reduced cache_mem on the backend instance on sq25 to free up some memory for testing.
  • 20:35 jeluf: wikipedia/b* moved, too
  • 19:00 jeluf: switched squids to send requests for upload.wikimedia.org/wikipedia/a* to ms1
  • 14:30 jeluf: Moving all wikipedia/a* image directories to ms1

October 4

  • 23:17 mark: Repooled knsq16-30 frontends in LVS. Also found that mint was fighting with fuchsia about being LVS master, due to reboot this afternoon.
  • 14:30 mark: Several servers in J-16 were shutting down, or going down around this time. Reason unknown, possibly auto shutdown because of high temperature, possibly they were turned off by someone locally.
  • 14:03 mark: SARA power failure. Feed B lost power for ~ 6 seconds.
  • 00:26 mark: Depooled srv61
  • 00:07 brion: found srv37 and srv61 have broken json_decode (wtf!)
    • updating packages on srv37. srv61 seems to have internal auth breakage
    • updated packages on srv61 too. su still borked, may need LDAP fix or something?

October 3

  • 21:40 brion: transferring old upload backups from storage2 to storage3. once complete, can restart dumps!
  • 20:01 brion: running updateRestrictions on all wikis (done)
  • 17:51 RobH: srv135 & srv136 reinstalled as ubuntu.
  • 17:34 RobH: srv132 & srv133 reinstalled as ubuntu.
  • 17:13 RobH: srv130 back online.
  • 16:40 RobH: depooled srv131, srv132, srv135, srv136 for reinstall.
  • 00:25 brion: switched codereview-proxy.wikimedia.org to use local SVN command instead of PECL SVN module; it seemed to be getting bogged down with diffs, but hard to really say for sure

October 1

  • 20:02 RobH: srv63 back online.
  • 19:35 RobH: srv61 and srv133 back online.
  • 18:22 RobH: storage3 online and handed off to brion.
  • 17:35 RobH: updated mc-pmtpa.php to put srv61 as spare.
  • 17:32 RobH: srv61 faulty fan replaced, back online.
  • 09:31 Tim: srv104 (cluster18) hit max_rows, finally. Removed it from the write list.
  • 08:36 Tim: fixed ipb_allow_usertalk default on all wikis
  • 23:46 mark: Reinstalled knsq24
  • 22:55 mark: Reenabled switchports of knsq16 - knsq30
  • 20:45 jeluf: fixed resolv.conf on srv131
  • 20:45 jeluf: mounted ms1:/export/upload as /mnt/upload5, started lighttpd on ms1
  • 19:47 brion: enabled revision deletion on test.wikipedia.org for some public testing.
  • 14:25 RobH: Cleaned out the squid cache on knsq16, knsq17, knsq18, knsq19, knsq21, knsq22, knsq23, knsq25, knsq26, knsq27, knsq28, knsq30. DRAC not responsive on knsq20, knsq24, knsq29.

Archives


Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox