Server admin log/Archive 20
From Wikitech
< Server admin log(Difference between revisions)
(few missing .frm files on db18 were causing trouble, resynced them from db19, resumed replication (domas)) |
|||
| Line 1: | Line 1: | ||
== November 2 == | == November 2 == | ||
| + | * 15:06 river: added missing /mnt/upload5 mount on several apaches: srv37 srv61 srv76 srv69 srv63 srv118 srv132 srv135 srv133 srv138 srv136 | ||
* 14:49 domas: few missing .frm files on db18 were causing trouble, resynced them from db19, resumed replication | * 14:49 domas: few missing .frm files on db18 were causing trouble, resynced them from db19, resumed replication | ||
* 13:02 river: copying en from storage1 to ms1 | * 13:02 river: copying en from storage1 to ms1 | ||
Revision as of 15:07, 2 November 2008
November 2
- 15:06 river: added missing /mnt/upload5 mount on several apaches: srv37 srv61 srv76 srv69 srv63 srv118 srv132 srv135 srv133 srv138 srv136
- 14:49 domas: few missing .frm files on db18 were causing trouble, resynced them from db19, resumed replication
- 13:02 river: copying en from storage1 to ms1
- 10:49 domas: replaced XFS with JFS on db18, installed ganglia on db17-db30
- 10:36 river: completed move of commons, now being served from ms1 (except archive/)
November 1
- 22:48 brion: fixed ContributionReporting to force a utf8 connection, now loads names in right charset
- 22:20 brion: fixed $wgNoticeInfrastructure setting; defaults must have changed at some point
- 22:15 domas: installed wikimedia-mysql4 on db21-23, established s1,s2,s3 replication. we now have full database copy in sdtpa \o/
- 20:53 brion: deploying CentralNotice editing system on meta, woo
- 20:27 brion: scapping to update reporting and centralnotice bits internally
- 19:38 brion: rescapping to make sure 159 is unbroken
- 19:27 brion: svn up'ing on wikitech just for domas
- 19:25 brion: srv159 is out of space
- We need to clean out the damn temp files somehow, eh?
- 19:20 brion: scapping to update ContributionReporting ext
- 12:56 mark: uppreffed traffic from knams to pmtpa via 6908/2828, as existing peering path had slight packet loss
- 11:25 Tim: enabled subpages in the main namespace by default for all Wikisource wikis. This appears to be a defacto standard and is used by all wikisources with an entry in wgNamespacesWithSubpages.
- 07:55 Tim: disabled ParserDiffTest, obsolete
- 07:06 mark: XO circuit back up:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now up [vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <w005.z207088246.xo.cnc.net>, session is now up
October 31
- 23:11 brion: set up some logs for fundraising banner campaign clicks for later mining
- 17:44 brion: adding support for Tomas skin on wikimediafoundation.org for new fundraiser templates
- 14:24 mark: XO circuit went down:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <207.88.246.5>, session is now down because <Port State Down> [vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now down because <Port State Down>
October 30
- 23:11 Tim: fixed disk space on srv159, db1, srv103
- 19:03 brion: updated triggers for donation reporting database a few minutes ago
- 18:14 RobH: moved ms1 from pmtpa:a4 to sdtpa:a1, its back online.
- 17:46 RobH: db26 OS installed and online
- 17:28 brion: added a spam filter rule for private-l messages :)
- 04:54 river: testing sun web server on ms1
- 03:56 brion: updating squid conf to send upload /centralnotice to storage1 for testing
- 03:53 brion: tweaked lighttpd config on storage1 for centralnotice static file testing, since amane's configuration is too crappy to support regexes needed to set headers on a directory
- 02:59 brion: poking experimental expires options on amane for static centralnotice tests
- 02:44 brion: brion broke lighttpd.conf briefly
October 29
- 22:39 brion: enabling $wgCodeReviewENotif experimentally
- 18:35 brion: disabled bitmap fonts in fontconfig on image scalers, seems to help with the "mad helvetica" problem
- 18:02 RobH: db28 & db29 OS installed and online.
- 17:59 brion: fixed some upload directory perms on foundationwiki
- 17:12 RobH: db27 OS installed and online.
- 16:54 RobH: db21 OS installed and online.
- 16:38 RobH: db22, db23, db25, db30 were installed yesterday, forgot to admin log it, sorry ;/
- 14:44 _mary_kate_: copying wikipedia/commons/thumb/4 from storage1 to ms1
October 28
- 20:02 domas: re-enabled db16
- 18:03 mark: Removed blackholes.securitysage.com from lily's spamassassin configuration
- 17:52 domas: db16 fubar'ed by queries that built 100GB temporary tables, leading to jfs hangs, leading to unhappy kernel.
- 15:23 RobH: updated dsh node group ALL, added backup of frontend data for bugzilla and blogs from isidore to tridge.
- 12:33 rainman-sr: experimentally turning on "did you mean.." on search8,9 for enwiki
- 10:44 mark: Reverted yesterday's search changes
October 27
- 23:24 mark: Switched to lucenesearch 2.1 for all wikis
- 23:06 mark: pooled search8 as the only search server in search pool 3
- 22:25 mark: rainman-sr is making me do more ugly things to lucene.php
- 22:22 mark: Pointed search for "all other wikis" hardcoded to search7 in lucene.php
- 22:14 mark: Added zhwiki and plwiki to lucene search 2.1 pool 2
October 26
- 15:43 mark: Set up OpenGear serial console server scs-a1-sdtpa
- 13:37 mark: Set up iBGP between csw1-sdtpa and csw5-pmtpa (IPv4/IPv6)
- 13:36 mark: Prepared csw1-sdtpa for production deployment (general configuration)
- 09:56 domas: updated db18 firmware to 2.1.1 (September 2008)
- 04:31 Tim: fixed the "service_ips" hostgroup in nagios
- 03:03 Tim: hardware reboot of db18
- 02:47 Tim: mysqld on db18 apparently hit a kernel bug. It was reported as a zombie but was still using 200% CPU in top. kswapd was simultaneously using 100% CPU. Did not respond to SIGKILL. The non-zombie parent, mysqld_safe, also did not respond to SIGKILL (wchan=flush_cpu_workqueue). Attempted a reboot with shutdown -r.
- 02:47 brion: tweaked MaxClientsPerChild on yongle to see if that helps with the mysterious hangs i sometimes see where requests seem to get backed up; it's disrupting the CodeReview proxy as well as mobile & Mac Dictionary search
October 25
- 20:46 brion: scapped to r42573
- 08:17 Tim: svn up to 42536 for API overload fix. Re-enabling disabled query modules.
- 05:55 Tim: svn up/scap to 42531 (for properly tested Interwiki.php fix).
- 05:09 Tim: DB overload on many enwiki slave servers. Long running queries attributed to ApiQueryAllpages, ApiQueryBacklinks, ApiQueryCategoryMembers and ApiQueryLogEvents. Disabled those modules and killed related running threads.
- 05:01 Tim: Interwiki links were broken due totally broken and untested getInterwikiCached() function. Live patch deployed at this time.
- 04:33 Tim: Fixed svn conflicts in two files. Scap to r42524.
- 04:20 Tim: disabled Drafts extension on test.wikipedia.org. Trevor, please contact me for code review.
- 04:11 Tim: synced php-1.5 to srv35 and ran "make -B" in the serialized directory. Seems to have fixed test. Will scap.
- 01:01 ariel: preemptively up mail quota to 7GB from 1GB for cbass, dmenard
- 00:59 brion: testwiki is borked until we figure out how to get it to load updated message files. tried disabling $wgLocalMessageCache and $wgCheckSerialized to no effect
- 00:51 brion: temporarily blocking scap during testing :) ... running serialized language file updates for test, broken by need to get magic word updates
- 00:44 brion: preparing a svn up...
- 00:37 ariel: up msecoquian's mail quota from 1GB to 6.9GB
October 24
- 23:12 brion: set up ariel (the person) on sanger to do mail administration -- quota fixes etc
- 16:24 TimStarling: reloaded ourusers.sql on all core and ext. mysql servers, adding a nagios user
- 15:39 mark: slacking
- 15:36 TimStarling: added special nagios user to ES instances on clematis
- 14:00 domas: re-enabled db5, added db18 to s3
- 10:45 domas: taking out db5 for copy to db18
- 10:44 domas: fixed ntpd on bart, was pointing to multicast address that doesn't work
- 09:57 Tim: removed decommissioned servers from monitoring: dryas, alrazi, diderot, friedrich, samuel
- 07:50 Tim: added monitoring for toolserver ES clusters 17-19
- 07:40 Tim: regenerated trusted XFF list with extra SAIX proxies
- 05:00 Tim: fixed nagios check script handling of MySQL connection errors
- 01:37 brion: setting $wgLicenseURL for Collection to point at GFDL English text
- 01:01 brion: enabling Drafts on testwiki, but it seems to not be saving there... works on my local test, not sure what the issue is
- 01:03 brion: disabling logentry, still borken?
October 23
- 22:33 brion: trying re-enabling logentry ext on wikitech, now with cache disable to avoid edittoken for now
- 21:34 brion: updating ipblocks table definition
- 21:25 brion: re-ran svnImport to update path listings for CodeReview
- 20:11 mark: Set up search7 - search9
- 17:05 mark: Pooled search4 as a s1 search server to help with dead search2
- 16:33 brion: updated mw-serve
- 15:38 Tim: On the image scalers, temporarily mounted /a/tmp as /tmp with --bind to stop the disk full problem while we figure out some better solution
- 15:24 Tim: removed temporary files on image scalers again
- 14:54 RobH: Replaced dead disk in amane, rebuilding array.
- 11:04 Tim: Added disk space monitoring for image scalers. Also added apache monitoring which was also missing.
- 10:53 Tim: freed up disk space on image scalers, magick-* temporary files were filling their root partitions
- 10:50 Tim: re-added cluster19 to the default write list. Not sure who took it out or why.
- 10:32 Tim: freed up some space on srv103 (was down to 500MB)
- 10:29 Tim: fixed monitoring for MegaRAID SAS
- 07:10 Tim: Set up monitoring of RAID status for all Ubuntu DB servers using the wikimedia-raid-utils package that I just wrote. It doesn't do anything on the MegaRAID servers yet, but the Adaptec ones should work.
- 05:05 Tim: running CodeReview svnImport.php
October 22
- 18:26 brion: enabling ODT output for collection
- 18:17 brion: updating collection and codereview extensions
- 18:13 Brion: updated mw-serve code and configured to send error emails per jojo's request
- 17:15 Brion: Changed bugzilla's mail delivery from local sendmail (SSMTP) to direct SMTP, per Mark's recommendation
October 21
- 19:29 RobH: Bayes upgraded from 2GB to 10GB.
- 13:49 Tim: Did a demonstration hack of nagios from CSRF to arbitrary shell. Disabled cmd.cgi.
- 04:13 Tim: Brought srv43-47 up as image scalers with mem limit 6 x 200MB = 1200MB (2GB physical)
October 20
- 18:11 RobH: srv118 rebooted, back online.
- 17:25 RobH: srv79 was in kernel panic, rebooted.
- 05:10 Tim: increased concurrency on srv159 to 15, for mem limit 15 x 200MB = 3000MB
- 02:40 Tim: installed NRPE on khaldun and db20
- 02:20 Tim: moved disk space checks on the ext stores from the "apaches" service group to the relevant ext store service group
- 01:53 Tim: installed NRPE on the new ext stores
- 01:45 Tim: Updated /etc/ssh/ssh_known_hosts on bart (copied from zwinger).
- 00:30-01:30 Tim: Listed down servers on DC tasks. Removed broken servers from memcached rotation. Restarted apache on srv99, srv109, srv123. Purged master binlogs on srv102.
October 18
- 21:45 RobH's mighty index finger brought amane and the site back up.
- 21:00 river: Ran 'nc -l -p 623' command, amane's kernel panic'ed. Rob was called.
- 20:55 mark, river: diagnosed the NFS communication problems to be caused by NIC hardware packet interception of port 623 packets... amane wasn't receiving NFS replies from ms1.
- 19:40 mark: Upload got unhappy, ms1 NFS mount on amane was unreachable and stalling things
- 13:40 Tim: down again, single process allocating all memory
- 07:35 Tim: took it down again, while recording /proc/vmstat and /proc/stat
- 06:27 Tim: restarted srv160
- 05:45 Tim: took srv160 into the purple for a much more convincing overload, and different oprofile results
- 03:40 Tim: used oprofile to determine what part of the kernel is responsible for the system CPU spike. Looks like a spinlock in dnotify.
- 03:12 Tim: simulated a memory-intensive request rate spike to srv160. Large system CPU response spike, but it didn't go down completely. Will try a bigger one.
October 17
- 21:10 brion: enabled Commons foreign image repo on Wikitech
- 18:45 brion: created Wikimedia-Boston list for SJ
- 16:55 brion: adding nomcomwiki to special.dblist so it shows up right in sitematrix
- 16:45 brion: deleted some junk comments from bugzilla
- 16:31 brion: changed autoconfirm settings for 'fishbowl' wikis -- 0 age for autoconfirm, plus set upload & move for all users just in case autoconfirm doesn't kick in right
- 14:22 RobH: srv131 back up.
- 09:03 Tim: copying srv129 and srv139 ES data directories to storage2:/export/backup
- 02:49 Tim: excessive lag on db16, killed long-running queries and temporarily depooled. CUPS odyssey continues.
- 01:59 Tim: removing cups on all servers where it is running
- 00:00 RobH: restarted srv43-47
October 16
- 20:42 brion: added 3 more dump threads on srv31... we need to find some more batch servers to work with for the time being until new dump system is in place :)
- 20:20 RobH: pulled samuel from the rack, decommissioned, RIP samuel.
- 19:35 RobH: migrated rack B4 from asw3 to asw-b4-pmtpa.
- 18:40 RobH: rebooted scs-ext opps!
- 18:26 RobH: srv61 reinstalled and redeployed.
- 18:24 RobH: Adler re-racked with rails, booted up to maintenance mode prompt.
- 17:34 mark: 208.80.152.0/25 NTP restriction is actually also not broad enough - changed it to /22 in ntpd.conf on zwinger
- 17:02 brion: thumbnails on commons are insanely slow and/or broken
- 14:44 Tim: added a more comprehensive redirection list to squid.conf.php for storage1 images
- 14:04 Tim: redirected images for /wikipedia/en/ to storage1, apparently they were moved a while ago. Refactored the relevant squid.conf section.
- 13:38 Tim: disabled directory index on amane. Was generating massive amounts of NFS traffic by generating a directory index for some timeline directories.
- 12:51 Tim: increased memory limit on srv159 to 8x200MB. Still well under physical.
- 11:38 Tim: cleaned up temporary files on srv159, had filled its disk
- 11:25 Tim: synced upload scripts (including to ms1)
- 10:06 Tim: removed sq50 from the squid node lists and uninstalled squid on it
- 09:22 - 09:52 mark, Tim, JeLuF: initial attempts to bring the squids back up failed due to incorrect permissions on the recreated swap logs. Most were back up by around 09:32, except newer knams and yaseo squids which were missing from the squids_global node group. The node group was updated and the remainder of the squids brought up around 09:52.
- 09:19 JeLuF: deployed squid.conf with an error in it. All squid instances exited.
- 08:26 Tim: Restarted ntpd on search7, was broken
- 06:42 Tim: ntp.conf on zwinger had the wrong netmask for the 208.x net, it was /26 instead of /25. So a lot of squids were out of it, and some had a clock skew of 10 minutes (as visible on ganglia). Fixed ntp.conf, not stepped yet. Will affect squid logs.
October 15
- 19:49 brion: added '<span onmouseover="_tipon' to spam regex; some kind of weird edit submissions coming with this stuff like [1]
- 12:00 Tim: trying to bring srv159 up as an image scaler. Limiting memory usage to 8x100 = 800MB with MediaWiki.
- 11:21 srv127 died just the same. Mark suggests using one with DRAC next.
- 10:20 Tim: all image scalers (srv43 and srv100) swapped to death again. Preparing srv127 as an image scaler with swap off.
- 08:43 Tim: reduced depool-threshold for the scalers to 0.1 since srv100 is quite capable of handling the load by itself while we're waiting for the other servers to come back up.
- 07:45 Tim: half the scaling cluster went down again, ganglia shows high system CPU. Installing wikimedia-task-scaler on srv100.
- 02:30 Tim: moved image scalers into their own ganglia cluster
- 02:17 Tim: apache on srv43-47 hadn't been restarted and so was still running without -DSCALER. This partially explains the swapping. Restarted them. Took srv38-39 back out of the image scaler pool, they have different rsvg and ffmpeg binary paths and break without a MediaWiki reconfiguration.
- 02:13 tomasz: upgraded srv9 to ubuntu 8.04
- 02:00 tomasz: upgraded srv9 to ubuntu 7.10
October 14
- 19:16 brion: restarted lighty on storage1 again -- it was back in 'fastcgi overloaded' mode, possibly due to the previously broken backend, possibly not
- 19:11 mark: Pooled old scaling servers srv38, srv39
- 18:50 brion: at least four of new image scalers are down -- can't reach by SSH. thumbnailing is borked
- 16:41 brion: fixed image scaling for now -- storage1 fastcgi backends were overloaded, so it was rejecting things. did some killall -9s to shut them all down and restarted lighty. ok so far
- 16:20 brion: image scaling is broken in some way, investigating
- 02:54 Tim: fixed srv43-47, this is now the image scaling cluster
- 00:10 Tim: oops, forgot to add VIPs, switched back.
- 00:05 Tim: switched image scaling LVS to srv43-47
October 13
- 23:45 Tim: prepping srv43-47 as image scaling servers
- 21:45 jeluf: moved more image directories to ms1. Now, upload/wikipedia/[abghijmnopqrstuwxy]* are on ms1
- 21:35 jeluf: killed mwsearchd on srv39, removed both the rc3.d link and the cronjob that start mwsearchd
- 21:30 RobH: search8 and search9 are online, awaiting configuration.
- 21:15 brion: thumb rendering failures reported... found some runaway convert procs poking at an animated GIF, killed them.
- rev:42058 will force GIFs over 1 megapixel to render a single frame instead of animations as a quick hackaround...
- 20:48 domas: thistle serving as s2a server
- 20:28 RobH: stopping mysql on adler so it can be re-racked with rails.
- 19:53 RobH: search7 back online, awaiting addition to the search cluster.
- 19:35 mark: Set up an Exim instance on srv9 for outgoing donation mail, as well as incoming for delivery into IMAP for CiviMail (*spit*).
- 17:00 RobH: srv21-srv29 decommissioned and unracked.
- 12:05 domas: put lomaria back in rotation
- 11:50 domas: Enabled write-behind caching on db15. Restarted.
- 10:40 domas: restarted replication on db15 and lomaria
- 10:27 domas: loading dewiki data from SQL dump into thistle
- 09:09 Tim: restarted logmsgbot
- 08:27 Tim: folded s2b back into s2
- 08:06 Tim: db13 in rotation
- 08:02 domas: copying from db15 to lomaria
- 07:38 Tim: started replication on db13
- 04:51 Tim: copying
- 03:27 Tim: Preparing for copy from db15 to db13
- 00:00 domas: something wrong with db15 i/o performance. it is behaving way worse, than it should.
October 12
- 23:58 brion: updated CodeReview to add a commit so loadbalancer saves our master position. playing with serverstatus extension on yongle to find out wtf it keeps getting stuck
- 22:05 brion: db15 sucks hard. putting categories back to db13
- 22:01 brion: db15 got all laggy with the load. taking back out of general rotation, leaving it on categories/recentchangeslinked
- 21:58 brion: db15 seems all happy. swapping it in in place of db13, and giving it some general load on s2. we'll have to resync db13 at some point? and toolserver?
- 19:41 Tim: shutting down db15 for restart with innodb_flush_log_at_trx_commit=2. But db8 seems to be handling the load now so I'm going to bed.
- 19:20 Tim: depooled db15.
- 19:09 Tim: split off some wikis into s2b and put db8 on it. To reduce I/O and hopefully stop the lag.
- 18:51 Tim: db15 still chronically lagged. Offloading all s2 RCL and category queries to db13.
- 18:38 Tim: offloading commons RCL queries to db13
- 18:36 Tim: dewiki r/w with ixia (master) only
- 18:33 Tim: offloading commons category queries to db13
- 18:25 Tim: balancing load. Fixed ganglia on various mysql servers.
- 18:06 Tim: going to r/w on s2. Not s2a yet because db15/db8 can't handle the load.
- 17:46 Tim: db8->db15 copy finished, deploying
- 17:33 Tim: installed NRPE on thistle.
- 16:54 Tim: copied mysqld binaries from db11 to db15 and thistle. Plan for thistle is to use it for s2a.
- 16:40 Tim: ixia/db8 can't handle the load between them with db13 out, even with s2a diverted. Restored db13 to the pool. Running out of candidates for a copy destination. Need db13 in because it's keeping the site up, can't copy to thistle because it's too small with RAID 10. Plan B: set up virgin server db15. Copying from db8.
- 16:07 Tim: repooled ixia/db8 r/o
- 15:53 Tim: removed ixia binlogs 290-349. 270-289 were deleted during the initial response.
- 14:54 mark: Pooled search6 as part of search cluster 2, by request of rainman
- 14:37 Tim: deployed r41995 as a live patch to replace buggy temp hack.
- 14:14 Tim: cleaned up binlogs on db2. Yes the horse has bolted, but we may as well shut the gate.
- 14:11 Tim: copy now in progress as planned.
- 13:48 Tim: going to try the resync option. Maybe with s2 it won't take as long as s1. Will try to sync up db8 from ixia with db13 serving read-only load for the duration of the copy.
- 13:40 Tim: ixia (s2 master) disk full. Classic scenario, binlogs stopped first, writing continued for 10 minutes before replag was reported.
- 13:00 jeluf: moved wikipedia/m* image directories to ms1
- 08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled.
- 02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while.
October 11
- 7:00 jeluf: moved wikipedia/s* image directories to ms1
October 10
- 21:30 jeluf: moved wikipedia/[jqtuwxy]* to ms1
- 19:20 RobH: Bayes online.
- 19:11 brion: recreated special page update logs in /home/wikipedia/logs, hopefully fixing special page updates
- 13:05 Tim: reverted live patch and merged properly tested fix r41928 instead.
- 12:31 Tim: deployed a live patch to fix a regression in MessageCache::loadFromDB() concurrency limiting lock
- 12:17 domas: killed long running threads
- ~12:04: s2 down due to slave server overload
October 9
- 22:52 brion: enabled Collection on de.wikibooks so they can try it out
- 20:00 jeluf: moved wikipedia/i* images to ms1
- 17:05 RobH: thistle raid died due to hdd failed, replaced hdd, reinstalled as raid10.
- 12:00 domas: switched s3 master to db1, did erase bunch of db.php stuff by accident (don't know how :). restored from db.php~ :-)
- 09:31 mark: pascal died yet again, revived it. Will move the htcp proxy tonight...
October 8
- 21:05 brion: yongle still gets stuck from time to time, breaking mobile, apple search, and svn-proxy. i suspect svn-proxy but can't easily prove it still. using separate svn command (in theory) but it's not showing me stuck processes.
- ??:?? rob fixed srv37, then later srv133 into mediawiki-installation node group. he did an audit and didn't see any other problems. i ran a scap to make sure all are now up to date
- Speculation: possible that rumored ongoing image disappearances have been caused by the image-destruction bug still being in place on srv133 for the last month.
- 19:02 mark: Upgraded packages on search1 - search6 and searchidx1
- 18:59 brion: aaron complaining of srv37 not properly updated (doesn't recognize Special:RatingHistory). flaggedrevs.php was out of date there. checking scap infrastructure, stuff seems ok so far...
October 7
- 21:47 brion: started two dump threads (srv31)
- 21:16 RobH: installed and configured gmond on all knams squids.
- 21:00 jeluf: moved wikipedia/g* to ms1
- 18:55 RobH: fixed private uploads issue for arbcom-en and wikimaniateam.
- 17:26 RobH: reinstalled and redeployed knsq24 and knsq29
- 15:00-16:00 robert: switched enwiki to lucene-search 2.1 running on new servers. Test run till tomorrow, if anything goes wrong, reroute search_pool_1 to old searchers on lvs3. Will switch on spell checking when all of the servers are racked. Thanks RobH for tunning config files.
- 15:54 RobH: srv101 crashed again, running tests.
- 15:45 RobH: srv146 was powered down for no reason. Powered back up.
- 15:42 RobH: srv138 locked up, rebooted, back online.
- 15:32 RobH: srv110 was locked up, rebooted, synced, back online.
- 15:31 RobH: srv101 back up and synced.
- 15:22 RobH: rebooted srv56, was locked up, handed off to rainman to finish repair.
- 15:21 RobH: updated lucene.php and synced.
- 15:04 RobH: updated memcached to remove srv110 and add in spare srv137.
- 15:00 RobH: removed all servers from lvs:search_pool_1 and put in search1 and search2 with rainman
October 6
- 23:55 brion: tweaked bugzilla to point rXXXX at CodeReview instead of ViewVC
- 14:29 domas: amane lighty was closing connections immediately, worked properly after restart. upgraded to 1.4.20 on the way.
- 14:36 RobH: setup ganglia on all pmtpa squids.
- 13:50 mark: The slow page loading on the frontend squids appears to be limited to english main page only, for unknown reasons. Set another article as pybal check URL to prevent pooling/depooling oscillation by PyBal for now.
- 09:27 mark: yaseo squids are fully in swap, set DNS scenario yaseo-down
October 5
- 23:14 mark: Frontend squids are not working well at the moment, sometimes serving cached objects with very high delays. I wonder if they are under (socket) memory pressure. Reduced cache_mem on the backend instance on sq25 to free up some memory for testing.
- 20:35 jeluf: wikipedia/b* moved, too
- 19:00 jeluf: switched squids to send requests for upload.wikimedia.org/wikipedia/a* to ms1
- 14:30 jeluf: Moving all wikipedia/a* image directories to ms1
October 4
- 23:17 mark: Repooled knsq16-30 frontends in LVS. Also found that mint was fighting with fuchsia about being LVS master, due to reboot this afternoon.
- 14:30 mark: Several servers in J-16 were shutting down, or going down around this time. Reason unknown, possibly auto shutdown because of high temperature, possibly they were turned off by someone locally.
- 14:03 mark: SARA power failure. Feed B lost power for ~ 6 seconds.
- 00:26 mark: Depooled srv61
- 00:07 brion: found srv37 and srv61 have broken json_decode (wtf!)
- updating packages on srv37. srv61 seems to have internal auth breakage
- updated packages on srv61 too. su still borked, may need LDAP fix or something?
October 3
- 21:40 brion: transferring old upload backups from storage2 to storage3. once complete, can restart dumps!
- 20:01 brion: running updateRestrictions on all wikis (done)
- 17:51 RobH: srv135 & srv136 reinstalled as ubuntu.
- 17:34 RobH: srv132 & srv133 reinstalled as ubuntu.
- 17:13 RobH: srv130 back online.
- 16:40 RobH: depooled srv131, srv132, srv135, srv136 for reinstall.
- 00:25 brion: switched codereview-proxy.wikimedia.org to use local SVN command instead of PECL SVN module; it seemed to be getting bogged down with diffs, but hard to really say for sure
October 1
- 20:02 RobH: srv63 back online.
- 19:35 RobH: srv61 and srv133 back online.
- 18:22 RobH: storage3 online and handed off to brion.
- 17:35 RobH: updated mc-pmtpa.php to put srv61 as spare.
- 17:32 RobH: srv61 faulty fan replaced, back online.
- 09:31 Tim: srv104 (cluster18) hit max_rows, finally. Removed it from the write list.
- 08:36 Tim: fixed ipb_allow_usertalk default on all wikis
- 23:46 mark: Reinstalled knsq24
- 22:55 mark: Reenabled switchports of knsq16 - knsq30
- 20:45 jeluf: fixed resolv.conf on srv131
- 20:45 jeluf: mounted ms1:/export/upload as /mnt/upload5, started lighttpd on ms1
- 19:47 brion: enabled revision deletion on test.wikipedia.org for some public testing.
- 14:25 RobH: Cleaned out the squid cache on knsq16, knsq17, knsq18, knsq19, knsq21, knsq22, knsq23, knsq25, knsq26, knsq27, knsq28, knsq30. DRAC not responsive on knsq20, knsq24, knsq29.
Archives
- Server admin log/Archive 1 (2004 Jun - 2004 Sep)
- Server admin log/Archive 2 (2004 Oct - 2004 Nov)
- Server admin log/Archive 3 (2004 Dec - 2005 Mar)
- Server admin log/Archive 4 (2005 Apr - 2005 Jul)
- Server admin log/Archive 5 (2005 Aug - 2005 Oct)
- Server admin log/Archive 6 (2005 Nov - 2006 Feb)
- Server admin log/Archive 7 (2006 Mar - 2006 Jun)
- Server admin log/Archive 8 (2006 Jul - 2006 Sep)
- Server admin log/Archive 9 (2006 Oct - 2007 Jan)
- Server admin log/Archive 10 (2007 Feb - 2007 Jun)
- Server admin log/Archive 11 (2007 Jul - 2007 Dec)
- Server admin log/Archive 12 (2008 Jan - 2008 Jul)
- Server admin log/2008-08
- Server admin log/2008-09