Server admin log/Archive 20

From Wikitech
< Server admin log
Revision as of 03:14, 12 November 2008 by River (Talk | contribs)

Jump to: navigation, search

November 13

  • 01:20 Tim: an error in the cron job on hume caused the r43398 bug to persist until this time, delivering incorrect language text in some site notices.
  • 01:08 Tim: Fixed those 50 servers with a couple of sed commands. Many of them were attempting to send data to larousse and zwinger. Tested srv125.
  • 00:56 Tim: srv125 was spewing PHP fatal errors without reporting them to the syslog on db20. Restarted it. A quick check (ddsh -cM -g apaches -- 'grep -q @syslog /etc/syslog.conf || echo help') suggests that there are 50 apache servers in the same situation.
  • 00:27 Tim: updated ExtensionDistributor configuration to account for amane -> ms1 storage move. (bug 16308)
  • 00:13 Tim: some language issues caused by r43398, reverted at 23:50 and resynced in fixed form at 00:12.

November 12

  • 03:14 river: didn't reboot ms1 as its lom is unreachable
  • 23:47 Tim: restored FlaggedRevs stats job as per Batch jobs, removal was not documented.
  • 23:35 Tim: r43398 worked just fine, memory usage dropped from ~4GB to 90MB. Adding rebuildTemplates.php to my crontab on hume, removing it permanently from Brion's on zwinger.
  • 23:28 Tim: updated CentralNotice templates on hume (which has enough memory to do it, unlike zwinger)
  • 22:11 Tim: deleted some binlogs on db1. Remaining disk space is still only 48 GB with negligible InnoDB free space.
  • 16:20 RobH: search2 still down, drives will not detect reliably. Ticket with sun reopened.
  • 15:56 RobH: replaced backplane on search2, reinstalling.
  • 15:13 RobH: srv137 back online. apache and memcached back up.
  • 14:49 RobH: srv100 back online.
  • 10:44 river: removed centralnotice php from brion's crontab as it was breaking zwinger
    • Core dump suggests the memory usage may be dominated by the localisation cache. wfMsgExt() loads the localisation for the requested language, and all languages are requested. -- Tim 12:07, 11 November 2008 (UTC)
  • 01:19 brion: swapped Commons to use $wgNoticeProject 'wikimedia' rather than having separate 'Commons needs you' notices
  • 00:57 brion: swapped in fundraiser to all projects

November 10

  • 19:18 mark: Shutdown AMS-IX route server 1 session as it's been flapping for hours

November 9

  • 16:11 river: removed nfsfind cronjob on ms1

November 7

  • 22:52 brion_: tossing 2008_meter_2b notice into partial rotation on enwiki -- has reduced collapsed version
  • 22:49 brion_: adding "_collapsed" to banner source tracking for collapsed view
  • 22:27 brion: scapping updates to ContributionReporting and CentralNotice
  • 01:43 Tim: experimentally reading the civicrm database into db10 with --master-data=1
  • 01:19 brion: db9 temporarily (hopefully) messed up. tim's fiddling with it to put it back
  • 01:05 Tim: my.cnf on db10 had an error in it, replicate-wild-do-tables instead of replicate-wild-do-table. Fixed it. The OTRS snapshot is now hopelessly out of date anyway, so I might wipe the data directory and start again. The idea is to set it up to replicate civicrm first. It's 100% InnoDB so should be easy to copy.
  • 00:09 river: upgraded ms2 to solaris 10 update 6

November 6

  • 21:03 Tim: switched GIFs to use Bitmap_ClientOnly (client-side scaling)
  • 17:23 brion: restarting apache on srv47, seem smysteriously stuck
  • 17:15 brion: setting $wgMaxAnimatedGifArea to 1 to prevent animated thumbnailing of GIFs for now, see if that helps
  • 17:10 brion: river complaining of image scaler issues -- load spikes, depooling?
  • 02:35 mark: disabled BGP, now using lvs2 only
  • 02:25 mark: restarting lvs2 with new kernel
  • 01:52 due to switch issues, load balancing to lvs2/lvs4 stopped working. Mark restarted the BGP session which fixed it temporarily.
  • 01:42 Tim: restarting squids
  • 01:42 mark: Setup lvs4 as temp LVS support for lvs2, balancing the load
  • 01:07 brion: updated ContributionReporting to add paging links to ContributionHistory (might be a little funky w/ caching, we'll work it out :)
  • 00:45 Tim: progressively clearing /a on the remaining image scalers
  • 00:37 Tim: wiping /a on srv44
  • ~00:30 lvs2 went into overload and started losing packets. Upload squid slowly went down over the next half hour.
  • 00:00 brion: scapping for update to ContributionReporting

November 5

  • 23:38 brion: set yongle to restart apache every hour since it still seems to bork up and get stuck sometimes
  • 22:01 RobH: srv100 rebooted, was down.
  • 18:28 mark: tech team is procrastinating
  • 18:16 atglenn: added dhelps to office@wikimedia.org alias, redirected office@wikipedia.org to him also
  • 18:14 brion: disabling centralnotice on private wikis, we don't need to be told to donate to ourselves ;)
  • 18:03 brion: poking sitenotices off wikibooks, on *.wikipedia
  • 18:03 brion: set up ariel on mchenry for mail admin
  • 05:38 brion_: opera users may rejoice ;)
  • 05:38 brion_: tweaked storage1 lighttpd config so centralnotice.js is served with utf-8 charset
  • 05:17 brion_: for reference -- load spikes are page rendering on enwiki and dewiki mostly :)
  • 05:16 brion_: bumping enwiki notice to 100%
  • 05:06 Tim: killed various mysqld_safe processes which were using 100% CPU on ES servers
  • 04:50 brion_: fixed morebots -- bots now allowed to edit again at wikitech
  • 04:50 brion_: enabling enwiki notice at about 10% sampling
  • 03:27 brion_: squids are... i think.... looking better :D
  • ... brion: cleaned up movepage attack, restricted editing here for convenience
  • 02:47 brion_: seems happier after restart of front-end squids
  • 02:43 brion_: tim's doing hard restarts of more squids, we're kinda offline briefly
  • 02:34 brion_: disabling centralnotices on remaining sites just for good measure while we debug
  • 02:29 brion_: current status: the squids which borked are still kind of borked, but perhaps slightly better. mark is examining squid memory reports
  • 02:14 brion: tim's attempting to restart borked squids
  • 02:01 brion: disabling enwiki centralnotice while investigating hits dropoff

November 4

  • 21:36 Tim: added nagios monitoring of HTTP on image backends
  • 21:14 Tim: installed NRPE stuff on db19
  • 19:37 Tim: killed the broken NFS mount on db21:/mnt with umount -l. The processes that are waiting for it will probably hang until system restart
  • 18:33 brion_: enabling ja-wikipedia notice for testing :D
  • 18:32 Tim: installed nagios stuff on db21,db22,db23
  • 18:27 Tim: srv104 done, cluster18 re-added to the write list
  • 18:15 Tim: installed NRPE on srv159,srv171,srv183
  • 17:25 domas: bounced db16 after jfs deadlock
  • 17:24 brion: settin' centralnotice on wikibooks to test, should show up in a few minutes
  • 16:00 Tim: fixing max_rows on srv104
  • 15:41 Tim: switching cluster18 master from srv104 to srv105
  • 01:33 Tim: fixing max_rows on srv105 and srv106
  • 01:28 Tim: removed cluster17 from the write list, is full.

November 3

  • 23:28 Tim: installed xdiff and gmp on hume. Used a source install of libxdiff since it's not packaged, and pecl install for the pecl module. Used the stock libgmp, a source install from the debian sources for the PHP GMP module.
  • 22:05 brion: enabled extra file upload types for foundationwiki, since it's restricted-write-access
  • 21:42 Tim: initialising srv159/171/183 as cluster20.
  • 21:24 Tim: srv159 needs to be an ext store, and so will be moved from the disk-intensive image scaler role back to an ordinary apache.
  • 20:46 brion: Special:ContributionTracking form submission intermediary live on foundationwiki
  • 20:33 brion: scapping for ContribtionTracking extension
  • 19:59 brion: enabled mp3 and aiff uploads for private wikis so jay can upload some radio PSAs for fundraiser
  • 19:46 brion: poking $wgSquidMaxage from 31 days to 1 hour on wikimediafoundation.org, since templates and funkypage URLs may do funky things and not get purged (extra parameters)
  • 19:32 brion: note there's no notice up yet ;)
  • 19:31 brion: enabling centralnotice loader on all wikis
  • 11:00 domas: mount -o remount,nobarrier /a on db15, observed 20x more performance. I am an idiot. :)
  • 02:36 brion-away: got a test centralnotice notice running on test.wikipedia.org. rock on
  • 02:18 brion: set up every-10-minute cronjob on zwinger to regen the centralnotice template JS files
  • 02:10 brion: centralnotice .js file loader up on test and meta for poking at
  • 01:12 mark: level 3 blackholing of traffic disappeared, brought BGP sessions back up
  • 00:59 mark: shutdown BGP session to AS 30217, for blackholing of traffic behind it (L3?)
  • 00:58 brion: network problems at pmtpa
  • 00:44 brion: for fun, did some load-time optimization on wikitech. trimmed out unneeded user/site .js, consolidated several .js files, and enabled mod_deflate for .css/.js. ssl setup time still sucks, and it's still a 1.7GHz Celeron. :)

November 2

  • 23:43 brion: added bot flag to domas's log bot so it doesn't get hit by the URL captcha
  • 23:29 domas: db19 jfs deadlocked: http://p.defau.lt/?hC8C7MTk9BdTKBEHFgcsqA
  • 23:28 brion: scapping for CentralNotice tweak update
  • 23:11 brion: setting up ContactFormFundraiser on wikimediafoundation.org for fundraiser templates
  • 22:52 brion: scapping for ContactPageFundraiser setup
  • 22:41 brion: poked spamregex update
  • 22:14 brion: added 403 block in checkers.php for 'speichern' GET parameter -- bug in a common dewiki user script allowing CSRF-type vandalism
  • 17:13 Tim: Unmounted /tmp, cleaned up /tmp. Deleting /a/tmp on all image scalers.
  • 16:48 Tim: set ImageMagick temporary directory to /a/magick-tmp. Will unbind the /tmp -> /a/tmp mount.
  • 15:06 river: added missing /mnt/upload5 mount on several apaches: srv37 srv61 srv76 srv69 srv63 srv118 srv132 srv135 srv133 srv138 srv136
  • 14:49 domas: few missing .frm files on db18 were causing trouble, resynced them from db19, resumed replication
  • 13:02 river: copying en from storage1 to ms1
  • 10:49 domas: replaced XFS with JFS on db18, installed ganglia on db17-db30
  • 10:36 river: completed move of commons, now being served from ms1 (except archive/)

November 1

  • 22:48 brion: fixed ContributionReporting to force a utf8 connection, now loads names in right charset
  • 22:20 brion: fixed $wgNoticeInfrastructure setting; defaults must have changed at some point
  • 22:15 domas: installed wikimedia-mysql4 on db21-23, established s1,s2,s3 replication. we now have full database copy in sdtpa \o/
  • 20:53 brion: deploying CentralNotice editing system on meta, woo
  • 20:27 brion: scapping to update reporting and centralnotice bits internally
  • 19:38 brion: rescapping to make sure 159 is unbroken
  • 19:27 brion: svn up'ing on wikitech just for domas
  • 19:25 brion: srv159 is out of space
    • We need to clean out the damn temp files somehow, eh?
  • 19:20 brion: scapping to update ContributionReporting ext
  • 12:56 mark: uppreffed traffic from knams to pmtpa via 6908/2828, as existing peering path had slight packet loss
  • 11:25 Tim: enabled subpages in the main namespace by default for all Wikisource wikis. This appears to be a defacto standard and is used by all wikisources with an entry in wgNamespacesWithSubpages.
  • 07:55 Tim: disabled ParserDiffTest, obsolete
  • 07:06 mark: XO circuit back up:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now up
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <w005.z207088246.xo.cnc.net>, session is now up

October 31

  • 23:11 brion: set up some logs for fundraising banner campaign clicks for later mining
  • 17:44 brion: adding support for Tomas skin on wikimediafoundation.org for new fundraiser templates
  • 14:24 mark: XO circuit went down:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <207.88.246.5>, session is now down because <Port State Down>
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now down because <Port State Down>

October 30

  • 23:11 Tim: fixed disk space on srv159, db1, srv103
  • 19:03 brion: updated triggers for donation reporting database a few minutes ago
  • 18:14 RobH: moved ms1 from pmtpa:a4 to sdtpa:a1, its back online.
  • 17:46 RobH: db26 OS installed and online
  • 17:28 brion: added a spam filter rule for private-l messages :)
  • 04:54 river: testing sun web server on ms1
  • 03:56 brion: updating squid conf to send upload /centralnotice to storage1 for testing
  • 03:53 brion: tweaked lighttpd config on storage1 for centralnotice static file testing, since amane's configuration is too crappy to support regexes needed to set headers on a directory
  • 02:59 brion: poking experimental expires options on amane for static centralnotice tests
  • 02:44 brion: brion broke lighttpd.conf briefly

October 29

  • 22:39 brion: enabling $wgCodeReviewENotif experimentally
  • 18:35 brion: disabled bitmap fonts in fontconfig on image scalers, seems to help with the "mad helvetica" problem
  • 18:02 RobH: db28 & db29 OS installed and online.
  • 17:59 brion: fixed some upload directory perms on foundationwiki
  • 17:12 RobH: db27 OS installed and online.
  • 16:54 RobH: db21 OS installed and online.
  • 16:38 RobH: db22, db23, db25, db30 were installed yesterday, forgot to admin log it, sorry ;/
  • 14:44 _mary_kate_: copying wikipedia/commons/thumb/4 from storage1 to ms1

October 28

  • 20:02 domas: re-enabled db16
  • 18:03 mark: Removed blackholes.securitysage.com from lily's spamassassin configuration
  • 17:52 domas: db16 fubar'ed by queries that built 100GB temporary tables, leading to jfs hangs, leading to unhappy kernel.
  • 15:23 RobH: updated dsh node group ALL, added backup of frontend data for bugzilla and blogs from isidore to tridge.
  • 12:33 rainman-sr: experimentally turning on "did you mean.." on search8,9 for enwiki
  • 10:44 mark: Reverted yesterday's search changes

October 27

  • 23:24 mark: Switched to lucenesearch 2.1 for all wikis
  • 23:06 mark: pooled search8 as the only search server in search pool 3
  • 22:25 mark: rainman-sr is making me do more ugly things to lucene.php
  • 22:22 mark: Pointed search for "all other wikis" hardcoded to search7 in lucene.php
  • 22:14 mark: Added zhwiki and plwiki to lucene search 2.1 pool 2

October 26

  • 15:43 mark: Set up OpenGear serial console server scs-a1-sdtpa
  • 13:37 mark: Set up iBGP between csw1-sdtpa and csw5-pmtpa (IPv4/IPv6)
  • 13:36 mark: Prepared csw1-sdtpa for production deployment (general configuration)
  • 09:56 domas: updated db18 firmware to 2.1.1 (September 2008)
  • 04:31 Tim: fixed the "service_ips" hostgroup in nagios
  • 03:03 Tim: hardware reboot of db18
  • 02:47 Tim: mysqld on db18 apparently hit a kernel bug. It was reported as a zombie but was still using 200% CPU in top. kswapd was simultaneously using 100% CPU. Did not respond to SIGKILL. The non-zombie parent, mysqld_safe, also did not respond to SIGKILL (wchan=flush_cpu_workqueue). Attempted a reboot with shutdown -r.
  • 02:47 brion: tweaked MaxClientsPerChild on yongle to see if that helps with the mysterious hangs i sometimes see where requests seem to get backed up; it's disrupting the CodeReview proxy as well as mobile & Mac Dictionary search

October 25

  • 20:46 brion: scapped to r42573
  • 08:17 Tim: svn up to 42536 for API overload fix. Re-enabling disabled query modules.
  • 05:55 Tim: svn up/scap to 42531 (for properly tested Interwiki.php fix).
  • 05:09 Tim: DB overload on many enwiki slave servers. Long running queries attributed to ApiQueryAllpages, ApiQueryBacklinks, ApiQueryCategoryMembers and ApiQueryLogEvents. Disabled those modules and killed related running threads.
  • 05:01 Tim: Interwiki links were broken due totally broken and untested getInterwikiCached() function. Live patch deployed at this time.
  • 04:33 Tim: Fixed svn conflicts in two files. Scap to r42524.
  • 04:20 Tim: disabled Drafts extension on test.wikipedia.org. Trevor, please contact me for code review.
  • 04:11 Tim: synced php-1.5 to srv35 and ran "make -B" in the serialized directory. Seems to have fixed test. Will scap.
  • 01:01 ariel: preemptively up mail quota to 7GB from 1GB for cbass, dmenard
  • 00:59 brion: testwiki is borked until we figure out how to get it to load updated message files. tried disabling $wgLocalMessageCache and $wgCheckSerialized to no effect
  • 00:51 brion: temporarily blocking scap during testing :) ... running serialized language file updates for test, broken by need to get magic word updates
  • 00:44 brion: preparing a svn up...
  • 00:37 ariel: up msecoquian's mail quota from 1GB to 6.9GB

October 24

  • 23:12 brion: set up ariel (the person) on sanger to do mail administration -- quota fixes etc
  • 16:24 TimStarling: reloaded ourusers.sql on all core and ext. mysql servers, adding a nagios user
  • 15:39 mark: slacking
  • 15:36 TimStarling: added special nagios user to ES instances on clematis
  • 14:00 domas: re-enabled db5, added db18 to s3
  • 10:45 domas: taking out db5 for copy to db18
  • 10:44 domas: fixed ntpd on bart, was pointing to multicast address that doesn't work
  • 09:57 Tim: removed decommissioned servers from monitoring: dryas, alrazi, diderot, friedrich, samuel
  • 07:50 Tim: added monitoring for toolserver ES clusters 17-19
  • 07:40 Tim: regenerated trusted XFF list with extra SAIX proxies
  • 05:00 Tim: fixed nagios check script handling of MySQL connection errors
  • 01:37 brion: setting $wgLicenseURL for Collection to point at GFDL English text
  • 01:01 brion: enabling Drafts on testwiki, but it seems to not be saving there... works on my local test, not sure what the issue is
  • 01:03 brion: disabling logentry, still borken?

October 23

  • 22:33 brion: trying re-enabling logentry ext on wikitech, now with cache disable to avoid edittoken for now
  • 21:34 brion: updating ipblocks table definition
  • 21:25 brion: re-ran svnImport to update path listings for CodeReview
  • 20:11 mark: Set up search7 - search9
  • 17:05 mark: Pooled search4 as a s1 search server to help with dead search2
  • 16:33 brion: updated mw-serve
  • 15:38 Tim: On the image scalers, temporarily mounted /a/tmp as /tmp with --bind to stop the disk full problem while we figure out some better solution
  • 15:24 Tim: removed temporary files on image scalers again
  • 14:54 RobH: Replaced dead disk in amane, rebuilding array.
  • 11:04 Tim: Added disk space monitoring for image scalers. Also added apache monitoring which was also missing.
  • 10:53 Tim: freed up disk space on image scalers, magick-* temporary files were filling their root partitions
  • 10:50 Tim: re-added cluster19 to the default write list. Not sure who took it out or why.
  • 10:32 Tim: freed up some space on srv103 (was down to 500MB)
  • 10:29 Tim: fixed monitoring for MegaRAID SAS
  • 07:10 Tim: Set up monitoring of RAID status for all Ubuntu DB servers using the wikimedia-raid-utils package that I just wrote. It doesn't do anything on the MegaRAID servers yet, but the Adaptec ones should work.
  • 05:05 Tim: running CodeReview svnImport.php

October 22

  • 18:26 brion: enabling ODT output for collection
  • 18:17 brion: updating collection and codereview extensions
  • 18:13 Brion: updated mw-serve code and configured to send error emails per jojo's request
  • 17:15 Brion: Changed bugzilla's mail delivery from local sendmail (SSMTP) to direct SMTP, per Mark's recommendation

October 21

  • 19:29 RobH: Bayes upgraded from 2GB to 10GB.
  • 13:49 Tim: Did a demonstration hack of nagios from CSRF to arbitrary shell. Disabled cmd.cgi.
  • 04:13 Tim: Brought srv43-47 up as image scalers with mem limit 6 x 200MB = 1200MB (2GB physical)

October 20

  • 18:11 RobH: srv118 rebooted, back online.
  • 17:25 RobH: srv79 was in kernel panic, rebooted.
  • 05:10 Tim: increased concurrency on srv159 to 15, for mem limit 15 x 200MB = 3000MB
  • 02:40 Tim: installed NRPE on khaldun and db20
  • 02:20 Tim: moved disk space checks on the ext stores from the "apaches" service group to the relevant ext store service group
  • 01:53 Tim: installed NRPE on the new ext stores
  • 01:45 Tim: Updated /etc/ssh/ssh_known_hosts on bart (copied from zwinger).
  • 00:30-01:30 Tim: Listed down servers on DC tasks. Removed broken servers from memcached rotation. Restarted apache on srv99, srv109, srv123. Purged master binlogs on srv102.

October 18

  • 21:45 RobH's mighty index finger brought amane and the site back up.
  • 21:00 river: Ran 'nc -l -p 623' command, amane's kernel panic'ed. Rob was called.
  • 20:55 mark, river: diagnosed the NFS communication problems to be caused by NIC hardware packet interception of port 623 packets... amane wasn't receiving NFS replies from ms1.
  • 19:40 mark: Upload got unhappy, ms1 NFS mount on amane was unreachable and stalling things
  • 13:40 Tim: down again, single process allocating all memory
  • 07:35 Tim: took it down again, while recording /proc/vmstat and /proc/stat
  • 06:27 Tim: restarted srv160
  • 05:45 Tim: took srv160 into the purple for a much more convincing overload, and different oprofile results
  • 03:40 Tim: used oprofile to determine what part of the kernel is responsible for the system CPU spike. Looks like a spinlock in dnotify.
  • 03:12 Tim: simulated a memory-intensive request rate spike to srv160. Large system CPU response spike, but it didn't go down completely. Will try a bigger one.

October 17

  • 21:10 brion: enabled Commons foreign image repo on Wikitech
  • 18:45 brion: created Wikimedia-Boston list for SJ
  • 16:55 brion: adding nomcomwiki to special.dblist so it shows up right in sitematrix
  • 16:45 brion: deleted some junk comments from bugzilla
  • 16:31 brion: changed autoconfirm settings for 'fishbowl' wikis -- 0 age for autoconfirm, plus set upload & move for all users just in case autoconfirm doesn't kick in right
  • 14:22 RobH: srv131 back up.
  • 09:03 Tim: copying srv129 and srv139 ES data directories to storage2:/export/backup
  • 02:49 Tim: excessive lag on db16, killed long-running queries and temporarily depooled. CUPS odyssey continues.
  • 01:59 Tim: removing cups on all servers where it is running
  • 00:00 RobH: restarted srv43-47

October 16

  • 20:42 brion: added 3 more dump threads on srv31... we need to find some more batch servers to work with for the time being until new dump system is in place :)
  • 20:20 RobH: pulled samuel from the rack, decommissioned, RIP samuel.
  • 19:35 RobH: migrated rack B4 from asw3 to asw-b4-pmtpa.
  • 18:40 RobH: rebooted scs-ext opps!
  • 18:26 RobH: srv61 reinstalled and redeployed.
  • 18:24 RobH: Adler re-racked with rails, booted up to maintenance mode prompt.
  • 17:34 mark: 208.80.152.0/25 NTP restriction is actually also not broad enough - changed it to /22 in ntpd.conf on zwinger
  • 17:02 brion: thumbnails on commons are insanely slow and/or broken
  • 14:44 Tim: added a more comprehensive redirection list to squid.conf.php for storage1 images
  • 14:04 Tim: redirected images for /wikipedia/en/ to storage1, apparently they were moved a while ago. Refactored the relevant squid.conf section.
  • 13:38 Tim: disabled directory index on amane. Was generating massive amounts of NFS traffic by generating a directory index for some timeline directories.
  • 12:51 Tim: increased memory limit on srv159 to 8x200MB. Still well under physical.
  • 11:38 Tim: cleaned up temporary files on srv159, had filled its disk
  • 11:25 Tim: synced upload scripts (including to ms1)
  • 10:06 Tim: removed sq50 from the squid node lists and uninstalled squid on it
  • 09:22 - 09:52 mark, Tim, JeLuF: initial attempts to bring the squids back up failed due to incorrect permissions on the recreated swap logs. Most were back up by around 09:32, except newer knams and yaseo squids which were missing from the squids_global node group. The node group was updated and the remainder of the squids brought up around 09:52.
  • 09:19 JeLuF: deployed squid.conf with an error in it. All squid instances exited.
  • 08:26 Tim: Restarted ntpd on search7, was broken
  • 06:42 Tim: ntp.conf on zwinger had the wrong netmask for the 208.x net, it was /26 instead of /25. So a lot of squids were out of it, and some had a clock skew of 10 minutes (as visible on ganglia). Fixed ntp.conf, not stepped yet. Will affect squid logs.

October 15

  • 19:49 brion: added '<span onmouseover="_tipon' to spam regex; some kind of weird edit submissions coming with this stuff like [1]
  • 12:00 Tim: trying to bring srv159 up as an image scaler. Limiting memory usage to 8x100 = 800MB with MediaWiki.
  • 11:21 srv127 died just the same. Mark suggests using one with DRAC next.
  • 10:20 Tim: all image scalers (srv43 and srv100) swapped to death again. Preparing srv127 as an image scaler with swap off.
  • 08:43 Tim: reduced depool-threshold for the scalers to 0.1 since srv100 is quite capable of handling the load by itself while we're waiting for the other servers to come back up.
  • 07:45 Tim: half the scaling cluster went down again, ganglia shows high system CPU. Installing wikimedia-task-scaler on srv100.
  • 02:30 Tim: moved image scalers into their own ganglia cluster
  • 02:17 Tim: apache on srv43-47 hadn't been restarted and so was still running without -DSCALER. This partially explains the swapping. Restarted them. Took srv38-39 back out of the image scaler pool, they have different rsvg and ffmpeg binary paths and break without a MediaWiki reconfiguration.
  • 02:13 tomasz: upgraded srv9 to ubuntu 8.04
  • 02:00 tomasz: upgraded srv9 to ubuntu 7.10

October 14

  • 19:16 brion: restarted lighty on storage1 again -- it was back in 'fastcgi overloaded' mode, possibly due to the previously broken backend, possibly not
  • 19:11 mark: Pooled old scaling servers srv38, srv39
  • 18:50 brion: at least four of new image scalers are down -- can't reach by SSH. thumbnailing is borked
  • 16:41 brion: fixed image scaling for now -- storage1 fastcgi backends were overloaded, so it was rejecting things. did some killall -9s to shut them all down and restarted lighty. ok so far
  • 16:20 brion: image scaling is broken in some way, investigating
  • 02:54 Tim: fixed srv43-47, this is now the image scaling cluster
  • 00:10 Tim: oops, forgot to add VIPs, switched back.
  • 00:05 Tim: switched image scaling LVS to srv43-47

October 13

  • 23:45 Tim: prepping srv43-47 as image scaling servers
  • 21:45 jeluf: moved more image directories to ms1. Now, upload/wikipedia/[abghijmnopqrstuwxy]* are on ms1
  • 21:35 jeluf: killed mwsearchd on srv39, removed both the rc3.d link and the cronjob that start mwsearchd
  • 21:30 RobH: search8 and search9 are online, awaiting configuration.
  • 21:15 brion: thumb rendering failures reported... found some runaway convert procs poking at an animated GIF, killed them.
    • rev:42058 will force GIFs over 1 megapixel to render a single frame instead of animations as a quick hackaround...
  • 20:48 domas: thistle serving as s2a server
  • 20:28 RobH: stopping mysql on adler so it can be re-racked with rails.
  • 19:53 RobH: search7 back online, awaiting addition to the search cluster.
  • 19:35 mark: Set up an Exim instance on srv9 for outgoing donation mail, as well as incoming for delivery into IMAP for CiviMail (*spit*).
  • 17:00 RobH: srv21-srv29 decommissioned and unracked.
  • 12:05 domas: put lomaria back in rotation
  • 11:50 domas: Enabled write-behind caching on db15. Restarted.
  • 10:40 domas: restarted replication on db15 and lomaria
  • 10:27 domas: loading dewiki data from SQL dump into thistle
  • 09:09 Tim: restarted logmsgbot
  • 08:27 Tim: folded s2b back into s2
  • 08:06 Tim: db13 in rotation
  • 08:02 domas: copying from db15 to lomaria
  • 07:38 Tim: started replication on db13
  • 04:51 Tim: copying
  • 03:27 Tim: Preparing for copy from db15 to db13
  • 00:00 domas: something wrong with db15 i/o performance. it is behaving way worse, than it should.

October 12

  • 23:58 brion: updated CodeReview to add a commit so loadbalancer saves our master position. playing with serverstatus extension on yongle to find out wtf it keeps getting stuck
  • 22:05 brion: db15 sucks hard. putting categories back to db13
  • 22:01 brion: db15 got all laggy with the load. taking back out of general rotation, leaving it on categories/recentchangeslinked
  • 21:58 brion: db15 seems all happy. swapping it in in place of db13, and giving it some general load on s2. we'll have to resync db13 at some point? and toolserver?
  • 19:41 Tim: shutting down db15 for restart with innodb_flush_log_at_trx_commit=2. But db8 seems to be handling the load now so I'm going to bed.
  • 19:20 Tim: depooled db15.
  • 19:09 Tim: split off some wikis into s2b and put db8 on it. To reduce I/O and hopefully stop the lag.
  • 18:51 Tim: db15 still chronically lagged. Offloading all s2 RCL and category queries to db13.
  • 18:38 Tim: offloading commons RCL queries to db13
  • 18:36 Tim: dewiki r/w with ixia (master) only
  • 18:33 Tim: offloading commons category queries to db13
  • 18:25 Tim: balancing load. Fixed ganglia on various mysql servers.
  • 18:06 Tim: going to r/w on s2. Not s2a yet because db15/db8 can't handle the load.
  • 17:46 Tim: db8->db15 copy finished, deploying
  • 17:33 Tim: installed NRPE on thistle.
  • 16:54 Tim: copied mysqld binaries from db11 to db15 and thistle. Plan for thistle is to use it for s2a.
  • 16:40 Tim: ixia/db8 can't handle the load between them with db13 out, even with s2a diverted. Restored db13 to the pool. Running out of candidates for a copy destination. Need db13 in because it's keeping the site up, can't copy to thistle because it's too small with RAID 10. Plan B: set up virgin server db15. Copying from db8.
  • 16:07 Tim: repooled ixia/db8 r/o
  • 15:53 Tim: removed ixia binlogs 290-349. 270-289 were deleted during the initial response.
  • 14:54 mark: Pooled search6 as part of search cluster 2, by request of rainman
  • 14:37 Tim: deployed r41995 as a live patch to replace buggy temp hack.
  • 14:14 Tim: cleaned up binlogs on db2. Yes the horse has bolted, but we may as well shut the gate.
  • 14:11 Tim: copy now in progress as planned.
  • 13:48 Tim: going to try the resync option. Maybe with s2 it won't take as long as s1. Will try to sync up db8 from ixia with db13 serving read-only load for the duration of the copy.
  • 13:40 Tim: ixia (s2 master) disk full. Classic scenario, binlogs stopped first, writing continued for 10 minutes before replag was reported.
  • 13:00 jeluf: moved wikipedia/m* image directories to ms1
  • 08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled.
  • 02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while.

October 11

  • 7:00 jeluf: moved wikipedia/s* image directories to ms1

October 10

  • 21:30 jeluf: moved wikipedia/[jqtuwxy]* to ms1
  • 19:20 RobH: Bayes online.
  • 19:11 brion: recreated special page update logs in /home/wikipedia/logs, hopefully fixing special page updates
  • 13:05 Tim: reverted live patch and merged properly tested fix r41928 instead.
  • 12:31 Tim: deployed a live patch to fix a regression in MessageCache::loadFromDB() concurrency limiting lock
  • 12:17 domas: killed long running threads
  • ~12:04: s2 down due to slave server overload

October 9

  • 22:52 brion: enabled Collection on de.wikibooks so they can try it out
  • 20:00 jeluf: moved wikipedia/i* images to ms1
  • 17:05 RobH: thistle raid died due to hdd failed, replaced hdd, reinstalled as raid10.
  • 12:00 domas: switched s3 master to db1, did erase bunch of db.php stuff by accident (don't know how :). restored from db.php~ :-)
  • 09:31 mark: pascal died yet again, revived it. Will move the htcp proxy tonight...

October 8

  • 21:05 brion: yongle still gets stuck from time to time, breaking mobile, apple search, and svn-proxy. i suspect svn-proxy but can't easily prove it still. using separate svn command (in theory) but it's not showing me stuck processes.
  •  ??:?? rob fixed srv37, then later srv133 into mediawiki-installation node group. he did an audit and didn't see any other problems. i ran a scap to make sure all are now up to date
    • Speculation: possible that rumored ongoing image disappearances have been caused by the image-destruction bug still being in place on srv133 for the last month.
  • 19:02 mark: Upgraded packages on search1 - search6 and searchidx1
  • 18:59 brion: aaron complaining of srv37 not properly updated (doesn't recognize Special:RatingHistory). flaggedrevs.php was out of date there. checking scap infrastructure, stuff seems ok so far...

October 7

  • 21:47 brion: started two dump threads (srv31)
  • 21:16 RobH: installed and configured gmond on all knams squids.
  • 21:00 jeluf: moved wikipedia/g* to ms1
  • 18:55 RobH: fixed private uploads issue for arbcom-en and wikimaniateam.
  • 17:26 RobH: reinstalled and redeployed knsq24 and knsq29
  • 15:00-16:00 robert: switched enwiki to lucene-search 2.1 running on new servers. Test run till tomorrow, if anything goes wrong, reroute search_pool_1 to old searchers on lvs3. Will switch on spell checking when all of the servers are racked. Thanks RobH for tunning config files.
  • 15:54 RobH: srv101 crashed again, running tests.
  • 15:45 RobH: srv146 was powered down for no reason. Powered back up.
  • 15:42 RobH: srv138 locked up, rebooted, back online.
  • 15:32 RobH: srv110 was locked up, rebooted, synced, back online.
  • 15:31 RobH: srv101 back up and synced.
  • 15:22 RobH: rebooted srv56, was locked up, handed off to rainman to finish repair.
  • 15:21 RobH: updated lucene.php and synced.
  • 15:04 RobH: updated memcached to remove srv110 and add in spare srv137.
  • 15:00 RobH: removed all servers from lvs:search_pool_1 and put in search1 and search2 with rainman

October 6

  • 23:55 brion: tweaked bugzilla to point rXXXX at CodeReview instead of ViewVC
  • 14:29 domas: amane lighty was closing connections immediately, worked properly after restart. upgraded to 1.4.20 on the way.
  • 14:36 RobH: setup ganglia on all pmtpa squids.
  • 13:50 mark: The slow page loading on the frontend squids appears to be limited to english main page only, for unknown reasons. Set another article as pybal check URL to prevent pooling/depooling oscillation by PyBal for now.
  • 09:27 mark: yaseo squids are fully in swap, set DNS scenario yaseo-down

October 5

  • 23:14 mark: Frontend squids are not working well at the moment, sometimes serving cached objects with very high delays. I wonder if they are under (socket) memory pressure. Reduced cache_mem on the backend instance on sq25 to free up some memory for testing.
  • 20:35 jeluf: wikipedia/b* moved, too
  • 19:00 jeluf: switched squids to send requests for upload.wikimedia.org/wikipedia/a* to ms1
  • 14:30 jeluf: Moving all wikipedia/a* image directories to ms1

October 4

  • 23:17 mark: Repooled knsq16-30 frontends in LVS. Also found that mint was fighting with fuchsia about being LVS master, due to reboot this afternoon.
  • 14:30 mark: Several servers in J-16 were shutting down, or going down around this time. Reason unknown, possibly auto shutdown because of high temperature, possibly they were turned off by someone locally.
  • 14:03 mark: SARA power failure. Feed B lost power for ~ 6 seconds.
  • 00:26 mark: Depooled srv61
  • 00:07 brion: found srv37 and srv61 have broken json_decode (wtf!)
    • updating packages on srv37. srv61 seems to have internal auth breakage
    • updated packages on srv61 too. su still borked, may need LDAP fix or something?

October 3

  • 21:40 brion: transferring old upload backups from storage2 to storage3. once complete, can restart dumps!
  • 20:01 brion: running updateRestrictions on all wikis (done)
  • 17:51 RobH: srv135 & srv136 reinstalled as ubuntu.
  • 17:34 RobH: srv132 & srv133 reinstalled as ubuntu.
  • 17:13 RobH: srv130 back online.
  • 16:40 RobH: depooled srv131, srv132, srv135, srv136 for reinstall.
  • 00:25 brion: switched codereview-proxy.wikimedia.org to use local SVN command instead of PECL SVN module; it seemed to be getting bogged down with diffs, but hard to really say for sure

October 1

  • 20:02 RobH: srv63 back online.
  • 19:35 RobH: srv61 and srv133 back online.
  • 18:22 RobH: storage3 online and handed off to brion.
  • 17:35 RobH: updated mc-pmtpa.php to put srv61 as spare.
  • 17:32 RobH: srv61 faulty fan replaced, back online.
  • 09:31 Tim: srv104 (cluster18) hit max_rows, finally. Removed it from the write list.
  • 08:36 Tim: fixed ipb_allow_usertalk default on all wikis
  • 23:46 mark: Reinstalled knsq24
  • 22:55 mark: Reenabled switchports of knsq16 - knsq30
  • 20:45 jeluf: fixed resolv.conf on srv131
  • 20:45 jeluf: mounted ms1:/export/upload as /mnt/upload5, started lighttpd on ms1
  • 19:47 brion: enabled revision deletion on test.wikipedia.org for some public testing.
  • 14:25 RobH: Cleaned out the squid cache on knsq16, knsq17, knsq18, knsq19, knsq21, knsq22, knsq23, knsq25, knsq26, knsq27, knsq28, knsq30. DRAC not responsive on knsq20, knsq24, knsq29.

Archives


Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox