Server admin log/Archive 20

From Wikitech
< Server admin log
Revision as of 21:48, 22 November 2010 by JeLuF (Talk | contribs)

Jump to: navigation, search

November 22

  • 21:44 Ryan_Lane: err - restarting varnish on knsq2,4,5
  • 21:44 Ryan_Lane: restarting varnish on knsq2,3,4
  • 21:43 Ryan_Lane: restarting knsq2,3,4
  • 21:29 mark: Restarted varnish on knsq2,4,5
  • 21:24 mark: Restarting Varnish on knsq1
  • 21:09 mark: Set static routes 91.198.174.232/31 to 91.198.174.247 (csw1-esams) on br1-knams
  • 21:03 mark: Started pybal on amslvs1 again
  • 20:53 mark: Killed PyBal on amslvs1
  • 20:39 mark: Lowered cache_mem and cache_dir sizes on amssq31 for testing
  • 20:28 Ryan_Lane: depooling knsq24
  • 20:25 mark: Restarted pybal on amslvs1 with depool-threshold = 1 (temporarily)
  • 17:27 Ryan_Lane: restarting nagios
  • 17:27 Ryan_Lane: adding warning notifications via IRC to nagios
  • 16:40 mark: Suppressed announcements to AS16265 on csw1-esams
  • 16:30 mark: Suppressed BGP announcements to AS13030 on br1-knams

November 21

  • 08:29 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25929 - Import sources for Spanish Wikipedia'
  • 05:49 tomasz: purging binary logs on db9 to mysql-bin.001600
  • 05:43 tomasz: syncing db9 mysql-bin.001562 - mysql-bin.001599 to tridge

November 20

  • 21:12 JeLuF: added download.wikipedia.org as ServerAlias for download.wikimedia.org
  • 20:14 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25982 - Enable subpages in the main namespace for ten.wikipedia.org'
  • 20:11 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25923 - Namespaces on br.wikisource'
  • 20:09 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25929 - Import sources for Spanish Wikipedia'
  • 20:06 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25996 - Import sources for pfl.wikipedia'
  • 13:00 JeLuF: added more error patterns to puppetmon

November 19

  • 23:04 Ryan_Lane: changed https check on payments to 1 retry in nagios, via puppet
  • 20:59 atglenn: restarted torrus. guess why :-P
  • 17:52 mark: Readded notification of Service[nagios] when changing nagios types in puppet
  • 13:59 richcole: set raid 10 up on DB41
  • 08:36 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'bug 25850 - Hide Take me Back link on all wikis'
  • 07:32 JeLuF: started puppet on sq42 sq41 sq47 sq45 sq46 sq44 sq50 sq52 sq53 sq51 sq48 sq55 sq54 sq43 sq56 sq58 sq64 sq62 sq61 sq65 sq66 sq60 sq63 sq72 sq75 sq71 sq73 sq74 sq76 sq78 sq77 sq81 sq80 sq79 sq83 sq84 sq82 sq85 sq86
  • 05:21 apergos: on sq85 I was seeing complaints from cron about restart of puppet: unknown option -w. removed that from /etc/default/puppet and restart, but that fails: Could not parse for environment production: Could not find file /agent.pp
  • 01:00 atglenn: added monitoring mechanism in root's crontab on sq85 (don't need it everywhere) that will sms me when ms4 is acting up. I'd do it in puppet if someone told me how they would want it to be added there.
  • 00:22 tomasz: turning db9 watchdog back on. setting at 5minutes

November 18

  • 21:57 richcole: DB42 shutdown for service
  • 21:40 JeLuF: CORRECTION: started puppet manually on sq59 sq61 sq73 sq60 sq62 sq65 sq77 sq63 sq64 sq72 sq75 sq74 sq76 sq71 sq78 sq66, startup script is broken.
  • 21:40 JeLuF: started squid manually on sq59 sq61 sq73 sq60 sq62 sq65 sq77 sq63 sq64 sq72 sq75 sq74 sq76 sq71 sq78 sq66, startup script is broken.
  • 21:26 mark: Fixed puppet on formey
  • 21:24 mark: Fixed puppet on linne
  • 19:57 JeLuF: blocked UDP from srv124 on nfs1 aka syslog
  • 17:00 JeLuF: restarted puppet on srv215, srv235, srv244, srv257, srv262, srv288
  • 16:33 JeLuF: fixed puppet on srv185 and srv200
  • 15:00 logmsgbot: aaron synchronized php-1.5/wmf-config/flaggedrevs.php 'Set FR_INCLUDES_CURRENT on mediawikiwiki'

November 17

  • 20:19 JeLuF: syslog is being spammed with one week old messages from srv124
  • 20:19 RobH: owa1/2/3 online with base OS install and puppet updates
  • 17:59 RobH: updated dns for new databases servers
  • 17:15 richcole: owa1 going down for repair
  • 15:52 Ryan_Lane: moved the nagios purge stuff out of puppet, and into nagios's init script. Pulled the nagios init script into puppet
  • 10:03 tomasz_: adding single field index on converted amount under public_reporting within civirm db on db9
  • 10:03 tomasz_: adding single field indexes to utm_source, utm_medium, and utm_campaign under contribution_tracking table within drupal db on db9
  • 03:44 atglenn: restarted apache on ekrem, many processes hung in "graceful close" state for a long period of time
  • 03:06 logmsgbot: tfinc synchronized php-1.5/extensions/CentralNotice/SpecialBannerController.php
  • 03:04 Tim: in puppet, disabled nagios::purge since it breaks puppet entirely on fenari. Removed Aaron's obsolete ssh public key by adding an ensure=>absent to puppet.
  • 01:34 Tim: on ekrem: ran logrotate -f, since log rotation previously failed due to disk full
  • 01:28 Tim: on ekrem: root partition full, deleted old apache access logs
  • 00:20 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php

November 16

  • 19:05 logmsgbot: catrope synchronizing Wikimedia installation... Revision: 76812:
  • 19:03 RoanKattouw: Running scap to deploy UploadWizard backend changes (core only)
  • 17:04 Ryan_Lane: adding run stages to puppet config; adding apt-get update to first stage, and nagios resource purging to last stage
  • 17:01 logmsgbot: catrope synchronized php-1.5/maintenance/nextJobDB.php 'Fix memcached usage for nextJobDB.php, broken since Sep 09. Should speed up job queue processing'
  • 16:38 RobH: updated dns
  • 15:50 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Add deletedtext, deletedhistory rights to eliminator group on hiwiki'
  • 15:04 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 25374 - Eliminator group for hiwiki'
  • 14:36 JeLuF: 25871 - fixed logo for pflwiki
  • 14:05 RobH: temp fixed nagios

November 15

  • 22:31 Ryan_Lane: repooling sq70
  • 21:43 Ryan_Lane: pushing change to varnish to send cache-control header for geoip lookup
  • 21:43 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Adding categories to $wmgArticleAssessmentCategory'
  • 21:37 logmsgbot: catrope synchronized php-1.5/extensions/ArticleAssessmentPilot/ArticleAssessmentPilot.hooks.php 'r76709'
  • 21:17 Ryan_Lane: depooling sq70
  • 21:17 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25569 - Create the Gagauz Wikipedia (wp/gag)'
  • 21:01 mark: Lowered CARP weight of esams text amssq* squids from 20 to 10, equal to the older knsq* squids
  • 20:33 Ryan_Lane: setting authdns-scenario normal
  • 20:05 RobH: current slowdowns reported for folks hitting AMS squids. Moving traffic to US datacenter should fix major slowdowns on !Wikipedia & !Wikimedia
  • 20:04 Ryan_Lane: setting authdns-scenario esams-down
  • 19:56 RobH: fixed nagios again
  • 19:51 RobH: updating dns for new owa processing nodes
  • 18:54 RobH: srv298 now online in api pool
  • 18:21 Ryan_Lane: fixing puppet manually on sq34, sq36, sq37, sq39, sq40, and knsq13
  • 18:18 RobH: gilman to secure gateway project stalled, needs network checks done
  • 18:07 Ryan_Lane: puppetizing /etc/default/puppet, since some hosts had START=no, instead of START=yes
  • 17:46 RobH: gilman needed hard reset, ilom responsive now (thx rich!)
  • 17:35 Ryan_Lane: restarting puppet again on all nodes using -M flag for ddsh to see system names (checking for errors)
  • 17:23 Ryan_Lane: restarting puppet on all nodes
  • 17:12 RobH: sq57 disk replaced, reinstalled, back in service
  • 17:09 mark: Restarted apache on sockpuppet with concurrency 4 instead of 3
  • 17:04 RobH: puppet is now failing to work properly on sq57, why did we upgrade puppet again?
  • 16:59 RobH: sq57 reinstalled and doing post installation configuration
  • 16:40 Ryan_Lane: upping configtimeout setting in puppet to 8 minutes, globally
  • 16:33 Ryan_Lane: trying to add puppet.conf to puppet again
  • 16:24 Ryan_Lane: undoing puppet.conf changes
  • 16:20 RobH: sq57 coming down for reinstallation
  • 16:19 RobH: db13 back online, restarted mysql, but its currently commented out of db.php
  • 16:12 RobH: not sure why db13 is borked, but its down, poking at it
  • 16:09 Ryan_Lane: added puppet.conf to puppet. pushing change out
  • 16:00 RobH: torrus is up again
  • 15:59 richcole: swaped sq57 sdb bad drive
  • 15:56 RobH: torrus is down, again, restarting and cleaning up its services
  • 15:52 RobH: manually purged spence nagios, started manually, working until puppet borks it again
  • 15:10 RobH: nagios is down, investigating

November 14

  • 23:28 mark: Fixed Nagios
  • 20:00 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25918 - Namespaces on vec.wikisource.org'
  • 14:57 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25904 - Create the Swedish Wikiversity (wv/sv)'
  • 14:56 logmsgbot: jeluf ran sync-common-all '25904 - Create the Swedish Wikiversity (wv/sv)'
  • 14:28 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25918 - Namespaces on vec.wikisource.org'
  • 08:26 domas: ran purge-nagios-resources.py manually to bring up nagios
  • 07:14 domas: reduced passenger pool size to 4 on sockpuppet
  • 04:42 Ryan_Lane: moving /etc/nagios/puppet_services.cfg to .bak and rerunning puppet
  • 03:05 Ryan_Lane: modified nagios puppet manifest to purge decommisioned servers from the services configuration
  • 01:48 logmsgbot: jeluf synchronized php-1.5/cache/interwiki.cdb 'Updating interwiki cache'
  • 01:33 Ryan_Lane: temporarily upped configtimeout in /etc/puppet/puppet.conf to 8 minutes on spence so that puppet would run

November 13

  • 22:03 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php
  • 20:42 mark: Ran dist-upgrade on sq68
  • 20:37 mark: powercycled sq68
  • 19:29 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25871 - Create the Palatinate German Wikipedia (wp/pfl)'
  • 19:29 logmsgbot: jeluf ran sync-common-all '25871 - Create the Palatinate German Wikipedia (wp/pfl)'
  • 18:39 mark: Fixed puppet on db16
  • 18:24 mark: Installed script reporting the last Puppet run in MOTD (Karmic and higher only)
  • 18:03 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25774 - Create Wikinews in Esperanto'
  • 18:02 logmsgbot: jeluf ran sync-common-all '25774 - Create Wikinews in Esperanto'
  • 17:43 logmsgbot: jeluf ran sync-common-all '25773 - Create Wikibooks in Limburgish'
  • 17:37 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25743 - Create the Breton Wikisource (ws/br)'
  • 17:27 logmsgbot: jeluf ran sync-common-all '25743 - Create the Breton Wikisource (ws/br)'
  • 17:10 mark: Installed cron job that removes puppetdlock files over a day old; these prevent puppet from doing runs forever otherwise
  • 17:01 apergos: removed "-n" from mw-tor-list on hume, otherwise it (I guess) terminates early, at any rate it produces an empty tor node list. If this turns out to be too big a burden on hume's resources we can look at some other approach
  • 16:49 mark: Upgrading puppet agent from 0.25 to 2.6 across the cluster
  • 16:24 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25696 - Create vec.wikisource.org'
  • 16:04 logmsgbot: jeluf ran sync-common-all 'added gag.wikipedia and vec.wikisource'
  • 15:48 mark: Replaced the Wikimedia APT repository by a new per-distribution-version repository managed by 'reprrepro' on brewster; the old repository is available as http://apt.wikimedia.org/wikimedia-old/
  • 14:29 logmsgbot: jeluf synchronized php-1.5/cache/interwiki.cdb 'Updating interwiki cache'
  • 10:51 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25714 - Adding sources wikis for [[Special'
  • 09:30 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25156 - Requesting an alias for project namespace on Persian Wikipedia'
  • 08:58 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25859 - Enable Collection on gl.wikipedia'
  • 08:45 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25893 - Wikimania Logo for WikimaniaTeam Wiki'

November 12

  • 22:56 Ryan_Lane: added file_mover user to hume
  • 21:22 mark: Fixed torrus
  • 21:11 mark: Setup amanda backups of brewster:/srv/{wikimedia,autoinstall,tftpboot}
  • 18:31 logmsgbot: jeluf ran sync-common-all '25737 - Closure of Nauruan Wikibooks'
  • 18:06 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionReporting.php
  • 17:46 Ryan_Lane: adding demon to shell accounts as mortal
  • 17:35 Ryan_Lane: rebooting ersch
  • 17:34 Ryan_Lane: rebooting alsted
  • 15:58 mark: Shutdown srv126 for decommissioning
  • 15:27 Ryan_Lane: deleting svnuser.pp manifest, and any references to it, since we are now using ldap for svn users instead.
  • 14:11 RobH: srv230 shows memory error in SEL. it reboots and sees all memory. opening a ticket to ensure its not showing the memory error on its LCD
  • 12:22 mark: Rebooting sockpuppet
  • 12:20 mark: Running apt-get dist-upgrade on sockpuppet
  • 12:05 mark: Converted puppetmaster install on sockpuppet from mongrel based to passenger based
  • 11:19 mark: Upgraded puppet and puppetmaster on sockpuppet to 2.6.1
  • 07:09 JeLuF: restarted crashed backend squids on sq41 and sq42
  • 01:32 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/FundraiserStatistics.css
  • 00:44 logmsgbot: tfinc synchronized php-1.5/extensions/CentralNotice/SpecialBannerAllocation.php
  • 00:27 logmsgbot: tfinc synchronized php-1.5/extensions/LandingCheck/SpecialLandingCheck.php
  • 00:27 logmsgbot: tfinc synchronized php-1.5/extensions/LandingCheck/LandingCheck.php
  • 00:27 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Changing landing check to be wg var'

November 11

  • 23:59 logmsgbot: tfinc synchronized php-1.5/wmf-config/InitialiseSettings.php 'Adding country codes to landing check'
  • 22:09 RobH: torrus wasnt recording items, restarted
  • 18:14 apergos: restarted varnish on storage1, seemed it might have gone out to lunch
  • 17:41 mark: Powercycled sq84
  • 17:01 mark: powercycled sq80
  • 16:52 mark: Powercycled amssq50
  • 16:42 mark: Powercycled sq68
  • 16:37 mark: Powercycled sq59
  • 16:34 mark: Powercycled sq57
  • 15:38 mark: Removed ex-fedora data on ms2, after backing it up to tridge
  • 10:33 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 10:32 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/Vector/Vector.combined.min.js 'r76511'
  • 09:58 Ryan_Lane: restarted apache on fenari
  • 04:02 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionReporting.php 'Updating for 2010'
  • 03:50 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'removing test since its in the extension config'
  • 03:46 logmsgbot: tfinc synchronizing Wikimedia installation... Revision: 76474
  • 02:30 atglenn: so another restart of torrus. seriously...
  • 00:43 domas: what Rob meant was that they went away by themselves, as it was upstream provider issue.
  • 00:39 RobH: !wikipedia and !wikimedia network issues resolved, all projects should be fine now
  • 00:35 domas: #network #failwhale #lol
  • 00:31 RobH: looking into the current slowdown/inaccessibilty issues for folks on !Wikipedia and !Wikimedia
  • 00:25 domas: flapping network in pmtpa

November 10

  • 22:32 rfaulk: installed "scipy" python package on grosley.wikimedia.org with apt-get - statistical analysis in python
  • 22:23 atglenn: restarted torrus, it had deadlocked again. is it my imagination or is this happening really often lately?
  • 21:26 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 21:25 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/Vector/Vector.combined.min.js 'r76474'
  • 21:21 logmsgbot: nimishg synchronized php-1.5/extensions/ContributionReporting/ContributionReporting.i18n.php 'r76472'
  • 21:16 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 21:16 RoanKattouw: Removed srv124 from mediawiki-installation node group as it's slated to be decommissioned
  • 21:14 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/Vector/Vector.combined.min.js 'r76471'
  • 20:44 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 20:44 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/Vector/Vector.combined.min.js 'r76469'
  • 20:32 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 20:31 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/js/plugins.combined.min.js 'r76467'
  • 19:40 RobH: singer config restarted, will host download.w.o & dumps.w.o as well as a number of other things that refer to those two entries in dns
  • 19:39 RobH: changed dns for dumps.wikimedia.org to go to singer instead of dataset1 during its downtime
  • 18:59 atglenn: someone was polite and didn't name me in the above comment :-P I commented out the script that ships logs to both dammit.lt and dataset1 instead of looking at the script itself
  • 18:58 domas: unbroke pagecounts shipment (someone broke it and said "yes you can blame me, it was my f*ckup, people should know that")
  • 14:59 apergos: rebooting dataset1 so we can get web service going over there (can't be restarted in the usual way after kernel panic)
  • 10:38 RoanKattouw: Published MW 1.16 tarball on noc.wm.o because download.wm.o is still down http://noc.wikimedia.org/mediawiki-1.16.0.tar.gz
  • 06:47 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Turning cc gateway back on with the sidebar'
  • 05:20 apergos: stopped rsync of pagecount stats from locke to dataset1 for now til disk/fs issue is resolved
  • 04:59 apergos: shot all dump processes on dataset1; note a kernel panic in logs from within __destroy_inode, going to reboot and leave rsync of pagecounts and dumps off
  • 00:41 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Taking outage on cc cluster'

November 9

  • 22:42 RobH: running puppet on spence to remove all the old apaches that are no longer in any kind of service
  • 22:39 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24539 - Transwiki import source for ml.wikisource.org'
  • 22:36 RobH: didnt log my change that I ran about 25 minutes ago to change test.w.o from srv124 to srv193 in squid settings and deployed
  • 22:32 logmsgbot: jeluf ran sync-common-all
  • 22:27 logmsgbot: jeluf synchronized closed.dblist
  • 21:31 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'removed srv193 from potential memcached pool as it will shortly become the new test.w.o server'
  • 20:53 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Set tenwiki logo to local Wiki.png'
  • 17:26 logmsgbot: robh synchronized php-1.5/wmf-config/abusefilter.php 'bugzilla#24394'
  • 17:17 logmsgbot: robh synchronized php-1.5/wmf-config/abusefilter.php
  • 17:15 RobH: that actually ran 15 minutes ago and was stuck on a broken server at the end, all other hosts had synced
  • 17:15 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'srv230 unresponsive to ssh, needs to reboot, swapped it out for working spare'
  • 15:51 RobH: removed srv* under srv151 from pybal, left entry for srv124 as its test.w.o, even though its set to false
  • 15:46 RobH: srv229 puppet was hanging, manually ran apt-get update and reran puppet, now its happy
  • 15:22 RobH: srv229 rebooted, wouldnt let me ssh in, coming back up now with puppet run
  • 15:21 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'srv281 is not behaving, swapped it out'
  • 15:17 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'replacing a server that i am working on'
  • 15:01 mark: Reduced CARP weight of new amssq squids from 30 to 20, until they get SSDs
  • 12:16 mark: Started copy of ms2:/a/ex-fedora data to tridge
  • 12:15 mark: Included class base for server tridge in Puppet
  • 12:09 mark: Cleaned up temporary files on image scalers
  • 06:33 apergos: cleared out some old bin log files on db27 to get back some space
  • 05:20 apergos: reboot of ms4 successful, we need to monitor performance over the next several days. If people are still seeing 503's for thumbs that's a problem
  • 05:09 apergos: er, ms4!!
  • 05:09 apergos: rebooting ms5 into alternate boot environ, now with new improved patches (:-P), let's see if it works
  • 01:00 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Bumping style version for meta udp2log fix'

November 8

  • 23:50 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'updating tenwiki logo'
  • 23:18 logmsgbot: tstarling ran sync-common-all
  • 22:32 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25801 - New logo for et.wikimedia.org'
  • 22:27 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25742 Please set the Buryat Wikipedia logo'
  • 22:25 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25742 Please set the Buryat Wikipedia logo'
  • 22:20 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25836 - Commons'
  • 22:13 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25779 - Create new namespace "Institution"/"Museum" for Commons'
  • 21:54 JeLuF: Fixed typo: Replaced the second "srv290" in node group "apaches" by srv296.
  • 21:10 logmsgbot: jeluf synchronizing Wikimedia installation... Revision: 76208
  • 21:10 JeLuF: srv154 didn't receive any updates in the last few days, it was missing in the mediawiki-installation nodegroup
  • 20:19 mark: Moved uplink of asw-b-sdtpa from temporary GigE link csw1-sdtpa:3/48 to 2x 10G (LACP) links csw1-sdtpa:16/2 and 16/3, shutdown 3/48
  • 19:44 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/SpecialLandingCheck.php 'r76156'
  • 19:44 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/LandingCheck.php 'r76156'
  • 19:22 RobH: updated dns, seems it borked, but pdns is running on nescio so it should clear up
  • 17:47 logmsgbot: robh synchronized php-1.5/wmf-config/abusefilter.php 'bug'
  • 16:38 RobH: esams squid flapping on text squids is due to disk i/o use, they will be replaced with SSD soon
  • 06:04 apergos: brought ms4 back up in primary boot environment, testing concluded for tonight, results being sent to Oracle
  • 05:50 apergos: doing reboot of ms4 into alternate boot environ for testing
  • 02:34 apergos: restarted torrus, it was out to lunch again

November 7

  • 17:39 RoanKattouw: Starting makeArbcomList.php in a screen on fenari. Tim says this'll take about a day
  • 12:43 logmsgbot: catrope synchronized php-1.5/includes/Profiler.php 'r76243'
  • 12:04 logmsgbot: catrope synchronized php-1.5/wmf-config/abusefilter.php 'More perm changes for frwiktionary'
  • 11:59 logmsgbot: catrope synchronized php-1.5/wmf-config/abusefilter.php 'More perm changes for frwiktionary'
  • 11:56 logmsgbot: catrope synchronized php-1.5/wmf-config/abusefilter.php 'More perm changes for frwiktionary'
  • 11:35 logmsgbot: catrope synchronized php-1.5/includes/Profiler.php
  • 11:33 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 25711 - enable AbuseFilter on frwiktionary'
  • 11:33 logmsgbot: catrope synchronized php-1.5/wmf-config/abusefilter.php 'bug 25711 - enable AbuseFilter on frwiktionary'
  • 11:17 logmsgbot: catrope synchronized php-1.5/includes/Profiler.php 'Temp hack to debug fatals in Profiler.php'
  • 10:48 logmsgbot: catrope ran sync-common-all
  • 10:47 RoanKattouw: Enabling FlaggedRevs on sqwiki per bug 25822 and disabling new page patrolling
  • 10:16 RoanKattouw: srv230 SSH is broken from fenari, Nagios disagrees. Commenting out srv230 from /etc/dsh/group/mediawiki-installation . After fixing srv230, uncomment it and resync the box
  • 10:13 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 25674 - Enable $wgBlockAllowsUIEdit on frwiktionary'

November 6

  • 18:59 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/ui/CodeRevisionView.php 'r76209'
  • 18:56 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/ui/CodeReleaseNotes.php 'r76208'
  • 18:56 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/ui/CodeAuthorListView.php 'r76208'
  • 18:56 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/ui/CodeRevisionListView.php 'r76208'
  • 18:56 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/api/ApiCodeUpdate.php 'r76208'
  • 18:55 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/api/ApiCodeRevisions.php 'r76208'
  • 18:55 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/api/ApiCodeDiff.php 'r76208'
  • 18:55 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/api/ApiCodeComments.php 'r76208'
  • 18:55 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/svnImport.php 'r76208'
  • 18:55 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/CodeReview.i18n.php 'r76208'
  • 18:55 RoanKattouw: Syncing CodeReview update

November 5

  • 23:38 logmsgbot: tfinc synchronized php-1.5/wmf-config/InitialiseSettings.php 'Turning central notice back on for everyone'
  • 23:09 logmsgbot: tfinc synchronizing Wikimedia installation... Revision: 76127
  • 23:09 logmsgbot: tfinc synchronized php-1.5/wmf-config/InitialiseSettings.php 'Turnning off cn on all but testing wikis before scap'
  • 23:08 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Turnning off cn on all but testing wikis before scap'
  • 23:03 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Setting wgCentralDBname to meta'
  • 20:34 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Picking up url fix so that udp2log doesnt double count on meta'
  • 20:33 logmsgbot: tfinc synchronized php-1.5/extensions/CentralNotice/SpecialBannerController.php 'Picking up url fix so that udp2log doesnt double count on meta'
  • 20:15 RobH: There are no longer any memcached servers in the decommissioned server range. If there are any issues from the changes, the original mc.php is named mc.php.old and will be removed in 72 hours if there are no mishaps
  • 20:14 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'of course one dies AS i sync it'
  • 20:12 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'on secondary review, missed two old servers, removed and updated'
  • 20:08 RobH: tested new memcached config, all servers working
  • 20:08 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'removed the older servers below srv150 and replaced with tested good new memcached servers'
  • 19:55 logmsgbot: catrope synchronized images/wikimedia-button.png 'Let's try that with an actual PNG file rather than HTML'
  • 19:50 logmsgbot: catrope synchronized images/wikimedia-button.png 'New Powered by Wikimedia button'
  • 19:11 logmsgbot: catrope synchronized php-1.5/skins/common/images/poweredby_mediawiki_88x31.png 'r76126'
  • 19:00 rfaulk: Added "httpagentparser" Python package on grosley.wikimedia.org from publicly avaialable distutils distribution - this package assists in parsing user-agent header strings found in the 2010/11 fundraiser squid logs
  • 18:58 rfaulk: Added "setuptools" Python package on grosley.wikimedia.org with apt-get for Wikimedia 2010/11 fundraiser work - This package enables installation of python packages distributed with Python distutils
  • 18:22 RobH: srv284 is having some booting issues, seems to be harddisk related, but since drac output is slightly garbled, unable to confirm. new rt# 376
  • 17:54 RobH: srv284 unresponsive to console, rebooting and fixing it to bring it back into service
  • 17:53 RobH: srv266 back online and in service
  • 17:52 mark: Shutdown browne and srv2 for decommissioning - thereby removing the last traces of Fedora from the cluster. Goodbye!
  • 17:43 RobH: srv266 unresponsive to remote console, rebooting and updating
  • 17:42 RobH: srv206 fixed, pushed back into lvs
  • 17:25 RobH: working on srv206, disregard any errors it throws
  • 16:40 RobH: issue with the new api servers is fixed and they are now back in service
  • 16:04 RobH: some new api servers are not working right, depooled until they are fixed
  • 15:58 mark: Removed ibis IPs from Squid ACLs; invalid requests issue has been resolved
  • 15:57 mark: Fixed NFS mounts on apaches that had them missing since the wikimedia-task-appserver upgrade
  • 15:26 RobH: working on sq57, disregard flapping
  • 15:24 RobH: new api apackes srv290-srv301 are online, except srv298 which needs drac correction before installation
  • 15:22 RobH: dropping old entry for tenwiki in apache config and resyncing/restarting apaches to eliminate error message
  • 15:18 RobH: pushing srv291-srv301 into lvs
  • 15:11 RobH: doing puppet runs on srv292-srv301 before pushing them into service
  • 14:57 mark: Hacked out the 'remotemount' lines in /var/lib/dpkg/info/wikimedia-task-appserver.postrm files to prevent apaches from being without NFS mounts during/between puppet runs and package upgrades
  • 14:23 mark: Deploying new package wikimedia-task-appserver 1.46 across the cluster, which removes configuration files (now handled by Puppet)
  • 11:59 logmsgbot: catrope synchronized php-1.5/includes/api/ApiLogin.php 'Revert r76078'
  • 11:49 logmsgbot: catrope synchronized php-1.5/includes/api/ApiLogin.php 'r76078'
  • 05:57 apergos: failure booting into be3 on ms4, had to back out. so, no progress, we are back to where we were before the reboots.
  • 05:40 apergos: cleared up luactivate error, shutdown ms4 again, trying to boot into alt boot environment
  • 05:16 apergos: used shutdown on ms4, be3 showed as "active on reboot" but it booted into be0 (old boot environment) nonetheless. *grumble*
  • 05:06 apergos: rebooted ms4 into alt boot environment with current patches applied
  • 00:18 RobH: new api servers are not coping down the data correctly and not reflecting config changes in puppet, so they fail, srv290+ not online yet

November 4

  • 23:06 RobH: running puppet across the new api servers srv290-srv301 then will push them in service later when i figure out why they are not doing what I want ;P
  • 20:13 RobH: sq51 hatees me
  • 20:11 RobH: new api servers srv290-301 are installed and showing in ganglia, having issues getting the first couple to pool into lvs before i push the rest into service
  • 20:09 RobH: fixed sq51
  • 19:29 RoanKattouw: Strike that, have backed out changes
  • 19:06 RoanKattouw: Until Mark's made sure they're good, that is
  • 19:06 RoanKattouw: Changing some files in wmf-deployment/includes/media . DO NOT RUN SCAP or otherwise deploy these changes!
  • 18:35 RobH: added dns entries for payments
  • 17:59 RobH: doing puppet runs and final setup for srv290-srv301
  • 16:56 rfaulk: Added numpy Python package to grosley.wikimedia.org with apt_get ... For use in the 2010/11 fundraiser to facilitate stats gathering by providing scientific computing functionality in Python
  • 16:43 rfaulk: Added MySQLdb Python package to on grosley.wikimedia.org with apt-get ... This package will be used to access fundraising databases to facilitate the gathering and synthesis of relevant statistics for the 2010/11 Wikimedia findraiser
  • 16:23 mark: Set storage1 (varnish) as upload backend on sq41-50, instead of ms4
  • 16:14 RobH: sq59 is being bitchy and wont clean the cache, possible hdd issue? will investigate later
  • 15:42 RobH: sq35 back in rotation
  • 15:34 mark: Added storage1 (varnish->ms4) as an HTTP backend to sq45's squid config
  • 15:34 RobH: commenting out sq35, trying to make it work again in pybal
  • 15:16 RobH: poking at sq59
  • 15:06 RobH: sq35 back online, pushed into lvs, partially up - may need to wait up to 5 for idleconnect timer
  • 14:46 RobH: pushed dns updates for new payments boxes and correcting owadb1/2 to db31/32
  • 14:28 RobH: sq35 set to false in pybal until i determine whats wrong with it
  • 14:09 mark: Reduced CARP weight of sq41-50 from 10 to 5
  • 13:37 RobH: sq35 may flag, disregard
  • 13:30 RoanKattouw: Removed uploadwizard test wiki on prototype, gonna set it up on the Commons prototype instead
  • 04:17 atglenn: ganglia 3.1 now running on ms4 and ms5
  • 01:44 RobH: srv217 back in cluster
  • 00:36 RobH: torrus back online
  • 00:29 RobH: fixing torrus deadlock, no touchy
  • 00:18 tomaszf: upped open file descriptors on loudon to 4096 for squid
  • 00:17 RobH: kicking srv217 for reinstall

November 3

  • 21:22 RobH: updated puppet to properly remove memcached from memcached::false entries and removed the host memcached check for servers no longer running memcached, hup'd nagios to take the change
  • 21:21 atglenn: rebooting ms5 after OS update. note that we were unable to get some of the more recent patches, they are probably from after the sun->oracle transition
  • 21:02 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/LandingCheck.i18n.php 'r75890'
  • 21:02 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/LandingCheck.alias.php 'r75890'
  • 21:01 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/SpecialLandingCheck.php 'r75890'
  • 21:01 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/LandingCheck.php 'r75890'
  • 20:31 atglenn: removed about 1.5T of stuff off of /export on ms4 (old backups, solaris isos, etc)
  • 19:41 logmsgbot: catrope synchronized php-1.5/README 'Dummy sync so I can document what the errors look like'
  • 19:32 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Backing out config change for stats fix'
  • 19:31 RobH: srv281 still down, setting to false in pybal just so it doesnt keep trying to use it
  • 18:31 RobH: reinstalling srv281, tired of lookin at it in red
  • 17:18 mark: Upgraded storage1 to Lucid
  • 16:42 mark: Removing 2010-03 snapshots on ms4
  • 16:01 mark: Fixed sshd on ms4
  • 15:46 mark: Removing 2010-02 snapshots on ms4
  • 15:45 mark: Disabled gmetric cron jobs on ms4
  • 15:43 mark: Disabled daily snapshot generation on ms4
  • 15:27 mark: Restarted gmond on ms4
  • 15:24 mark: Upgraded puppet on ms4
  • 15:13 mark: Powercycled knsq2
  • 14:52 mark: Removing daily snapshots for 2010-10 on ms4
  • 14:24 mark: Restored /etc/sudoers file on DB machines butchered by old versions of wikimedia-raid-utils
  • 05:34 logmsgbot: tstarling synchronized php-1.5/includes/Math.php 'r75909'
  • 04:52 apergos: oh btw, I notice that when / on the squids fills, we don't see it in ganglia, it must report an aggregate or something. it would sure be nice to get notified.
  • 04:18 apergos: lather rinse repeat for sq47, I hope that's all of 'em
  • 03:46 apergos: repeated on sq45...
  • 03:13 apergos: same old story on sq46... restarted syslog, reloaded squid, got back some space on /
  • 02:41 apergos: er... and deleted the log file :-P
  • 02:38 apergos: moved ginormous cache.log out of the way on sg48 and reloaded squid over there since it wasn't done earlier
  • 02:32 apergos: cleaned up / on sq41, restarted syslog, reloaded squid
  • 00:59 logmsgbot: nimishg synchronized php-1.5/wmf-config/InitialiseSettings.php
  • 00:53 logmsgbot: nimishg synchronizing Wikimedia installation... Revision: 75891
  • 00:33 apergos1: also 44 and 43
  • 00:30 apergos1: cleaning up space on other / full squids: sq42

November 2

  • 23:22 apergos: same story on sq50, cleared out some space, tried upping that to 300 but started seeing TCP connection to 208.80.152.156 (208.80.152.156:80) failed in the logs so backed off to 200
  • 23:13 apergos: trying adjusting max-conn on sq49 for conns to ms4... tried 200, it maxed out. trying 300 now...
  • 23:08 apergos: hupped squid on sq49, restarted syslog, / was full from "Failed to select source" errors, cleared out some space
  • 23:08 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Updating sidebar links'
  • 22:40 apergos: added in the amssq47 through amssq62 to /etc/squid/cachemgr.conf on fenari
  • 19:48 RobH: torrus back online
  • 19:44 RobH: following procedure on wikitech to fix torrus
  • 16:46 RobH: sq42 & sq44 behaving normally now, cleaning cache on sq48 and killing squid for restart as it is flapping and at high load, due to earlier nfs issue
  • 16:38 RobH: restarting and cleaning backend squid on sq44 and sq42 which were complaining in lvs
  • 16:35 RobH: sq43 was flapping since the nfs mount on ms4 was borked. restarted it
  • 16:07 apergos: NFSD_SERVERS=2048 in /etc/default on ms4
  • 16:06 apergos: note that the variables rpcmod:cotsmaxdupreqs has been changed to 2048 in /etc/system, and
  • 15:54 apergos: hard reset on ms4, reboot was not getting the job done
  • 15:47 apergos: rebootint ms4, nfsd hung and couldn't be restarted or killed.
  • 14:04 RobH: restarted pdns on linne due to crash from authdns update
  • 14:02 RobH: updated dns with new mgmt entries for payments, owasrvs, and owadbs
  • 03:45 domas: added srv193 back to apaches pool on lvs

November 1

  • 23:55 logmsgbot: tfinc synchronized php-1.5/extensions/CentralNotice/SpecialBannerController.php 'Picking up fixes for Bug #25564'
  • 23:54 logmsgbot: tfinc synchronized php-1.5/extensions/CentralNotice/CentralNotice.php 'Picking up fixes for Bug #25564'
  • 20:42 domas: ms4 mildly loaded (disks go to >100i/s each) throwing nfs timeouts, I bumped up NFSD_SERVERS to 2048
  • 19:05 Ryan_Lane: powercycling srv207
  • 16:18 RoanKattouw: Something weird's going on with srv207: Nagios says its SSH is up but it times out on SSH from fenari
  • 16:15 logmsgbot: catrope synchronized php-1.5/includes/api/ApiBase.php 'r75798'

Archives

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox