Server admin log/Archive 20

From Wikitech
< Server admin log
Revision as of 18:44, 16 August 2010 by JeLuF (Talk | contribs)

Jump to: navigation, search

August 16

  • 18:43 logmsgbot: root synchronized php-1.5/extensions/MWSearch/MWSearch_body.php 'Temporarily disable part of MWSearch'
  • 18:38 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '17627 - Allow autoconfirmed users to patrol on ar Wikisource'
  • 18:08 Fred: MWSearch extension is having issues...
  • 17:36 Fred: API servers all started segfaulting. Restarting apache for the time being
  • 04:50 logmsgbot: jeluf synchronized php-1.5/wmf-config/abusefilter.php '24304 - Reconfigure English Wikibooks'
  • 04:48 logmsgbot: jeluf synchronized php-1.5/wmf-config/flaggedrevs.php '24304 - Reconfigure English Wikibooks'

August 15

  • 22:42 domas: resynced db11 and db17 from db27, db33 from ixia, db19 from db1, with accompanying BIOS flashing to 3.0 and OS reinstalls. decommissioned ixia and db1
  • 22:40 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'drumrolllll'
  • 22:04 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 21:56 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 21:49 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 21:39 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24520 - Transwiki import source for ml.wikiquote.org'
  • 20:55 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24458 - Enable subpages on frwikisource'
  • 20:39 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24623 - Enable 'eliminator' flag on ptwiki'
  • 20:36 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24304 - Reconfigure English Wikibooks'
  • 20:36 logmsgbot: jeluf synchronized php-1.5/wmf-config/flaggedrevs.php '24304 - reconfigure enwikibooks'
  • 20:35 logmsgbot: jeluf synchronized php-1.5/wmf-config/abusefilter.php '24304 - reconfigure enwikibooks'
  • 20:32 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 20:14 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24623 - Enable 'eliminator' flag on ptwiki'
  • 20:09 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24321 - ml.wikiquote.org lost its project namespace'
  • 20:04 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 19:54 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24777 - Request for a patrolling function on the Nynorsk (nn) Wikipedia'
  • 19:48 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24535 - Enable "mark as patrolled" feature in hindi wiki'
  • 19:44 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24394 - Install AbuseFilter on Hindi Wikipedia'
  • 19:41 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24336 - Change Simplewiki's (EN) autoconfirm time/edit rates'
  • 19:36 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24790 - Localize Wikipedia sitename in devanagari'
  • 19:33 JeLuF: fixed zero byte thumbnail of commons:Shkval_head.jpg
  • 19:19 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24789 - Enable AbuseFilter for ja.wikipedia'
  • 19:18 logmsgbot: jeluf synchronized php-1.5/wmf-config/abusefilter.php '24789 - Enable AbuseFilter for ja.wikipedia'
  • 19:14 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 18:44 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 18:24 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 17:44 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 16:23 logmsgbot: midom synchronized php-1.5/wmf-config/db.php

August 14

  • 20:15 mark: Decommissioning srv150
  • 19:56 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24789 - Enable AbuseFilter for ja.wikipedia'
  • 19:52 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24626 - Add an "autopatrolled" status for frwiktionary'
  • 15:37 mark: dobson has failed RAID1 array member /dev/sda. Running long SMART self test on /dev/sda
  • 14:18 logmsgbot: mark synchronized php-1.5/wmf-config/db.php 'Add ms2 and ms1 to clusters rc1 an cluster22'
  • 14:06 mark: FLUSH TABLES WITH READ LOCK on ms1 for testing
  • 13:59 mark: Stopping mysql on ms1 as monitoring test
  • 13:59 mark: Granted SELECT on mysql.* to nagios on ms3
  • 10:57 mark: Removed oldest LVM snapshot on ixia
  • 09:43 mark: Fixed apparmor profile /etc/apparmor.d/usr.sbin.mysqld on ms1, restarted mysql under apparmor
  • 09:39 mark: START SLAVE on ms1, catching up with ms3
  • 09:38 mark: RESET SLAVE on db5
  • 09:37 mark: STOP SLAVE on db5
  • 09:35 mark: Stopped apparmor on ms1
  • 08:41 Andrew: Leaving as-is for now, hoping somebody with appropriate permissions can fix it later.
  • 08:40 Andrew: STOP SLAVE on db5 gives me ERROR 1045 (00000): Access denied for user: 'wikiadmin@208.80.152.%' (Using password: NO)
  • 08:34 Andrew: Slave is supposedly still running on db5. Assuming Roan didn't stop it when he switched masters a few days ago. Going to text somebody to confirm that stopping is correct course of action.
  • 08:24 Andrew: db5 can't be lagged, it's the master ;-). Obviously something wrong with wfWaitForSlaves.
  • 08:19 Andrew: db5 lagged 217904 seconds
  • 05:09 Andrew: Ran thread_pending_relationship and thread_reaction schema changes on all LiquidThreads wikis
  • 05:06 logmsgbot: andrew synchronizing Wikimedia installation... Revision: 70933
  • 05:04 Andrew: About to update LiquidThreads production version to the alpha.

August 13

  • 22:03 mark: API logins on commons (only) are reported broken
  • 21:45 mark: Set correct $cluster variable for reinstalled knsq* squids
  • 21:03 mark: Increased cache_mem from 1000 to 2500 on sq33, like the other API backend squids
  • 20:58 mark: Stopping backend squid on sq33
  • 20:50 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24769 - Import source addition for tpi.wikipedia.org'
  • 17:46 Fred: and srv100
  • 17:45 Fred: restarted apache on srv219 and srv222
  • 15:57 logmsgbot: mark synchronized php-1.5/wmf-config/mc.php 'Remove some to-be-decommissioned from the down list'
  • 15:56 logmsgbot: mark synchronized php-1.5/wmf-config/mc.php 'Remove some to-be-decommissioned hosts from the down list'
  • 15:53 RobH: srv146 removed from puppet and nodelists, slated for wipe, decommissioned.
  • 15:47 mark: Sent srv146 to death using echo b > /proc/sysrq-trigger. It had a read-only filesystem and is therefore decommissioned.
  • 15:38 mark: Restarted backend squid on sq33
  • 15:36 logmsgbot: mark synchronized php-1.5/wmf-config/mc.php 'Remove some to-be-decommissioned hosts from the down list'
  • 15:25 mark: Reinstalled sq32 with Lucid
  • 15:01 mark: Removed sq86 and sq87 from API LVS pool
  • 14:55 mark: sq80 had been down for a long time. Brought it back up and synced it
  • 14:54 rainman-sr: all of the search cluster restored to pre-relocation configuration
  • 14:34 logmsgbot: robh synchronized php-1.5/wmf-config/lucene.php 'reverting search13 to search11'
  • 13:55 mark: /dev/sda on sq57 is busted
  • 13:54 RobH: removed search17 from search_pool_3
  • 13:50 mark: Set idleconnection.timeout = 300 (NOT idlecommand.timeout) on all LVS services on lvs3, restarting pybal
  • 13:44 mark: powercycled sq57, which was stuck in [16538652.048532] BUG: soft lockup - CPU#3 stuck for 61s! [gmond:15746]
  • 13:42 mark: sq58 was down for a long long time. Brought it back up and synced it
  • 13:37 RobH: added search7 back into search_pool_3, kept search17 in as well
  • 13:27 RobH: changed search_pool_3 back from search7 to search17 since it failed
  • 13:25 logmsgbot: robh synchronized php-1.5/wmf-config/lucene.php 'Re-enabling LucenePrefixSearch - pushed changes on lvs3 to put search back to normal use'
  • 12:45 mark: API squid cluster is too flaky to my taste. Converting sq33 into an API backend squid as well
  • 12:40 mark: Shutdown puppet and backend squid on sq32
  • 11:41 mark: Corrected changed hostname for api.svc.pmtpa.wmnet in text squid config files
  • 11:37 mark: Temporarily rejecting requests to sq31 backend to give it some breathing room while it's reading its COSS dirs
  • 11:32 mark: Reinstalled sq31 with Lucid
  • 10:25 mark: Shutting down backend squid on sq31 to see the load impact
  • 10:18 mark: Setup backend request statistics for the API on torrus
  • 09:15 rainman-sr: bringing up search1-12 and doing some initial index warmups
  • 01:53 RobH: searchidx1, search1-search12 relocated and online, not in cluster until Robert can fix in the morning. The other half will have to move on a different day, 12 hours in the datacenter is long enough.
  • 01:40 RobH: finished moving searchidx1 and search1-12, bringin them back up now

August 12

  • 23:10 RobH: shutting down searchidx1, search1-12 for move
  • 22:40 logmsgbot: robh synchronized php-1.5/wmf-config/lucene.php 'swapped search13 and search18 for migration'
  • 22:37 logmsgbot: robh synchronized php-1.5/wmf-config/lucene.php 'reverting so search13 and search18 can change roles'
  • 22:22 logmsgbot: robh synchronized php-1.5/wmf-config/lucene.php 'changes back in place to migrate searchidx1 and search1-10'
  • 22:19 RobH: puppet updated on all search servers, confirmed all have all three lvs ip addresses
  • 21:55 mark: Configured puppet to bind all LVS service IPs to all search servers
  • 21:54 RobH: reverted search_pool changes on lvs
  • 21:54 logmsgbot: robh synchronized php-1.5/wmf-config/lucene.php 'rolling it back'
  • 21:48 logmsgbot: robh synchronized php-1.5/wmf-config/lucene.php 'changing settings for migration of searchidx1 and search1-search12'
  • 21:43 RobH: changing lvs3 search pool settings for server relocations
  • 20:33 logmsgbot: robh synchronized php-1.5/wmf-config/lucene.php 'commented out wgEnableLucenePrefixSearch for search server relocation'
  • 19:30 RobH: srv281 reinstall done but not online as puppet has multiple package issues, leaving out of lvs
  • 19:09 RobH: srv230 is on, but set to false in lvs. do not push back into rotation until after new memory arrives and is installed tomorrow (rt#69)
  • 18:59 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'updating without srv230'
  • 18:53 RobH: srv230 coming down for memory testing
  • 18:49 RobH: set srv230 to false in lvs, need to test memory
  • 18:04 RobH: reinstalling srv281
  • 17:59 RobH: nix that, srv125 was ex-es, leaving those for now.
  • 17:58 RobH: pulling srv103 & srv125 for wipe (pulling stuff with temp warnings first)
  • 17:53 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'removed srv103, replacing it with srv244'
  • 17:47 RobH: pulling srv95 for wipe
  • 17:38 RobH: srv110 removed from lvs3 config
  • 17:36 mark: Removed all apaches up to srv150 from the appserver LVS pool on lvs3
  • 17:21 Fred: restarting apache on webservers (220,221,222,224)
  • 16:45 RobH: wipe running on adler and amane, and they have been removed from puppet and dsh node groups
  • 16:12 logmsgbot: jeluf synchronized docroot/bits/index.html
  • 15:41 mark: Setup ports ge-2/0/0 to ge-2/0/20 for search servers on asw-b-sdtpa
  • 15:03 mark: Shutdown BGP session to AS1257 130.244.6.249 on port 2/5 of br1-knams, preparing for cable move
  • 13:08 mark: Recovered backend squid on knsq11
  • 12:53 mark: Reassembling RAID arrays md0 and md1 on knsq11
  • 12:40 mark: Running apt-get upgrade && reboot on amssq31
  • 11:17 mark: Shutdown knsq1 and knsq11 for swapping drives
  • 09:34 logmsgbot: catrope synchronized php-1.5/extensions/TitleBlacklist/TitleBlacklist.hooks.php 'r70933'
  • 09:08 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 24710: Enable $wgVectorShowVariantName on srwiki'
  • 06:00 logmsgbot: jeluf synchronized php-1.5/cache/interwiki.cdb 'Updating interwiki cache'
  • 05:42 JeLuF: Bug 24736 - Update wikimania.wikimedia.org to point to wm2011
  • 02:43 RobH: dataset1 back online, serving http, any jobs in process are borked (sorry ariel!)
  • 02:39 RobH: dataset1 unresponsive to physical console, serial console, had to do a hard reset
  • 02:05 RobH: dataset1 crashed while querying its raid controller about a bad disk, in route to dc to fix.

August 11

  • 21:56 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24561 - Quiz on Polish Wikibooks'
  • 21:53 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '21375 - new wiki as internal working space for the fiwiki arbcom'
  • 21:48 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24486 - Create Appendix namespace on the Luxembourgish Wiktionary'
  • 21:46 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24374 - Create new usergroups at commons'
  • 21:28 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24460 - Set up transwiki import for lb.wiktionary'
  • 21:10 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24626 - Add an "autopatrolled" status for frwiktionary'
  • 21:06 RobH: db16 will remain offline until replacement parts arrive from Sun. rt#54
  • 21:02 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24570 - Request'
  • 20:42 mark: Fixed ganglia mess on ms1
  • 20:37 mark: Started rsync of ms2:/a to ms1:/a
  • 20:32 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24685 - Author, Index, Page namespace for id.wikisource'
  • 20:30 mark: FLUSH TABLES WITH READ LOCK on ms2
  • 20:27 mark: Readded spare /dev/mdak1 to /dev/md1 on ms1. Why do spares go missing all the time...
  • 20:26 mark: Upgraded ms1 to Lucid, rebooted it
  • 20:26 RobH: working on db16
  • 20:24 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24719 - Extension'
  • 20:07 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 20:07 domas: promoted db5 to slave on s4
  • 19:53 mark: Upgrading ms1 to Lucid
  • 19:44 mark: Readded missing spare drive to /dev/md1 on ms1
  • 17:25 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'removed srv95 as it has temp warnings and is going to go away soon.'
  • 16:41 RobH_dc: pulled network on srv110 and started wipe, byebye
  • 16:35 RobH_dc: db19 eth1 disconnected per dc tasks
  • 16:24 RobH_dc: ms1 power cable was messing it up, rerouted the cable to be securely in place and system is now operating normally (no more sudden shutdowns hopefully) there was no evidence of hardware failure in logs, but power issues, so this should fix it.
  • 16:02 RobH_dc: working on ms1
  • 15:22 RobH_dc: bad disk replaced in db7, raid is currently rebuilding, system still online.
  • 15:10 RobH_dc: pulled hdd5 from db7 for replacement
  • 15:02 mark: Shutdown clematis for decommissioning
  • 14:18 RobH: knsq11,knsq12,knsq13 are post os reinstall, pre squid deployment config, will finish them in a bit
  • 14:07 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'Bug 24441 - Enable Rollback in Quechua Wikipedia'
  • 13:33 RobH: knsq11-knsq13 coming down for reinstallation
  • 12:23 Tim: deployed non-threaded version of imagemagick on all image scalers
  • 11:43 logmsgbot: tstarling synchronized php-1.5/includes/media/Bitmap.php 'OMP_NUM_THREADS=1'
  • 11:21 mark: Reconfigured wikimedia-lvs-realserver on hume, so wikimedia-task-appserver install succeeds
  • 11:19 logmsgbot: tstarling synchronized php-1.5/includes/media/Bitmap.php 'reduced magick memory limit from 100M to 50M to stop hanging with vsize limit 300M'
  • 10:46 mark: Removed pattern check from nagios check_http
  • 09:42 logmsgbot: tstarling synchronized php-1.5/wmf-config/CommonSettings.php
  • 09:38 logmsgbot: tstarling synchronized php-1.5/wmf-config/CommonSettings.php
  • 09:35 Tim: rebooting srv223, went OOM and mostly died
  • 09:32 logmsgbot: tstarling synchronized php-1.5/includes/media/Bitmap.php 'temporary patch to stop scalers going OOM'
  • 09:19 Tim: temporarily increased memory limit on the image scalers, since the new convert tends to hang when it runs out of memory instead of crashing nicely
  • 09:17 logmsgbot: tstarling synchronized php-1.5/wmf-config/CommonSettings.php 'more memory for image scalers'
  • 08:56 Tim: upgrading imagemagick on image scalers to 6.6.2.6-1wm1, package recently committed to svn
  • 02:48 Tim: on techblog, disabled WP_DEBUG since it was messing up the admin panels with E_NOTICE messages
  • 02:42 Tim: disabled WP-SpamFree on techblog due to bug 19540

August 10

  • 23:12 Fred: upgraded Tridge to Lucid. Now rebooting.
  • 22:04 RobH: knsq10 back online
  • 20:59 RobH: knsq10 reinstalling
  • 20:44 RobH: knsq9 online
  • 19:37 RobH: handed off knsq8 to mark, reinstalling knsq9
  • 19:02 ^demon: disabled svn post-commit hook for parser tests, long-since broken
  • 18:57 mark: Stopping backend squid on amssq60 for testing
  • 15:24 RobH: knsq8 reinstalled, not yet online, will push online shortly
  • 14:56 mark: Setup RT on rt.wikimedia.org (streber)
  • 14:32 RobH: knsq30 online and in cluster, knsq8 coming down for work
  • 14:18 RobH: updated wordpress versions on blog.wikimedia.org and techblog.wikimedia.org
  • 13:35 RobH: finishing install on knsq30
  • 12:50 Tim: installed schroot on stafford, for hardy versions of uupdate etc.
  • 11:19 mark: Fixed broken hourly cron job mw-serve
  • 11:18 mark: Changed su www-data into su mwlib in cleanup cronjob on pdf1
  • 10:23 mark: Removed broken daily system health report on srv178
  • 10:22 mark: Removed broken daily system health report on db4
  • 07:13 logmsgbot: andrew synchronized php-1.5/extensions/CommunityApplications/SpecialCommunityApplications.php 'Merge r70798'
  • 07:13 logmsgbot: andrew synchronized php-1.5/extensions/CommunityApplications/CommunityApplications.i18n.php 'Merge r70798'
  • 04:02 RobH: knsq30 set to false in pybal, install half done, will finish tomorrow morning.
  • 02:44 RobH: knsq29 online and in cluster
  • 02:30 RobH: knsq30 reinstalling
  • 00:09 RobH: knsq28 back online
  • 00:03 RobH: knsq27 back online
  • 00:03 RobH: knsq29 reinstalling

August 9

  • 23:34 RobH: knsq28 reinstalling
  • 23:32 RobH: knsq26 online
  • 23:32 RobH: knsq25 online
  • 23:12 RobH: continuing reinstallation, ignore errors for knsq27, reinstalling
  • 22:33 RobH: knsq23, knsq24 back online, knsq25, knsq26 still being reinstalled, knsq27-30 still online not yet reinstalled
  • 21:51 mark: Added Nagios router interfaces check for br1-knams (using puppet)
  • 21:11 mark: Unmounted /dev/sda6 (/a) on srv171, replaced it by /dev/mapper/nonredundant-data (LV with the same data and more space)
  • 21:02 RobH: knsq24, knsq25, knsq26, knsq27 coming down for reinstall and puppetfication
  • 20:49 RobH: knsq23 reinstall done and pushed back into cluster
  • 18:40 mark: Running apt-get upgrade on db9
  • 18:29 mark: Fixed ganglia mess on sq45
  • 18:24 mark: Powercycled sq45
  • 18:18 mark: Added a new MegaCli64 to wikimedia-raid-utils, made check-raid.py use it instead (we have all 64 bit servers anyway), and deployed the new package to the repository. Puppet will upgrade it everywhere.
  • 16:26 Fred: fixed DPKG issue on transcode... another one of those conflicting gmond install
  • 16:14 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 24735: Sanitize private/fishbowl config'
  • 16:03 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 24732: Portal and Book namespaces for yowiki'
  • 13:59 mark: srv110 decommissioned itself
  • 13:55 RobH: knsq23 coming down for reinstallation
  • 13:31 mark: Changed broken HTTP nagios check for Squid on brewster into a TCP port check
  • 13:28 mark: Stopped MySQL on srv171, created LVM PV,VG and LV on unused drive /dev/sdb. Copying MySQL data onto it.
  • 13:09 mark: START SLAVE on srv171 to get rid of relay binlogs
  • 12:56 mark: Shutdown db3 for decommissioning
  • 12:56 RoanKattouw: Mark 12:53 Shutdown db2 for decommissioning
  • 12:56 RoanKattouw: 12:52 mark synchronized php-1.5/wmf-config/db.php 'Remove db3 from rotation, decommissioning'
  • 12:55 RoanKattouw: Mark 12:47 Power cycled pdf3, out of memory
  • 12:55 RoanKattouw: Mark 12:44 Restarted Apache on srv91
  • 12:55 RoanKattouw: Mark 12:39 Relaxed NTP peers check for dobson and linne (NTP servers)
  • 12:55 RoanKattouw: Mark 12:36 Shutdown adler for decommissioning
  • 12:54 RoanKattouw: Mark 12:18 Made disk space on mchenry by DELETING LOTS OF OLD BACKUPS
  • 12:53 RoanKattouw: Restarted morebots

August 8

August 7

  • 11:58 logmsgbot: mark synchronized php-1.5/wmf-config/db.php 'New master: db18, r/w'
  • 11:57 mark: Changed master for s3 to db18 on db11, db27, db25
  • 11:49 mark: New master db18 log position: db18.bin.001 pos 79
  • 11:32 logmsgbot: mark synchronized php-1.5/wmf-config/db.php 'Setting s3 to read-only'
  • 11:21 mark: Stopping mysql on db17
  • 11:21 mark: For reference, SHOW SLAVE STATUS on db18 before the switch:
      Master_Log_File: db17-bin.368
  Read_Master_Log_Pos: 650276717
       Relay_Log_File: db18-relay-bin.048
        Relay_Log_Pos: 650276247
Relay_Master_Log_File: db17-bin.368
     Slave_IO_Running: No
    Slave_SQL_Running: Yes
  • 11:02 RoanKattouw: All s3 slaves down, master serving all read load and getting overloaded
  • 11:00 RoanKattouw: db17 (s3 master) has full disk

August 6

  • 22:59 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 22:59 logmsgbot: catrope synchronized php-1.5/includes/GlobalFunctions.php 'r70605'
  • 22:58 logmsgbot: catrope synchronized php-1.5/skins/vector/main-rtl.css 'r70605'
  • 22:56 logmsgbot: catrope synchronized php-1.5/skins/vector/main-ltr.css 'r70605'
  • 15:43 logmsgbot: catrope synchronized php-1.5/languages/messages/MessagesCs.php 'r70573'
  • 15:08 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 24688: Namespace aliases for kowiki'
  • 10:45 logmsgbot: catrope synchronized php-1.5/extensions/WikimediaMessages/WikimediaLicenseTexts.i18n.php 'r70550'
  • 10:11 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Add foundationwiki addgroups in correct section'

August 5

  • 21:14 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 24678: Set $wgAddGroups, $wgRemoveGroups on foundationwiki'
  • 14:30 RobH: replacing certificate file on sanger

August 4

  • 22:57 rainman-sr: search1 somehow got stuck, restarting
  • 22:42 Fred: restarted Nagios bot
  • 16:28 mark: Reverted Fred's automatic security upgrades in puppet
  • 16:25 mark: base::puppet and base::apt were not being included on every Linux host, fixed the case statement in the Puppet base class
  • 11:35 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 24652: Allow bureaucrats to add/remove communityapps group on officewiki'

August 3

  • 22:58 mark: Changed /etc/default/exim4 to make exim listen on SMTP
  • 22:56 mark: s/srv9/grosley/ on /etc/exim4/exim4.conf on grosley
  • 21:20 RobH: grosley exim config was overwritten when converted to puppet control (default install uses simple exim setup now). defined host specifically, adding in ganglia details and removing exim control from puppet
  • 20:21 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 24652: Archive namespace for officewiki'
  • 20:10 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 20:09 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/WikiEditor/WikiEditor.combined.min.js 'r70409'
  • 15:59 logmsgbot: catrope synchronized php-1.5/languages/messages/MessagesEo.php 'r70387'
  • 14:24 RobH: knsq7 coming down for reinstall
  • 14:24 RobH: kndq7 coming down for reinstall
  • 04:57 Andrew: scap
  • 04:57 logmsgbot: andrew synchronizing Wikimedia installation... Revision: 70064
  • 04:55 Andrew: Preparing to update LiquidThreads alpha to trunk with r70106 and r70100 unmerged.
  • 01:30 mark: Restored exim config on williams (OTRS)
  • 00:35 RobH: otrs emails still bouncing, working on it.
  • 00:31 RobH: exim was sitting idle? on williams for otrs delivery after complaining of a failed database connection. restarted exim, it appears to be working on the delivery backlog now, will check back on it in 30 minutes or so

August 2

  • 23:55 guillom: afaict, all emails sent to OTRS are being rejected with the "retry time not reached for any host after a long failure period" message. The issue seems to have started a few hours ago.
  • 22:34 tomaszf: upgrading civicrm to 3.1.6
  • 18:20 RobH: pulling knsq6 for reinstallation and such
  • 17:47 RobH: sq33-sq40 reinstalled, online, serving requests
  • 15:54 RobH: reinstalling sq34-sq40
  • 14:55 mark: Installed gmetad on streber for collecting ganglia information in local RRDs
  • 14:34 mark: Upgraded streber to Ubuntu 10.04
  • 14:24 RobH: restarted pdns on linne
  • 14:20 RobH: updated dns for tesla host ci
  • 12:46 mark: Fixed ganglia package mess on search12-20
  • 12:10 mark: Depooled all text squids from the bits.esams LVS pool
  • 12:02 mark: Reinstalled knsq2 and knsq5 with Ubuntu 10.04, set them up as Varnish bits caches, and pooled them in LVS
  • 11:50 mark: Fixed ganglia mess on kaulen
  • 11:43 mark: Added missing Wikimedia APT repository to kaulen. Why was it not there? Was this host installed in some nonstandard way?
  • 11:35 mark: Set up exim::simple-mail-sender classes for kaulen
  • 11:35 mark: Fixed puppet for snapshot*, base classes were not included
  • 11:19 mark: Turned off knsq3; broken HBA and out of warranty
  • 10:55 mark: Started puppetd on spence inside gdb

August 1

  • 17:21 mark: Reinstalled linne.wikimedia.org with Lucid
  • 15:41 mark: Fixed the haproxy for puppetmaster on brewster, was broken by the upgrade
  • 15:19 mark: Upgraded brewster to Ubuntu 10.04 Lucid
  • 15:18 mark: Removed Wikimedia repository default pinning on brewster, as it's doing more harm than good
  • 15:07 mark: Explicitly install and deinstall gmond / ganglia-monitor packages in puppet, depending on the ubuntu version
  • 14:52 mark: Removed broken amanda backup client from brewster. Broken package install/dpkg state, no logs, no documentation

July 31

  • 17:31 mark: Setup bits.esams varnish cluster, pooled knsq4 (varnish) with all the text squids
  • 17:06 mark: Reinstalled knsq4 with lucid, redeploying it for varnish
  • 16:24 mark: Depooled knsq1-knsq4 in squid config

July 30

  • 20:00 domas: hotfixed db12 build to have faster mysqldumps
  • 19:01 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'Bug 24466 reversion'
  • 18:29 RobH: gracefulled apaches a few times, had a rogue typo
  • 17:45 RobH: dns push successful, all nameservers still online and up to date
  • 17:45 RobH: updated dns for movementroleswiki
  • 17:43 logmsgbot: robh ran sync-common-all
  • 16:51 RobH: added movement roles information to apache, still setting up other stuff for it
  • 05:53 Tim: svn up/scap r70064

July 28

  • 15:26 Rob: running updates on sockpuppet
  • 13:08 logmsgbot: andrew synchronizing Wikimedia installation... Revision: 70064
  • 12:57 Andrew: about to update LiquidThreads alpha to trunk state
  • 07:54 logmsgbot: tstarling synchronized php-1.5/wmf-config/CommonSettings.php 'adding 1.16 branch to ExtensionDistributor'
  • 05:01 apergos: added bayes to list of clients for exports on dataset1; added /data to fstab on bayes for stats use
  • 04:23 logmsgbot: tstarling synchronized php-1.5/includes/api/ApiBase.php
  • 04:22 logmsgbot: tstarling synchronized php-1.5/includes/api/ApiMain.php
  • 04:22 Tim: deploying r70063 and r70064 to fix API fatals
  • 02:09 logmsgbot: root synchronizing Wikimedia installation... Revision: 70061
  • 02:04 Tim: doing svn up/scap to r70061, to get the API cache header fix

July 27

  • 23:27 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 24466'
  • 21:39 mark: Added wikimedia-base to the standard packages list in puppet
  • 21:34 Rob: pushed hu language file change on survey.wikimedia.org
  • 21:16 Rob: singer is being a pain, ssh isnt running, so forth... working on it
  • 19:07 Rob: racktables physical audit of pmtpa done (ignore the bottom of rack a5, will remove them later today when I have a damned mouse)
  • 18:49 Rob: physical audit pmtpa row b complete
  • 18:40 Rob: physical audit pmtpa row c complete
  • 18:18 Rob: uplink successfully moved for asw-b5-sdtpa
  • 18:17 Rob: moving the uplink
  • 18:17 Rob: all the apaches in sdtpa-b5 are migrated from old to new asw-b5-sdtpa.
  • 18:05 Rob: bugzilla admin note: disregard the email bounces for the scireview domain user, their mail server will be fixed shortly
  • 17:58 Rob: srv284 fixed, drac online, needs setup
  • 17:54 Rob: !log srv281 was shutdown when I came into the DC. popped case to use as base for rebuilding another system. need to investigate its initial shutdown
  • 17:17 mark: Reinstalled sq68
  • 16:41 atglenn: started up several more workers on snapshot3 doing xml dumps
  • 14:43 Tim: running svn cleanup on some ExtensionDistributor working copies
  • 14:29 Tim: cleaning up old ext-dist tarballs, removing all that were older than a month
  • 11:48 mark: Stracing puppet on spence
  • 08:01 apergos: started one thread of xml dumps from screen session on snapshot3 as root, if these look good tomorrow we'll crank up more of 'em

July 26

  • 18:29 Rob: moved up to srv268, all working
  • 18:25 Rob: moved srv258-srv261 to new asw-b5-sdtpa ports, all seems shiny
  • 17:59 Rob: hooked up additional ports for sq67/sq68
  • 17:17 mark: Installed Lucid on sq69 and sq70
  • 14:41 mark: Moved db19 back to vlan 2
  • 14:34 mark: Pooled sq68 in the bits.pmtpa varnish LVS pool
  • 14:06 mark: Installed lucid on sq68 and deployed varnish for bits.pmtpa
  • 12:22 mark: Started puppet on spence
  • 05:21 Tim: removed a log on search12 to give it a tiny bit more space

July 25

  • 21:55 mark: Pooled sq67 (Varnish) in the bits.pmtpa LVS pool along with the text squids
  • 21:54 mark: Moved bits.pmtpa.wikimedia.org DNS from 208.80.152.2 (text squids) to 208.80.152.118 (dedicated LVS service)
  • 21:21 mark: Started puppetd on spence
  • 20:50 mark: Setup Nagios monitoring for bits.pmtpa
  • 20:33 mark: Added LVS service for bits.pmtpa.wikimedia.org on lvs4
  • 19:53 mark: Added LVS service ip 208.80.152.118 (bits.pmtpa.wikimedia.org) to all text squids and sq67
  • 19:18 mark: Fixed ganglia varnish monitoring on sq67
  • 13:01 domas: thumbs get i/o errors across multiple clients
  • 10:42 domas: xmltypecheck loop filled / with error logging on srv243 ( http://p.defau.lt/?KtZMFW9x7xaEa2eDwuqMEw ) - had to do some cleanup. didn't livehack anything yet.

July 24

  • 21:31 mark: Deployed varnish with configuration for bits.wikimedia.org on sq67 - not active yet
  • 03:59 apergos: running xml dump file consistency checking script on snapshot2 and snapshot3 in screen as root, expect these to run all night, maybe through the next day as well

July 23

  • 19:17 Rob: synced for Bug 24470 - Enable NewUserMessage extension on lv.wikipedia
  • 19:17 logmsgbot: root synchronized php-1.5/wmf-config/InitialiseSettings.php
  • 18:21 tomaszf: moving otrs-783-529404994.sql to tridge to free up space
  • 18:21 kaldari: Backing up dev_civicrm database in preparation for drupal upgrade
  • 11:33 RoanKattouw: Started Apache on srv278, had died mysteriously
  • 09:30 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 24505: Editprotected group for hiwiki'
  • 02:02 Rob: dig tested against all three nameservers after update, all nominal
  • 01:59 Rob: pushing dns update to fix pointer for bugs.wikimedia.org
  • 01:30 Tim: running some statistics queries on db38

July 22

  • 19:31 Rob: mail server works, bypass expiration issue, rob will fix when all of the employees are not hitting the mail server.
  • 18:13 logmsgbot: kate synchronized php-1.5/wmf-config/db.php 'put ixia back'
  • 17:50 Rob: wmf mail server cert expired, we know, working on replacing it now
  • 17:29 mark: Fixed ganglia data sources by qualifying all hostnames in gmetad_pmtpa.conf; this must have broken by the resolv.conf change last week
  • 17:15 logmsgbot: catrope synchronized php-1.5/skins/MonoBook.php 'r69735'
  • 17:15 logmsgbot: catrope synchronized php-1.5/skins/Vector.php 'r69735'
  • 17:11 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Hide Navigable TOC preference'
  • 17:10 atglenn: xml dumps stopped in preparation for removal of bad dumps and xml code fix push
  • 17:09 atglenn: snapshot3 added back to sync cluster in preparation for xml dumps fixes
  • 16:50 logmsgbot: kate synchronized php-1.5/wmf-config/db.php 'put db27 back'
  • 16:37 mark: Installed Lucid on sq67
  • 16:26 mark: Made Lucid the default distribution for new installs
  • 16:21 mark: Fixed sq73
  • 15:56 mark: Fixed ganglia cron job on spence.
  • 15:46 logmsgbot: kate synchronized php-1.5/wmf-config/db.php 'put db7 back'
  • 15:29 RoanKattouw: Looks like that resolved the zombie issue on srv86
  • 15:15 mark: Fixed puppet for app servers
  • 15:05 RoanKattouw: Defunct Apache process running on srv86, Apache won't start. Need root to kill zombie
  • 15:04 RoanKattouw: Started Apache on srv270, srv163, had died for some reason
  • 15:04 RoanKattouw: Deployed r69728, r69729 (UsabilityInitiative, Vector updates) about 30 minutes ago
  • 15:03 RoanKattouw: <RoanKattouw> !log Deploying r69728, r69729 (UsabilityInitiative, Vector updates) to test.wikipedia.org
  • 15:03 RoanKattouw: <mark> !log Upgraded Locke to Ubuntu 10.04 Lucid
  • 15:03 RoanKattouw: <logmsgbot> !log kate synchronized php-1.5/wmf-config/db.php 'removed ixia to dump s4 for TS'
  • 15:03 RoanKattouw: <logmsgbot> !log kate synchronized php-1.5/wmf-config/db.php 'removed db7 to dump s6 for TS'
  • 15:03 RoanKattouw: <logmsgbot> !log kate synchronized php-1.5/wmf-config/db.php 'removing db27 to dump s3 for TS'
  • 15:02 RoanKattouw: Started morebots as catrope. werdnum has a defunt instance running that I can't kill

July 21

  • 20:02 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/ui/CodeRevisionListView.php 'r69705'
  • 19:44 Rob: all 5 new misc servers have mgmt working, will allocate network port in a bit and get them loaded up
  • 19:43 mark: Setup logging for Special:Book on locke
  • 19:42 Rob: all dns servers back online and happy
  • 19:38 Rob: grosley back online providing services
  • 19:30 Rob: grosley was down due to my dns issue, rebooting it, sorry about that.
  • 19:29 Rob: restarted pdns on linne
  • 19:25 Rob: fixed letter reversal in dns, repushed
  • 19:13 Rob: linne pdns restarted
  • 19:06 Rob: updating dns for 5 new misc servers mgmt
  • 16:35 Rob: lots more time wasted diagnosing evident mainboard failures on srv284, hopefully replacement will soon be inroute
  • 16:10 mark: Done basic setup of asw-b-sdtpa, added to RANCID and torrus
  • 14:27 Rob: ms1 is back, passing it off to someone who doesnt hate it
  • 14:06 Rob: working on ms1, no touchy
  • 09:56 logmsgbot: catrope synchronized php-1.5/includes/api/ApiLogin.php 'r69661'
  • 02:23 Tim: increasing nagios retry count for NTP from 8 to 15

July 20

  • 15:55 Rob: pushed blog and techblog updates on existing plugins, core wordpress, but not themes (because I dont feel like rehacking our css)
  • 15:41 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'Bug 24374 - Create new usergroups at commons'
  • 15:35 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'Bug 24364 - Install Extension:Collection for PDF export on foundation wiki'
  • 15:16 Rob: synced for bug 24449 disable file talk pages on cs wikis (since they all have upload disabled)
  • 15:15 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php
  • 14:34 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'noboard_chapterswikimedia logo'
  • 14:16 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Change dbname noboardwiki -> noboard_chapterswikimedia'
  • 14:12 logmsgbot: robh ran sync-common-all
  • 13:54 logmsgbot: andrew synchronized php-1.5/wmf-config/InitialiseSettings.php 'communityapplications is too long for user_groups table'
  • 13:48 logmsgbot: andrew synchronizing Wikimedia installation... Revision: 69504
  • 13:48 Andrew: scapping to deploy CommunityApplications extension
  • 13:47 logmsgbot: andrew synchronized php-1.5/wmf-config/InitialiseSettings.php 'New group for viewing community applications'
  • 13:42 logmsgbot: andrew synchronized php-1.5/wmf-config/InitialiseSettings.php 'New group for viewing community applications'
  • 13:32 logmsgbot: robh ran sync-common-all
  • 12:21 logmsgbot: andrew synchronized php-1.5/extensions/CommunityHiring/SpecialCommunityHiring.php 'r69606'
  • 12:20 logmsgbot: andrew synchronized php-1.5/extensions/CommunityHiring/CommunityHiring.php
  • 12:20 logmsgbot: andrew synchronized php-1.5/wmf-config/CommonSettings.php 'Configuration changes for r69606'
  • 11:53 Andrew: Adding CommunityHiring tables to officewiki database

July 19

  • 20:05 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'Bug 24448 - w:cs: upload settings'
  • 18:58 aZaFred_OSCON: started slave on db10 (sync had stopped on July 8th)
  • 16:59 Rob: upded tfinc email quota cuz i was tired of seeing the alerts to postmaster
  • 13:09 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 24440: New namespaces for dawikisource'
  • 10:20 logmsgbot: midom synchronized wmf-deployment/cache/trusted-xff.cdb
  • 08:32 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 24435: Enable $wgBlockAllowsUTEdit on hiwiki'
  • 07:32 Tim: changing max retry count on NTP nagios monitoring to avoid constant flapping
  • 06:45 Tim: restarted ntpd on mchenry, wasn't responding on IPv4

July 18

  • 21:26 domas: fixed opensearch caching, unbreaked API
  • 21:18 logmsgbot: midom synchronized php-1.5/includes/api/ApiMain.php 'fix broken API caching'
  • 15:10 logmsgbot: mark synchronizing Wikimedia installation... Revision: 69504
  • 14:59 mark: Making puppet upgrade wikimedia-raid-utils on all servers
  • 12:58 logmsgbot: catrope synchronized php-1.5/extensions/WikimediaMessages/WikimediaLicenseTexts.i18n.php 'r69504'
  • 12:51 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'New groups for hiwiki (bugs 24416, 24417, 24418, 24419)'
  • 08:13 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'db13 and db15 go live'

July 17

  • 22:01 mark: Apparently package wikimedia-raid-utils truncated /etc/sudoers on many hosts; having puppet put a proper sudoers file back on the application servers
  • 21:42 mark: Fixed gmond mess on search12
  • 21:03 RoanKattouw: Importing another batch of ~10,500 files (~22GB) concurrently with the first one
  • 19:38 RoanKattouw: Import is running in a screen on fenari
  • 19:38 RoanKattouw: Importing ~10,500 files (aggregate size ~21GB) with importImages.php for commons:User:OrdnanceSurveyBot
  • 18:49 logmsgbot: catrope synchronized php-1.5/languages/messages/MessagesLad.php 'r69487'
  • 18:48 mark: Having puppet install wikimedia-raid-utils on all servers
  • 18:12 mark: srv278 rebooted for no apparent reason
  • 17:21 mark: Fixed ganglia mess on search19
  • 16:19 mark: Having puppet install NRPE on all internal servers by default
  • 15:52 logmsgbot: root synchronized php-1.5/wmf-config/lucene.php 'Use fixed LVS service search-pool3 instead of search7 directly'
  • 15:31 mark: Removed fscking nscd from spence
  • 15:20 mark: Added internal LVS services to pmtpa.wmnet DNS
  • 15:11 mark: Doing reboot test of nescio (esams DNS)
  • 15:08 mark: Fixed puppet exchange of ssh hostkeys which has been broken for a while
  • 14:30 mark: Truncated all tables in the puppet db on db9
  • 12:30 mark: Implemented domain search list in resolv.conf for fenari (use sparingly!)
  • 12:24 mark: Put /etc/resolv.conf under Puppet management on all servers. Setting timeout option to 3s, to avoid PyBal depools due to 5s timeout when the primary resolver is down.
  • 11:38 mark: Removed rkhunter and chkrootkit on bayes. What is the point with 2 year old software? Just creating more cron spam? :)
  • 10:06 mark: Fixed degraded raid on nescio
  • 09:53 mark: Removed X.org and gdm from bayes. Why was it installed?
  • 09:30 domas: doing maintenance on db13 and db15 (BIOS, OS, MySQL upgrades, resync of data)

July 16

  • 21:05 mark: Removed old-style DNS monitoring from Nagios conf.php, now fully Puppet managed
  • 21:03 mark: Deployed authoritative DNS on nescio.esams, and moved the service IP
  • 19:32 mark: Deployed PowerDNS recursor on nescio, and moved the 91.198.174.6 service ip to it
  • 19:16 mark: Installed lucid on new server nescio.esams
  • 13:49 mark: Fixed entry in upload.wikimedia.org georecord, sent to text squids for a few ip ranges
  • 12:27 mark: Pointed country code 'eu' (127.0.255.1 or 65281) to esams in geodns; since geobackend uses a signed short, I had to mask it to 0x7ff / 32513 in the director map
  • 10:49 mark: Shutdown fuchsia for decommissioning
  • 07:57 logmsgbot: catrope synchronizing Wikimedia installation... Revision: 69416
  • 07:56 RoanKattouw: Running scap to deploy r69416
  • 01:48 logmsgbot: tstarling synchronized php-1.5/includes/diff/DifferenceInterface.php 'r69414'

July 15

  • 19:23 mark: Replaced all occurrences of 'rr.wikimedia.org' with 'text.wikimedia.org' in DNS
  • 19:14 mark: Updated IP of deprecated record rr.esams.wikimedia.org
  • 19:10 mark: Started PyBal on amslvs1 with a new config; it automatically picked up the traffic for both text.esams (91.198.174.232) and bits.esams (91.198.174.233)
  • 19:07 mark: Stopped PyBal on amslvs1, BGP and OSPF did an automatic failover of bits.esams (91.198.174.233) to amslvs3
  • 18:59 mark: Removed IP 91.198.174.2 (old text squids service ip) from amslvs1. Anyone still using the old IP after weeks will now be unable to reach our sites.
  • 18:56 mark: Depooled knsq1-knsq7 in PyBal
  • 17:38 Fred: fixed nfs mounts on Bayes.
  • 15:35 apergos: chowned /mnt/upload6/private/ExtensionDistributor/mw-snapshot/trunk/extensions tree to extdist. ExtensionDistributor apparently working now
  • 15:01 apergos: running svn cleanup on /mnt/upload6/private/ExtensionDistributor/mw-snapshot/trunk/extensions as extdist user
  • 12:34 logmsgbot: tstarling synchronizing Wikimedia installation... Revision: 69381
  • 12:18 Tim: svn up/scap to r69380
  • 05:13 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24321 - ml.wikiquote.org lost its project namespace'

July 14

  • 23:44 Fred: re-added ccron job to periodically save rrds on our ganglia server. (cron job seems to have vanished for some reason)
  • 17:59 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Favicon for wikimaniateamwiki per Guillaume'
  • 16:06 Fred: restarted apache on mobile1 (had begun to return 500)
  • 14:07 mark: Fixed memcached on srv110
  • 12:19 mark: Fixed ganglia and puppet on stafford
  • 11:54 mark: Migrated DNS monitoring to puppet
  • 10:31 mark: Migrated ZFS RAID nagios check to puppet
  • 10:14 mark: Migrated monitoring of lucene to puppet
  • 09:37 mark: Migrated monitoring of image scalers to puppet
  • 08:49 Tim: using stafford for some pbuilder experimentation

July 13

  • 22:02 mark: Migrated monitoring of application servers to Puppet
  • 20:29 mark: Fixed puppet on ms4
  • 20:16 mark: Hacked up nagios conf.php to not create host entries for most servers (now in puppet), except special cases
  • 19:58 mark: Hacked up nagios conf.php to not create host entries
  • 16:51 mark: Migrated Squid Nagios monitoring to puppet, commented some functionality in nagios conf.php
  • 15:51 mark: Split puppet nagios config over multiple files

July 12

  • 16:54 Fred: changed LONGQUERIES check threshold
  • 16:08 Fred: restarting morebots since it had died.
  • 16:08 Fred: restarting Nagios since it was down.
  • 14:29 mark: Added "cfg_file=/etc/nagios/puppet_hosts.cfg" to nagios.cfg
  • 13:25 JeLuF: added disk space monitoring for apaches
  • 12:51 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24306 - Create namespaces for Lithuanian Wiktionary'
  • 12:48 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24321 - ml.wikiquote.org lost its project namespace'
  • 12:46 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24321 - ml.wikiquote.org lost its project namespace'
  • 12:41 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24344 - Namespace changes - si.wiktionary'
  • 11:45 JeLuF: fixed broken ganglia-metrics installation on srv146 (chown gmetric /var/log/gmetricd/gmetricd.log)
  • 11:41 JeLuF: added DPKG status monitoring for all app servers to nagios. Reports all packages that are not in state 'rc' or 'ii'.
  • 10:43 JeLuF: lots of false alerts from nagios due to missing SSL setup for NRPE. Working on it.
  • 09:53 JeLuF: changed puppet config to install nrpe on all app servers
  • 09:28 JeLuF: replacing opsview-nrpe agents by nagios-nrpe agents (image_scalers, some other apaches). Most apaches already use nagios-nrpe
  • 07:40 Tim: set up NRPE disk space monitoring on ms4, discovered that /mnt2 is full
  • 04:54 Tim: updated NFS host/service groups to monitor the actual NFS servers, not a random collection of miscellaneous ex-NFS servers
  • 04:46 Tim: installed NRPE on nfs1 and nfs2
  • 04:08 Tim: adding rendering, m, bits.esams, recursor0, recursor1, recursor0.esams to nagios
  • 04:02 Tim: added forward DNS entry for recursor0.esams, modified reverse DNS entry resolver0.esams -> recursor0.esams
  • 03:55 Tim: fixed reverse DNS entries for recursor0 and recursor1, were set incorrectly to non-existent hostnames "resolver0" and "recursor1"
  • 03:36 Tim: renamed db6.mgmt to locke.mgmt

July 10

  • 14:14 rainman-sr: search7 disk was full, deleting some old unneccessary indexes
  • 12:50 Fred: applied security updates on all machine running Karmic or Lucid (per USN-959-1)

July 9

  • 18:07 domas: forgot to log, rebooted locke, put startup stuff to rc.local, maybe Tim changed it afterwards, hehe. beer is good too.
  • 15:31 Rob: wikimania2011wiki is now using vector
  • 15:31 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php
  • 12:48 logmsgbot: robh ran sync-common-all
  • 01:06 logmsgbot: tstarling synchronized php-1.5/includes/filerepo/RepoGroup.php
  • 01:04 logmsgbot: tstarling synchronized php-1.5/includes/filerepo/RepoGroup.php
  • 01:04 logmsgbot: root synchronized php-1.5/includes/filerepo/RepoGroup.php
  • 01:03 logmsgbot: tstarling synchronized php-1.5/includes/filerepo/RepoGroup.php
  • 00:59 logmsgbot: tstarling synchronized php-1.5/includes/filerepo/RepoGroup.php

July 8

  • 22:27 apergos: powercycled db9 fromm drac after shutdown failed
  • 22:20 Fred: re-imaging srv225 back to normal until wikimedia-task*can be ported to lucid.
  • 22:15 apergos: rebooting db9, mysqld was defunct but the port was in use so couldn't restart it the nice way
  • 17:06 mark: Set temporary 91.198.174.0/24 null0 route on br1-knams, to investigate prefix announcement problems
  • 16:10 Rob: updated puppet to add zak to the mortals admin group and allowed access to shell on fenari as non-root
  • 04:10 Tim: starting upload of BnF images, using importImages.php in screen on fenari

July 7

  • 21:38 Fred: RIP sfoservices. (box not booting at all anymore)
  • 17:10 Fred: re-imaging srv225 to the apache cluster.
  • 16:26 mark: Fixed puppet on srv193
  • 15:56 mark: Fixed horrible gmond mess on searchidx1
  • 15:42 mark: Fixed puppet on sr255
  • 15:30 mark: Mounted /mnt/upload6 on srv255
  • 14:23 mark: Fixed /home backup on nfs1/nfs2 to tridge
  • 07:57 Rob: srv193 is refusing to take my updates, removed it from pybal so it doesnt serve out of data information
  • 07:53 logmsgbot: robh ran sync-common-all
  • 07:50 Rob: updated dns for wikimania wiki
  • 07:38 Rob: adding wikimaniawiki apache support, sycning lots of apaches and docroots.
  • 01:49 Tim: downloading the DjVu files via rsync/ssh for http://www.wikimedia.fr/wikim%C3%A9dia-france-signe-un-partenariat-avec-la-bnf

July 6

  • 13:43 mark: Fixed puppet on nfs1 and nfs2
  • 11:11 mark: Removed config cache on srv110
  • 10:32 mark: Fixed puppet on srv110
  • 10:26 mark: Stopped apache on srv110
  • 00:28 Tim: restarted mailman on lily
  • 00:23 Tim: killed all mailman processes on lily in an attempt to save it from swap death (swapping severely since 00:07)
  • 00:12 Tim: fixed stale /home on searchidx1 and restarted indexer
  • 00:02 Tim: codereview-proxy is up now. Pinging CR update API for all recent revisions

July 5

  • 23:55 logmsgbot: tstarling synchronized php-1.5/wmf-config/CommonSettings.php 'new URL for codereview proxy'
  • 23:33 Tim: changed CNAME for codereview-proxy to kaulen
  • 23:20 Tim: moving codereview-proxy to kaulen to replace isidore (which is down)
  • 22:53 Tim: on srv124: remounted /home to fix test.wikipedia.org
  • 18:56 logmsgbot: jeluf synchronized php-1.5/wmf-config/flaggedrevs.php '24010 - id.wikipedia requesting FlaggedRevs'
  • 17:01 mark: Did BGP soft clear outbound on all AMS-IX sessions; no prefixes were being announced as of two weeks ago
  • 16:01 mark: Made puppet ensure apache is running on the app servers; running "sync-common" upon start
  • 15:38 mark: Fixed puppet on srv145
  • 11:39 mark: Remounted /home on hume
  • 11:30 logmsgbot: root synchronizing Wikimedia installation... Revision: 68850
  • 11:29 logmsgbot: mark synchronized php-1.5/wmf-config/CommonSettings.php 'CommonSettings.php out of sync on a few apaches'
  • 09:17 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24264 - Create a namespace aliases on zhwiki'
  • 06:26 ronabop: Kudos to the team who rebuilt a multi-hundred node system under extreme pressure.
Was there extreme pressure? :-) --domas
  • 06:23 Tim: fixed broken ircd auth configuration, irc.wikimedia.org now working again
  • 05:13 Tim: on browne: reinstalled udprec to fix IRC server
  • 04:16 Tim: switched nagios monitoring for search to less flappy TCP connection check instead of HTTP
  • 03:58 domas: s3 pos: db17-bin.321:0
  • 03:58 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 's3 rw'
  • 03:54 Tim: fixed search monitoring in nagios
  • 03:47 Tim: started lsearchd on search1-20
  • 03:44 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'rw s4 s4'
  • 03:42 Tim: fixed search1: just needed /home remounted
  • 03:38 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 03:32 domas: new repl positions, s2: db30-bin.000015:1227, s4: db16-bin.019:0
  • 03:17 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 03:13 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 03:04 logmsgbot: tstarling synchronized php-1.5/wmf-config/db.php 's3 fake master r/o'
  • 02:54 Tim: mysql status: s2 and s4 have replication broken with "Client requested master to start replication from impossible position". s3: still waiting for innodb recovery on master. Other clusters good.
  • 02:49 logmsgbot: tstarling synchronized php-1.5/wmf-config/db.php
  • 02:48 Tim: on db8: read_only=1 again and setting wiki to r/o
  • 02:46 Tim: on db8: read_only=0, started up r/o (s4)
  • 02:42 Tim: putting s2 into read-only mode due to replication issues
  • 02:35 RobH_: search server defaults to sitting on grub screen for search13-search20, will fix later, for now they are booting back up.
  • 02:30 Tim: fixed m.wikipedia.org on lvs4
  • 02:26 RobH_: search13 back up, working on the others
  • 02:24 mark: Moved bits.pmtpa to point to Text squids in DNS
  • 02:08 Tim: starting mysqld on a lot of DB servers
  • 02:08 RobH_: seems like a power outage, not an AC issue.
  • 02:07 RobH_: email back online
  • 01:52 Tim: on nfs1: river fixed the filesystem with fsck
  • 01:33 Tim: (about 5 minutes ago) started mysqld on db17
  • 01:26 Tim: on lvs4: removed dead squids from text list
  • 01:16 Tim: started mysql on db8
  • 01:15 Tim: started mysqld on db5
  • 01:11 Tim: power went off briefly again, lvs4 came back up properly this time, starting pybal on it again
  • 00:59 Tim: got LVS set up and working on lvs4
  • 00:56 Tim: s/nfs4/lvs4
  • 00:55 Tim: got nfs4 back online

July 3

  • 01:25 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 00:40 logmsgbot: midom synchronized php-1.5/wmf-config/db.php

July 2

  • 22:17 logmsgbot: andrew synchronized php-1.5/wmf-config/CommonSettings.php 'style version'
  • 22:14 logmsgbot: andrew synchronizing Wikimedia installation... Revision: 68850
  • 21:47 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24119 - Change the logo image in Sinhala wiktionary'
  • 21:46 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24119 - Change the logo image in Sinhala wiktionary'
  • 21:42 logmsgbot: andrew synchronized php-1.5/skins/common/shared.css
  • 21:19 logmsgbot: andrew synchronized php-1.5/wmf-config/ExtensionMessages.php
  • 21:17 logmsgbot: andrew synchronizing Wikimedia installation... Revision: 68850
  • 20:27 Fred: added replag nagios check for all slave DBs.
  • 19:56 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24025 - Enable NewUserMessage extension on ko.wikipedia'
  • 19:50 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24067 - New namespace for gl.wiktionary'
  • 19:40 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24048 - Change sitename for Osetian Wikipedia.'
  • 19:28 mark: Slave SQL thread stopped on db40 due to lock wait timeout (?), restarted
  • 17:58 logmsgbot: andrew synchronized php-1.5/extensions/StrategyWiki/ActiveStrategy/ParserFunctions.php
  • 17:47 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24157 - Enable RevisionMove on testwiki'
  • 11:30 JeLuF: mark: !log Upgraded asw-b3-sdtpa, asw-b4-sdtpa and asw-b5-sdtpa to newer JunOS
  • 00:54 Fred: starting MTA on list server again...
  • 00:51 Fred: Cause of spam: spammers using wiki@wikimedia.org as originating address. result: we are getting hit by the responses.
  • 00:46 Fred: purging mail queue for spam 'replies' on list server.
  • 00:21 Fred: mailing list server down while fixing .

July 1

  • 22:58 tomaszf: setting watchdog timer on db9 to 60sec for process kill
  • 21:08 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '17531 - Add subpage feature to the article namespace on nowikimedia'
  • 21:05 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '17510 - Give patroller-group access to suppressredirect on nowiki'
  • 20:57 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24065 - Add some interwiki links in Special'
  • 20:55 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24018 - Change the logo image in Sinhala wikibooks'
  • 20:53 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24138 - Logo of the fr.wikisource'
  • 20:49 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24205 - New namespace for de.arbcom'
  • 20:28 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/backend/CodeRevision.php 'r68850'
  • 20:27 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/CodeReview.php 'r68850'
  • 18:11 logmsgbot: andrew synchronized php-1.5/extensions/StrategyWiki/ActiveStrategy/ParserFunctions.php
  • 18:09 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 18:09 Rob: every single access switch is now accessible via serial mgmt
  • 18:08 Rob: only switch in pmtpa-row a is asw-a2-pmtpa, which was not responsive to serial, fixed scs settings and properly labeled the port
  • 18:07 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/js/plugins.combined.min.js 'r68842'
  • 18:06 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/Vector/Vector.combined.min.js 'r68842'
  • 17:35 Rob: pmtpa-rowc switches connected to scs-c1-pmtpa
  • 17:03 Rob: asw-b1,asw-b2,asw-b3,asw-b4,asw-b5-pmtpa connected to scs-c1-pmtpa
  • 16:30 mark: Rob moved management fiber on pmtpa end from csw5-pmtpa:8/24 to msw1-pmtpa:0/1/1
  • 16:05 Rob: moved the primary serial mgmt interface of csw5-pmtpa from scs-ext to scs-c1-pmtpa.mgmt.pmtpa.wmnet, Also ran the permanent connections to the same serial console for msw1-pmtpa and mr1
  • 14:04 mark: Configured Exim to bypass spamd for wiki@wikimedia.org recipient
  • 13:51 mark: Restarted exim4 and spamassassin on mchenry
  • 07:51 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Second attempt at restoring thumb size on svwiki'
  • 07:48 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Change thumb size on svwiki back to 250px'
  • 04:49 Tim: on singer: configured "php_admin_flag engine off" in all planet vhosts
  • 03:24 Tim: added a user account for myself on filesrv1, in tech group, I figure I've been here long enough to deserve one

Archives

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox