Server admin log/Archive 20

From Wikitech
< Server admin log(Difference between revisions)
Jump to: navigation, search
(zwinger)
(September 12)
Line 1: Line 1:
 
== September 12 ==
 
== September 12 ==
 +
* 23:30 jeluf: apache on srv37 doesn't restart, libhistory.so.4 is missing
 
* 23:15 mark: NTP ip missing on [[zwinger]], readded
 
* 23:15 mark: NTP ip missing on [[zwinger]], readded
 
* 15:30 Tim: set read_only=0 on srv108 (Rob rebooted it)
 
* 15:30 Tim: set read_only=0 on srv108 (Rob rebooted it)

Revision as of 23:31, 12 September 2008

September 12

  • 23:30 jeluf: apache on srv37 doesn't restart, libhistory.so.4 is missing
  • 23:15 mark: NTP ip missing on zwinger, readded
  • 15:30 Tim: set read_only=0 on srv108 (Rob rebooted it)
  • 15:00 RobH: bart crashed, rebooted.
  • 14:56 Tim: pulling out all the stops now, running migrate.php migrate-all.
  • 14:45 RobH: synced srv104, back online.
  • 14:40 RobH: synced db.php.
  • 14:32 RobH: srv105 unresponsive, rebooted.
  • 14:25 Tim: Removed the corrupted ES installations on srv151-176
  • 14:18 RobH: Installed NRPE plugins on db9-db16.
  • 09:01 Tim: reverted, blob corruption due to charset conversion observed
  • 07:58 Tim: Experimentally switched db.php to use the ubuntu servers for cluster3/4.
  • 07:50 Tim: Stopping replication on the ubuntu cluster3 and cluster4 servers, and changing the file permissions on the MyISAM files to prevent any kind of modification by the mysql daemon. This is done by the new lock/unlock commands in ~tstarling/migrateExtToUbuntu/migrate.php.

September 11

  • 05:30 Tim: Migrating cluster4. Testing new binlog deletion feature.

September 10

  • 15:40 RobH: Racktables database moved from will to db9.
  • 15:00 RobH: Reinstalled srv185, srv186, srv187 to newest ubuntu, online as apache.
  • 05:00 - 10:10 Tim: copied cluster3 to srv151, srv163 and srv175, second attempt, seems to have worked this time

September 9

  • 23:25 brion: for a few minutes got some complaints about 'Can't contact the database server: Unknown error (10.0.6.22)' (db12). This box seems to be semi-down pending some data recovery, but load wasn't disabled from it. May have gotten load due to other servers being lagged at the time. Set its load to 0.
  • 18:49 RobH: Moved maurus from A4 to A2.
  • 18:05 mark: Made lvs2 a temporary LVS host for upload.pmtpa.wikimedia.org to be able to remove alrazi from its rack. Will redo this setup soon.
  • 17:50 RobH: srv61 reinstalled and setup as apache and memcached.
  • 17:50 RobH: srv144 reinstalled, needs ES setup.
  • 17:50ish brion: updated planet to 2.0, cleared en feed caches. Something was broken in them which caused updated to fail since September 5.
  • 17:42 RobH: Updated DNS to reflect new search servers.
  • 15:11 RobH: Moved isidore, upon reboot, noticed the wordpress update didnt take, reapplied it to blog and whygive installations.
  • 14:49 RobH: zwinger and khaldun moved from A4 to A2.
  • 10:26 Tim: copying ES data from srv32 to srv151, srv163 and srv175
  • 01:30-10:20 Tim: testing and debugging the ubuntu ES migration script on srv151, srv163 and srv175
  • 02:15 Tomasz: Added bugzilla reporting cron on isidore.
  • 00:48 Tim: granted root access to zwinger on all ES servers, useful for migration

September 8

September 7

  • 15:48 mark: alrazi overloaded, switch traffic back to knams and hope it can take the load
  • 14:37 mark: knams partially back up, broken line card still down. Moved some important servers to another line card. knsq16 - knsq30 will be down for the upcoming days, as well as most management.
  • 10:20 domas: copied in mysql build from db16 to db12 - db12 was running gcc-4.2 one, and in crashloop. next crash will bring up proper build :)

September 6

  • 20:15 river: failure of many hosts at knams (including lvs), moved to authdns-scenario knams-down
  • 12:05 hashar : merged r40433 to fix &editintro
  • 5:30 JeLuF: image upload on enwiki enabled again. Slowly deleting images from amane.
  • 3:00 JeLuF: image upload on enwiki disabled, copying enwiki images to storage1

September 5

  • 22:00-00:00 Hashar : gmaxwell provided backup of files (downloaded in ~/files/), I recovered non existent one.
  • 17:03 Tim: Updated trusted-xff.cdb. Fixes AOL problems.
  • 14:45 JeLuF: started to rsync enwiki images from amane to storage1 in preparation of tomorrow's final move of the image directory
  • 04:24 Tim: sync-file screwup caused thumbnails to be created in the source image directory. Will try to repair.
  • 03:13 Tim: srv151 is depooled for some reason. No indication as to why in the logs or config files. Using it to test the new wikimedia-task-appserver package. Will repool once I get it working properly.

September 4

  • 22:15 JeLuF: Switched srv179's mysql to read_only
  • 22:10 JeLuF: OTRS back online, switched to db9. Changed exim config on mchenry, too.
  • 20:00 JeLuF,RobH: Shut down OTRS, migrating its DB from srv179 to db9
  • 19:49 RobH: db10 replication slave of db9
  • 17:58 RobH: civicrm and dev civicrm database now located on db9 (was on srv10)
  • 17:19 RobH: Bugzilla database is now located on db9 (was on srv8)
  • 16:52 RobH: Both the wikimedia blog and donation blog databases are now residing on db9 (was on srv8)
  • 16:43 Tim: re-enabled thumb.php after some of the culprits came to talk to me on #wikimedia-tech and promised to reform their ways
  • 11:09 Tim: fixed APC on srv38 and srv39, was broken.
  • 10:35 Tim: srv38 and srv39 have been overloaded since 05:50. Blocked thumb.php for external clients.
  • 05:30 Tim: restarted srv138 with sysrq-trigger. Was reporting "bus error" on sync-file.
  • 04:03 Tim: upgrading to wmerrors-1.0.2 on all mediawiki-installation

September 3

  • 23:00 jeluf: moved enwiki's upload archive from amane to storage1, freeing up some 20G on amane.
  • 16:54 brion: tweaking ApiOpenSearchXml to hopefully fix the rendering-thumbs-on-text-apaches problem
  • 14:01 RobH: updated libtiff4 on all apaches
  • 04:23 Tim: svn up/scap to r40356
  • 04:13 Tim: populating ss_active_users
  • 03:21 Tim: applying patch-ss_active_users.sql

September 2

  • 19:50 mark: Repooled srv181
  • 19:31 mark: Many boxes still in inconsistent state because of OOM kills. Some background processes not running (e.g. ntpd). Rebooted srv159, srv182, srv154, srv156, srv157, srv158, srv181, srv188
  • 19:28 mark: scap
  • 19:01 mark: Killed all stuck convert processes on srv151..srv188 (but left srv189 intact for debugging)
  • 18:51 mark: Rebooted srv169, srv180
  • 18:48 mark: Remounted /mnt/upload4 on srv151..srv188 (not srv189)
  • 18:33 mark: Many application servers are running out of memory, one by one. This seems to be caused by stuck thumbnail convert processes which end up there. The thumbnail convert processes on the regular apaches are indirectly caused by the API, and is opensearch/prefixsearch/allpages related - but I get lost in that code. One sample url is http://en.wikipedia.org/w/api.php?action=opensearch&search=Gina&format=xml Another interesting and likely related question is why many apaches can no longer reach storage1 NFS...
  • 17:07 RobH: Restarted ssh process which had stalled on srv188.
  • 16:52 mark: Rebooted srv186
  • 16:00 RobH: Pushed a number of dns changes for CZ chapter redirects.
  • 15:25 RobH: Updated dns for arbcom.de.wikimedia.org. Also added wiki to the cluster.

September 1

  • 23:10 mark: Added upload.v4.wikimedia.org hostname (explicitly A-record only), and allowed it in Squid frontend.conf
  • 17:40 jeluf: unpooled apache srv138, srv181 ssh not working
  • 17:30 jeluf: re-enabled srv124 in ES cluster12
  • 17:15 jeluf: re-enabled srv86 in ES cluster7
  • 16:32 mark: Deployed the PowerDNS pipe backend with the selective-answer script on all authoritative servers
  • 09:38 Tim: srv102 done, re-added cluster17 to the write list
  • 04:09 Tim: repooled ES on srv107, schema change done
  • 03:50 Tim: depooled apache on srv105, had old MW configuration, no ssh
  • 03:45 Tim: starting max_rows change on srv102. srv107 is actually stopped due to disk full, fixing.
  • 03:37 Tim: switching masters on cluster17 to srv103.
  • 02:14 Tim: Killed job runner on srv107 to speed up schema change.
  • 02:10 Tim: Brought srv142 and srv145 into ES rotation in cluster16.

August 31

  • 23:05 mark: A parser bug in the PowerDNS Bind backend caused unavailability of the wikimedia.org zone for a few minutes, ouch...
  • 22:55 mark: Deployed a PowerDNS pipebackend instance with this script on ns2.wikimedia.org (lily) only. Just one out of three nameservers for stability testing for now. Should there be major trouble, remove all "pipe" backend references from /etc/powerdns/pdns.conf.
  • 18:38 Tim: Going to bed. Status is: srv107 replicating but locked with slow alter table. Can be re-added after it catches up. cluster18 is working, for no apparent reason, and should be migrated to max_rows=20M ASAP. cluster17 needs a master switch so that srv102 can be fixed, after that it should be re-added to the write list. Once srv142 is done copying, it can be restarted and repooled, as can srv145. No need to fix the replication there since it's an old cluster.
  • 18:30 Tim: re-adding cluster19 to the write list, without srv107 which is still altering.
  • 16:22 Tim: srv141 didn't work out, out of disk space, trying copy to srv142 instead (from srv145)
  • 14:44 Tim: srv103 and srv110 done, repooling.
  • 14:02 Tim: srv108 done, changed master to srv108, started max_rows change on srv107
  • 13:51 Tim: started max_rows change on srv110. Not patient enough to do them one at a time.
  • 13:38 Tim: copy to srv110 finished. Put srv110 in, srv103 left out for now for max_rows change
  • 13:27 Tim: taking srv145 out of rotation for copy to new ext store srv141 (has same partitioning)
  • 12:45 Tim: srv109 finished, starting on srv108
  • 11:45 Tim: taking srv103 out of rotation for copy to new ext store srv110
  • 11:37 Tim: alter table blobs max_rows=10000000; on srv109.
  • 11:34 Tim: cluster is too much of a mongrel undocumented mess to set up new ext store servers, and we don't have that many candidates left anyway. Going to try saving the existing clusters.
  • 10:27 Tim: received reports that cluster19 has gone the same way. Most likely all slaves and masters set up that time are affected and will fail roughly simultaneously. Will set up new clusters.
  • 10:15 Tim: set mysql root password on external storage servers where it was blank
  • 10:07 Tim: cluster17 master srv102 has stopped being writable for enwiki due to exhausted MyISAM index table size (max_rows=1000000). Removed from write list, working on it.
  • 07:00 Tim: On srv189: added ddebs.ubuntu.com to sources.list. Installed debug symbols for apache.

August 30

  • 22:11 mark: Set up an experimental IPv6 to IPv4 proxy on iris
  • 17:13 Tim: killed long-running convert processes on srv152-189

August 29

  • 21:00 jeluf: checked srv104, added it back to its ES pool, added cluster18 back to wgDefaultExternalStore
  • 16:12 RobH: moved srv52 and srv56 from B2 to C4 for heat issues.
  • 15:32 RobH: srv149 reinstalled as apache core.
  • 13:08 Tim: images on kuwiki were actually broken because the move from amane to storage2 failed. The directory on amane was probably recreated by the thumbnail handler before the migration script created the symlink, resulting in a new writable image directory with no images in it. Merged the two directories and fixed the symlink.
  • 12:00 domas: did space cleanups on amaryllis, and all DBs (all <80% disk usage now :) - preparing for vacation. VACATION!!! :)

August 28

  • 22:50 mark: Set up a dirty, temporary test setup of PyBal on lvs2 doing SSH logins on all apaches for health checking.
  • 21:43 RobH: reinstalled srv134 back online as apache core.
  • 21:10 RobH: reinstalled srv130 back online as apache core.
  • 20:09 RobH: searchidx1, search1, search2, search3, search4, search5, search6, & search7 racked with remote management enabled.
  • 16:09 RobH: db9 reinstalled for misc db role.
  • 13:28 Tim: removed dkwiktionary and dkwikibooks from all.dblist. Apparently they're visible on the web when they were previously removed. They were created accidentally years ago due to dk being an alias for da.
    • They became visible due to Rob's changes to langlist.
  • 05:59 Tim: Following complaint about bad uploads on kuwiki, running "find -type d -not -perm 777 -exec chmod 777 {} \;" in various upload directories with various maxdepth options.

August 27

  • 22:57 RobH: srv127 reinstalled and back online as apache.
  • 22:34 RobH: srv36 reinstalled and back online as apache.
  • 22:09 RobH: srv117 reinstalled and back online as apache.
  • 22:00 mark: Commented out most LVS related checks in /home/wikipedia/bin/apache-sanity-check which are no longer relevant
  • 22:00 mark: Various changes to the Ubuntu installer, to make SM apache installs work, and for preseeding of NTP config.
  • 21:48 RobH: srv81 reinstalled and back online as apache.
  • 19:07 RobH: Purged cz.wikimedia.org redirect from all knams squids.
  • 18:10 RobH: srv147 reinstalled and deployed as apache.
  • 16:30 RobH: sq48 had a possible issue with hdc. Tested fine, cleaned and back online.
  • 15:19 RobH: srv146 was read-only. Rebooted, fsck, restarted.
  • 08:38 Tim: added FlaggedRevs stats update to crontab on hume
  • 08:03 Tim: running FlaggedRevs/maintenance/updateLinks.php on dewiki

August 26

  • 20:00 RobH: moved srv84 and srv85 from B4 to B3 rack.
  • 18:39 RobH: moved srv82 and srv83 from B4 to B3 rack.
  • 17:30 RobH: srv81 reinstalled and running apache. Needs ext store setup.
  • 16:35 RobH: srv103 restarted and synced.
  • 16:01 brion: srv103 serving pages with stale software but unreachable. needs to be shut down
  • 14:53 RobH: reinstalled db10 for misc. db tasks.
  • 13:27 Tim: disabled some user account on otrs-wiki
  • 11:15 mark: Added coronelli to search pool 3 on lvs3
  • 00:26 RobH: fixed my own typo in redirects.conf, pushed, graceful all apache.
  • 00:15 RobH: pushed some fixes on InitialiseSettings.php for a private wiki.

August 25

  • 23:07 brion: enabled write API, let's see what happens!
  • 22:41 brion: query.php disabled as scheduled.
  • 22:07 brion: a SiteConfiguration code change broke upload dirs for a bit. reverted it.
  • 20:15 brion: set wgNewUserSuppressRC to true, was false unsure why it's annoying
  • 14:30 RobH: pushed dns changes to langlist to support cz. as well as a number of other langlist redirects not added to dns.
  • 14:15 RobH: Fixed an error in my additions for the cz.wikistuff, pushed out the redirects to apaches.
  • 12:10 domas: mark stealing db10 for stuff
  • 11:00 domas: reenabled db10, added db14 to s1, db9 given away to non-core tasks, added full contributions load to db16 (as it has covering index)
  • 09:55 domas: reverted an instance where 'IndexPager' was causing filesorts... :)
  • 08:00 domas: cleaned up hume / diskspace, was full, added /a to updatedb prunepaths, apt-get clean too - 4.5G released
  • 08:00 domas: disabled db10 for db14 bootstrap
  • 07:36 domas: updating FlaggedRevs schema on ruwiki.
  • 02:26 brion: updating MW, including FlaggedRevs schema update (fp_pending_since, flaggedrevs_tracking)

August 24

  • 17:15 domas: removing db9 entirely, crashed, disk gone...
  • 07:20 Tim: deployed the TrustedXFF extension that I just wrote.
  • 02:56 Tim: removed db9 from the contributions, watchlist and recentchangeslinked query groups. Long running queries (2000 seconds) from IndexPager::reallyDoQuery and ApiQueryContributions::execute, probably needs index fixes. Removed general load from the remaining query group server, db7.

August 22

  • 21:34 RobH: will moved from A4 to A2.
  • 21:00 RobH: diderot unracked
  • 00:27 brion: FR feedback on on enwikinews as well
  • 00:24 brion: Deleting email record rows from cu_changes; some had slipped through before we disabled the privacy breakage

August 21

  • 23:47 brion: FlaggedRevs feedback enabled on test & labs
  • 23:35 brion: Enabled experimental HTML diff on test.wikipedia.org, en.labs.wikimedia.org, and de.labs.wikimedia.org
  • 18:17 RobH: Updated DNS entries to add a number of .cz domains. Also updated redirects.conf to support the added domains.
  • 11:43 Tim: installing GlobalBlocking
  • 02:42 Tim: returned db16 to general load, a less critical role
  • 02:30 Tim: installed mysql-client-5.0 on db11-16. Installed ganglia-metrics on thistle, db1, db4, db7, db12, db13, db14, db15, db16.
  • 02:20 Tim: offloaded query group read load from db16. System+user CPU disappeared.
    • Recovery spike in I/O shows that replication was suppressed due to read activity. Caught up in ~8 minutes.
  • 02:11 Tim: db16 is chronically lagged, probably overloaded with inflexible query group load
    • db16 shows high flat system+user CPU since ~01:05

August 20

  • 04:15 Tim: attempting to upgrade hume from Ubuntu 7.10 to 8.04
  • 01:24 brion: experimentally lifting $wgExportMaxLimit from 1000 to infinity on enwiki -- testing hack to SpecialExport.php to use unbuffered query

August 19

  • 08:38 Tim: done with lomaria
  • 07:42 Tim: taking lomaria out of rotation to drop non-s2a databases and change its replication to s2a-only.
  • 04:45 Tim: increased load on db13 to relieve db8, stressed by removal of lomaria from s2
  • 04:10 Tim: A hotlinking mirror, getting images from thumb.php, was being visited at high rate, DoSing our storage servers. Referer blocked.
  • 03:50 Tim: ixia disk space critical, fixed
  • 03:45 Tim: Older s3 slave servers are showing signs of strain. Adding more s3 load to db11 to test its capacity.
    • db11 is fine at 47% load ratio, reporting 80-90% disk util, await 5-7ms, load ~6
    • 96% load ratio, reporting disk util ~90%, await ~6ms, load ~7.5. Wait CPU ~12%. Yawning in mock-boredom.
  • 03:37 Tim: lomaria was relatively overloaded. Adjusted loads, put it in an s2a role since we haven't had any s2a servers since holbach was decommissioned
  • 02:40 Tim: removed holbach, webster and bacon from db.php, decomissioned. Removed decomissioned servers from $wgSquidServersNoPurge.
  • 02:27 Tim: compiled udpprofile on zwinger, started collector. Firewalled port 3811 inbound, /etc/init.d/iptables save. Updated MediaWiki configuration. Updated report.py on bart.
  • 01:40 Tim: reduced apache "TimeOut" on srv38/39 from 300 to 10, to limit the impact of LVS flapping

August 18

  • 23:00 RobH: added the image scaling servers back into the apache node group and updated their config files. This fixes the thumbnail generation issue evident on both uploads. and se.wikimedia (may have existed elsewhere as well, in fact, it most certainly must have.) All apaches restarted.

August 17

  • 22:30 jeluf: restarted apaches on srv38/39 due to user reports about broken thumbnails.

August 16

  • 13:20 mark: Reenabled ProxyFetch monitor on rendering cluster on lvs3, and set depool_threshold = .5.
  • 12:58 Tim: removed ProxyFetch monitor from rendering cluster in pybal on lvs3
  • 12:50 Tim: thumbnailing broke completely, at ~03:00 UTC. The apache processes on srv38/39 were stuck waiting for writes to the storage servers. Couldn't find the associated PHP threads on the storage servers to see if something was holding them up, so I tried restarting apache on srv38/39 instead. Suspect broken connections due to regular depooling by pybal

August 14

  • 18:55 domas fixed db16 replication
  • 18:50 brion: db16 replication is broken -- contribs/watchlists/recentchangeslinked for enwiki stopped at about 4 hours ago
  •  ??? ??? db16 crashed

August 13

  • 17:10 Tim: Changed http://noc.wikimedia.org/conf/ to use a PHP script to highlight the source files from NFS on request, instead of them being updated periodically. Added a warning header to all affected files.
  • 06:17 Tim: Removed old ExtensionDistributor snapshots (find -mtime +1 -exec rm {} \;), synced r39273
  • 02:40 brion: fixed permissions on dewiki thumb dir -- root-owned directory not writable by apache worked for existing directories, but failed for the 'archive' directory needed for old-version thumbnails used by FlaggedRevs

August 12

  • 21:06 mark: Moved LVS load balancing of apaches to lvs3 as well, using a new service IP (10.2.1.1)
  • 18:10 brion: fixed up security config that disabled PHP execution in extension directories; several configs had this wrong and non-functional
  • 12:45 tfinc: removed /srv/org.wikimedia.dev.donate & /srv/org.wikimedia.donate on srv9 and removed the apache confs that mention them.

August 11

  • 23:53 mark: Moved traffic from Russia (iso code 643) to knams
  • 23:53 mark: Moved the rendering cluster LVS to lvs3 as well.
  • 22:45 mark: Deployed lvs3 as the first new internal LVS cluster host, and moved over the search pools to it using new service IPs (outside the subnet). The rest of the LVS cluster as well as the documentation are a work in progress - let me know if there are any problems.

August 10

  • 17:43 Tim: freed up another 100GB or so by deleting all dumps from February 2008.
  • 17:27 Tim: freed up a few GB on storage2 by deleting failed dumps: enwiki/{20080425,20080521,20080618,20080629}, dewiki/20080629.

August 8

  • 22:46 RobH: setup network access LOM for db13, db14, db15, & db16
  • 22:40 brion: set up 'inactive' group on private wikis; this is just "for show" to indicate disabled accounts, adding a user to the group doesn't actually disable them :)
  • 21:15 brion: can't seem to reach the 'oai' audit database on adler from the wiki command-line scripts. This is rather annoying; permissions wrong maybe?

August 6

August 5

  • 22:09 mark: Shutdown BGP session to XO for maintenance
  • 18:27 RobH: db14, db15, db16 installed with Ubuntu.
  • 18:24 brion: enabling flaggedrevs on ruwiki per [1]
  • 17:09 brion: enabling flaggedrevs on enwikinews per [2]
  • 6:20 jeluf: set wgEnotifUserTalk to true on all but the top wikis, see bugzilla

August 4

  • 05:58 brion: dewiki homepage broken for a few minutes due to a bogus i18n update in imagemap breaking the 'desc' alignment options

August 3

  • 14:15 robert: got reports about lots of failed searches on nl and pl.wiki, looks like diderot (again) failed to depool a dead server (rabanus), removed manually.

August 1

  • 21:05 brion: forcing display_errors on for CLI so I don't keep discovering my command-line scripts are broken _after_ I run them, they don't show any errors, and I thought they worked. :)
  • 06:39 Tim: wrote a PHP syntax check for scap, using parsekit, that runs about 6 times faster than the old one
  • 04:58 Tim: installing PHP on suda (CLI only) for syntax check speed test
  • 01:46 Tim: removed db1 from rotation, it's stopped in gdb at a segfault.
  • 00:22 brion: aha! found the problem. MaxClients was turned down to 10 from default of 150 long ago, while the old prefix search was being tested. :) now back to 150
  • 00:19 brion: just turning off the mobile gateway on yongle for now, it just doesn't appear to be working at full load. (files moved to subdir -- in /x/ it works fine seemingly). Server doesn't appear overly loaded -- CPU and load are low -- just the requests stick.
  • 00:10 brion: installing APC on yongle, php bits are ungodly slow sometimes

Archives

the whole kaboodle


Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox