Server admin log/Archive 9
From Wikitech
< Server admin log(Difference between revisions)
(fixed OTRS job) |
(→November 3) |
||
| Line 3: | Line 3: | ||
== November 3 == | == November 3 == | ||
| − | * 05:35 Tim: traced unusual disk activity on srv38 back to the DeleteAlbertMailerDaemon job, in OTRS's GenericAgent. Changed the job to delete bounce messages which have arrived in the last hour, rather than doing a search of all 370,000 tickets. | + | * 05:42 Kyle: The [[apc]] is enabled and has [[anthony]], [[bayle]], [[isidore]], and [[yongle]] on it. |
| + | * 05:35 Tim: traced unusual disk activity on srv38 back to the DeleteAlbertMailerDaemon job, in OTRS's GenericAgent. Changed the job to delete bounce messages which have arrived in the last hour, rather than doing a search of all 370,000 tickets. | ||
== November 2 == | == November 2 == | ||
Revision as of 05:46, 3 November 2006
November 3
- 05:42 Kyle: The apc is enabled and has anthony, bayle, isidore, and yongle on it.
- 05:35 Tim: traced unusual disk activity on srv38 back to the DeleteAlbertMailerDaemon job, in OTRS's GenericAgent. Changed the job to delete bounce messages which have arrived in the last hour, rather than doing a search of all 370,000 tickets.
November 2
- 16:59 brion: disallowed all mailing list archives from robots.txt now
- 16:46 brion: got mailman-htdig working
- 15:00 mark: Created temporary channel #wikimedia-tech on irc.wikimedia.org See you there?
- 14:23 mark: Deflecting some traffic from knams to pmtpa
- 14:20ish brion: upgrading mailman for htdig search
- 14:19 mark: Freenode is under DDoS
- 14:00 jeluf: irc.freenode.org does not resolve any longer. The cname points to chat.freenode.net, which gets a *** Can't find chat.freenode.net: No answer reply on nslookup
- 09:43 brion: applying ipblocks schema updates
- 08:45 brion: rebuilt wikifr-l archives to suppress some messages due to a problem; unfortunately the numbering got thrown off by something much earlier in the archives, possibly the old 'from' bug. oh wells
November 1
- 21:43 river: scap broke blocking since db changes weren't applied, reverted PHP files from r17355
- 17:05 Tim: back to 3, small cluster couldn't handle it
- 16:10 Tim: back to 2 partitions. If the small servers can serve requests in ~200ms by hitting the disk, we may as well let them. srv38 and 39 will be better utilised by enwiki, which needs more CPU power allocated to it. We just have to be careful that the disk I/O on the small servers doesn't become saturated.
- 15:37 Tim: split search nodes into 3 partitions instead of 2.
- 15:22 Tim: sending dewiki search requests back to the "big" pool
- ~15:10 Tim: moved srv38 and srv39 to the search pool
- 14:20 Tim: started search index rebuild for all wikis
- 14:15 Tim: Inserted bfr (from /home/wikipedia/src/bfr) into the pipe in search index rebuilds. It seems to improve performance, by ensuring that MWSearchTool does not stall waiting for dumpBackup.php.
- 13:40 Tim: took srv37 out of apache rotation for lucene stuff
- 5:11 Kyle: srv144 has bad ram, will RMA.
October 31
- 16:55 brion: redirected sep11 to sep11memories.org
- 11:51 Tim: added "umask 002" to JeLuF's .bashrc
- 09:12 Tim: noticed that srv61 and srv67 are down, memcached instances with them. Brought in the spares.
- 04:45 Tim: set up srv145-149
- 04:26 Tim: srv144 crashed
- 03:57 Tim: setting up srv126,srv138,srv141,srv143,srv144,srv145
- 03:45 Tim: installed ganglia on srv121-145
- 03:36 Tim: set up apache on srv122
October 30
- 20:29 Kyle: srv146 - srv149 are available.
- 16:24 Tim: fixed ganglia
- 15:37 brion & mark: trying to fix ganglia, still borked
- 15:27 mark: Started Apache on zwinger
- 04:24 Tim: added bart to the trusted XFF list
October 29
- 16:24 Tim: locked sep11.wikipedia.org at Erik's request
October 28
- 14:45 Tim: removed dkwiki from all.dblist, old alias for da
October 27
- 14:36 brion: adding redirect & querycachetwo tables, not yet populated
- 05:04 Kyle: configured ipmi on srv121?? Maybe? I'm not sure how to test it.
- 04:56 Kyle: srv39 was off, I don't know why. I turned it on. Also a bunch of unreachable srv's were fixed. (Of the newest batch)
- 04:27 Kyle: sq1, and sq3 have Ubuntu Edgy. But need a password.
- 04:04 Kyle: Accidently rebooted zwinger! Sorry!
- 03:37 Kyle: Replaced power supply in sq11, its back up.
October 26
- 21:15 mark: Reinstalled yf1010 with Ubuntu Edgy, instead of Dapper. Install went ok, but needs a few more tweaks to the preseeding files to make it fully automatic again.
- 19:00 brion: tightened down friedrich, nfs /home no longer mounted
October 25
- 23:08 mark, kyle: Swapped sq11's mainboard, reinstalled it and brought it up as an upload squid
- 16:47 Tim: installed FSS on new apaches, added to install-modules51
- 16:40 Tim: running rebuildMessages.php
- 15:25 Tim: Set system-wide default for ssh ConnectTimeout to 5 seconds, on zwinger
- 14:30 Tim: finished user table schema changes
- ~13:30 Tim: switched masters to db2 and samuel.
- 12:52 mark: Fuzheado says PMTPA is blocked in China. Updated the GeoIP maps to make sure as many Chinese IPs resolve to yaseo
- 08:00 jeluf: Updated apaches 121-145, added to the pool, fixed startup scripts (use of eth2 instead of eth0/1). Still broken: srv122, srv126, srv136, srv138, srv141, srv145.
- 04:00 Kyle: New apaches were down because of poor power distribution. Its fixed now and they are back up.
- 03:19 Tim: starting user table schema changes
October 24
- 12:34 mark: Deployed a newer PyBal on pascal, avicenna, alrazi and yf1018
- 11:14 brion: wikibugs bot wasn't running; restarted it on goeje and added run-wikibugs to rc.local
- 06:23 brion: restarted postfix on leuksman.com; svn mails were stalled
- ~06:00 Tim: installed Dancer's dsh as ddsh on zwinger, changed scap and sync-file to use it. It shares perl dsh's node group files, via a symlink.
October 23
- 19:00 jeluf: added srv121 and srv123-srv134 to the farm. srv122 and srv135 are unreachable. srv136-145 died earlier during a "scap". I've no idea why.
- 16:13 mark: Users reporting image problems with IE in yaseo. Depooled dryas from the upload queue. What was it doing there and wtf wasn't it logged?
- 08:20 Tim: made scap faster by turning off "lazy backups" and using an rsync daemon on suda instead of cp -prfu over NFS. Set up scap to recompile and install texvc automatically.
- 07:18 Kyle: srv121-135 are available. srv143-145 fixed.
October 22
- 14:08 mark: Deployed a newer PyBal on pascal
October 21
- 18:00 jeluf, domas: installed apache&al on srv136-srv145. srv143 was already broken when we started, srv144 broke during the installation (had to reboot it, didn't come back)
- 17:00ish brion: hack-bumped the $wgStyleVersion again
- 17:55ish brion: tweaked mail servers on leuksman.com again
- 16:50ish brion: did a svn up & scap; there may be some css/js issues with the changes to section edit links. germans have broken js
- 16:15 brion: ldap is broken on srv144
- 15:24 brion: updated leuksman.com to PHP 5.2.0RC6
- 15:11 brion: disabling disused MWBlocker extension include; new boxen we're not installing the PEAR xml-rpc anymore since we don't use it anymore and the install kept breaking
- 14:58 brion: removed ganglia port and interface options from mwsearch.conf, trying to see if these get through ganglia... manual from rabanus does go through using gmetric without the specifiers on the command line
- 09:04 jeluf: created otrs-de-l, otrs-it-l
- ~05:25 Tim: synced files on srv63, was out of date. Initialised srv103 as a memcached hot spare.
October 20
- 13:30 Domas: enabled holbach, lomaria, ixia with higher loads.
- 03:48 Kyle: srv136 - srv145 are available for service.
- 00:36 Tim: noticed that srv68 is down, memcached instance included. Brought the hot spare on srv119 into rotation.
October 19
- 10:31 Tim: updating fedora mirror
- 00:09 Kyle: srv136 available. (More soon)
- 00:09 Kyle: srv54, srv55, srv63, srv66 rebooted. Bad raid controllers.
October 18
- 23:12 Kyle, Mark: csw1 uplinked to csw5.
- 21:46 brion: upload.wm.o dead in pmtpa
October 17
- 16:29 mark: Started Mailman on goeje
- 16:00 mark: goeje back up after a PM reboot request.
- 15:50 mark: Users reporting loss of session data. mctest.php reported srv55 down, which indeed doesn't reply to ping. Replaced its memcached slot by srv62.
- 15:18 mark: Because goeje went down, srv1 couldn't resolve DNS, which brought the entire cluster into dismay (fun). Made srv1 forward to zwinger,goeje (in that order). Recursing DNS really needs to be fixed.
- 15:00 mark: goeje went down
- 13:45 mark: Converted sq14 and sq15 to upload squids
- 07:01 jeluf: set up srv6 as thumb server, serving de/thumb, taking load from anthony, which is only serving en/thumb now
- 06:39 brion: enabled AntiSpoof extension for active prevention as well as logging
October 16
- 19:52 jeluf: restarted mwsearchd on coronelli
- 18:40 jeluf: moved thumbs/en/ to anthony, which is now serving thumbs/en/ and /thumbs/de/. Set up another HTCPpurger in the second page of the screen session.
- 14:38 mark: Increased swap size per COSS cache_dir from 5000 to 8000 on sq12 and sq13... After 4 days they had only a 4% i/o wait.
- 14:30 mark: Disabled Squid cache digests, as I don't believe they work well in our very dynamic environment, and may actually decrease cache efficiency.
- 12:20 mark: Squids were set to deny HTCP CLR requests from the pmtpa internal subnet, so purging didn't work in pmtpa. Fixed.
- 03:54 brion: updated viewvc on leuksman.com to 1.0.4-dev
- 02:09 Tim: installed FastStringSearch (fss)
- 01:04 Tim: installed gmetricd on sq2-10
- 00:43 Tim: ran updateArticleCount.php on the new wikis, to correct for a previous bug in the same script.
- 00:14 mark: Reinstalled yf1000 - yf1004 with Ubuntu, set them up as text Squids. Taken yf1019 out of rotation.
October 15
- 23:00ish brion: mysterious spike in apache cpu usage and segfaults, haven't figured out cause yet. reverting recent changes to mw to test
- 21:30 mark: Reinstalled yf1005 - yf1009 with Ubuntu, set them up as upload squids. Set up LVS on yf1018, pointed upload.yaseo at it...
- 18:54 mark: Changed MediaWiki's HTCP purge method from 'NONE' to 'GET' to make Squid 2.6 purge again
- 18:40 mark: Built a new squid-2.6.4-2wm1 .deb with debug symbols and --enable-stacktrace, and installed it on sq15
- 17:00 mark: Lots of Ubuntu Squids (with COSS) crashed around the same time. Restarted them.
- 16:42 brion: added charset header on 404 page to fix utf-7 silliness
- 16:15 mark: Fixed NTP on amaryllis. Y! has blocked UDP port 123, so SNAT to a high port...
- 14:20 mark: Creating two separate Squid groups with distinct default origin servers and "special destinations": text for MediaWiki content from the Apaches, and upload for static content from Amane and the thumb servers. This allows us to tweak the two very different Squid groups much better. Each group has its own subdir under /h/w/conf/squid, along with a separate subdir with a backup of the old setup. Yaseo doesn't have its own upload group yet, but I hope to rectify that today.
October 14
- 23:30 mark: Installed Ubuntu on clematis, it's back up as a Squid
- 11:30 jeluf: migrated upload.wm.o/wikipedia/de/thumbs/ to anthony, migration of /wikipedia/en/thumbs/ still running.
- 08:24 brion: [1] was somehow stuck in sq21's cache as a 301 to wikimediafoundation.org. UDP multicast packets to purge it could be seen when using ?action=purge, but had no effect. manually sending a PURGE over port 80 cleared it successfully
- 07:35 brion: adjusted 'missing wiki' screen to send a 404 response instead of 200; should keep some transient errors out of caches more nicely
- 07:29 brion: adding wikimania2007.wm.o to dns, preparing for wiki setup
- 07:03 brion: recompiled utfnormal extension on benet against proper ICU headers *cough*, restarted dump thread 4
- 06:48 brion: recompiled utfnormal extension on benet w/o -fPIC, restarted dump thread 4
- 06:12 brion: started pmtpa data dumps
- 05:00 Kyle: New ram with srv74, lets see how it does.
- 04:48 brion: migrating some old dump data from benet to amane to make room for next dump run
- 04:50ish brion: unmounted broken khaldun mount from benet
October 12
- 18:00 mark, jeluf: added thumb server bacon. Serves upload.wikimedia.org/wikipedia/commons/thumb/[0-3]/*. Currently, the squid.conf is a live hack. The next deployment will break this again, unless squid.conf.php is fixed.
- 17:05 mark: Set originserver on all parent cache_peers in squid.conf This makes Squid treat parents as origin content servers instead of proxy caches, and therefore enables Connection: keepalive and non-proxy GET requests.
- 15:10 mark: amane overloaded, tweaked its TCP settings a little more
- 07:39 Tim: secure.wikimedia.org back up, courtesy of mod_proxy.
- 07:00 jeluf: installed lighty on bacon, changed thumb handler to save images it got from the apaches to the FS. 0/* has been copied from bacon, 1/* currently running. Todo: HTCP listener to delete thumbs
- 05:57 Tim: disabled wiki stuff on secure.wikimedia.org temporarily, bart was overloaded. Will try to find a permanent solution involving proxying.
- 03:50 brion: started apache on leuksman.com, died again. :(
- Set somaxconn = 1024 and tcp_max_syn_backlog = 4096 on the old image squids, and on amane.
October 11
- 23:40 mark: Made sq12 and sq13 image squids
- 22:30ish brion: a recently committed bug in ObjectCache caused the db to be used instead of memcached, grindin geverything to a halt
- 19:30 jeluf: copying amane's wikipedia/commons/thumb/* to bacon:/export/upload/wikipedia/commons/thumb using rsync on bacon, bwlimit 500
October 10
- 23:00-* mark: Upgrading the new Squids sq12..sq30 to squid-2.6.4-1wm4 to enable COSS
- 19:40 mark: Set connect-timeout=5 on Squid backend requests
- 17:40 mark: Reduced amane's PHP processes from 64 to 32
- 17:30 mark: Upgraded amane's lighttpd to 1.4.13.
- 11:25 mark: Set up sq29 with COSS as well, though different settings than sq30, to compare.
- 11:00 mark: Started Squid on several of the new servers. Squid had disappeared...
- 11:00 mark: Set up sq30 with COSS filesystems, using devices /dev/sda6, /dev/sdb, /dev/sdc, /dev/sdd.
- mark: Set up an Ubuntu Dapper mirror on khaldun
- 07:54 brion: took stats.wikimedia.org offline; contains private info, needs scrubbing
October 9
- 21:25 mark: Set 'refresh-pattern ignore-reload' on upload squids
- 21:03 brion: removed anthony from mediawiki-installation group
- 20:35ish brion: disabled FancyCaptcha; using now SimpleCaptcha. seems to be lighter on amane's NFS for now
- 20:15ish brion: restarted many pmtpa upload squids with high InActConn backed up in lvs
- 18:00 mark,kyle: Reinstalled khaldun as dedicated install server / archive mirror
- 18:00 kyle,jeluf: Rebooted holbach. After reboot, mysqld's error log shows duplicate key errors while replicating. Shut down mysqld.
- 03:27 brion: disabled obsolete firewall rules on maurus; was preventing rsyncing of search index updates, stopping the ex-yaseo wikis from being searchable
October 8
- 15:02 Tim: Doubled the memcached instance count. srv104-118 brought into service with srv119 spare.
- 08:41 Tim: Stepped clocks on sq1-8, which were off by 8 hours. This was messing up ganglia. In the process of fixing NTP.
- 03:45 Tim: zwinger's disks were very overloaded due to the PMTPA gmetad. The data size is only 120MB, but apparently it was syncing very often. I moved the rrds to a tmpfs with an hourly rsync to disk.
- 02:38 Tim: holbach is down, took it out of rotation
- 02:34 Tim: removing old static HTML dump backup on srv35
- 02:12 Tim: Fixed disk space exhaustion on coronelli. MWDaemon.log was to blame.
- ~02:00 Tim: installed gmetricd in various places. diskio_* metrics should now be available.
October 7
- 22:00 jeluf: restarted db's on ixia and db1, with help of domas. Running 4.0.27 on db1
- 19:30 jeluf: Shut down mysql on ixia, copying DB to db1
- 17:30 jeluf: rebooted sq1, disabled squid. Mark depooled it from the LB
- 15:30 (Squid on) sq1 is down and being odd again
- 13:00 jeluf: rebooted sq1
- 03:45 Tim: removed sq11 from LVS on avicenna manually, it was down again and pybal didn't remove it.
- 03:35ish - timeouts connecting to rr.pmtpa
- 03:20 Kyle: db1 is now up and ready to be setup.
October 6
- 20:15 mark: Brought sq1 back up. The reason PyBal didn't depool it last night, not even during a restart, was that PyBal was in dry run mode so that it prints ipvsadm commands but never actually executes them. Apparently it has been inactive for weeks. Sorry!
- 19:00 jeluf: unmounted ikhaldun:/usr/local/upload on all apaches, removed from fstab
- 17:15 mark: Set up imap.wikimedia.org (which points to my private colocated server) as a temporary solution. Various @wikimedia.org aliases will be redirected.
- 17:15 jeluf: restarted apache on bart. Nagios and OTRS were not responding
- 01:33 brion: sq1 switchport reenabled; still hasn't fully shut down.
- 01:20ish tim manually removed sql from lvs; pybal wasn't removing it for unknown reason
- 01:07 brion: rebooting sq1, still haven't figured out wtf is wrong
- 00:55 brion: removed sq1 from pybal list while trying to kill its mad squid
- 00:50 brion: restarting squid on sq1; insane load (30+), not responding
- 00:44 brion: something wrong with upload.wikimedia.org; investigating. trouble connecting to pybal on alrazi; is it a problem with pybal or backends?
October 5
- 21:32 brion: resyncing srv11 common files; all were missing!
- 21:27 brion: wiped old copy of fundraising report scripts w/ redirect to new location
- 19:30 mark: Set up ingress filtering on port e8/1 and e8/2 of csw5-pmtpa
- Tim: Set up ganglia 3.0.3, more or less starting from scratch with the configuration. We now have a hierarchical arrangement of grids, with knams and pmtpa in the system at present, yaseo will perhaps follow later if we can get the ACLs set up.
- 01:23 Tim: fixed replication on srv75. It's a read-only cluster so it's not critical. Had to skip some deleted binlogs, they were probably empty anyway. MAX(blob_id) looks fine.
October 4
- 23:43 brion: starting search rebuilds for ex-yaseo wikis on maurus
- 22:30 mark: Moved the console server to csw5-pmtpa and Wikia's network, so we have out of band access. Also moved the last bunch of machines off csw1-pmtpa.
- 22:00 jeluf, kyle: hot-replaced amane's faulty drives, started rebuilding the RAID.
- 21:26 jeluf: gzipped binlogs 1 and 2 on adler.
- 17:40 jeluf: rebooted srv98,srv93,srv87,srv109 since their apaches locked up a few minutes after being restarted
- ~13:15 Tim: albert was hanging, smtpd down. Mark's reboot -f attempts weren't working, so I did echo b > /proc/sysctl-trigger which did the trick. Came up without the right VIPs, I fixed it temporarily, Mark will fix it permanently.
- 09:20 Tim: postfix on albert had been broken since 23:56, restarted.
October 3
- 22:30 mark: Installed Ubuntu on yf1005 (used it as a testing host)
- 15:25 Tim: Deployed new external storage: srv87-89 as cluster8 and srv90-92 as cluster9
- 12:14 mark: Deployed sq21..30 as text squids to see if brute power solves the TCP open problem.
- 11:53 mark: zwinger is not letting me log in. Stalls after "entering interactive session."
October 2
- 18:50ish brion: set up www.wikibooks.org portal
- 15:42 brion: disabling writes to cluster6; it's overloaded
- 15:15ish overload on ES
- 14:40 Tim: srv54 went down, replaced its memcached instance with srv68
- 12:40 mark: Made zwinger external only by disabling eth1 and changing the default gateway to 66.230.200.10.
- 03:00-07:00 kyle, tim, jeluf, river: suda broke, zwinger broke. rebooted suda, moved zwinger's dns resolver to goeje (temporary only)
October 1
- 13:00-20:00 mark, bw, river: moved uplinks over to csw5, set up BGP and began advertising our network. brief downtime due to router breaking.
- 13:51 Tim: Fixed uploads on new wikipedias. I also fixed the absence of a spoofuser table, earlier today.
Archives
- Server admin log/Archive 1 (2004 Jun - 2004 Sep)
- Server admin log/Archive 2 (2004 Oct - 2004 Nov)
- Server admin log/Archive 3 (2004 Dec - 2005 Mar)
- Server admin log/Archive 4 (2005 Apr - 2005 Jul)
- Server admin log/Archive 5 (2005 Aug - 2005 Oct)
- Server admin log/Archive 6 (2005 Nov - 2006 Feb)
- Server admin log/Archive 7 (2006 Mar - 2006 Jun)
- Server admin log/Archive 8 (2006 Jul - 2006 Sep)