Server admin log/Archive 9
From Wikitech
< Server admin log(Difference between revisions)
(→February 10) |
|||
| Line 14: | Line 14: | ||
== February 10 == | == February 10 == | ||
| + | * 07:10 kate: upgraded zwinger to nfs-utils 1.0.8-rc2 so the Solaris NFS client doesn't crash it | ||
* 05:30 Solar: [[bart]] back up at 207.142.131.227 (With FC3, let me know if you wanted FC4) | * 05:30 Solar: [[bart]] back up at 207.142.131.227 (With FC3, let me know if you wanted FC4) | ||
* 04:23 Solar: [[larousse]] is gone, no warranty. I might scavange a harddrive from a bomis server to replace its drive. | * 04:23 Solar: [[larousse]] is gone, no warranty. I might scavange a harddrive from a bomis server to replace its drive. | ||
Revision as of 07:10, 10 February 2006
February 10
- 07:10 kate: upgraded zwinger to nfs-utils 1.0.8-rc2 so the Solaris NFS client doesn't crash it
- 05:30 Solar: bart back up at 207.142.131.227 (With FC3, let me know if you wanted FC4)
- 04:23 Solar: larousse is gone, no warranty. I might scavange a harddrive from a bomis server to replace its drive.
- 04:23 Solar: Taken anthony for RMA
February 9
- 17:25 brion: enabled emergency captcha and blocked some ip. robot o rsomething.
- 06:30 Solar: hydra the new server is up at 10.0.0.201
- 04:54 Solar: ixia back up.
- 02:13 brion: rebuild and reenabled interwiki cache
- 01:55 brion: disabled interwiki cache; it doesn't seem to handle removal of the cache file, and there's no obvious way to clear the cache.
- 01:45 brion: interwiki map now protected; for some reason somebody left this unprotected even though it gets updated on an unattended basis, and somebody decided to add javascript: to it. nice. updated cache epoch to ensure things are cleared
- 00:12 brion: restarted apaches; odd 'bad title' and failed load errors reported on srv12, restart cleared it
February 8
- 19:20 brion: larousse dead, doesn't come up on boot.
- 19:00 brion: benet / briefly filled, but nothing seems to have gone awry with the dump. cleaned some space.
- 18:30 brion: had larousse rebooted since its root filesystem doesn't work, worth a shot. may not be coming back up
- 18:00 brion: larousse is down since yesterday, nobody logged it.
February 7
- 23:57 brion: running cleanupWatchlist on pmtpa
- 07:23 brion: namespaceDupes on ta wikis for bugzilla:4889
- 05:40 Tim: Recompiled ImageMagick from the source RPM, with --with-quantum-depth=8. Installed on all apaches.
February 6
- 23:00 jeluf: Added new "Urgent-en" queue to OTRS
- 22:30 jeluf: Restarted sage and iris. Changed /a to ext3 to reduce fsck time.
- 21:00 jeluf: Restarted lily, purged cache
- 19:00 jeluf: Restarted load balancer on pascal, rebooted mayflower
- 18:50 mark: Revived clematis
- 00:05 mark: Created mailinglist chaptercommittee-l by request of Delphine.
February 5
- 02:43 brion: nowikinews was duplicated in all.dblist; cleared
February 4
- 21:38 brion: started data dumps in pmtpa, now including progress/ETA for xml dumps
February 3
- 01:50 brion: started fill-in dumps on srv31 again; now using local temp dir for stub dumps in the hope it won't mysteriously fail
- 01:30 brion: started dumps on yaseo
February 2
- 19:09 brion: compiling php 5.1.2 on srv31
- 19:00 brion: mark rebooted pascal for reasons unknown
- 07:30 brion: started makeup dump runs on pmtpa databases which had dump failures. unsure of cause still...
- 03:02 brion: testing fixes to yahoo dump gen
- 01:00 brion: squids in yaseo are way into swap, slow. trying some restarts
February 1
- 23:48 brion: trimmed a message from wikimediafr-l logs for privacy by request
- 20:35 brion: srv10 back up and ips put in service
- 19:35 brion: srv10 down; squid errors
- 01:04 brion: adding cfp.wikimania.wikimedia.org redirect for those wikimaniacs
January 31
- 22:00 brion: hewiki, huwiki, iawiktionary dumps report failure in full-history dump. checking log for iawiktionary showed an XML error in the stub load partway through, but rerunning the command to a test dump was successful. cause unknown
January 30
- 23:45 brion: disabled blank passwords on wikis
- 23:00 mark: Upgraded pybal to a newer version on pascal
- 22:20 brion: started a refreshLinks for itwiki; some major category was broken by a bogus template
- 19:30 brion: installed APC for srv13-30. had to reduce apc shm size to 30 on i386 boxen. temporarily used a cvs checkout of apc, in /h/w/src/apc visibly
- 19:15 brion: trying to get APC installed on the machines recently upgraded to php 5.1
- 18:30 brion: disabled accesslog on amane's lighty
January 29
- 11:10 brion: fixed externallinks table on leuksman.com wikis :P
- 11:06 brion: enabled captcha on remaining non-wikipedias, so all small sites covered. large sites still off while the smaller ones collect live test data. (added captcha to new user form a couple hours ago)
- 07:15 Tim: started upgrading srv11-30 to PHP 5.1.2
- 06:33 Tim: fixed secure.wikimedia.org
- ~05:00 Tim: Upgraded srv12 to PHP 5.1.1. Working on srv11.
January 28
- 10:55 brion: enabled experimental captcha on small wikipedias (all except the top 20 most edited and yaseo) to get some more test data
- 05:52 brion: added VfD/AfD entries to robots.txt, bugzilla:4776
January 27
- 22:50 brion: running captcha generation test on amane
- 22:45 brion: amane's root partition filled with 41 gigs of lighty logs. :) cleared out, restarted lighty.
- 22:17 brion: got srv63 updated php modules. Note: it's using dba as built-in, not .so module. A warning on Apache start about missing the .so is normal until we get the rest updated this way.
- 21:48 brion: added '--with-cdb --with-gdbm=/usr' to install-php51 script
- 21:43 brion: trying to fix srv63. why do we have these things turn on apache on boot? it's incredibly stupid; they end up broken
- 21:06 Solar: srv63 back up
- 02:29 brion: started refreshLinks.php on yaseo, running on amaryllis
- 02:28 brion: ran update.php to update schema on yaseo wikis, which were forgotten
- 01:58 Tim: fixed spam blacklist and re-enabled it
- 01:36 Tim: started refreshLinks.php, running on srv31
- 00:59 Tim: Updated schema, enabled externallinks table
January 26
- 09:29 brion: disabled spam blacklist; more reports of all kinds of things triggering blacklist for no apparent reason
- 01:09 brion: got ImageMagick 6.2.6 installed everywhere. bleh.
January 25
- 15:38 ævar: Added a portal namespace & portal talk namespace to svwiki and ran php maintenance/namespaceDupes.php svwiki --fix to fix the one resulting conflict:
Checking namespace 100: "Portal" ... 1 conflicts detected: ... 209565 (0,"Portal:Musik") -> (100,"Musik") Portal:Musik ... resolving on page... ok.
- 09:59 brion: postfix was stuck; killed (zombies, kill -9 needed), restarting
- 01:18 brion: added FollowSymLinks and mime type for .7z on download-yaseo
- 01:00 brion: enabled indexes on download-yaseo
January 24
- 22:55 brion: restarted squid on srv8; it was serving lots of error pages to people for unknown reason, seems happier after
- 06:24 Tim: Updated /h/w/b/foreachwiki. Started running cleanup.php on all wikis.
January 23
- 19:55 brion: disabled digests option for all users on daily-article-l by request (list admins disabled digests)
- 09:13 brion: enabled APC (from HEAD) on leuksman.com
- 03:40 brion: dba module needs to be enabled on secure.wm.o
January 22
- 23:30 brion: syncing fedora-extras from a mirror in .jp; added to sync-fedora-mirror.sh script
- 23:10 brion: fedora-extras seems to be missing from fedora mirror in yaseo; fedora-extas.repo points to the local main fedora repo mirror which doesn't help
- 22:30 brion: restarted dump run in pmtpa; PHP utfnormal extension enabled to speed up non-Latin dumps
- prefetch was actually working ok once i got into the debug log to watch. slowness was from not loading utfnormal from dumpTextPass. now controlled by WIKIDEBUG env var at CommonSettings level
- 21:30 brion: ragweed down, no OTRS (mark rebooted it shortly after)
- 12:08 brion: aborted dumps on pmtpa and yaseo pending investigation
- setting WIKIDEBUG env var causes segfault in php on srv31. what the hell
- 11:12 brion: prefetch didn't work due to broken symlinks. restarting on pmtpa
- 10:00 brion: running dumps on srv31 in pmtpa
- 06:24 brion: running another test dump on yaseo; will go ahead and run one on pmtpa soon. setting up to use srv31 as the dump runner
January 21
- 21:25 mark: Half the knams servers were down, at which point PyBal decided not to depool any more servers. Consequence is that most traffic is attrracted by the down server in LVS, and the site is more or less down. Fixed it by commenting the down servers in /etc/pybal/squids. (PyBal will reload that file every minute)
- 19:15 brion: disabled interwiki cdb cache on yaseo wikis. domas forgot to install the required php module
- 18:10 domas: enabled interwiki cdb cache, cleaned logs on apaches
- 15:00 domas: installed dba extension (--with-cdb --with-gdbm=/usr) all around, will make use of it soon.
January 20
- 01:45 brion: syncookies all around.
- 01:33 brion: ah, the old tcp_syncookies. resolved.
- 01:11 brion: hella slow squids on .246/.247/.248
- srv8. lots of suppressed messages in syslog, on the order of 5k/second. BUT WHAT ARE THEY
January 19
- 21:25 mark: Resurrected hawthorn
January 18
- 08:18 brion: running another dump test on yaseo
- 01:58 brion: saw some breakage with LanguageZh_hk; its deps file was missing one dep (Zh_cn) which I've now added. at least it's consistent with theory so far :D
- 00:20 brion: added dependency-loading stubs for language and skin classes that need them. hopefully will help with http://pecl.php.net/bugs/bug.php?id=6503
January 17
- 20:00 brion: reenabled DoubleWiki extension on wikisource, it seems to work now
- 17:35 Tim: someone commented on our APC bug report that the problem seemed very similar to this bug, which was apparently fixed in CVS last October. There hasn't been a release since then. So I upgraded APC to CVS on the cluster and switched off the initEncoding hack. Fingers crossed.
- 11:20 brion: running an experimental job of the new backup script on yaseo. output and some control features need some more work, but it should at least pump out some files.
- tried to set up download-yaseo.wikimedia.org vhost on amaryllis rooted right at the public backups dir, but it's not working right for some reason. *shrug*
- 06:10 brion: set up a log of Mozilla and Google Accelerator prefetch requests in /h/w/logs/x-moz.
- Sasa^Stefanovic reported unexpected reverts happening when visiting user contribs pages, had Google Accelerator installed and turned it off on my request.
- Haven't yet found confirmation of such a bug w/ google accel, but I am seeing requests made from their proxies which include the fragment identifier which is odd. Emailed google about it.
- 01:20 brion: reverted $wgMimeDetectorCommand back to default. Setting to 'file -bi' broke SVGs on the site.
- 00:35 brion: installed hack for bugzilla:4635, safari breakage on pages with '.gz'. these pages are sent without gzip encoding to avoid triggering
- 00:29 brion: testing, seems that test.wikipedia.org does NOT use the local nfs version of CommonSettings.php
- 00:10ish brion: set mime detector to 'file -bi' on duesentrieb's advice
January 16
- 16:15 mark: Deployed my new lvsmon like LVS script PyBal on Pascal, in /usr/local/pybal/
- 07:16 Tim: added a hack to Setup.php to automatically clear the APC cache if a language class is missing its parent.
January 15
- 21:51 brion: several instances of an odd error reported:
- Jan 15 21:04:32 srv38.pmtpa.wmnet httpd: PHP Fatal error: Internal error: Failed to retrieve the reflection object in /usr/local/apache/common-local/php-1.5/includes/ProxyTools.php on line 133
- This is some kind of PHP5 constructor error [1] which should never occur. The referenced line is a 'global' declaration in a top-level function. Did an apache restart to try clearing caches...
- 21:27 brion: enabled semi-protection on jawikinews (bugzilla:4608) and eswiki ([2])
- 20:24 brion: default logo on all wikiquotes was the locally-uploaded copy, but many don't have one. changed to the en.wiki copy as default
- 20:22 brion: moved en.wikiquote uploads to where they belong, working.
- 20:19 brion: found that en.wikiquote uploads aren't working properly; the dir is a symlink into /home which is no longer mounted on amane. need to rearrange the files...
January 14
- 16:18 brion: did extra sync; some old files stuck on servers or something (for instance SpecialUserlogin, with the broken password button)
- 15:15 brion: parser cache bug :( updated wgCacheEpoch and did a few manual squid purges of main pages
- 00:30 brion: trimmed obsolete funddrive dir from docroot/foundation (didn't work anymore, superseded by fundraising.wikimedia.org)
January 13
- 20:10 brion: broke some stuff for a couple minutes trying to make clone() work; now have a wfClone() for PHP4 compat until we finish killing PHP4
- 12:35 Tim: started refreshLinks.php with template redirect fix in place
- 08:15 brion: reverted experimental anti-bot hacks to login page; it was breaking 'mail new password'
- 07:30 jeluf: added info-es alias on zwinger, forwarding to OTRS
- 07:00 jeluf: hawthorn rebooted
- 06:30– Tim: second attempt at upgrading to PHP 5. Watching CPU stats closely this time.
- 06:17 Tim: added access_log to logrotate.conf, up to 10 GB will be stored
- 06:03 Tim: Amane's root partition was full due to 40 GB access_log. Deleted it and restarted lighttpd.
- 05:50 Tim: Put copy of skeleton /home on amane
- 05:15 Tim: unmounted /home on amane. Amane's network out shot up.
- 04:25 Tim: killed updatedb on amane and removed it from cron.daily
- 01:05 brion: briefly broke redirects when upgrading; forgot index.php had been reverted temporarily during yesterday's excitement.
January 12
- 20:00 jeluf: changed Dutch wikis to allow patroling by users, not only by sysops.
- 19:30 jeluf: rebooted fuchsia, purged squid cache.
- 19:14 ævar: I discovered that viwiki has made an extension to the software in Javascript. I did a quick security review of it and it doesn't appear to be evil(TM) in any way. It's basically an input method written in Javacript (docs in Vietnamese), for example try going to their sandbox, select "Tu9-. d/o>-.ng" or "Telex" and type "aw" in the input box, it'll be converted to "a". Still more eyes on the source code: Monobook.js and Him.js (main program) couldn' t hurt given some of the evil javascript we've been removing recently.
- 11:00- Tim: Upgrading PHP on srv31-70 to PHP 5.1.1
- 04:35 brion: installing xdebug on all apaches so it's available
- 04:20 brion: with xdebug extension was able to limit recursion within php and got a stack trace pointing to Image.php svg thumb rendering. bug in tim's recent changes was found to be the culprit. reverted Image.php while working
- 03:42 brion: monitoring very high apache load situation with logs of segfaults.
- Appears to be some recursion -> segfault in PHP PHP crash backtrace, may be recursion in user-level function
January 11
- 22:30 jeluf: rebooted ragweed after crash
- 22:00 mark: Added srv71-160 to DNS
- 21:00 jeluf: installed srv71...78. 79 and 80 need to be rebooted, probably using the old kernel.
January 10
- 21:19 brion: put paypal donation form back onto fundraising.wikimedia.org/ongoing top/year/month-level pages
- 21:15 brion: added no.wikinews
January 9
- 22:39 hashar: fix project namespace for bn: and csb: languages
- 21:56 hashar: ocwiktionary is now case sensitive.
- 21:56 brion: switching php error logs from local files to syslog, which should go to zwinger and include the hostnames
- 21:48 hashar & nikerabbit: fixed ga: project namespaces (now use genitive)
- 19:09 brion: where the hell is the documentation on external storage servers? srv34/srv33/srv32 aren't even documented -- they're listed as apaches.
- They are apaches, all external storage servers are dual-purpose. A list of external storage servers can be found in db.php. -- Tim 03:40, 10 January 2006 (PST)
- 17:45 brion: noticed secure.wikimedia.org is broken:
- Error in numRows(): SELECT command denied to user: 'wikiuser@goeje.wikimedia.org' for table 'blobs'
- 17:00 brion: we have mysterious huge load on apaches, started about 5 hours ago. restarted all apaches to see...
- 11:10 Domas: parser cache set to 2 weeks
- 05:48 Tim: enabled direct external storage on enwiki
- 04:01 Tim: enabled direct external storage on meta, as a pilot.
January 8
- 17:37 brion: odd db access error about 40 minutes ago:
- Sun Jan 8 16:56:02 UTC 2006 srv68 RecentChange::markPatrolled 10.0.0.101 1146 Table 'nlwiki.recemtchanges' doesn't exist (10.0.0.101) UPDATE `recemtchanges` SET rc_patrolled = '1' WHERE rc_id = '2429501'
- the source for this file on this server looks ok; no memory errors in /var/log/mcelog
- very odd
- 12:16 hashar: un hardcoded languageNV ns_project
- 12:05 hashar: un hardcoded languageOC ns_project (bug 4526)
- 05:30 brion: adding CNAME download-yaseo.wikimedia.org to amaryllis; yaseo dumps will be there ...
January 7
- 18:50 brion: running test dump for yahoo's abstract thingy for enwiki on benet (from samuel)
- 14:37 hashar: recached special:disambiguation for all pmtpa databases.
- 14:30 hashar: WARNING ran cvs up. That raised a lot of conflict. Attempting to solve them.
- 01:30 brion: added user throttle to en in response to registration flood
- deleted some of the crud accounts
- extra live hacks pending captcha later
January 6
- 14:42 Tim: Restarted squid on ragweed, was refusing connections. Four knams squids are currently down.
- 13:00 jeluf: destroyed attachments of a posting on wikimediach-l (personal CV) upon Delphine's request
January 5
- 22:30 mark: Added email alias for Monica in wikimedia.org
- 21:35 brion: put a CACert-issued SSL cert on secure.wikimedia.org
- 21:25 brion: added tr.wikisource (bugzilla:4333) and is.wikisource (bugzilla:4471)
- 21:00 mark: sq1 seems broken:
scsi3 (0:0): rejecting I/O to dead device sde : READ CAPACITY failed. sde : status=0, message=00, host=0, driver=04 sde : sense not available. scsi3 (0:0): rejecting I/O to dead device sde: Write Protect is off sde: Mode Sense: 00 00 00 00 sde: assuming drive cache: write through sde:<3>scsi3 (0:0): rejecting I/O to dead device Buffer I/O error on device sde, logical block 0
- mark: Fixed Fedora mirror on Albert, FC4 mirror is ok now
- 18:20 brion: unblocked faleg.org from squids leech list, swears to do good (and moved stuff to tools server)
- 17:05 brion: changed SSL cert for wikitech to one signed by CACert
- 16:40 brion: added redirection for *.wikipedia.info
- 05:47 Tim: Enabled plus signs in titles
- 05:35 ævar: "Gordon Lyon" => "Fyodor Vaskovich" at http://fundraising.wikimedia.org/2005q4/index.php/2006-01-04/detail/
January 4
- 21:30 jeluf: power cycled ragweed once again
January 3
- 04:34 brion: set amane's hardware clock to UTC, was on US mountain time
- 04:31 brion: load seems to have stabilized, things seem to be working
- 04:28 brion: colo rebooted amane. it seems to be working now, but gettings lots of hangs and things on main web
- still waiting to fudge stuff
- 03:09 amane seems to be dead; hangs on http, ssh; pings though
- was able to root-ssh in, /home mount was hung. unmounted, remounted, restarted lighty; now dead again, but no longer accepting ssh and NFS server is also dead. site is down
- 02:33 Tim: Thousands of pdns_control processes started by crond were running on zwinger, stuck waiting for a hung pdns_server process. Fixed.
January 2
- 19:50 Domas: putting holbach into dewiki only operation
January 1
- 20:22 Tim: starting refreshLinks.php
- 19:10 Tim: putting templatelinks code live. Schema update finished a few hours ago.
- 16:30 ævar: Got into tingxi, it's mysql gone wild, ~50 load, trying to contact the mysql server which isn't working with all the load..
- ~15:55 ævar: tingxi is superloaded (or something) and isidore might be as well (might be using that for the SQL, not sure), as a result fundraising.wikimedia.org is down.
$ ssh tingxi "uptime"
16:03:18 up 124 days, 23:18, 0 users, load average: 78.62, 75.26, 72.74
..I haven't been able to open a normal shell...
- 12:00 Tim: gave Datrio steward access on jawiki
- 10:11 Solar: failed drive repaired on sq7, up on /d
December 31
- 06:40 brion: adding en2 to dns aliases
December 30
- 18:55 brion: added the one-time donation form to [3] by mav's request
- 18:30 brion: set up private board wiki
- 15:00 brion: banned another leech
December 29
- 16:00 mark:
Resized RRDs (of Cricket/ http://noc.wikimedia.org/stats.php ) to store more than 4 months of data...Rolled back. As always with rrdtool, there were "issues"... Sigh. - 15:55 brion: added https://bugzilla.wikimedia.org/ SSL alias
December 28
- 22:30 jeluf: replaced ssl certificate on https://tickets.wikimedia.org/
- 21:19 brion: running cleanupCaps on zhwiktionary; changed the caps setting a few hours ago (bugzilla:4351)
- 10:10 brion: killed a leech
- srv5 is alive, but not yet configured. host key has changed, local login keys not installed. this may be an annoyance during squid updates.
- Jamesday: gzipped binary logs on Adler; still need to be moved to long term storage. Now has 57GB/18 days of space for binary logs.
December 26
- 11:49 brion: hacked RecentChange.php so the IRC output uses getInternalUrl(), so https urls don't go into the irc stream and confuse things
- correct solution may be to add another level, 'getPrimaryUrl' or something. or else declare 'internal' to mean 'external' ;)
- 08:30 brion: added goeje's new ip to ourusers for adler
- 07:22 brion: set 'reupload' permission off for regular users, on for autoconfirmed users (older accounts) in response to persistent upload vandalism
December 25
- 23:12 ævar: Installed a new special page extension, Special:Filepath, redirects user agents to the full path of a file like this.
- 22:57 ævar: Ran
scap - 14:27 mark: Partially set up sq1, but got frustrated by all the yum / Fedora crap.
- 13:11 brion: left goeje waiting for a reboot on kernel upgrade; shutdown is hanging on an nfs unmount which should eventually time out.
- apache 2.2 is installed, need to set up php and that start experimenting with stuff
- need to take goeje out of apache nodegroup, but *not* mediawiki_installation
- 12:13 brion: moved goeje to vlan 1, on 207.142.131.221
- 11:30 brion: I'm going to reassign webster's external IP to one of the old 512 mb apaches and use it as an experimental https server for secure logins to the wikis
- 04:33 Solar: sq7 is up at 10.0.3.7, but it is missing one drive at /d, RMA requested
- 04:15 Solar: srv5 is back up with a fresh FC3
- 04:00 Solar: srv71-80 are up at 10.0.2.71-80
- 02:26 Solar: srv57 and srv61 are back up
- 02:14 Solar: sq3 is up at 10.0.3.3
- 01:54 ævar: Installed a debuglog for Cite on enwiki to debug a whitespace generating problem I can't reproduce locally, even with tidy.
- ~00:45 ævar: Installed extensions/Cite/Cite.php site-wide, doesn't appear to be working on yaseo, as in <ref> & <references> just shows up as if the extension wasn't defined, even though it's required_once in CommonSettings.php on amaryllis, is it using some other system now?
- ...It's because I don't have permission to do anything at yaseo except on amaryllis...
$ ssh zwinger.wikimedia.org Last login: Sun Dec 25 00:46:21 2005 from adsl6-56.simnet.is **** Documentation wiki at http://wikitech.leuksman.com/ **** [0113][avar@zwinger:~]$ ssh amaryllis Last login: Sun Dec 25 00:55:57 2005 from zwinger.wikimedia.org Fedora Core linux kickstart-installed on Sun Sep 11 03:22:28 UTC 2005 [avar@amaryllis ~]$ dsh -f -N mediawiki-installation "hostname" executing 'hostname' avar@211.115.107.145's password: avar@211.115.107.143's password: avar@211.115.107.144's password: avar@211.115.107.149's password: avar@211.115.107.148's password: avar@211.115.107.146's password: avar@211.115.107.153's password: avar@211.115.107.155's password: avar@211.115.107.150's password: avar@211.115.107.152's password: avar@211.115.107.147's password: avar@211.115.107.154's password:
December 24
- 09:49 brion: switched wikitech.leuksman.com to HTTPS
- Out of interest, why? cause brion prefer yellow in URL bar.
- Would like to move more of our infrastructure stuff to be behind encrypted connections so passwords won't be exposed on insecure wireless networks when we're at conferences (ccc, wikimania, etc). While it would only be annoying if someone gets into wikitech or bugzilla accounts, sysop accounts on the main wikis or access to the internal wikis might be even more dangerous. Starting small, moving up.
- Out of interest, why? cause brion prefer yellow in URL bar.
- 09:30 brion: upgraded leuksman.com to Apache 2.2.0 and PHP 5.1.2RC1.
- Had to set 'EnableSendfile Off' to fix zero-length responses for static files. Probably something funny with the virtual server's kernel or filesystem.
- 03:35 brion: turned on autoconfirm protection level on dewiki by elian's request
- 02:59 brion: set local logo for hiwiki
December 23
- 05:39 brion: removed leftover Amethyst.php from servers in pmtpa
December 22
- 07:00 brion: installed new protection interface. set newbies time to 4 days
December 21
- 10:00 mark: Raised bandwidth limit of csw1-pmtpa's port gi0/33 (Bomis/Wikicities) to 100 Mbit/s
- 09:40 brion: updated the squid error page
- 05:31 ævar: ran maintenance/updateSpecialPages.php --only=Unwatchedpages on all pmtpa wikis.
- 05:24 ævar: Enabled Special:Unwatchedpages for users with protect permission and modified the querycache to cache 5000 pages for that instead of the default 1000. Jimbo made me!
December 20
- 17:42 ævar: People were still reporting problems with $wgOut & sitenotice, cvs up'ed & ran
scap - 14:59 ævar: Brion recently changed the sitenotice to use $wgOut->parse() instead of $p = new Parser; $notice = $p->parse(...); Appperently $wgOut is not always an object at that point. Nikerabbit reported a fatal call on a non object on that line. Inserted a live hack that tells people to report to #wikimedia-tech if it isn't an object while we hunt down why it doesn't get initialized properly sometimes.
- 14:27 ævar: Turned on rcpatrol on fiwiki much to the enjoyment of domas
December 19
- 20:00 Domas: srv57 and srv61 down, used srv70 and srv55 as Tugela replacements.
- 15:00 mark: Resurrected mint and lily.
December 18
- 14:20 Tim: attempted to restart lily, it crashed 20 hours ago.
- 00:00 Domas: holbach resurected and is working as db slave...
December 17
- 13:05 Solar: Holbach is available at 10.0.0.24
- 11:35 Solar: sq1-10 minus 3 and 7 ( hardware errors ) are up with 10.0.3.x ip's
- 06:40 brion: installed <fundraising/> extension (FixedImage) for the fundraising progress bar
- 01:55 brion: reinstalled php 5.1.1 on tingxi with gd enabled
- 01:50 brion: briefly locked new registrations on zh.wikipedia while adding a range block;
- 0:10 brion: rebuilt interwikis (bugzilla:1586)
December 16
- 04:45 brion: installed apache 2.2 and php 5.1.1 on tingxi for fundraising info server (with SSL)
December 15
- 23:35 ævar: Removed evil privacy invading javascript counting thing from http://wikimedia.org/nl-portal/ and http://wikimedia.org/be-portal/, the javascript pointed to a counter at http://e0.extreme-dm.com/
- 07:00 jeluf: Power cycled ragweed. Again.
December 14
- 20:30 hashar: added stylesheet for http://static.wikipedia.org/
- 12:50 mark: Built a new squid RPM (2.5.STABLE12-2wm) that sets a maximum resident memory size (default: 2 GB, specifiable in /etc/sysconfig/squid), and tested it on fuchsia
- 11:20 mark: Decreased the Squid timeout value of lvsmon on pascal to 10 seconds, and restarted iris which was trashing heavily.
December 13
- 22:31 brion: benet ran out of disk space, looking at where it went
- 19:22 brion: review of dump status shows that srv30 broke during the dump circa 04:22 yesterday, crashing enwiki and eswiki. restarting those two dumps
- 01:40 Tim: Restarted python IRC client on browne, on reports that no more channels were being created
December 12
- 22:40 brion: reinstalled turck-mmcache on tingxi; had not been upgraded after PHP recompile and was whining about version mismatch
- 14:30 mark: Resurrected mint which apparently had crashed two days ago.
- 03:00-5:00 Tim: restarted some apaches with hung processes waiting for NFS
December 11
- 13:17 hashar: BUG zwinger:/tmp/mediawiki/ should probably be in /var/cache/mediawiki/confs/ and wikitech group writable.
- This is not the place to report bugs. Please use the IRC channel. -- Tim 20:55, 11 December 2005 (PST)
- 13:16 hashar: created namespaces for itwiki & itwikisource (#bug 4247).
- 09:33 brion: dumps running in pmtpa on benet/srv35/srv36; in yaseo on amaryllis
December 10
- 05:20 brion: leuksman.com mysql & apache went wacko, memory limits killing things... restarted mysqld and apache
- ~05:00 Tim: dsh -N mediawiki-installation -f chmod -R 777 /tmp/mediawiki . And changed MessageCache.php so that it will stay that way.
- 01:40 brion: segfaults on leuksman.com reappeared; got backtrace, posted additional details on similar-looking php bug 35140. I have disabled APC on this server to try to reproduce the bug without it.
- 00:10 brion: set up cywikisource and copied in some pages (bugzilla:4228)
December 9
- 14:40 mark: Shutdown Tunnel0 on csw2-knams as an attempt to solve weird routing problems
- 06:30 Solar: new squids are racked, but only sq1 and sq2 are up at 10.0.3.1-2
- 04:47 Tim: set up www.wikimedia.org as a portal editable via meta, like the others
December 8
- 17:43 brion: ns1 and ns2.wikimedia.org don't have updated DNS. what's wrong??
- 11:40 Solar: sq1 is connected to the SCS port 9.
- 11:29 Solar: asw3-pmtpa is racked and connected to the scs
- 11:00 Solar: connected equ1's eth1 interface to csw4-pmtpa's port 34
- 10:16 ævar: Turned allowemailchange on in buzilla, users can now change their email
- 10:12 Solar: fixed srv66's grub.conf to boot to correct kernel
- 08:21 Domas: used srv70 as emergency tugela as srv66 down
- 07:42 brion: updating tingxi in forward/reverse DNS and adding 'fundraising' CNAME
- 06:30 brion: taking tingxi out of apache groups, giving it an external setup for fundraising utilities
- 03:54 kate: stopped lomaria to dump for import to zedler
- 02:10 brion: cleaning up after bogus CVS updates in common dir owned by hashar
December 7
- 13:30 mark: Rerouted traffic back to knams
- 12:00 mark: Rerouted knams traffic to pmtpa because of networking problems near knams
- 11:14 brion: recompiled apache/php/apc on leuksman.com, hoping to debug intermittent segfaults if they continue
December 6
- 19:00 jeluf: upgraded OTRS to 2.0.4
- 05:47 ævar: cvs up'ed includes/SpecialVersion.php, there was a conflict, I removed the following code (the top part) since I presume it's not an issue anymore and the offending site has been blocked:
<<<<<<< SpecialVersion.php
$ip = str_replace( '--', ' - - ', htmlspecialchars( wfGetIP() ) );
#return "<!-- visited from $ip -->\n";
# hacked to a hidden span since one nasty was stripping comments
return "<span style='display:none'>visited from $ip</span>\n";
=======
$ip = str_replace( '--', '-', htmlspecialchars( wfGetIP() ) );
return "<!-- visited from $ip -->\n";
>>>>>>> 1.32
- 04:23 ævar: De-installed Special:Cite on commons, meta, sources, species, foundation, nostalgia and mediawikiwiki. We really should have a $site variable that can be counted on (doesn't return wikipedia for non-wikipedia sites)
- 01:48 Tim: installed Folding@Home on the yaseo apaches
- 00:50-01:30 Tim: Started Folding@Home on knams squids
December 5
- 22:13 Domas: Did bring back srv9 (not sure if it is a good idea). Removed bayle/will from service. All squids are null-storage now.
- 20:35 hashar: apache-(restart|gracefull)-all(hard)? now use dologmsg instead of wikibugs
- 20:26 Hashar: http://www.mediawiki.org/FAQ now redirect to meta: page (rewrite rule for virtual host mediawiki.org).
- 18:00 Domas: noticed packetloss, talking to PM support
- 17:30 Domas: did put srv6 squid into i/o-less operation, as srv10 had same hitrate ;-)
December 4
- 20:30 jeluf: Several people report problems with Linker.php:504, thumbnail linking code. As a workaround, submitted and deployed Linker.php,rev-1.56
- 13:27 hashar: changed kawiki & kawiktionary namespaces (bugs 2103 & 3905)
- 10:10 brion: fixed tingxi's sudoers, fixed tingxi's /usr/local/apache/conf, synced its mediawiki, trying to start it. working? maybe
- 10:00 brion: stopped apache on tingxi, has damaged copy of mediawiki
- 04:05 brion: pascal root partition is full, needs cleanup
- deleted ~350 megs of old kernel modules from /lib/module, leaving those for 2.6.12-1.1381_FC3
- 03:02 Tim: srv55 has reported no more MCE errors, re-added to the apache pool
- 01:45 Tim: fixed Turck on humboldt
- 00:01 Domas, Tim: amane up and running, site back to normal
December 3
- 21:20 mark: zwinger's /home mount on amane is broken, all fs calls block
- 08:40 Tim: restarted squid on srv8, it had crashed
- 07:47 Tim: Fixed ntpd on coronelli, harris, larousse and adler.
- 07:14 Tim: Fixed ntpd on vincent and maurus. Stepped their clocks.
- 06:41 brion: fixed sync-file so that message is optional again, like its help message claims
- 06:10 Solar: Replaced what I believe to be the bad stick of ram for srv55. Its up.
- 03:01 brion: added info-pl mail alias to OTRS
- 02:09 brion: hopefully fixed the lucene restart problem; new mono installation in /usr/local wasn't in the PATH in crontab. hacked the init script to add it back
December 2
- 22:00 brion: watching search daemons more closely; i think they're not properly restarting on the hourly restart cronjob
- run logs in /var/log/mwdaemon-run.log
- also clocks are very bad on maurus and vincent, need to ntp them
- 21:19 hashar: sync-file now accept comments after the file name.
- 21:23 brion: restarted luene daemons; for some reason they had all died
- 09:17 brion: got lucene daemons back up and hopefully running.
- there was an extra restart script in my personal crontab on maurus which seemed to be messing things up there
- added a 'ulimit -n 8192' on the init script
- 05:45 brion: yum mirrors appear to be broken (missing repo files), trying to re-sync
- 03:44 Tim: srv43 didn't come back into the apache pool after restart, fixed
- 02:28 brion: restarted search daemons, stuck
December 1
- 23:53 brion: installed joe 3.3 on zwinger (in /usr/local/bin), handles utf-8 files properly
- 23:01 hashar: knams cluster was unreacheable for roughly 2 minutes, probably a maintenance on kennisnet side.
- 22:39 hashar: created Portal namespaces on ptwiki #3385
- 22:35 hashar: renamed namespaces on huwikibooks. 2 conflicts. #3783
- 19:54 brion: moved old .conf files from /h/w/conf to /h/w/conf/httpd-old to reduce confusion
- 19:54 hashar: gracefulled all pmtpa apaches to fix bug #4131
- 19:49 hashar: fixed apache-sanity-check , calls to 'ip' missed '/sbin/'
- 18:00 mark: Setup failover LVS on avicenna and alrazi. Still needs lvsmon, and isn't active yet. Uses CARP for failover.
- 14:30 mark: Removed avicenna and alrazi from Apache duty, as I am going to use them as LVS load balancers.
- 06:30 brion: removed /tmp/mediawiki/* caches on srv36; the backup run had saved a bunch by root and apache screamed about being unable to write them
- 06:25 brion: restarted apache on yf1005; odd PHP error, possibly APC cache breakage.
- Fatal error: main(): Failed opening required '' (include_path='/usr/local/apache/common/php-1.5:/usr/local/apache/common/php-1.5/includes: /usr/local/apache/common/php-1.5/languages:/usr/local/apache/common/php-1.5/templates: /usr/local/apache/common/php-1.5/extensions/wikihiero:/usr/local/lib/php:/usr/share/pear') in ½Íÿ on line 14
- 05:47 Solar: Ariel's raid has "failed", but no real disk failures. It put the array back online and rebooted. We'll see how it does.
- 05:00 Solar: Moved srv35-43 to second cage. Racked new sq1-sq10.
- 04:30 brion: deleted 20051127 enwiki pages_full dumps, since srv36 was turned off before they finished
November 30
- 21:49 brion: fixed upload dirs for wikimediafoundation.org
- 05:43 Solar: Racked donated load balancer in core cage on csw1-pmtpa port 34
- 02:25 brion: removed a privacy-violation in a revision comment via database edit (enwiki rev_id 29652015)
November 29
Note: ZX will be rebooting and upgrading most knams machines tonight, to help fix our problems. They will be taking machines down one by one, so this shouldn't give downtime - in theory. If it does, check whether the LVS ip is bound to the machines when they come up.
- 23:30 mark: Apparently service ips often were not added because /etc/rc.d/rc.local wasn't run... because it did not have eXecute permissions on some machines. Fixed.
- 21:56 hashar: rebuildMessages.php finished.
- 21:45 brion: bugzilla:4115 setting up latex on latest srv batch, adding to setup-apache
- 21:30 brion: found and fixed upload files for meta
- 21:15 brion: investigating broken upload files on meta
- 20:49 hashar: fixed bug 4048 and running 'rebuildMessages.php --update' on all wikis.
- 16:35 mark: Squid wasn't running on srv6, started
- 16:15 mark: Ran yum upgrade on all knams machines
- 15:15 mark: Reversed the change as it didn't work anyway: Squid simply ignores failure on binding IPs.
- 14:00 mark: Adapted the Squid configurator / squid.conf.php to explicitly bind to the Squid's main IP address and the LVS IP, if applicable. Meant to ensure that Squid will not start if the LVS IP is not bound to the machine, so lvsmon can detect that.
November 28
- 23:34 Hashar: uploaded a picture of clusters, please post comment on image talk page so I can modify / update it.
- 21:50 Domas: restarted rogue failing (bytecode cache issues?) apaches: srv47, srv4, srv37, srv63, srv58, srv67, srv68, srv53, srv39
- 20:40 Domas: ragweed booted up, started squid, then started something else (for a minute or two), then ran rc.local with LVS IP adding... site down for several minutes
- 19:58 brion: ragweed is down (no ping), OTRS dead
- 14:35 ævar: Site crashed because of insufficient sanity checks, my bad.
- 14:30 Domas: srv62 tugela crashed, no core dump yet, if crashes persist will need some poking, either code, or srv62. mcelog empty.
tugela-fc3-x64[3634]: segfault at 00000000010e1000 rip 0000003a781716e0 rsp 0000007fbffff668 error 6
- 00:44 ævar: Changed the project name and metanamespace for iswikibooks to Wikiorðabók
- 00:00 Domas: oops, ran tugela on srv51-srv54,srv56-srv69 instead of memcached, will see how it performs/scales/...
November 27
- 23:21 hashar: thanks to palica : updated Server inventory bot to add a link to ganglia.
- 19:09 hashar: added two scripts to check database : 'mysql-list' & 'replication'
- 18:49 hashar: BUG rose got 4 memcached instances but they are not listed in mc-pmtpa.php
- 18:47 hashar: commented 10.0.2.43:10000 from mc-pmtpa.php
- 13:13 ævar: Installed Special:Cite on all the wikipedias
- 10:15 brion: blocked wikipedia-l, wikien-l, and helpdesk-l list archives in mail.wikipedia.org's robots.txt to discourage future complaints about embarrassing newbie posts becoming #1 google hits. Search patches for mailman archives should be integrated at some point...
- 08:55 JeLuF: added http://www.spy-sweeper-webroot.de/wiki/?/ to squid's leecher blocklist
- 07:38 Solar: smellie is ready for service. Turned off seLinux.
- 07:30 Solar: srv5 is out with a bad case of bad blocks
- 07:00 Solar: Crossed over to the new switch, csw4-pmtpa
- 03:09:56 ævar: Installed Special:Cite on enwiki as an experiment.
- 01:39 Tim: took srv55 out of service, likely dud RAM. MCE errors reported.
- 01:10 Tim: squid on will had crashed. Restarted.
- 01:05 Domas: fixed default route on tingxi
November 26
- 17:30 jeluf: changed password of wikipl-l admin account. Gave new PW to Datrio. Docuemented PW at the usual place.
- 14:36 Tim: put srv52-70 into apache service. I broke srv51 with a restart test.
- 12:00 Tim: wrote /h/w/b/apache-sanity-check, set up scripts such as apache-start to run it and refuse to start apache if the necessary LVS-friendly conditions are not met.
- ~11:00 Tim: broke site temporarily due to LVS-related misconfiguration
- 10:53 Tim: rose, tingxi and srv2 had apache running but no LVS VIP. This would explain the random hanging behaviour with ab -X apaches:80. Fixed temporarily, will look into a permanent solution.
- 08:45 Tim: LVS wasn't decomissioned properly on iris. LVS on pascal was forwarding packets to LVS on iris, and iris, with no lvsmon running, forwarded most of those packets to sage, which is down. Thus users were seeing connection timeouts. Fixed with ipvsadm -D -t rrvs.knams.wikimedia.org:80.
- 07:41 Tim: srv5 still not up. Moved its virtual IPs, one to srv6, one to srv8 and one to srv10.
- 07:15 Tim: did a fsck of srv5 then a system reboot
- 03:32 srv5's root partition spontaneously declared "read-only filesystem". Logs stopped moving. Mount reported that it was still rw, but it couldn't be written to.
- mount uses the contents of /etc/mtab to display mounts. These are not updated when the file system is r/o. Use /proc/mounts instead.
- 05:50 Tim: introduced time and memory limit for rsvg and convert
- 01:45 Tim: started image backup using updated scripts in /h/w/b
- 00:14 ævar: changed the logo for iswiktionary.
November 25
- 21:45 Hashar: killed some rsvg process on various apaches. Seems they tried to render a 120px thumb of /commons/7/70/Interstate_Highways.svg (possible DOS ? :( ).
- 04:40 Tim: experimentally enabled keepalive on apache.
- 03:35 Tim: testing lvsmon failover by stopping squid on clematis
- 03:05 Jamesday: Adler had 11GB disk free. gzipped first 80 binlogs to raise it to 48GB or so. gzipped version still need to be moved to wherever we're keeping them these days.
November 24
- 06:30 kate: setting up l3 failover.. see that page for details
- 02:55 brion: took cornelli out of search rotation while kyle moves it around
November 23
- 21:18 mark: Routing problems from 38.0.0.0/8 (cogent ip space) to florida. Altered the countries.nerd.dk file to reroute that prefix via knams.
- 20:44 mark: Reinstated the normal epoll RPM on mint, as epoll wasn't the problem
- 16:44 brion: fixed arrangement of upload directories for several sites (non-wikipedia :P)
- 00:35 kate: "ntp source vlan1" fixed NTP problem on csw1, but need to work out why traffic to 64.156.25.242 is being dropped
- 00:04 kate: upgraded csw4-pmtpa to 12.2(25)SED, enabled ssh and configured vlan 2 properly
November 22
- 22:33 brion: amane still seems to work. YAY \o/
- 21:49 brion: restarted apache on zwinger, wasn't loading
- 21:45 brion: increased php fastcgi workers on amane to absurd levels for thumbs to run
- 21:30 brion: mostly working now! had to set server.max-workers to 8 in lighty to get it running smoothly
- 19:28 brion: mounted /mnt/upload3 (amane) on zwinger, was missing mountpoint
- 19:22 brion: mounted /mnt/upload3 (amane) on srv2, was missing mountpoint
- 19:11 brion: restarted albert's http temporarily to cover the work period
- 19:02 brion: khaldun copy finally finished, rearranging bits on amane
- 15:21 brion: turned albert's http back off (hope you're done) so khaldun can finish its copy without the extra load
- 07:44 brion: started albert's http so kate can set things up requiring the local fedora yum mirror
- 05:08 kate: configured asw2-pmtpa. has the new srvs and the equ device on it (equ is 10.0.1.3)
- 00:55 brion: started copying commons files from bacon -> amane. disabled albert's apache
- 00:45 brion: started copying enwiki files from khaldun -> amane, non-wikipedia non-wiktionary files from albert -> amane
- 00:35 brion: started copying files bacon -> amane
- 00:20 brion: disabled uploads sitewide
November 21
- 23:10 brion: setting up to move uploads to amane, will disable all uploads and upload.wikimedia.org for a while to make this damn thing happen
- 21:15 brion: started lucene index rebuild on maurus
- 21:05 brion: restarted squid on will, was not responding (stuck) on port 80
- 20:49 brion: restarted apache on ragweed; https was down so otrs inaccessible
- 20:30 mark: Brought sage and mayflower back up.
- 20:00 mayflower went down.
- 20:00 mark: Moved LVS back to pascal to allow iris to be a squid again.
- 19:45 mark: Modified lvsmon on iris because it was always sending curl requests with Pragma: no-cache! And therefor testing the whole chain to florida.
- 19:45: sage went down.
- 18:00 mark: Installed non-epoll RPM on mint to compare.
- 17:56:40-17:57:31 ævar: Invalid argument notices were being generated in this time period due to me syncing three files and them depending on each other, ok now.
- 17:30 mark: udpmcast wasn't running on pascal. No idea since when... started.
- 17:30 jeluf: Restarted ragweed. Came back after powercycling and fsck.
- 16:30 ragweed broken.
- 12:14 erik: Updated logo of nap.wikipedia.org and sync'd InitialiseSettings.php
November 20
- 23:30 mark: Upcoming maintenance of knams tomorrow (ZX will do some firmware upgrades, rebooting at least pascal and vandale). Moved LVS to iris because of that.
- 20:00 JeLuF: All wikipedia.org upload directories moved off of albert and to amane.
- 18:03 Hashar: fixed #4022 'Asia/Seoul' timezone for kowiki.
- 17:50 Hashar: switched some logos to /b/bc/Wiki.png
- 14:24 JeLuF: chown -R apache:apache amane:/export/upload/wikipedia.org/
- 14:19 Hashar: in amane:/export/upload/wikipedia.org/ some directories cant be write by apache (af de es & fr). dewiki upload page report an error.
- 09:26 Tim: Fixed NTP broadcast, documented
- 03:21 Tim: Fixed perl upgrade on srv51-70 as per [4]
November 19
- 17:05 Tim: same on fuchsia
- 16:50 Tim: restarted squid on clematis, disabled swap.
- 16:05 Tim: upgraded otrs on ragweed to version 2.0.3, after Anthere complained about this bug: [5]. Minor upgrades within the 2.0.x series weren't documented (just an unanswered question on the ML), so I just untarred over the top of the old directory, with a backup in /opt/otrs-2.0.1. Treat any problem symptomatically, some chmodding might be required.
- 15:40 Tim: restarted squid on bayle
November 18
- 23:30 brion: installing ploticus 2.32 on mediawiki-installation, set to use gd & truetype fonts (bugzilla:3965)
- truetype fonts in common/fonts
- 07:00 jeluf: migration of dewiki's image and thumbnail directories done. archive and shared will be moved when albert has more headroom. Some 30 small to medium wikis moved. Currently running frwiki thumbnail migration.
- 00:27 brion: blocked another leech [6]
November 17
- 14:30 mark: ragweed was missing the LVS ip, fixed. Also readded iris as squid.
- 06:30 Tim: Added root key to srv51-70. The following machines didn't want to cooperate: 56, 64, 66, 67, 69
- 06:05 Tim: added srv51-70 to DNS, created a node group. Configured albert's BIND as a slave for the 10/8 reverse DNS zone.
- 05:46 Solar: srv2 is back up.
- 05:38 Solar: srv56 is up too.
- 05:26 Solar: srv51-srv70 are ready for Rock & Roll! (Except srv56 has some hardware issue)
- 04:34 Solar: holbach is rebuilt and ready
- 03:47 Tim: added tingxi and rose to the apaches node group. Left harris out, it sucks.
- 03:30 Tim: after moving some more hosts to the misc2 cluster, restarted gmond on the apache cluster to remove hosts which have been moved out
- 02:24 Tim: fixed amane's date, started ntpd
- 01:49 Tim: Created "Misc VLAN2" cluster on ganglia, for miscellaneous hosts which, due to being in the wrong VLAN, couldn't be in Miscellaneous.
November 16
- 8:25 brion: srv50 error_log flooded disk; removed and restarted apache
- 6:30 jeluf: moved es upload area to amane:/export/upload
- 5:30 jeluf: moved eo, ang, an upload areas to amane:/export/upload. Backups are still on albert in .../remove.
- 04:14 Tim: attempted to restart squid on will. It didn't work. I hacked /etc/init.d/squid to send errors to a file instead of /dev/null, and found it was giving error messages like "parseConfigFile: line 17 unrecognized: 'htcp_port 4827'". I started the squid copy in /usr/local/ instead.
- 01:20 brion: reenabled special:renameuser with the 'archive' bit disabled. it's possible that some undeleted pages will have incorrect rev_user_text data
November 15
- 23:00 jeluf: moved aa, ab, af, ak, als, am, ar, ast, zh image uploads to amane:/export/upload
- 20:32 hashar: updated http://wikimedia.org/stats/live/ with a message redirecting to the "new" system ( http://noc.wikimedia.org/stats.php ).
- 16:13 Tim: running batch imagemagick convert job on bacon, converting 1911 EB scans to PNG.
- ~12:30 Tim: Deployed diff cache and parser cache push features. Reduced cache expiry for RC feeds on en from 60 to 20 seconds. The performance impact of this should be monitored -- the diff cache should reduce it but it might not be enough.
- 03:46 Tim: Re-enabled tidy, trimmed error logs. The huge error logs did indeed have a few tidy errors towards the end, once every few minutes, interspersed with lots of "file not found" errors. Preceding this lack of activity was gigabytes of either:
- [Mon Nov 7 04:33:33 2005] [error] PHP Parse error: parse error, unexpected $ in /usr/local/apache/common-local/php-1.5/checkers.php on line 101
- OR
- *** attempt to put segment in horiz list twice
- Neither of which have anything to do with tidy. The other noticeable thing at the very end of the error logs was that apache was segfaulting regularly, but it was doing that just as much after tidy was disabled.
- 01:22 ævar: resolved bug 3968
- 00:50 brion: cleaned giant error_log files from srv44 and srv47, which had run out of space during sync
- 00:41 brion: adding some signature-nazi features, so new sigs with unbalanced html tags will not be inserted
November 14
- 22:30 mark: Many apaches have error_log's of 100G in size and more! Partly due to tidy, but how is logrotation supposed to be setup? See bug #3966
- 22:00 - 22:12 hashar: $wgUseTidy = false; its filling error logs on all apaches and seems to stall. Restarted all apaches too. Wikipedians need to FIX their HTML.
- 14:00 mark: Rebooted srv10, and started Squid on it with no cachedirs (1 null cachedir). Assigned IP .214 to it.
- 08:28 Tim: restarted squid on srv6. Slow hit service times (~100ms), it wasn't swapping but it had very little spare memory for kernel cache and buffers.
- 03:05 Tim: bayle was swapping heavily, very slow service times for both hits and misses. Restarted squid, added it to the ganglia squid cluster.
November 13
- 22:50 jeluf: mounted amane:/export/math to all mediawiki-installation servers for storage of math images.
- 20:00 midom: srv10 squid hanged, reiserfs issues?
- 16:57 brion: running data dumps on benet/srv35/srv36
November 12
- 19:49 ævar: tingxi had languages/LanguageCs.php (and probably something else) out of date, IIRC it has been down for some time, ran scap to bring it and others up to date.
November 11
- 00:16 brion: changed sitename on eswikinews (meta-namespace was already set)
November 10
- 14:28 ævar: changed the logo on trwiki
- 09:06 ævar: Changed the upload url of the wikis that had uploading disabled to point to the commons
- 09:09 brion: gave up trying to upgrade bugzilla due to bugzilla upgrade failure
- 08:40 brion: running yum update on pascal; got some glibc double-free bug during bugzilla update, and thought it was time to upgrade some damn packages
- 08:25 brion: shutting down bugzilla for upgrade to 2.20
- 07:18 brion: removed check_policy_service from /etc/postfix/main.cf on kate's advice, to see if it's more stable with that off
- 07:02 brion: restarting postfix on zwinger, mail stopped again
- 05:59 ævar: Removed harris from /usr/local/dsh/node_groups/mediawiki-installation, responded to ping, had port 22 open, but hung forever on ssh harris
- 04:27 Tim: set up ftp server on bacon, to accept uploads of scanned page images
November 9
- 14:18 Tim: fuchsia was swapping, regularly timing out on lvsmon health checks. Restarted squid.
- 11:09 brion: modified parser cache behavior to do cache with redirect targets. should increase hit rate; if troubles experienced, revert Article.php back to rev 1.396
- 10:13 brion: reenabled search text extracts for active sessions only
- 07:32 brion: updating live search indexes
- 00:54 brion: no mail in last eight hours... restarting postfix
November 8
- 23:30 jeluf: After intensive fsck, ragweed is back.
- 19:00 ragweed pings, but doesn't allow SSH login
- 13:10 holbach crashed
- 12:05 Tim: deployed local message cache, causing a 60% drop in network traffic on the apache cluster according to ganglia. We had noticed probable network saturation on the 100 Mbps switch asw1, this was the obvious solution. A content hash is stored in memcached and checked on each request. The local cache is stored in files, one file per wiki in /tmp/mediawiki/
November 7
- 20:51 kate: stopped replication on lomaria. please don't start it without asking me unless it's extremely important.
- 20:45 brion: trying to get tidy going again
- 20:30 brion: rebuilding search indexes on maurus.
- 20:00 brion: set search daemons to restart hourly. *sigh*
- 14:05 Tim: brought holbach back into service. Tweaked some load ratios.
- 13:55 Tim: started slave on lomaria. It was idle, the site was slow.
- 05:45 brion: switched lucene search to default to AND matches
- 02:50 brion: set up init script for MWDaemon (/etc/init.d/mwdaemon), added a daily cronjob to restart them
November 6
- 21:04 brion: several servers had disks filled from apache error_log; libart in rsvg apparently spewing out gigs of "*** attempt to put segment in horiz list twice"
- 20:10 brion: site unusually loaded; giving a kick to the apaches for luck
- 11:08 jeluf: srv22 was overheated. killed svg renderer (240 cpu minutes)
- 11:00 jeluf: added Category:Broken_servers for better keeping track of todos
- 10:40 jeluf: added portal namespace for nowiki upon Jhs' request
- 08:20 kate: copying from lomaria again... whee!
- 05:20 brion: added id.wikisource.org by request
- 04:59 Tim: started lvsmon-ksquid on pascal
- 04:39 kate: iris crashed... moved lvs to pascal.
- 02:40 Tim: Made MW check $cluster.dblist instead of all.dblist. This will generate appropriate error conditions for improper access to foreign databases via commandLine.inc, Special:Makesysop or squid misconfiguration.
- 01:40 Tim: installed memcached on srv41-50, moved instances from various other machines to there, including offloading browne completely. Restarted memcached on srv22, it had a dead instance.
November 5
- 22:02 kate: restarted replication on lomaria. set up replication on zedler.
- 11:00 brion: chgrp'd common files on humboldt
- 09:15 solar: installed new image filer, amane, into the rack.
- 04:55 kate: stopped replication lomaria again to re-dump. don't start it please. (server is still running)
- 03:41 Tim: tried to restart dumpHTML on srv31, the machine crashed almost immediately
- 03:39 brion: starting dumps on yaseo on amaryllis/henbane
- 03:32 kate: copy finished, restarted replication on lomaria
- 03:00 brion: refresh-dblist now also creates pmtpa.dblist and yaseo.dblist, based on assignment overrides from clusters.dblist
- 00:45 brion: started pmtpa dumps on benet, srv35, srv36
November 4
- 21:45 jeluf: moved lightgy on benet to /usr/local/lighttpd. Added startup to /etc/rc.local
- 21:00 jeluf: mounted benet:/var/backup to zwinger:/mnt/backup_benet
- 06:25 brion: restarted search servers; memory usage up to 650-1000mb range, and very slow response on vincent
November 3
- 11:03 kate: copying lomaria's db to zedler, don't start it
- 21:45 erik: fixed he.wikinews site name and meta namspace (hopefully), sync'd InitialiseSettings.php and ran update.php accordingly
- 20:44 brion: investigating connection errors (hacked wfLogDBerror to include hostname); seems to be on the new opteron boxen only
- 20:30 hashar: started apache on srv35.
- 20:22 hashar: started apache on avicenna.
- 20:10 mark: Will was running with only 1024 FDs. As it's the only non-RPM squid around (will is FC1) and I added bayle, I have taken it out, reassigned IPs to srv5 and srv7.
- 19:55 hashar: some apaches need a reboot. load is incorrectly high on them cause of state=D process (see bug #3869)
- 15:10 mark: Moved bayle (previously broken, inactive memcached) to the external vlan, made it a temporary squid. I cannot get it to mount izwinger:/home though. Any ideas?
- 5:30 Tim: copied ~tstarling/.ssh/known_hosts to /etc/ssh/ssh_known_hosts on all pmtpa machines
- ~5:00 Tim & kate: syslogd stopped working on zwinger, causing DNS to stop working. Kate restarted syslogd.
- ~5:00 created hewikinews using addwiki.php, sync-common-all
- 04:07 kate: made amaryllis ns3.wikimedia.org. needs magic stuff so it can be added as auth ns
- 01:58 Tim: restarted search daemon on vincent, the usual problem
November 2
- mark: Apparently the restart squid cron job in the squid RPM is broken in a weird way: at some point in time /sbin/pidof /usr/sbin/squid will stop working. I will fix it and roll out a new RPM tomorrow. Sorry for the trouble!
- 23:20 JeLuF: Found 2 squids on srv8. Killed both, started a new one.
- 22:20 Tim: adapted lvsmon for knams squid service, started it on iris. See /usr/local/bin/lvsmon-ksquid . There's also a copy in ~tstarling/lvs on zwinger in case iris goes down.
- 21:30 mark: Installed the new squid RPM on clematis. Not using epoll didn't change memory leaking behaviour.
- 19:17 kate: LDAP in on pascal was broken after reboot.
Nov 2 19:11:26 pascal slapd[29793]: bdb_db_init: Initializing BDB database Nov 2 19:11:26 pascal slapd[29794]: bdb(dc=knams,dc=wikimedia,dc=org): Lock table is out of available - locks Nov 2 19:11:26 pascal slapd[29794]: bdb_db_open: db_open(/var/lib/ldap) failed: Cannot allocate - memory (12) Nov 2 19:11:26 pascal slapd[29794]: backend_startup: bi_db_open(0) failed! (12)
- Did a db_recover and restarted slapd.
- 04:38 kate, kyle: csw4 is installed. nothing on it yet.
- 01:08 kate: pascal broke again, moved LVS to iris
- 00:10 kate: colo allocated us 84.40.25.224/27, wikicities will move into this network
November 1
- 23:39 brion: created car-fr-l list for french arbcom
- 22:25 brion: heavy packet loss between pmtpa and lopar; kate is moving dns off lopar for now
- 21:10 UTC erik: created ru.wikinews.org using addwiki.php
- 18:26 mark: Dropped 207.142.131.225 as gateway IP, as it doesn't seem to be in use anymore
- 18:15 mark: Made csw1-pmtpa act as a DHCP relay agent for rabanus, 10.0.0.15
- 04:20 kate: replaced mormo.org with pascal & amaryllis as backup MX, using postgrey + other anti-spam stuff
- 05:48 Solar: anthony, suda, isidore and bayle are back up.
- 05:10 Tim: Cleaned up the squid list in CommonSettings.php. The need to have variables for the IP addresses of each squid passed long ago, it was just clutter, doubling the length of the section. Added the external IP address of will, which was missing, causing edits to be wrongly attributed in the yaseo wikis.
Archives
- Server admin log/Archive 1 (2004 Jun - 2004 Sep)
- Server admin log/Archive 2 (2004 Oct - 2004 Nov)
- Server admin log/Archive 3 (2004 Dec - 2005 Mar)
- Server admin log/Archive 4 (2005 Apr - 2005 Jul)
- Server admin log/Archive 5 (2005 Aug - 2005 Oct)