Server admin log/Archive 5
From Wikitech
October 6
- yesterday kate: installed solaris on vandale because mysql wanted to test something. finished with it now, should have linux put back.
October 5
- 18:57 brion: image server has been very slow lately. fixed a broken thumb file or two which had a subdirectory in the way (one on the wikipedia portal was being requested *very* often, producing extra redirect load)
- 05:25 Solar: Rebooted srv26, bumped temp. threshold to 80C. Will investigate further.
- 05:20 Solar: webster is back up for now, but will fail again. Will call SM to get replacement drives.
- 02:56 Tim: srv24 in rotation as part of cluster2. Restarted compressOld.
- 01:00 Tim: Setting up srv24 as an external storage server, to replace srv26 which is down again. Stopped compressOld and stopped slave on srv25 for data directory copy.
October 4
- 23:50 Tim: Started compressOld.php, started mysqld on srv26.
- 23:30 Tim: restarted evil resource-eating program (with kate's permission)
- 20:02 kate: stopped evil resource-eating tim program on albert started ~ 06:20.
- 14:40 mark: Increased DB load on samuel in an attempt to solve DB availability problems
- 13:13 Webster broke.
- 06:22 Tim: HTML dump post-process running on albert. It'll spend most of its time in sed, with a perl controlling script.
- 05:13 Tim: static HTML dump of English Wikipedia is pretty much finished. I'm currently running a huge find command on albert, to get a list of files to post-process.
- 00:20 Solar: Uploaded pictures. Take a look at User:Solar
October 3
- 23:07 brion: srv28 shutdown broke dewiki and enwiki dumps, have to restart them. non-wikipedias finished before this.
- 19:45 Solar: srv11 and srv28 moved to new racks for power distribution requirements.
- 03:30 jeronim: pmtpa squids were mostly running with max FDs of 1024 and starving, so rebuilt them with limit of 8192 and restarted
October 2
- 21:41 brion: taking srv35 out of apache loop to run additional dump processing
- 13:10 brion: running wikipedia backups from bacon via srv36, nonwikipedia backups from bacon via benet
- 12:57 brion: replication halted on bacon due to missing tables on the new wikis (napwiki, warwiki etc) -- this will need to get fixed. in the meantime doing dumps from other wikis ...
- 09:30 brion: srv31-35 in apache service (in perlbal list)
- 08:45 jeronim: srv31-35 ready for apache deployment
- 07:40 Tim: fixed exif bug (http://bugs.php.net/bug.php?id=34704) and deployed the updated tree on all florida apaches
- 06:30 brion: running cleanupTitles.php on various wikis
- 00:10 Tim: Running fixSlaveDesync.php on en.
October 1
- 21:07 Tim: Told dalembert to stop echoing its syslog spam to zwinger and larousse. Apparently temperature warnings were appearing in terminals on larousse.
- 19:40 Tim: Added Internode proxies to the trusted XFF list
- 11:00 brion: bacon and adler catching up last couple hours' data
- 08:30 brion: stopping bacon, adler to copy current data over to bacon
- 08:15 brion: continued replication catchup on bacon
- 08:10 brion: stopped backups; benet's out of space (going to do cleanup) and I'm testing an improved backup dump script that eliminates the overhead of mwdumper on the initial dump-split-compress job.
- 08:07 Tim: re-enabled Special:Makesysop, minus steward interface
September 30
- 19:00-20:00 mark: Deployed the fixed HTCP-CLR patch to all squids, and restarted them
- 19:18 ævar: disabled PageCSS because of potential XSS issues.
- 16:06 ævar: Installed the PageCSS extension on the cluster for per-page CSS.
- 13:12 Tim: installed apache, php etc. on dalembert, by modifying /home/wikipedia/deployment/apache/prepare-host until it kind of worked. Not sure if it's all set up right, but it's probably good enough for dumpHTML, which is what I'm using it for.
- 12:30 Tim: installed gmond on various reinstalled machines
- 07:22 midom: adler in service
- 01:25 brion: did some scripted despamming crosswiki (some deleted pages by '127.0.0.1'...)
- Solar: Replaced ram in srv42
September 29
- 19:50 mark: Fixed a memleak in my HTCP CLR squid patch, and testing it on clematis. If it works well, I will deploy it to all other squids...
- 17:52 Tim: made some more tweaks to http://mail.wikipedia.org/index.html . Now it displays properly in IE, and it works with small screens
- 17:12 Tim: Returned text on http://mail.wikipedia.org/ to a comfortably readable size. Apologies to optometrists everywhere for the reduced pay cheque.
- 07:25 brion: Ran initStats on warwiki, napwiki, ladwiki.
- 05:15ish brion: ntp setup on ariel
- 05:00 jeronim: clean fc3 on ariel; it has had a drive swapped and is hopefully not faulty now
- 04:30 Solar: srv33, srv34, and srv35 have ip's and are ready for service. srv32 and srv31 are pending a bomis server move
September 28
- - jeronim: srv49, alrazi, diderot, hypatia, avicenna, goeje, harris, dalembert, humboldt, kluge, friedrich all freshly set up with fc3 - but no ntp setup, and no apache. alrazi's old host keys lost.
- 19:00 jeronim: on zwinger, moved squid errors directory and sync-errors back into /h/w/conf/squid from /h/w/conf/old-squid, and updated sync-errors to also sync to lopar, yaseo, and knams. Updated all squids to use shiny new error page from mark_ryan.
- 10:00 mark: Added ragweed back to the knams squid pool because of overload on the other squids
- 09:10 brion: dewiki backup running on benet while others continue (from bacon 20050921)
- 07:20 brion: backups switched to use bzip2 for xml dumps; 'articles' instead of 'public' name change; image dumps disabled
- 06:52 brion: starting bzip2 filter/output of 20050924 enwiki dump on srv36
- 01:00 Solar: alrazi avicenna diderot friedrich goeje harris hypatia humboldt kluge srv42 srv49 are back on the netgear switch
September 27
- - jeronim: dhcp still not working so I've asked Kyle to put most fc2 boxes on a different switch
- 23:53 jeronim: commented out icpagent in /etc/rc.local on dalembert in case it's rebooted
- 22:10 mark: The new switch appears to be Fast Ethernet only! It's accessible on 10.0.1.1. I configured some parts of it to make it somewhat usable: all ports in access mode, vlan 2.
- 20:15 midom: disabled steward interface, needs rewriting to select databases instead of specifying their names directly in queries -- breaks replication
- 18:00 midom: ariel gone down:
LSI MegaRAID SCSI BIOS Version G112 May 20, 2003 Copyright(c) LSI Logic Corp. HA -0 (Bus 3 Dev 1) MegaRAID SCSI 320-2 Standard FW 1L26 DRAM=64MB (SDRAM) Battery module is present on adapter Following SCSI ID's are not responding Channel-2: 0, 1, 2 1 Logical Drives found on the host adapter. 1 Logical Drive(s) Failed 1 Logical Drive(s) handled by BIOS Press <Ctrl><M> or <Enter> to Run Configuration Utility
- 08:53 Tim: Stopped icpagent on dalembert for now, pending examination of pound's problems
- 07:11 brion: added wikipedia.nl alias in powerdns in prep for changing master servers for that domain (jason has that info)
- 02:54 brion: new8 machines (except srv50) don't have ntp working, still. punching at it again (were up to about 15 seconds slow)
- copied /etc/ntp.conf and /etc/ntp/step-tickers from srv50 to the others in the group, ran /etc/init.d/ntpd start
- 02:40 brion: starting test cur-only dewiki dump to double-check dump processing bugs while other backups continue
September 26
- 21:59 brion: killed commons image dump again; too slow, too big. need to rework that...
- 19:30 jeronim: turned off swap and commented it out in /etc/fstab on all pmtpa squids after kate noticed srv7 was swapping and restarted its squid
- 17:30 jeluf: skipped some insert statements to enwiki on the slaves not replicating enwiki. Steward tool running on metawiki tries to write to enwiki and mysql replicates these transactions.
- 09:30 brion: stopped bacon again, running backup of everything but enwiki/dewiki (backdated to 20050921) from bacon
- 09:22 brion: added refresh-dblist script to update the split .dblist files in /h/w/c
- 08:46 brion: started replication catchup on bacon (about 5 days behind)
- 08:40 brion: restarted mwdumper on the enwiki dump, which had broken with a funky file locking problem
- 07:37 Tim: deleted srv27 binlogs 020-026, the rest are needed for srv26 when it starts working again.
- 07:37 brion: locking fiwiktionary for case conversion
- 04:49 brion: turned off srv41, srv26 apaches due to segfaults; turned off exif log for commons due to giant >2gb log file
- 01:04 brion: created wikiro-l, wikimediaro-l lists, iulianu as list admin
September 25
- 13:45 hashar: made nap language inherit from italian language instead of english (rebuildMessages.php nap --update).
- 13:10 hashar: created nap, war & lad wikipedia using the updated howto Add a language. Thanks Tim for the technical assistance.
- 06:17 kate: copying bacon's mysql data to zedler
September 24
- 23:44 brion: changed squid config to use bacon and holbach's .wikimedia.org names instead of .pmtpa.wmnet on kate's advice
- 23:31 brion: pound didn't seem to be working; 503 errors, other problems, was unkillable without -9. unable to run site on holbach's perlba; squids couldn't find it? restarted pound and icpagent on dalembert, working now
- 23:23 brion: tried restarting pound. (there's also weird cyclic load between rose and anthony every few minutes)
- 23:15 brion: slow site performance reported; ganglia showed unusually high load on srv50, srv37, dalembert. Stopped dumpHTML on dalembert (pound machine), restarted apache on 50 & 37
- 22:12 brion: srv8 failed on squid restart due to broken symlink to config file. added srv8 to pmtpa squid lists for new config list and relinked its config file
- 20:40 midom: webster is up with non-enwiki dbset, ariel is up with enwiki only.
- 19:36 kate: zedler is up with mysql installed; waiting for replication to be sorted out somehow
- 18:20 Tim: deployed new squid configuration generator
- 16:11 jeronim: diderot, harris, alrazi, avicenna out
- 13:30 jeronim: kluge & friedrich out too, for reinstall
- 12:12 jeronim: took goeje out of mediawiki-installation dsh group; putting fc3 on it
- Tim: stopped icpagent on bacon. Load balancers are now holbach (perlbal) and dalembert (pound)
- 07:55 brion: started enwiki xml dump with five parallel readers; experimental (on srv36 pulling from samuel)
- 07:04 brion: trying to fix ntp again on humboldt and new8 machines
- 02:14 brion: disabled Special:Undelete toplevel list; code needs rewriting or just dump it for Special:Log (added link as temp hack)
September 23
- 21:17 brion: added /^Lynx/ to unicode browser blacklist
- 15:37 Tim: Deployed pound/icpagent on dalembert. It is currently running alongside perlbal instances on bacon and holbach.
September 22
- 23:18 brion: turning off capitallinks on tawiktionary
- 18:48 brion: updated pmtpa squid error messages to remove obsolete openfacts and wikisearch references. master copies now in /h/w/conf/squid/errors
- 18:21 brion: wikinews backup done. enwiki backup halted due to some nfs/large file problem. investigating
- 11:58 Tim: brought srv26 back into service
- 11:28 Tim: started deleting thumbnails still in their obsolete locations, 180,000 to delete.
- 09:10 brion: starting *wikinews backups on srv36 pulling from bacon. [installed mwdumper]
- 08:25 brion: running enwiki backup on srv36 pulling from a halted bacon, saving on benet
- 08:08 brion: taking srv36 off perlbal nodelist to try running backups with it
- 07:26 brion: adding new machines to perlbal; ready for service... hopefully
- 07:00 Tim: restarted dumpHTML.php, I had stopped it for a while due to high DB load. I'll stop it again when we get closer to peak time.
- 06:45 brion: running setup-apache script on remaining new8 machines (srv36-41, srv42-48)
September 21
- 23:54 brion: recompiling librsvg with correction to security fix; it had accidentally disabled data: urls as well
- 20:35 brion: set european tz for nlwikimedia
- 18:46 Solar: webster and ariel have rebuilt raid and FC3 installed although they do not have IP's. They are accessible via console.
- 17:00 midom: disabled all bloat in albert's http configuration (mod_perl, php, jk, ssl, ...), that freed lots of memory and allows more effective caching of directory trees and file metadata. And yes, it solved a bit performance issues (uh oh, yet another image server overload).
- 08:59 brion: disabled wikidiff PHP extension sitewide; there are numerous reports of bad diff output in some cases, and dammit alleges it may be crashy or futex-y. InitialiseSettings.php is set to enable it in the wiki if it's on in php.ini and ignore it if not.
- 07:50 brion: tim is doing ongoing debugging on srv50 trying to identify source of segfaults
- 07:00 brion: installed patch for apache rewrite bug on amd64, but still getting segfaults on srv50
- 06:08 brion: clocks are wrong on new8 boxen; working on correcting
- 06:00 brion: setting up APC instead of Turck on srv50 experimentally
- 00:35 brion: srv50 back out; some apache child process segfaults, which don't look too good
- 00:34 brion: srv50 back in
- 00:11 brion: srv50 out for further adjustments (tidy, proctitle)
- 00:09 brion: putting srv50 into apache rotation to test it out before installing all others
September 20
- 23:20 ævar: changed wgSitename to Vichipedie on furwiki
- 23:14 ævar: ran php namespaceDupes.php --fix --suffix=/broken furwiki to fix namespaces on furry wiki
- 22:39 ævar: changed $wgMetaNamespace on furwiki from Wikipedia to Vichipedie.
- 20:14 brion: reverted Parser.php change temporarily due to reports of massive template breakage
- 19:45 brion: fixed internal wiki (whoops, typo in config change last night)
- 19:13 brion: removed bogus entries from master robots.txt ("/?", "/wiki?", "/wiki/?")
- 14:16 Tim: Disabled context display for full text search results as an emergency optimisation measure. It was taking more than its fair share of our precious DB time. $wgDisableSearchContext in CommonSettings.php.
- Note: This caused a large reduction in CPU usage on the master DB server, from 100% down to 70%. In the future, it might be worthwhile to ensure text for context display is loaded from the slaves.
- 10:30 brion: doing experimental software installs on srv50 [amd64]
- 10:08 brion: Added sync-apache script to rsync the apache config files from zwinger to pmtpa apaches. Don't forget to use it after making changes and before restarting apaches!
- 09:30 brion: moving apache configs a) into /h/w/conf/httpd subdir, and b) into local copies on each server which will be rsync'd
- 08:05 brion: new apache configs on all
- 07:23 brion: fixed up apache configs on *.wikimedia.org
- 07:00 jeronim: added acpi=off panic=5 to adler's kernel params and rebooted, because apparently there are some ACPI problems, and so that it reboots on kernel panic instead of freezing
- 06:53 brion: cleaning up apache config files; replacing ampescape rewrite usage with aliases to remove our patch dependency (tested on wikimediafoundation.org)
- 06:40 jeronim: installed same kernel on adler as is on samuel and set it as default; also samuel's default kernel was changed to a newer one (by yum?) in Template:Filename, so changed it back to match the current kernel
- 05:30 brion: put suda back in rotation; toned down its share of enwiki hits a bit
- 05:02 brion: adler crashed again at some point
- 02:36 brion: adler was rebooted by colo; running innodb recovery
- 01:58 brion: adler is down, seems to have crashed (panic bits on scs output). taking out of rotation too
- 01:45 brion: lots of delays trying to open suda from wiki; taking out of db rotation
- 01:11 brion: halted backup; benet ran out of space. en_text_table.gz is much larger than expected (49gb), perhaps external storage has not been used correctly as expected? will remove file and continue.
September 19
- 22:10 ævar: uninstalled nogomatch on enwiki, who's going to sort through all that gibberish data? Not me!
- 21:07 brion: rebooting new8 machines to make sure they're running current kernel
- 21:02 brion: new8 group status: srv47 online but borked; 31-35 and 49 offline. others to be set up as apaches
- 20:46 brion: running special pages update on frwiki by request... will update others on cronjob if there's not already one?
- 19:40 mark: Replaced udpmcast.py by a properly daemonized version. Set it up at knams to forward to a multicast group instead of all unicast IPs forwarded by larousse...
- 18:45 mark: Removed miss_access line from knams squids to solve the cache peer errors. Repeat at yaseo if it works...
- 13:49 ævar: Installed the nogomatch extension experementally on enwiki.
- 08:00 Tim: Removed all NFS mounts from srv1's fstab. Set up a simple /home directory on its local hard drive.
- 06:06 kate: reverted root prompt on zwinger so it's not invisible on a white background
- 04:47 James: stop slave on bacon while dumper is running. Slave will restart when done.
- 02:45 Tim: changed root prompt on zwinger. Started sync-to-seoul, with -u option this time so we don't accidentally overwrite stuff
- 01:50 brion: seems to be mostly back up at this point. boot seemed to be aided by disabling named and letting it lookup from albert
- 01:36 brion: zwinger boot still going on; nfs init is *very* slow doing the exportfs -r; seems to be slow dns lookups
- 00:38 brion: jeronim did this: [root@zwinger srv38]# reboot - unfortunately it was not srv38, but zwinger.
- 00:05 brion: mounted /home on srv1; couldn't login, caused sync-file failures
- 00:05 brion: enabled Nuke extension on meta & mediawiki.org
September 18
- 14:00 jeronim: rebooted zwinger by mistake and it needed a manual reset by colo staff to come back up. Site was offline for about an hour.
- 04:34 brion: vandale kernel panic, frozen
- 04:30 Solar: srv36-srv50 are racked, have ip's, and are ready for production
- 03:10 Tim: moved compressOld.php to dalembert (where dumpHTML.php has been running), on complaints that it was causing problems on zwinger.
September 17
- 22:17 brion: running unique-ip counter on fuchsia with saved logs (into uniqueip table on vandale)
- 22:02 brion: disabled disused info-de-l list by request of list admins
- 11:05 brion: ran initStats on all wikisources to initialise those not already set
- 07:06 brion: canceled upload dump for commons backup due to size and slowness; too big to fit
- 06:30 jeronim: on larousse, removed fedora netcat and installed from source into /usr/local
- 04:30 Tim: used ntpdate -u pool.ntp.org to set the times on all the yaseo machines, some were a long way out. Then set all their timezones to UTC. This apparently caused ganglia to think yf1000 and yf1002 were down, fixed by restarting the local gmond.
- 04:10 Tim: Started replication on henbane
- 01:10 brion: enabled wikidiff on all wikis. (can be disabled selectively w/ wgUseExternalDiffEngine in InitialiseSettings)
- Tim: Set up mysql on henbane, made a consistent dump of kowiki and commonswiki using bacon, copied dump to henbane ready to start replication
September 16
- 22:20 Tim: started mysqld on srv26, it had been off for 12 hours or so. The compression script had been running all that time, srv26 caught up to the master without incident.
- Colo (Solar):
- supposedly bart is brought back up
- borrowed HP switch connected to gi0/4 on the cisco
- moreri was moved, and is trying to netboot (fails)
- 10 of the 20 new servers have been racked and wired to the borrowed HP switch, but don't have IPs yet
- 11:37 brion: updating sitenames on he, el, ru wikisource
- 11:30 brion: started backup run
- 03:17 brion: frwiki reimport done
- 02:47 brion: frwiki reimport started
- 02:35 brion: jawiki reimport done
- 01:49 brion: started jawiki reimport
- 01:33 brion: bacon catching up; suda is fine as it is partial mirror
- 01:29 brion: took bacon, suda out of rotation for further investigation
- 01:23 brion: nlwiki open for editing
- 01:03 brion: reimporting nlwiki on samuel
- 00:41 brion: nl/fr/ja dumps done (in /var/backup/private/recovery). going to try reimporting soon
- 00:16 brion: running attachLatest on *wikisource
September 15
- 23:14 brion: 3 dumps from adler done; doing extra backups from samuel too. setting adler to read-only
- 22:37 dumping nlwiki, frwiki, jawiki databases from adler onto sql files on benet
- 22:18 put load back on samuel for enwiki with adler disabled. fr, nl, ja wikipedias are locked while we work this out
- 22:09 commented out adler from db.php; adler appears to be misconfigured and all kinds of breakage is going on. it's not read-only, and has some revisions that others don't have
- 21:56 brion: took load off bacon (was 100 load on fr, nl, ja; nl and fr reporting weird editing problems possibly freak lag problems, and it was consistently lagging a few seconds at least)
- 17:25 mark: Setup IPsec between bacon and vandale. Who wants to setup replication?
- 16:50 mark: Altered geodns: pointed Malaysia at yaseo, and Israel, Turkey, Cyprus at knams
- 13:04 Tim: Shutting down apache on dalembert temporarily so that I can use it for HTML dump testing and generation
- 12:35 Tim: Restarted compressOld.php, it stopped when I shut down bacon to do the copy to adler.
- 11:30 mark: Restarted some knams squids to increase FDs, changed /etc/rc.local startup script
- 11:15 mark: Deployed squid on yf1003 and yf1004, and added them to the DNS pool
- 11:10 mark: Recompiled squid on yaseo to increase filedescriptors to 8192 and restarted all squids with 4096
- 07:37 brion: running importDumpFixPages.php on wikisources to fix bogus rev_page items
- 02:30 kate: ariel's down
- 02:29 brion: recompiling mono 1.1.9 on benet for xml bugfix
- 00:15 brion: removed humboldt and hypatia from mediawiki-installation node group, neither has port 80 on:
- humboldt prompts for password, not configured correctly?
- hypatia shows host key changed; was reinstalled?
- 00:10 brion: disabled MWSearchUpdater plugin as the daemon is broken; briefly broke the wiki due to bad include_path; need to fix config for MWBlockerHook to make sure the path is right even w/o the lucene include
September 14
- 21:30 mark: Setup log rotation at yaseo to knams, routed japanese and chinese clients to yaseo squids.
- 20:30 midom: adler online, bacon catching up
- 20:15 mark: Deployed squid on yf1001, and routed Korean clients to the Florida squid cluster.
- 18:15 mark: Deployed squid on yf1000.
- 18:10 mark: Wrote a YASEO squid deploy script /home/wikipedia/deployment/yaseo-squid/prepare-host (yahoo cluster only, should I put it at florida?) after Tim's apache prepare-host script
- 17:48 ævar: de-opped myself on ruwiki and stopped my revert bot, the russians hate me even more now.
- 16:30 mark: Set up a squid on yf1001. Same setup as knams, except it's in /usr/local/squid as in florida. Adapted florida's squid and mediawiki configs accordingly.
- 13:19 ævar: ran INSERT INTO user_groups VALUES (1165, "sysop"); on ruwiki to make myself temp. sysop to fix the MediaWiki: fsckup.
- 11:15 brion: halted nlwiki partial temp backup as enough was run to test problem
- (identified problem as [1])
- 10:41 brion: running another nlwiki backup to get raw dumpBackup.php output for testing
- 10:39 brion: halted old backup sequence (at nlwiki, with a mystery breakage in output that needs examining)
- 10:33 brion: hacking dumpBackup.php to load php_utfnormal.so extension (not yet enabled sitewide)
- 10:05 brion: running kowikisource and zhwikisource imports on formerly broken parts
- 08:55 brion: updated messages on jawikisource
- 08:30ish brion: updated messages on *wikisource
- 01:30 jeronim: access to yaseo console server should be back hopefully within a few hours - eam is dealing with it
September 14
- 13:32 Tim: Shut down mysql on bacon, started copying data directory to adler
September 13
- 23:23 brion: set logo on dewikiquote to commons version
- 23:ish brion: installing mono 1.1.9 with xml patch on benet to fix future dumps ([2])
- 17:23 ævar: Logging Exif debug information to /home/wikipedia/logs/exif.log using wgDebugLogGroups.
- 16:40 jeronim: yf1000 - yf1004 are all set up with reiserfs now. The only yaseo machine not working is yf1013 which is in an unknown state as the console server (konsoler04.krs.yahoo.com (10.11.1.186)) is unreachable.
- 16:18 Tim: Started moving some text to cluster2, starting with frwiki.
September 12
- 11:59 brion: killed search update daemon; going to replace this (again) with a more robust queuing system
- 15:00 or so kate: upgraded perlbal to 1.37
- 13:24 jeronim/kyle: lots of machines connected to SCS, port labels corrected. The APC has apparently vanished - Kyle couldn't find it.
- 09:40 brion: installed ICU 3.4 on zwinger and mediawiki-installation from RPMs built from the ICU-provided spec file. Source and binary rpms in /home/wikipedia/src/icu
- 09:34 brion: fixed misnamed krwikisource -> kowikisource db
- 8:50 Tim: rebuilt interwiki tables
- 02:15 brion: replaced old php.ini on zwinger with symlink to the common one. added /usr/local/lib/php back into the default include_path (for PEAR stuff sometimes used)
- 01:04 brion: blocked leech enciclopedia.ipg.com.br
September 11
- 22:05 brion: trying batch clears in parallel overloaded zwinger; canceled, running in serial again
- 21:35 brion: running batch operation to remove bad cached messages
- 21:00 brion: reconfigured blocker daemon to log to samuel. had to set up permission grant again on samuel
- 18:19 Tim: finally managed to fix the message problem, except for some erroneous values stored in cache
- ~18:00 ævar: To get interwiki links working on hrwikisource: sourced the output of maintenance/rebuildInterwiki.php and sourced mainteance/interwiki.sql on all wikis, some interwiki prefixes appear to have been lost in the progress e.g. bugzilla: (only mediazilla: exists in interwiki.sql) looks like we need better interwiki update scripts...
- Don't run interwiki.sql, under any circumstances. Add new prefixes to m:Interwiki map. -- Tim 08:52, 12 Sep 2005 (UTC)
- 16:05 Tim: switched master to samuel. Adler asks for root pw after reboot due to failed fsck.
- 15:10 Adler crashed. Tim and JeLuF on the scene, wiki switched to read-only mode
- 14:59 Tim: Non-default language message caching completely f****d up. Blank messages everywhere
- 07:10 brion: now using blocker list
- 07:00 brion: installed limited librsvg on apache cluster, svg back on
- 15:40 Tim: Installed apache, php, turck and mediawiki on yf1005. Put all required commands in /home/wikipedia/deployment/yaseo-apache/prepare-host. Still needs database, memcached and mediawiki configuration.
- 05:05 brion: restarted MWUpdateDaemon, hung again at 1gb used memory
- 02:38 brion: disabled svg for further security work
- 01:20 brion: reconfiguring wikisource to allow en.wikisource.org to work (hr ja kr sv zh en now imported)
- 01:09 brion: installed librsvg 2.11.1 on the apaches; it's in /usr/local. (old librsvg versions seemed to muck up text pretty bad)
September 10
- 22:49 brion: importing wikisource nl ro ru
- 22:34 ævar: deinstalled the wgDebugLogFile on commonswiki, got enough debug output to see if anything was wrong.
- --:-- jeronim: yaseo stuff:
- reinstalled FC4 on yf1000, yf1001, yf1003, yf1004 with reiserfs
- reinstalled FC4 on dryas & henbane with 10GB ext3 root partition and the bulk of the disk as jfs on /a
- rsyncing /home, /tftpboot, /root, /var/www, /usr/local, and /etc from amaryllis to dryas in preparation for reinstalling amaryllis with reiserfs. It's a script, /root/amaryllis-rsync.sh, running in a screen on dryas.
- 14:14 ævar: installed a wgDebugLogFile for commonswiki in /home/wikipedia/logs/commonswiki.log to monitor Exif debug output.
- 13:26 ævar: ran maintenance/deleteImageMemcached.php on all wikis fixing bug 3410
- 10:44 brion: cleaning out old mysql data from benet to free up space for current backups (40 days+ out of date, not too useful)
- 10:00 brion: restored working frame-breakout code (pending cached wikibits.js)
- 07:58 Tim: moved some ancient rubbish from /home/wikipedia/htdocs to /var/backup/home/wikipedia/htdocs
- 07:10 brion: running data split for additional wikisource languages
- 02:40 Tim: Changed names of Seoul machines
- 02:15 brion: set edit rate limit for new accounts to same as ip rate limit
- 01:40 brion: installed rsvg (librsvg2) on mediawiki-installation machines, enabled SVG uploads
September 9
- 06:30 brion: restarted stalled de,en dumps
September 8
- 19:18 brion: checker daemon running
- 10:50 brion: setting up vandal checker daemon on larousse
- 10:42 hashar: enabled subpages for portal (100) and portal discussion (101) on dewiki.
- 7:45 hashar: added two namespaces for frwiki : 100=>Portail, 101=>Discussion_Portail .
September 7
- 22:00 jeronim: fixed avar's login problem on servers in the mediawiki-installation group -
- nscd -i passwd did not work
- /etc/init.d/nscd restart ; /etc/init.d/sshd restart did solve the problem on each machine except for benet; for benet, problem was finally solved after doing the restarts twice more, then nscd -i passwd, then doing the 2 restarts with a pause in the middle
- 21:30 jeronim: killed everyone's ssh sessions and sshd on zwinger (sorry)
- 10:25 midom: After Tim did put live memcached patch, site's sessions were switched from NFS to memc.
- 06:54 brion: killed stalled backup -- memcached send hang for the last day or so. It's continuing w/ dkwiki; will rerun stalled dewiki and enwiki
September 6
- 19:55 brion: tgwiktionary to lowercase
- 05:30 brion: set up experimental upload verification hook
- 04:02 koko: removed firewall
September 5
- 12:40 brion: set up to shut down search builder daemon every hour (at 47 minutes) to protect aganst memory leaks in builder; search-update-daemon wrapper script set to auto-restart 5 seconds after shutdown/crash of the daemon
- 09:05 brion: rebuildMessages.php --update on all wikis to add various new messages
- 06:09 brion: starting mass lucene updates of pages edited in august
- 05:18 brion: lucene back-deletions done, reoptimizing build index
- 01:10 brion: search updater up; running queued deletions
- 00:45 brion: vincent back in active search rotation
September 4
- 23:55 brion: splitting lucene config to lucene.php. putting coronelli on search, wiht optimized index
- 19:30 jeronim: created helpdesk-l
- 17:20 jeronim: fuchsia does not boot on the latest kernel (see below), but it does boot on the 2.6.11-1.33_FC3smp kernel, so switched it to boot that kernel by default
- 16:27 mark: Because of cascading incidents in knams, we moved all traffic to florida and lopar via DNS.
- 14:30 jeronim: fuchsia was dead or very close, so power-cycled it using the IPMI. It is broken:
Copyright (c) 1999-2004 LSI Logic Corporation
insmod: error inKernel panic - not syncing: Attempted to kill init!
serting '/lib/mpACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 27 (level, low) -> IRQ 177
tmscsih.ko': -1 ptbase: Initiating ioc0 bringup
niknown symbol ioc0: 53C1030: Capabilities={Initiator}
module
Call Trace:/sbin/udevstart <ffffffff80138164>{panic+196}e xited abnormaly!
Creating roo<ffffffff8034f811>{__down_read+49}t
device
dev: label /1 n t found
Mountin<ffffffff80207ef1>{__up_read+33}g root filesyste m
mount: error <ffffffff8013ae53>{do_exit+99}2
mounting ext2
<mount: error 2ffffffff80207db1>{__up_write+49}mounting none
S witching to new <ffffffff8013ba8f>{do_group_exit+239}r
oot
: mount failed: 22
umount /init<ffffffff8010eaa6>{system_call+126}r d/dev failed:
- 13:16 Tim: made /home/wikipedia/lib/install.sh ignore x86_64 machines, added a part to clean up rubbish left in /usr/lib, then ran it everywhere with dsh -a -f
- 04:20 Tim: reinstalling PHP 4.4.0 with exif support. Using php-upgrade-440, which calls the new script /home/wikipedia/lib/install.sh to set up shared libraries in /usr/local/lib.
September 3
- 18:40 jeronim: removed body of mailman archive messages here and here on yannf's request
- 06:40 brion: relaunch updated backup script with some of the broken bits fixed.
- 04:50 Tim: Finished benchmarking PHP 4.4.0, see GCC benchmarking. Now deploying the new binaries, from source tree /home/wikipedia/src/php/php-4.4.0-gcc4
- sometime brion: added .log to text/plain on benet's lighty
September 2
- 12:00 brion: ran backup test on aawiki using the new dump splitter and partial new backup script. (script is in ~brion/run-backup.sh if anyone wants to examine it)
- 07:19 Tim: compiling GCC 4.0.1 on zwinger. It will be installed with a program suffix, so gcc<tt> is still the old compiler, and <tt>gcc-4.0.1 is the new one. Source directory is /home/wikipedia/src/gcc/gcc-4.0.1, build directory is /home/wikipedia/src/gcc/gcc-4.0.1-build.
- 06:21 Tim: removing hypatia from perlbal nodelist for an hour or so, for some benchmarking
September 1
- 07:45 brion: set sitename/meta namespace on mtwiki
- 07:00 brion: running cleanupTitles.php to rename broken pages. Will be at Special:Prefixindex/Broken/ at each wiki.
August 30
- 17:30 jeronim: made a robots.txt on larousse (noc/kohl) to disallow some dynamic pages and a few others
- 16:40 jeronim: created wikimediapl-l
August 29
- 21:30 brion: blocked wissens-schatz.de for remote loading
- 17:30 jeluf: anonymized a name in the archive of wikide-l
- 11:30 brion: running a batch job checking for invalid titles on various wikis (cleanupTitles). shouldn't interfere with anything, making no changes.
August 28
- 22:15 brion: locking plwiktionary for capitalization change
- 15:18 hashar: created wikimk-l mailing list.
- 15:15 mark: Brought mayflower back up. Repaired the filesystems, and rebooted it. It was reporting lines like
Aug 28 04:22:34 mayflower kernel: swap_free: Bad swap file entry 7800007ffffff00f
- 14:30 mark: Another Kennisnet V-20 went down, this time it was mayflower dieing somewhere this morning. Depooled it... As it's not critical and we still have SP access, I will have a look at it first.
August 27
- 00:45 brion: turned on wegge's experimental watchlist bot thingy on dawiki
August 26
- sometime: lots of data imported on wikisources
August 25
- 16:02 jeronim: added fc-mirror.wikimedia.org DNS entry for fedora mirror
- fc-mirror 1H IN CNAME albert
- 15:40 hashar: created wikials-l mailing list. TODO: delete /h/w/htdocs/mail/.index.html.sw(o|p) (swap files by fire).
- 19:00 mark: PowerDNS on pascal appeared corrupted. Most probably because of an overlapping zones problem in bindbackend (not bindbackend2). I integrated rev.wikimedia.org into the wikimedia.org to evade that.
- 16:09 hashar: blacklisted www . izynews . com on florida squids (using acl badbadip src 62.75.174.182/32). Need to be done on kennisnet and paris cluster too.
- 11:00 brion: set up https on kohl. (old ssl key files backed up; wasn't using the established password, nobody knew what it might have been)
- 07:05 brion: rebuilt interwiki tables; using correct interwikis for the new wikisources.
- 06:51 brion: added sr.wikisource.org
- 02:02 hashar: updated in HEAD LanguagePt.php from meta. Watchout when syncronising.
August 24
- 14:04 hashar: disabled lucene search. Daemon run on maurus but timeout / dont give any output.
- 04:00 Jamesday: started nice bzip2 for slow query log and first 72 binary logs on adler to free 40GB of disk. Can archive them on another box later.
- use avicenna for binlog archives -- Tim 05:53, 25 Aug 2005 (UTC)
- 00:43 brion: trying out an older version of MWDaemon on vincent to see if memory leak is a new code problem
August 23
- 16:17 jeluf: removed 10.0.0.17 (vincent) from MWDaemon pool. Was always reporting errors.
- 09:39 brion: added http://ar.wikisource.org http://da.wikisource.org http://de.wikisource.org http://el.wikisource.org http://es.wikisource.org http://fr.wikisource.org http://gl.wikisource.org http://it.wikisource.org http://la.wikisource.org http://nl.wikisource.org http://pl.wikisource.org http://pt.wikisource.org http://ro.wikisource.org http://ru.wikisource.org
- 05:23 Tim: Reports from users of frequent "connection refused" errors reported by the browser. Investigated, found squid was crashing once every 10 minutes or so, on 4 out of 6 squids. The two that weren't crashing were running a newer version of squid, I upgraded them all to that.
August 22
- 22:12 brion: upped max post size to 75mb on squids; were problems posting large videos to commons (or something)
- 21:50 brion: renamed presswiki to internalwiki
August 21
- 22:53 brion: bugzilla up; removed ssl-ticket.wikimedia.org from pascal's apache conf.d dir
- 22:48 brion: bugzilla.wikimedia.org appears to be offline.
- 13:30 Tim: reduced lucene load on vincent to 1/4, maybe that will stop it from locking up (which it did again)
- 13:00 Tim: restarted lucene on vincent, it was closing connections as soon as they were established
- 06:27 brion: otrs now accessible again on https://ticket.wikimedia.org/ ; now with redirect for the index page! For reference: Apache is in /usr/local/otrs
- 06:00 brion: trying to start otrs on ragweed. apache configuration appears to be borked.
August 20
- 10:00 jeluf: finished OTRS transition to ragweed. Spamassasin setup finished.
- 09:53 Tim: Switched site to 1.6alpha
- 08:16 Tim: Applying schema update for 1.6alpha, basically an ALTER TABLE watchlist
- 01:00 Tim: ran update-special-pages
August 19
- 23:30 brion: changed postfix 'myhostname' setting from zwinger.wikimedia.org to mail.wikimedia.org, should prevent the mail loop errors reported sending to the full addr
- 23:00 brion: ran namespace conflict checks for updates on tawiki and gawiki
- 21:40 brion: updated rebuildInterwiki
August 18
- 23:30 jeluf: OTRS status: Installed apache/php/perl/postfix/mysql client on ragweed. Using pascal as DB server. Problems with sessions, sessions seem to be mixed up, sometimes I get logged in as presroi, sometimes as JeLuF :-/ Stopped apache for now. Postfix still accepting new tickets.
- 22:30 mark: Changed DNS CNAME ticket.wikimedia.org to point to ragweed
- 22:17 brion: disabled account creation throttle on press wiki; this is closed wiki and all accounts are created by an admin
- 10:00 midom: suda is back again, with enwiki and commonswiki databases
- 05:00 jeluf: copied OTRS tables to pascal, copied otrs binaries to pascal, configured pascal to serve https. Can access old tickets again. Currently can't send new tickets to otrs. DNS change needs to be done.
- 00:55 brion: recreated wikimediasr-l list on zwinger
August 17
- 19:27 brion: fixed bug in db.php that set all database load factors to NULL
August 16
- 20:15 jeluf: renamed project namespace on cswikibooks to Wikiknihy.
- 15:30 midom: resumed idle bacon's mysql replication, we might need to do external store migration soon, and bring back suda with smaller dataset.
August 15
- 21:46 kate: always_bcc on zwinger was set to "quagga" and its mbox was full, so it generated lots of bounce messages. i removed the setting.
- 12:30 mark: Mint seems to have at least a bad disk, possibly other problems. Sun will look at it. In the meantime, we can *try* to network boot it and recover data.
- 10:30 jeronim: had a look at mint via the IPMI - tried to power cycle it but it wouldn't switch off. Mark will tell the kennisnet guys about it. There's a dump of the OTRS DB from before the transfer to mint in albert:/root. If mailman is to be put back to zwinger, chapter-l and the new Serbian list will need to be re-created (and maybe some other lists?).
- 09:00 mark: Mint apparently is fucked, RAID and SP settings were reverted to factory defaults. Trying to do data recovery now. Possibly a power problem?
August 14
- 19:51 brion: mail config on zwinger broken or funky or otherwise annoying; just leaving it off for now. moved dns for mail back to mint (which is still dead) sighhhh
- 19:26 brion: moved mail.wikimedia.org back to zwinger due to extended outage on mint. With our limited support contract on knams we can't afford to have this critical service there.
- 14:30 midom: srv27,srv26,srv25 joined external storage service, waiting for payload
- 09:30 brion: mint is offline, no ping
- 00:20 brion: stopped bacon to run backup dump
- 01:00 jeluf: enabled spamassassin for OTRS on mint (~otrs/.procmailrc)
August 13
- sometime kate: moved otrs to mint
- 23:25 brion: added wikimediasr-l aliases to mailman on mint
- sometime someone: Apparently mail.wikimedia.org has been moved to mint.
- 10:42 jeronim: set ticket.wikimedia.org to CNAME mint.knams.wikimedia.org. (move of OTRS to mint is in progress)
- 00:58 Tim: started update-special-pages
- 00:19 Tim: it happened again so I disabled otrs's crontab. Original crontab is in /opt/otrs/crontab
August 12
- 23:18-23:30 Tim: An OTRS process on albert (PostMaster.pl) developed a runaway memory leak, causing heavy swapping. This slowed down albert sufficiently to cause the entire apache cluster to lock up with high load. Killed the process at 23:30 and the site soon returned to normal.
- 09:30 brion: took srv1 out of 'apaches' node group and shut off apache on it. DON'T RUN APACHE ON SRV1
August 11
- 21:26 Tim: TICK TICK TICK, that's the sound of 58 servers with their clocks ticking in synchrony, maximum offset 80ms.
- 20:30 Tim: Added the missing restrict line for 10.0.0.200 to ntp.conf on (almost) all machines
- 19:30 Tim: Synchronised ntp.conf on hypatia, humboldt, rose, anthony, rabanus, diderot and srv1 with /home/config/others/etc/ntp.conf.vlan2 . This made them remotely queryable, for easier debugging in the future, and also switched their preferred server from zwinger to the cisco (in broadcastclient mode).
- 18:35 Tim: Fixed tingxi's resolv.conf
- 17:45 mark: Fixed inconsistent favicons on apaches. Older apaches had symlinks to a common (wikipedia) favicon, which got overwritten with the new wikinews favicon by brion. Removed the symlinks, and put the correct favicons in place.
- 12:20 brion: set up pl.wikimedia.org and press.wikimedia.org (press is locked, and currently has no user accounts. a sysop/bureaucrat will need to be added for it to be used)
- 07:28 brion: updated wikinews.org favicon
August 9
- 23:20 mark: Rerouted Europe back to knams, because all sorts of weird problems were occuring. Fixed a typo (pmpta) in DNS. Some nameservers report TTL 0 for some of our DNS records - need to investigate that.
- 22:20 mark: Moved Squid service IP 207.142.131.246 from overloaded srv10 to srv5. Cleared the ARP entry on the l3 switch.
- 22:00 mark: Reroute everything from knams to pmtpa directly, because of routing problems
- 13:35 mark: changed biruni's hostname from biruni.wikimedia.org to biruni
- 13:30 mark: added avicenna and biruni to node_groups/apaches
- 13:00 mark: Restarted apaches on avicenna, alrazi and biruni with -DSLOW, and changed startup scripts
- 08:52 jeronim: blocked 61.48.105.65 spammer IP from all wikis using block-ip-all - so ipblocklist message will speak of "vandalism" instead of "spam"
- 08:25 jeronim: created chapter-l for mailman on mint
August 8
- 09:22 kate: enabled greylisting on mail.wm.org
- 20:54 hashar: readded srv2 (with ip x.x.0.1 ) to the apache pool
- 18:25 hashar: avicenna & biruni readded. Monitoring error log, #wikipedia and memory.
- 17:43 brion: added /mnt/upload mounts on avicenna and biruni
- 17:32 hashar: forgot sync-common on avicenna and biruni :/ I though scap would do the job ... They both missing the upload directory.
- 15:45 brion: stopped apache on avicenna and biruni pending more information on reported errors
- 15:36 hashar: TODO: biruni hostname seems wrong /etc/sysconfig/network list HOSTNAME=biruni.wikimedia.org whereas other servers just get HOSTNAME=zwinger or HOSTNAME=srv30 ...
- 15:36 hashar: removed srv1 from mediawiki-installation dsh file (as apache is not meant to run on).
- 15:24 hashar: bringed back biruni in mediawiki-installation pool
- 15:12 hashar: bringed back avicenna in mediawiki-installation pool
- 14:30 hashar: started apache on srv11.
- 06:30 kate: moved mailing lists to mint. let's see if it starts sucking less.
August 7
- 20:50 brion: postfix hung zombified on zwinger, wouldn't restart automatically. had to remove master.pid and restart.
- 16:25 brion: installed DynamicPageList on wikiquote per [3]
- 15:50 brion: locked tlhwiki
- 07:47 brion: added application/ogg as mime type for ogg files on albert
- 00:59 brion: set localized logo for ptwiktionary
August 3
- 14:15 mark: Switched over upload.wikimedia.org to lighttpd instead of apache on albert
- 12:00 brion: added frankfurt city map to wikimania whitelist. whoops!
August 2
- 15:45 mark: Bound albert's apache to a single IP, instead of INADDR_ANY
- 09:40 brion: added wildcard subdomains for wiktionary.com redirection
August 1
- 22:30 all: samuel's disk filled up. Switched master to adler. Re-syncing samuel from suda.
- 14:50 mark: Put all kennisnet squids back into DNS, updated DNS on pascal
Archives
- Server admin log/Archive 1 (2004 Jun - 2004 Sep)
- Server admin log/Archive 2 (2004 Oct - 2004 Nov)
- Server admin log/Archive 3 (2004 Dec - 2005 Mar)
- Server admin log/Archive 4 (2005 Apr - 2005 Jul)