Server admin log/Archive 9

From Wikitech
< Server admin log(Difference between revisions)
Jump to: navigation, search
(move)
Line 14: Line 14:
 
== December 19 ==
 
== December 19 ==
 
* 20:00 Domas: [[srv57]] and [[srv61]] down, used [[srv70]] and [[srv55]] as [[Tugela]] replacements.  
 
* 20:00 Domas: [[srv57]] and [[srv61]] down, used [[srv70]] and [[srv55]] as [[Tugela]] replacements.  
== December 18 ==
 
 
* 15:00 mark: Resurrected [[mint]] and [[lily]].
 
* 15:00 mark: Resurrected [[mint]] and [[lily]].
 +
== December 18 ==
 
* 14:20 Tim: attempted to restart lily, it crashed 20 hours ago.
 
* 14:20 Tim: attempted to restart lily, it crashed 20 hours ago.
 
* 00:00 Domas: [[holbach]] resurected and is working as db slave...
 
* 00:00 Domas: [[holbach]] resurected and is working as db slave...

Revision as of 21:23, 19 December 2005

Template:Topnavbar

Thursday
28
March

December 19

December 18

  • 14:20 Tim: attempted to restart lily, it crashed 20 hours ago.
  • 00:00 Domas: holbach resurected and is working as db slave...

December 17

  • 13:05 Solar: Holbach is available at 10.0.0.24
  • 11:35 Solar: sq1-10 minus 3 and 7 ( hardware errors ) are up with 10.0.3.x ip's
  • 06:40 brion: installed <fundraising/> extension (FixedImage) for the fundraising progress bar
  • 01:55 brion: reinstalled php 5.1.1 on tingxi with gd enabled
  • 01:50 brion: briefly locked new registrations on zh.wikipedia while adding a range block;
  • 0:10 brion: rebuilt interwikis (bugzilla:1586)

December 16

  • 04:45 brion: installed apache 2.2 and php 5.1.1 on tingxi for fundraising info server (with SSL)

December 15

December 14

  • 20:30 hashar: added stylesheet for http://static.wikipedia.org/
  • 12:50 mark: Built a new squid RPM (2.5.STABLE12-2wm) that sets a maximum resident memory size (default: 2 GB, specifiable in /etc/sysconfig/squid), and tested it on fuchsia
  • 11:20 mark: Decreased the Squid timeout value of lvsmon on pascal to 10 seconds, and restarted iris which was trashing heavily.

December 13

  • 22:31 brion: benet ran out of disk space, looking at where it went
  • 19:22 brion: review of dump status shows that srv30 broke during the dump circa 04:22 yesterday, crashing enwiki and eswiki. restarting those two dumps
  • 01:40 Tim: Restarted python IRC client on browne, on reports that no more channels were being created

December 12

  • 22:40 brion: reinstalled turck-mmcache on tingxi; had not been upgraded after PHP recompile and was whining about version mismatch
  • 14:30 mark: Resurrected mint which apparently had crashed two days ago.
  • 03:00-5:00 Tim: restarted some apaches with hung processes waiting for NFS

December 11

  • 13:17 hashar: BUG zwinger:/tmp/mediawiki/ should probably be in /var/cache/mediawiki/confs/ and wikitech group writable.
    • This is not the place to report bugs. Please use the IRC channel. -- Tim 20:55, 11 December 2005 (PST)
  • 13:16 hashar: created namespaces for itwiki & itwikisource (#bug 4247).
  • 09:33 brion: dumps running in pmtpa on benet/srv35/srv36; in yaseo on amaryllis

December 10

  • 05:20 brion: leuksman.com mysql & apache went wacko, memory limits killing things... restarted mysqld and apache
  • ~05:00 Tim: dsh -N mediawiki-installation -f chmod -R 777 /tmp/mediawiki . And changed MessageCache.php so that it will stay that way.
  • 01:40 brion: segfaults on leuksman.com reappeared; got backtrace, posted additional details on similar-looking php bug 35140. I have disabled APC on this server to try to reproduce the bug without it.
  • 00:10 brion: set up cywikisource and copied in some pages (bugzilla:4228)

December 9

  • 14:40 mark: Shutdown Tunnel0 on csw2-knams as an attempt to solve weird routing problems
  • 06:30 Solar: new squids are racked, but only sq1 and sq2 are up at 10.0.3.1-2
  • 04:47 Tim: set up www.wikimedia.org as a portal editable via meta, like the others

December 8

  • 17:43 brion: ns1 and ns2.wikimedia.org don't have updated DNS. what's wrong??
  • 11:40 Solar: sq1 is connected to the SCS port 9.
  • 11:29 Solar: asw3-pmtpa is racked and connected to the scs
  • 11:00 Solar: connected equ1's eth1 interface to csw4-pmtpa's port 34
  • 10:16 ævar: Turned allowemailchange on in buzilla, users can now change their email
  • 10:12 Solar: fixed srv66's grub.conf to boot to correct kernel
  • 08:21 Domas: used srv70 as emergency tugela as srv66 down
  • 07:42 brion: updating tingxi in forward/reverse DNS and adding 'fundraising' CNAME
  • 06:30 brion: taking tingxi out of apache groups, giving it an external setup for fundraising utilities
  • 03:54 kate: stopped lomaria to dump for import to zedler
  • 02:10 brion: cleaning up after bogus CVS updates in common dir owned by hashar

December 7

  • 13:30 mark: Rerouted traffic back to knams
  • 12:00 mark: Rerouted knams traffic to pmtpa because of networking problems near knams
  • 11:14 brion: recompiled apache/php/apc on leuksman.com, hoping to debug intermittent segfaults if they continue

December 6

  • 19:00 jeluf: upgraded OTRS to 2.0.4
  • 05:47 ævar: cvs up'ed includes/SpecialVersion.php, there was a conflict, I removed the following code (the top part) since I presume it's not an issue anymore and the offending site has been blocked:
<<<<<<< SpecialVersion.php
                $ip =  str_replace( '--', ' - - ', htmlspecialchars( wfGetIP() ) );
                #return "<!-- visited from $ip -->\n";
                # hacked to a hidden span since one nasty was stripping comments
                return "<span style='display:none'>visited from $ip</span>\n";
=======
                $ip =  str_replace( '--', '-', htmlspecialchars( wfGetIP() ) );
                return "<!-- visited from $ip -->\n";
>>>>>>> 1.32
  • 04:23 ævar: De-installed Special:Cite on commons, meta, sources, species, foundation, nostalgia and mediawikiwiki. We really should have a $site variable that can be counted on (doesn't return wikipedia for non-wikipedia sites)
  • 01:48 Tim: installed Folding@Home on the yaseo apaches
  • 00:50-01:30 Tim: Started Folding@Home on knams squids

December 5

  • 22:13 Domas: Did bring back srv9 (not sure if it is a good idea). Removed bayle/will from service. All squids are null-storage now.
  • 20:35 hashar: apache-(restart|gracefull)-all(hard)? now use dologmsg instead of wikibugs
  • 20:26 Hashar: http://www.mediawiki.org/FAQ now redirect to meta: page (rewrite rule for virtual host mediawiki.org).
  • 18:00 Domas: noticed packetloss, talking to PM support
  • 17:30 Domas: did put srv6 squid into i/o-less operation, as srv10 had same hitrate ;-)

December 4

  • 20:30 jeluf: Several people report problems with Linker.php:504, thumbnail linking code. As a workaround, submitted and deployed Linker.php,rev-1.56
  • 13:27 hashar: changed kawiki & kawiktionary namespaces (bugs 2103 & 3905)
  • 10:10 brion: fixed tingxi's sudoers, fixed tingxi's /usr/local/apache/conf, synced its mediawiki, trying to start it. working? maybe
  • 10:00 brion: stopped apache on tingxi, has damaged copy of mediawiki
  • 04:05 brion: pascal root partition is full, needs cleanup
    • deleted ~350 megs of old kernel modules from /lib/module, leaving those for 2.6.12-1.1381_FC3
  • 03:02 Tim: srv55 has reported no more MCE errors, re-added to the apache pool
  • 01:45 Tim: fixed Turck on humboldt
  • 00:01 Domas, Tim: amane up and running, site back to normal

December 3

  • 21:20 mark: zwinger's /home mount on amane is broken, all fs calls block
  • 08:40 Tim: restarted squid on srv8, it had crashed
  • 07:47 Tim: Fixed ntpd on coronelli, harris, larousse and adler.
  • 07:14 Tim: Fixed ntpd on vincent and maurus. Stepped their clocks.
  • 06:41 brion: fixed sync-file so that message is optional again, like its help message claims
  • 06:10 Solar: Replaced what I believe to be the bad stick of ram for srv55. Its up.
  • 03:01 brion: added info-pl mail alias to OTRS
  • 02:09 brion: hopefully fixed the lucene restart problem; new mono installation in /usr/local wasn't in the PATH in crontab. hacked the init script to add it back

December 2

  • 22:00 brion: watching search daemons more closely; i think they're not properly restarting on the hourly restart cronjob
    • run logs in /var/log/mwdaemon-run.log
    • also clocks are very bad on maurus and vincent, need to ntp them
  • 21:19 hashar: sync-file now accept comments after the file name.
  • 21:23 brion: restarted luene daemons; for some reason they had all died
  • 09:17 brion: got lucene daemons back up and hopefully running.
    • there was an extra restart script in my personal crontab on maurus which seemed to be messing things up there
    • added a 'ulimit -n 8192' on the init script
  • 05:45 brion: yum mirrors appear to be broken (missing repo files), trying to re-sync
  • 03:44 Tim: srv43 didn't come back into the apache pool after restart, fixed
  • 02:28 brion: restarted search daemons, stuck

December 1

  • 23:53 brion: installed joe 3.3 on zwinger (in /usr/local/bin), handles utf-8 files properly
  • 23:01 hashar: knams cluster was unreacheable for roughly 2 minutes, probably a maintenance on kennisnet side.
  • 22:39 hashar: created Portal namespaces on ptwiki #3385
  • 22:35 hashar: renamed namespaces on huwikibooks. 2 conflicts. #3783
  • 19:54 brion: moved old .conf files from /h/w/conf to /h/w/conf/httpd-old to reduce confusion
  • 19:54 hashar: gracefulled all pmtpa apaches to fix bug #4131
  • 19:49 hashar: fixed apache-sanity-check , calls to 'ip' missed '/sbin/'
  • 18:00 mark: Setup failover LVS on avicenna and alrazi. Still needs lvsmon, and isn't active yet. Uses CARP for failover.
  • 14:30 mark: Removed avicenna and alrazi from Apache duty, as I am going to use them as LVS load balancers.
  • 06:30 brion: removed /tmp/mediawiki/* caches on srv36; the backup run had saved a bunch by root and apache screamed about being unable to write them
  • 06:25 brion: restarted apache on yf1005; odd PHP error, possibly APC cache breakage.
    • Fatal error: main(): Failed opening required '' (include_path='/usr/local/apache/common/php-1.5:/usr/local/apache/common/php-1.5/includes: /usr/local/apache/common/php-1.5/languages:/usr/local/apache/common/php-1.5/templates: /usr/local/apache/common/php-1.5/extensions/wikihiero:/usr/local/lib/php:/usr/share/pear') in �Íÿ on line 14
  • 05:47 Solar: Ariel's raid has "failed", but no real disk failures. It put the array back online and rebooted. We'll see how it does.
  • 05:00 Solar: Moved srv35-43 to second cage. Racked new sq1-sq10.
  • 04:30 brion: deleted 20051127 enwiki pages_full dumps, since srv36 was turned off before they finished

November 30

  • 21:49 brion: fixed upload dirs for wikimediafoundation.org
  • 05:43 Solar: Racked donated load balancer in core cage on csw1-pmtpa port 34
  • 02:25 brion: removed a privacy-violation in a revision comment via database edit (enwiki rev_id 29652015)

November 29

Note: ZX will be rebooting and upgrading most knams machines tonight, to help fix our problems. They will be taking machines down one by one, so this shouldn't give downtime - in theory. If it does, check whether the LVS ip is bound to the machines when they come up.

  • 23:30 mark: Apparently service ips often were not added because /etc/rc.d/rc.local wasn't run... because it did not have eXecute permissions on some machines. Fixed.
  • 21:56 hashar: rebuildMessages.php finished.
  • 21:45 brion: bugzilla:4115 setting up latex on latest srv batch, adding to setup-apache
  • 21:30 brion: found and fixed upload files for meta
  • 21:15 brion: investigating broken upload files on meta
  • 20:49 hashar: fixed bug 4048 and running 'rebuildMessages.php --update' on all wikis.
  • 16:35 mark: Squid wasn't running on srv6, started
  • 16:15 mark: Ran yum upgrade on all knams machines
  • 15:15 mark: Reversed the change as it didn't work anyway: Squid simply ignores failure on binding IPs.
  • 14:00 mark: Adapted the Squid configurator / squid.conf.php to explicitly bind to the Squid's main IP address and the LVS IP, if applicable. Meant to ensure that Squid will not start if the LVS IP is not bound to the machine, so lvsmon can detect that.

November 28

  • 23:34 Hashar: uploaded a picture of clusters, please post comment on image talk page so I can modify / update it.
  • 21:50 Domas: restarted rogue failing (bytecode cache issues?) apaches: srv47, srv4, srv37, srv63, srv58, srv67, srv68, srv53, srv39
  • 20:40 Domas: ragweed booted up, started squid, then started something else (for a minute or two), then ran rc.local with LVS IP adding... site down for several minutes
  • 19:58 brion: ragweed is down (no ping), OTRS dead
  • 14:35 ævar: Site crashed because of insufficient sanity checks, my bad.
  • 14:30 Domas: srv62 tugela crashed, no core dump yet, if crashes persist will need some poking, either code, or srv62. mcelog empty.
tugela-fc3-x64[3634]: segfault at 00000000010e1000 rip 0000003a781716e0 rsp 0000007fbffff668 error 6
  • 00:44 ævar: Changed the project name and metanamespace for iswikibooks to Wikiorðabók
  • 00:00 Domas: oops, ran tugela on srv51-srv54,srv56-srv69 instead of memcached, will see how it performs/scales/...

November 27

  • 23:21 hashar: thanks to palica : updated Server inventory bot to add a link to ganglia.
  • 19:09 hashar: added two scripts to check database : 'mysql-list' & 'replication'
  • 18:49 hashar: BUG rose got 4 memcached instances but they are not listed in mc-pmtpa.php
  • 18:47 hashar: commented 10.0.2.43:10000 from mc-pmtpa.php
  • 13:13 ævar: Installed Special:Cite on all the wikipedias
  • 10:15 brion: blocked wikipedia-l, wikien-l, and helpdesk-l list archives in mail.wikipedia.org's robots.txt to discourage future complaints about embarrassing newbie posts becoming #1 google hits. Search patches for mailman archives should be integrated at some point...
  • 08:55 JeLuF: added http://www.spy-sweeper-webroot.de/wiki/?/ to squid's leecher blocklist
  • 07:38 Solar: smellie is ready for service. Turned off seLinux.
  • 07:30 Solar: srv5 is out with a bad case of bad blocks
  • 07:00 Solar: Crossed over to the new switch, csw4-pmtpa
  • 03:09:56 ævar: Installed Special:Cite on enwiki as an experiment.
  • 01:39 Tim: took srv55 out of service, likely dud RAM. MCE errors reported.
  • 01:10 Tim: squid on will had crashed. Restarted.
  • 01:05 Domas: fixed default route on tingxi

November 26

  • 17:30 jeluf: changed password of wikipl-l admin account. Gave new PW to Datrio. Docuemented PW at the usual place.
  • 14:36 Tim: put srv52-70 into apache service. I broke srv51 with a restart test.
  • 12:00 Tim: wrote /h/w/b/apache-sanity-check, set up scripts such as apache-start to run it and refuse to start apache if the necessary LVS-friendly conditions are not met.
  • ~11:00 Tim: broke site temporarily due to LVS-related misconfiguration
  • 10:53 Tim: rose, tingxi and srv2 had apache running but no LVS VIP. This would explain the random hanging behaviour with ab -X apaches:80. Fixed temporarily, will look into a permanent solution.
  • 08:45 Tim: LVS wasn't decomissioned properly on iris. LVS on pascal was forwarding packets to LVS on iris, and iris, with no lvsmon running, forwarded most of those packets to sage, which is down. Thus users were seeing connection timeouts. Fixed with ipvsadm -D -t rrvs.knams.wikimedia.org:80.
  • 07:41 Tim: srv5 still not up. Moved its virtual IPs, one to srv6, one to srv8 and one to srv10.
  • 07:15 Tim: did a fsck of srv5 then a system reboot
  • 03:32 srv5's root partition spontaneously declared "read-only filesystem". Logs stopped moving. Mount reported that it was still rw, but it couldn't be written to.
mount uses the contents of /etc/mtab to display mounts. These are not updated when the file system is r/o. Use /proc/mounts instead.
  • 05:50 Tim: introduced time and memory limit for rsvg and convert
  • 01:45 Tim: started image backup using updated scripts in /h/w/b
  • 00:14 ævar: changed the logo for iswiktionary.

November 25

  • 21:45 Hashar: killed some rsvg process on various apaches. Seems they tried to render a 120px thumb of /commons/7/70/Interstate_Highways.svg (possible DOS ? :( ).
  • 04:40 Tim: experimentally enabled keepalive on apache.
  • 03:35 Tim: testing lvsmon failover by stopping squid on clematis
  • 03:05 Jamesday: Adler had 11GB disk free. gzipped first 80 binlogs to raise it to 48GB or so. gzipped version still need to be moved to wherever we're keeping them these days.

November 24

  • 06:30 kate: setting up l3 failover.. see that page for details
  • 02:55 brion: took cornelli out of search rotation while kyle moves it around

November 23

  • 21:18 mark: Routing problems from 38.0.0.0/8 (cogent ip space) to florida. Altered the countries.nerd.dk file to reroute that prefix via knams.
  • 20:44 mark: Reinstated the normal epoll RPM on mint, as epoll wasn't the problem
  • 16:44 brion: fixed arrangement of upload directories for several sites (non-wikipedia :P)
  • 00:35 kate: "ntp source vlan1" fixed NTP problem on csw1, but need to work out why traffic to 64.156.25.242 is being dropped
  • 00:04 kate: upgraded csw4-pmtpa to 12.2(25)SED, enabled ssh and configured vlan 2 properly

November 22

  • 22:33 brion: amane still seems to work. YAY \o/
  • 21:49 brion: restarted apache on zwinger, wasn't loading
  • 21:45 brion: increased php fastcgi workers on amane to absurd levels for thumbs to run
  • 21:30 brion: mostly working now! had to set server.max-workers to 8 in lighty to get it running smoothly
  • 19:28 brion: mounted /mnt/upload3 (amane) on zwinger, was missing mountpoint
  • 19:22 brion: mounted /mnt/upload3 (amane) on srv2, was missing mountpoint
  • 19:11 brion: restarted albert's http temporarily to cover the work period
  • 19:02 brion: khaldun copy finally finished, rearranging bits on amane
  • 15:21 brion: turned albert's http back off (hope you're done) so khaldun can finish its copy without the extra load
  • 07:44 brion: started albert's http so kate can set things up requiring the local fedora yum mirror
  • 05:08 kate: configured asw2-pmtpa. has the new srvs and the equ device on it (equ is 10.0.1.3)
  • 00:55 brion: started copying commons files from bacon -> amane. disabled albert's apache
  • 00:45 brion: started copying enwiki files from khaldun -> amane, non-wikipedia non-wiktionary files from albert -> amane
  • 00:35 brion: started copying files bacon -> amane
  • 00:20 brion: disabled uploads sitewide

November 21

  • 23:10 brion: setting up to move uploads to amane, will disable all uploads and upload.wikimedia.org for a while to make this damn thing happen
  • 21:15 brion: started lucene index rebuild on maurus
  • 21:05 brion: restarted squid on will, was not responding (stuck) on port 80
  • 20:49 brion: restarted apache on ragweed; https was down so otrs inaccessible
  • 20:30 mark: Brought sage and mayflower back up.
  • 20:00 mayflower went down.
  • 20:00 mark: Moved LVS back to pascal to allow iris to be a squid again.
  • 19:45 mark: Modified lvsmon on iris because it was always sending curl requests with Pragma: no-cache! And therefor testing the whole chain to florida.
  • 19:45: sage went down.
  • 18:00 mark: Installed non-epoll RPM on mint to compare.
  • 17:56:40-17:57:31 ævar: Invalid argument notices were being generated in this time period due to me syncing three files and them depending on each other, ok now.
  • 17:30 mark: udpmcast wasn't running on pascal. No idea since when... started.
  • 17:30 jeluf: Restarted ragweed. Came back after powercycling and fsck.
  • 16:30 ragweed broken.
  • 12:14 erik: Updated logo of nap.wikipedia.org and sync'd InitialiseSettings.php

November 20

  • 23:30 mark: Upcoming maintenance of knams tomorrow (ZX will do some firmware upgrades, rebooting at least pascal and vandale). Moved LVS to iris because of that.
  • 20:00 JeLuF: All wikipedia.org upload directories moved off of albert and to amane.
  • 18:03 Hashar: fixed #4022 'Asia/Seoul' timezone for kowiki.
  • 17:50 Hashar: switched some logos to /b/bc/Wiki.png
  • 14:24 JeLuF: chown -R apache:apache amane:/export/upload/wikipedia.org/
  • 14:19 Hashar: in amane:/export/upload/wikipedia.org/ some directories cant be write by apache (af de es & fr). dewiki upload page report an error.
  • 09:26 Tim: Fixed NTP broadcast, documented
  • 03:21 Tim: Fixed perl upgrade on srv51-70 as per [1]

November 19

  • 17:05 Tim: same on fuchsia
  • 16:50 Tim: restarted squid on clematis, disabled swap.
  • 16:05 Tim: upgraded otrs on ragweed to version 2.0.3, after Anthere complained about this bug: [2]. Minor upgrades within the 2.0.x series weren't documented (just an unanswered question on the ML), so I just untarred over the top of the old directory, with a backup in /opt/otrs-2.0.1. Treat any problem symptomatically, some chmodding might be required.
  • 15:40 Tim: restarted squid on bayle

November 18

  • 23:30 brion: installing ploticus 2.32 on mediawiki-installation, set to use gd & truetype fonts (bugzilla:3965)
    • truetype fonts in common/fonts
  • 07:00 jeluf: migration of dewiki's image and thumbnail directories done. archive and shared will be moved when albert has more headroom. Some 30 small to medium wikis moved. Currently running frwiki thumbnail migration.
  • 00:27 brion: blocked another leech [3]

November 17

  • 14:30 mark: ragweed was missing the LVS ip, fixed. Also readded iris as squid.
  • 06:30 Tim: Added root key to srv51-70. The following machines didn't want to cooperate: 56, 64, 66, 67, 69
  • 06:05 Tim: added srv51-70 to DNS, created a node group. Configured albert's BIND as a slave for the 10/8 reverse DNS zone.
  • 05:46 Solar: srv2 is back up.
  • 05:38 Solar: srv56 is up too.
  • 05:26 Solar: srv51-srv70 are ready for Rock & Roll! (Except srv56 has some hardware issue)
  • 04:34 Solar: holbach is rebuilt and ready
  • 03:47 Tim: added tingxi and rose to the apaches node group. Left harris out, it sucks.
  • 03:30 Tim: after moving some more hosts to the misc2 cluster, restarted gmond on the apache cluster to remove hosts which have been moved out
  • 02:24 Tim: fixed amane's date, started ntpd
  • 01:49 Tim: Created "Misc VLAN2" cluster on ganglia, for miscellaneous hosts which, due to being in the wrong VLAN, couldn't be in Miscellaneous.

November 16

  • 8:25 brion: srv50 error_log flooded disk; removed and restarted apache
  • 6:30 jeluf: moved es upload area to amane:/export/upload
  • 5:30 jeluf: moved eo, ang, an upload areas to amane:/export/upload. Backups are still on albert in .../remove.
  • 04:14 Tim: attempted to restart squid on will. It didn't work. I hacked /etc/init.d/squid to send errors to a file instead of /dev/null, and found it was giving error messages like "parseConfigFile: line 17 unrecognized: 'htcp_port 4827'". I started the squid copy in /usr/local/ instead.
  • 01:20 brion: reenabled special:renameuser with the 'archive' bit disabled. it's possible that some undeleted pages will have incorrect rev_user_text data

November 15

  • 23:00 jeluf: moved aa, ab, af, ak, als, am, ar, ast, zh image uploads to amane:/export/upload
  • 20:32 hashar: updated http://wikimedia.org/stats/live/ with a message redirecting to the "new" system ( http://noc.wikimedia.org/stats.php ).
  • 16:13 Tim: running batch imagemagick convert job on bacon, converting 1911 EB scans to PNG.
  • ~12:30 Tim: Deployed diff cache and parser cache push features. Reduced cache expiry for RC feeds on en from 60 to 20 seconds. The performance impact of this should be monitored -- the diff cache should reduce it but it might not be enough.
  • 03:46 Tim: Re-enabled tidy, trimmed error logs. The huge error logs did indeed have a few tidy errors towards the end, once every few minutes, interspersed with lots of "file not found" errors. Preceding this lack of activity was gigabytes of either:
    [Mon Nov 7 04:33:33 2005] [error] PHP Parse error: parse error, unexpected $ in /usr/local/apache/common-local/php-1.5/checkers.php on line 101
    OR
    *** attempt to put segment in horiz list twice
    Neither of which have anything to do with tidy. The other noticeable thing at the very end of the error logs was that apache was segfaulting regularly, but it was doing that just as much after tidy was disabled.
  • 01:22 ævar: resolved bug 3968
  • 00:50 brion: cleaned giant error_log files from srv44 and srv47, which had run out of space during sync
  • 00:41 brion: adding some signature-nazi features, so new sigs with unbalanced html tags will not be inserted

November 14

  • 22:30 mark: Many apaches have error_log's of 100G in size and more! Partly due to tidy, but how is logrotation supposed to be setup? See bug #3966
  • 22:00 - 22:12 hashar: $wgUseTidy = false; its filling error logs on all apaches and seems to stall. Restarted all apaches too. Wikipedians need to FIX their HTML.
  • 14:00 mark: Rebooted srv10, and started Squid on it with no cachedirs (1 null cachedir). Assigned IP .214 to it.
  • 08:28 Tim: restarted squid on srv6. Slow hit service times (~100ms), it wasn't swapping but it had very little spare memory for kernel cache and buffers.
  • 03:05 Tim: bayle was swapping heavily, very slow service times for both hits and misses. Restarted squid, added it to the ganglia squid cluster.

November 13

  • 22:50 jeluf: mounted amane:/export/math to all mediawiki-installation servers for storage of math images.
  • 20:00 midom: srv10 squid hanged, reiserfs issues?
  • 16:57 brion: running data dumps on benet/srv35/srv36

November 12

  • 19:49 ævar: tingxi had languages/LanguageCs.php (and probably something else) out of date, IIRC it has been down for some time, ran scap to bring it and others up to date.

November 11

  • 00:16 brion: changed sitename on eswikinews (meta-namespace was already set)

November 10

  • 14:28 ævar: changed the logo on trwiki
  • 09:06 ævar: Changed the upload url of the wikis that had uploading disabled to point to the commons
  • 09:09 brion: gave up trying to upgrade bugzilla due to bugzilla upgrade failure
  • 08:40 brion: running yum update on pascal; got some glibc double-free bug during bugzilla update, and thought it was time to upgrade some damn packages
  • 08:25 brion: shutting down bugzilla for upgrade to 2.20
  • 07:18 brion: removed check_policy_service from /etc/postfix/main.cf on kate's advice, to see if it's more stable with that off
  • 07:02 brion: restarting postfix on zwinger, mail stopped again
  • 05:59 ævar: Removed harris from /usr/local/dsh/node_groups/mediawiki-installation, responded to ping, had port 22 open, but hung forever on ssh harris
  • 04:27 Tim: set up ftp server on bacon, to accept uploads of scanned page images

November 9

  • 14:18 Tim: fuchsia was swapping, regularly timing out on lvsmon health checks. Restarted squid.
  • 11:09 brion: modified parser cache behavior to do cache with redirect targets. should increase hit rate; if troubles experienced, revert Article.php back to rev 1.396
  • 10:13 brion: reenabled search text extracts for active sessions only
  • 07:32 brion: updating live search indexes
  • 00:54 brion: no mail in last eight hours... restarting postfix

November 8

  • 23:30 jeluf: After intensive fsck, ragweed is back.
  • 19:00 ragweed pings, but doesn't allow SSH login
  • 13:10 holbach crashed
  • 12:05 Tim: deployed local message cache, causing a 60% drop in network traffic on the apache cluster according to ganglia. We had noticed probable network saturation on the 100 Mbps switch asw1, this was the obvious solution. A content hash is stored in memcached and checked on each request. The local cache is stored in files, one file per wiki in /tmp/mediawiki/

November 7

  • 20:51 kate: stopped replication on lomaria. please don't start it without asking me unless it's extremely important.
  • 20:45 brion: trying to get tidy going again
  • 20:30 brion: rebuilding search indexes on maurus.
  • 20:00 brion: set search daemons to restart hourly. *sigh*
  • 14:05 Tim: brought holbach back into service. Tweaked some load ratios.
  • 13:55 Tim: started slave on lomaria. It was idle, the site was slow.
  • 05:45 brion: switched lucene search to default to AND matches
  • 02:50 brion: set up init script for MWDaemon (/etc/init.d/mwdaemon), added a daily cronjob to restart them

November 6

  • 21:04 brion: several servers had disks filled from apache error_log; libart in rsvg apparently spewing out gigs of "*** attempt to put segment in horiz list twice"
  • 20:10 brion: site unusually loaded; giving a kick to the apaches for luck
  • 11:08 jeluf: srv22 was overheated. killed svg renderer (240 cpu minutes)
  • 11:00 jeluf: added Category:Broken_servers for better keeping track of todos
  • 10:40 jeluf: added portal namespace for nowiki upon Jhs' request
  • 08:20 kate: copying from lomaria again... whee!
  • 05:20 brion: added id.wikisource.org by request
  • 04:59 Tim: started lvsmon-ksquid on pascal
  • 04:39 kate: iris crashed... moved lvs to pascal.
  • 02:40 Tim: Made MW check $cluster.dblist instead of all.dblist. This will generate appropriate error conditions for improper access to foreign databases via commandLine.inc, Special:Makesysop or squid misconfiguration.
  • 01:40 Tim: installed memcached on srv41-50, moved instances from various other machines to there, including offloading browne completely. Restarted memcached on srv22, it had a dead instance.

November 5

  • 22:02 kate: restarted replication on lomaria. set up replication on zedler.
  • 11:00 brion: chgrp'd common files on humboldt
  • 09:15 solar: installed new image filer, amane, into the rack.
  • 04:55 kate: stopped replication lomaria again to re-dump. don't start it please. (server is still running)
  • 03:41 Tim: tried to restart dumpHTML on srv31, the machine crashed almost immediately
  • 03:39 brion: starting dumps on yaseo on amaryllis/henbane
  • 03:32 kate: copy finished, restarted replication on lomaria
  • 03:00 brion: refresh-dblist now also creates pmtpa.dblist and yaseo.dblist, based on assignment overrides from clusters.dblist
  • 00:45 brion: started pmtpa dumps on benet, srv35, srv36

November 4

  • 21:45 jeluf: moved lightgy on benet to /usr/local/lighttpd. Added startup to /etc/rc.local
  • 21:00 jeluf: mounted benet:/var/backup to zwinger:/mnt/backup_benet
  • 06:25 brion: restarted search servers; memory usage up to 650-1000mb range, and very slow response on vincent

November 3

  • 11:03 kate: copying lomaria's db to zedler, don't start it
  • 21:45 erik: fixed he.wikinews site name and meta namspace (hopefully), sync'd InitialiseSettings.php and ran update.php accordingly
  • 20:44 brion: investigating connection errors (hacked wfLogDBerror to include hostname); seems to be on the new opteron boxen only
  • 20:30 hashar: started apache on srv35.
  • 20:22 hashar: started apache on avicenna.
  • 20:10 mark: Will was running with only 1024 FDs. As it's the only non-RPM squid around (will is FC1) and I added bayle, I have taken it out, reassigned IPs to srv5 and srv7.
  • 19:55 hashar: some apaches need a reboot. load is incorrectly high on them cause of state=D process (see bug #3869)
  • 15:10 mark: Moved bayle (previously broken, inactive memcached) to the external vlan, made it a temporary squid. I cannot get it to mount izwinger:/home though. Any ideas?
  • 5:30 Tim: copied ~tstarling/.ssh/known_hosts to /etc/ssh/ssh_known_hosts on all pmtpa machines
  • ~5:00 Tim & kate: syslogd stopped working on zwinger, causing DNS to stop working. Kate restarted syslogd.
  • ~5:00 created hewikinews using addwiki.php, sync-common-all
  • 04:07 kate: made amaryllis ns3.wikimedia.org. needs magic stuff so it can be added as auth ns
  • 01:58 Tim: restarted search daemon on vincent, the usual problem

November 2

  • mark: Apparently the restart squid cron job in the squid RPM is broken in a weird way: at some point in time /sbin/pidof /usr/sbin/squid will stop working. I will fix it and roll out a new RPM tomorrow. Sorry for the trouble!
  • 23:20 JeLuF: Found 2 squids on srv8. Killed both, started a new one.
  • 22:20 Tim: adapted lvsmon for knams squid service, started it on iris. See /usr/local/bin/lvsmon-ksquid . There's also a copy in ~tstarling/lvs on zwinger in case iris goes down.
  • 21:30 mark: Installed the new squid RPM on clematis. Not using epoll didn't change memory leaking behaviour.
  • 19:17 kate: LDAP in on pascal was broken after reboot.
Nov  2 19:11:26 pascal slapd[29793]: bdb_db_init: Initializing BDB database
Nov  2 19:11:26 pascal slapd[29794]: bdb(dc=knams,dc=wikimedia,dc=org): Lock table is out of available
-               locks
Nov  2 19:11:26 pascal slapd[29794]: bdb_db_open: db_open(/var/lib/ldap) failed: Cannot allocate
-               memory (12)
Nov  2 19:11:26 pascal slapd[29794]: backend_startup: bi_db_open(0) failed! (12)
Did a db_recover and restarted slapd.
  • 04:38 kate, kyle: csw4 is installed. nothing on it yet.
  • 01:08 kate: pascal broke again, moved LVS to iris
  • 00:10 kate: colo allocated us 84.40.25.224/27, wikicities will move into this network

November 1

  • 23:39 brion: created car-fr-l list for french arbcom
  • 22:25 brion: heavy packet loss between pmtpa and lopar; kate is moving dns off lopar for now
  • 21:10 UTC erik: created ru.wikinews.org using addwiki.php
  • 18:26 mark: Dropped 207.142.131.225 as gateway IP, as it doesn't seem to be in use anymore
  • 18:15 mark: Made csw1-pmtpa act as a DHCP relay agent for rabanus, 10.0.0.15
  • 04:20 kate: replaced mormo.org with pascal & amaryllis as backup MX, using postgrey + other anti-spam stuff
  • 05:48 Solar: anthony, suda, isidore and bayle are back up.
  • 05:10 Tim: Cleaned up the squid list in CommonSettings.php. The need to have variables for the IP addresses of each squid passed long ago, it was just clutter, doubling the length of the section. Added the external IP address of will, which was missing, causing edits to be wrongly attributed in the yaseo wikis.

Archives


Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox