Server admin log/Archive 5
From Wikitech
19 May
- 02:45 kate, brion: fixed ldap/firewall for external servers
18 May
- 18:11 Tim: Categorised the servers by interface and vlan at Interfaces. Fixed routing tables on a few hosts that were non-standard for their category.
- 16:35 Tim: removed isidore and vincent from dsh ALL node group, non-standard configuration. Also removed bart and moreri, permanently down
- 14:54 jeluf: flushed firewall on bayle. Back in apache service.
- 14:38 brion: readded vincent in search group
17 May
- 19:45 JeLuF: restored lost history on dewiki
- 10:29 Tim: Updated DNS to get closer to this, and hence reality.
- 09:45 Tim: Fixing sources in gmetad.conf fixed it
- 09:28 Tim: Moved ganglia configuration to /h/w/conf/gmond, symlinked config on new apaches, changed cluster name, restarted ganglia. It doesn't seem to have fixed the recording problem.
16 May
- 16:50 Tim: Moving ariel binlogs 130-139 to avicenna. Don't ask me where 106-129 went.
- as I already said: khaldun.
- 05:50 brion: added a bunch of nazi spam subjects to wikipedia-l spam filters, hoping to reduce admin load
15 May
- .... dammit is moving memcacheds around to work around browne problem ...
- 12:05 brion: installed libtidy-devel and patched tidy PECL extension on srv11-srv30
- 11:59 brion: browne is having some funky problems; can't talk to the srv machines, which is Bad for memcahced work
- 10:10 brion: installed updated LanguageEl.php; had to fix permissions on file
14 May
- 23:40 brion: disabled catfeed extension for security review
- 23:25 brion: lucene search now up for all en, de, eo, and ru sites. In theory.
- 10:30 brion: running enwiki index update again
- 08:00 brion: vincent back online; eth0 had not initialized properly
13 May
- 21:45 brion: ran checksetup.pl on bugzilla to apply stealth database updates which broke login
- 19:05 brion: upgraded bugzilla to 2.18.1
- 18:37 brion: wikibugs back on irc
- 18:13 brion: hacked Image.php to ignore metadata with negative fileExists, and updated wgCacheEpoch to force rerendering. broken images should be mostly fixed now
- 17:57 brion: grants wiki fixed (wrong directory was synced in docroot)
- 16:20 jeronim: bugzilla bot not running, problems with images ("Missing image" on wiki pages) not fixed
- 12:10 -14:00 and beyond: jer/kate/midom/tim: power loss @ florida colo. most servers lost power; albert, ariel, bacon, suda, khaldun, webster, holbach, srv2, srv3, srv4, srv6, srv7 did not
- 06:00 brion: running lucene builds for all remaining en, eo, ru, de wikis
12 May
- 22:23 brion: hacked language name for 'no' to 'Norsk (bokmål)' per request.
- 21:00 JeLuF: Test installation of mysql cluster on srv29 and srv30. Management server running on srv0. Installation done according to howto.
- 8:54 Tim: offloaded ariel to correct for load caused by compressOld.php and the pending deletion script
- 08:10 Tim: deleting articles on en marked "pending deletion", see Wikipedia:User:Pending deletion script
11 May
- 19:46 Tim: started compressOld.php, running on a screen on zwinger
- 00:05 brion: corrected year on a fr.wikinews revision from 2025 to 2005. Assumed a very badly set clock yesterday morning -- does anybody know about this? I can find no trace of it now, though there were several complaints about affected articles, other examples of which now show correct years. Did someone correct them? Who, and when?
- midom: System clock wasn't synced to hardware clock before new server crash - servers came up with bad timers. Fixed bad entries in ~15wikis (wikipedias only), therefore frwn remained..
10 May
- 23:59 brion: synched hardware clocks on all apaches to current system time. (some were hours off, a few were in 2003)
- 21:33 brion: resynched clock on srv14 to zwinger with ntpdate; was about a minute off.
- 14:00 midom: restarted all memcacheds
- 13:15 midom: chain reaction of slow image server maxed out fds on memcached, which caused even more image server load. temporary workaround: remove some old apaches from service, so that memcached would function a bit better.
- 12:20 midom: ldap server reached maxfiles. fixed in /etc/sysconfig/openldap && restarted
- 12:00 midom: recovered broken new apaches
- 07:55 brion: disabled curl extension loading in case it makes a difference if/when mysteriously killed machines are raised from dead
- 07:31 midom: srv11-srv30 all died
- 07:30 brion: installed curl PHP module on apaches
9 May
- 22:00 chaper, jeluf, midom: srv11-srv30 joined apache service.
- 01:30 brion: removed three invalid image records from commons (from 1.3 era before some name validation fixes)
- 01:00 brion: Somebody (gwicke?) checked out an entire phase3 source tree inside the 1.4 live installation directory. That's a very bad place for it -- it would get replicated to all servers if a full sync is run. I moved it to /tmp.
8 May
- 22:09 Tim: discontinued freenode enwiki RC->IRC feed
- 21:45 JeLuF: removed khaldun from dsh group mysqlslaves
- 21:25 JeLuF: fixed replication on holbach, otrs.ticket_history was broken. Holbach back in service.
- 15:00 JeLuF: fixed replication on bacon, otrs.ticket_history was broken. Bacon back in service.
- 11:08 Tim: added CNAME for irc.wikimedia.org, still working its way through the caches. Opened up port 6667 on browne. Switched on RC->UDP for all wikis, the whole thing is now fully operational.
- 11:00 brion: cleared image metadata cache entries for commonswiki due to latin1-wiki bug inserting bogus entries
- 8:40 Tim: installed patched ircd-hybrid on browne
7 May
- 11:00 brion: replaced wikimania's favicon with the WMF thang. running some lucene updates in background on vincent
- 02:18 brion: started squid on isidore; had been down for some time. cause unknown.
- (some time earlier) tim: made unspecified changes to squid configuration for another external squid
6 May
- 03:00 brion: updated Latin language file changed namespaces on those wikis.
- 01:05 brion: suda caught up. back in rotation.
- 00:44 brion: restarted replication on suda (bugzilla's votes table had some kind of index error)
- 00:37 brion: took suda out of rotation; replication is broken
5 May
- 21:39 brion: starting rc bots on browne. Configuration has changed, they are not using a proxy and must be run from a machine with an external route.
- 16:38 Tim: dumping, dropping and reimporting bgwiktionary.brokenlinks seems to have worked, gradually reapplying load
- 15:55 Tim: Trying standard recovery procedures
- 15:08: Suda crashed due to corrupt InnoDB page
4 May
- 22:15 brion: hacked in os interwiki defs for wikipedias (not other wikis, not sure if they're even set up)
- 18:52 Tim: installed RC->UDP->IRC system. The UDP->IRC component is udprec piped into mxircecho.py running in a screen on zwinger. This removes the high system requirements previously needed for RC bots.
- 10:30 Tim: Bots K-lined. Removed enwiki and dewiki to avoid further offence, and left them in a reconnect loop. If someone wants to approach Geert yet again, be my guest.
- 10:20 Tim: moved RC bots to browne, which is mostly idle, has plenty of RAM, and has an external IP address, allowing it to connect to freenode without going through the apparently undocumented and non-working port forwarder on zwinger.
- 6:45 jeluf: started squid on will, was down.
3 May
- 22:25 kate: changed liwiki tz to Europe/Berlin
- 4:40 jeluf: added webster to DB pool again.
2 May
- 14:15 midom: after second consecutive webster crash, took it out from rotation, trying forced innodb recovery, planning resync:
050502 14:11:15InnoDB: Assertion failure in thread 1207892320 in file btr0cur.c line 3558 InnoDB: Failing assertion: copied_len == local_len + extern_len InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to http://bugs.mysql.com. InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mysqld startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: http://dev.mysql.com/doc/mysql/en/Forcing_recovery.html InnoDB: about forcing recovery.
- 14:00 midom: webster's mysql crashed with some assertions, did come up later and continued to serve requests after some load management
- 11:00 brion: started squid on srv7, which had been down for unexplained reasons and its IP addresses had not been reassigned
- 07:20 brion: rebuilt foundation-l list archives after removing some personal info by request
1 May
- 00:05 brion: changed $wgUploadDirectory settings so they won't break in maintenance scripts. hopefully didn't get them wrong.
30 April
- 23:30 brion: cleared image cache for all wikis. bogus entries probably added during links refresh; maintenance scripts have wrong $wgUploadDirectory
- 23:00 brion: cleared image cache entries in memcached for commonswiki due to spurious entries marked as not existing.
- 04:25 Tim: Setting up for perlbal throughput test on tingxi
29 April
- 22:18 Tim: resumed refreshLinks.php
- 15:57 Tim: stopped refreshLinks.php at the end of enwiki, before the delete queries
- 15:28 Tim: Restarted avicenna, which caused the site to crash due to a large number of threads waiting for Lucene
<TimStarling> what is avicenna's role? <dammit> was: search server <dammit> dunno now <TimStarling> avicenna is reporting 20% user CPU usage <dammit> every host that runs lucene <TimStarling> but nothing is showing up in top <dammit> has broken top output <dammit> and broken ps output <TimStarling> nothing important shows up in netstat, I'll just reboot it <TimStarling> ok? <dammit> 'k *site explodes*
- 09:10 brion: took vincent out of lucene search rotation while it's building; changed default_socket_timeout in php.ini to 3 seconds from 60
- 04:00 brion: started incremental index update for lucene search indexes
- 03:38 Tim: resumed refreshLinks.php after having stopped it for a while during peak period
28 April
- 05:12 Tim: Shutting down apache on srv1 to dedicate it to refreshLinks.php
- 02:10 brion: set up logrotate on isidore to rotate squid log, in hourly cron
- 01:40 brion: manually rotated squid log on isidore due to reaching 2gb, restarted squid.
27 April
- 07:30 brion: installed patched Tidy extension on apaches to fix binary-safe string bug.
26 April
- 06:00ish brion: copied updated lucene indexes to avicenna and maurus, put vincent back in search rotation
- 05:40-05:55: Severe external network problems
- 05:25 Tim: deleted obsolete binlogs, moved the remainder (77-87) from ariel to avicenna. 33 GB of disk space remaining on ariel.
- (yesterday) jeronim: installed python 2.4.1 from source on alrazi, using make altinstall instead of make install, so that the current python 2.3 installation is not interfered with -- the 2.4.1 binary is at /usr/local/bin/python2.4
25 April
- 23:30 jeronim: clocks were wrong on 5 machines; fixed 4 of them (installed ntpdate on vincent). isidore still needs to be done (dammit? :)
- 07:55 brion: started a second active search daemon on maurus (vincent is still rebuilding indexes)
- 05:00 jeluf: enabled LuceneSearch.
- 01:20 brion: had to restart srv7 squid again. moved logrotate from cron.daily to cron.hourly, where it should have been before but wasn't
24 April
- 21:30 jeluf: disabled LuceneSearch. All apache processes were in state LuceneSearch::newFromQuery
- 11:15 jeluf: set wgCountCategorizedImagesAsUsed for commons.
- 02:55 brion: manually rotate squid log on srv7 again when it reached 2gb and crashed. logrotate needs to be fixed...
- 02:15 brion: installed GCC 4.0 final on vincent, avicenna for GCJ. Taking vincent out of search rotation for index rebuild.
23 April
- 13:15 Tim: recaching special pages, with wget script running in a screen on zwinger, which requests recache pages from bayle, which sends the expensive queries to benet.
- 02:25 brion: manually rotated logs and restarted squid on srv7. had been down for 2.5 hours, but nobody noticed the alarm from servmon.
22 April
- 10:20 brion: as a temporary hack, bumped rc_namespace on metawiki from tinyint to int. somebody added a russian help namespace at 128/129 which is outside of the signed tinyint range, so pages were recorded with the wrong namesapce.
- 01:30 brion: removed 'wrap' option from tidy.conf to work around weird corruption problem (may be bug in tidy; investigating)
21 April
- 18:00 midom: started backup run on benet
20 April
- 11:25 brion: tidy extension installed on apaches, now active. To go back to external, set $wgTidyInternal = false; or remove extension=tidy.so from php.ini and restart apaches
- 10:50 brion: added node groups fc3, freebsd, debian
- 10:06 brion: removed isidore and vincent from fc2-i386 node group, as they're running FreeBSD and Debian
- 10:00 brion: working on installing tidy extension for php...
- 03:00 brion: re-enabled search
19 April
- 16:50 Tim: Pope-related flash crowd, peaking at 2100/s. Apaches were hard hit by searches (about 50% of profile time) so I disabled them temporarily.
- 16:00 Tim: we were getting reports of gzuncompress errors in memcached-client.php, on every page view on en. I put in an error suppression operator and instead logged all such errors to /home/wikipedia/logs/memcached_errors, to determine which server was the problem. It turned out to be not a server but a key, enwiki:messages to be precise. Deleting it and letting it reload fixed the problem.
- 07:30 midom: sad notice, smellie down, memory or other hardware troubles, lots of segmentation faults and other signals before reboot, didn't come up after.
17 April
- 09:00 midom: fixed broken webster replication, caused by table bugs at database bugs
- 06:45 brion: fixed symlinked php.ini on srv2, srv3
- 00:00 midom: reformatted suda data area from xfs to ext2, brought into MySQL service for enwiki only
14 April
- 03:20 brion: eowiki lucene search live! others building...
- 02:45 brion: started lucene index builds for eowiki, ruwiki, dewiki
- 02:15 brion: lucene search live for meta
- 01:45 brion: restarted meta search build, as it was pulling from wrong db. whoops!
13 April
- 23:51 brion: noticed some spam coming in on bugzilla. hacked rel="nofollow" into comment processing, removed the comment, and disabled the account used to post it.
- 22:40 brion: starting lucene index builds for metawiki and some other wikipedias
- 00:08 brion: removed Apache-Midnight-Job from avicenna crontab
12 April
- 23:50 brion: vincent and avicenna are sharing LuceneSearch burden.
- 20:00 brion: Chad fixed vincent, which is now running lucene. Isidore lucene stopped, it's going to be squid soon. Will take over an apache for additional search capacity.
- 13:30 brion: lucene search turned on for en with slightly old index file, daemon running on isidore
- 10:30 brion: gcj on isidore seems horked; index rebuild is much too slow (eta 18 hours) so stopped it. uploading an index from home, and building mono for further testing.
- 10:00 midom: holbach restored.
- 08:55 holbach seems to be deadish
- 08:50 brion: started lucene index build on isidore
- 05:50 brion: vincent doesn't seem to be coming up again, will need to be kicked.
- 05:20 brion: upgrading vincent to 2.6 kernel hoping to resolve threading/memory issues w/ MWDaemon
- 02:10 brion: rebooting srv6 due to zombie squid eating port 80
11 April
- 23:05 kate: experimenting with making an en.wp image dump using trickle (cvs: /tools/trickle/)
- 08:00 midom: broken replication (by chineese scammer) on bacon, fixed by "use otrs; repair table article" - myisam tables are evil, aren't they?
10 April
- ~23:00: kate: upgraded squid to STABLE9+patches (see squid builds) + restarted all squids.
- mark: All squids are running with too few FDs (1024), and if noone replaces all daemons by the new one Kate just built, we may have a problem tomorrow during peak hours...
- 19:15 midom: srv7 is now in squid service
- 19:07 brion: MWDaemon's memory usage got high enough it started swapping. Hung connections ate up apaches and hung the site until it was restarted.
- 5:30 brion: lucene search server active for en.wikipedia.org, running on vincent.
9 April
- 15:45 midom: dropped thttpd (as it was using 32bit mmaps) on dumps in favor of lighttpd. It has superb performance, serves 3500hits/s under ab and served 70MB/s from benet in small reqs... Extreme recommendations for using lighttpd for image uploads.
- 10:15 brion: running lucene search indexer on vincent (pulling enwiki from benet).
- 05:25 brion: added additional is rcbots to #is.wikipedia for tionary/books/quote
8 April
- 16:00 midom: redirected http://download.wikimedia.org/ to benet, misses tomeraider and uploads...
- 13:00 Tim: switched to Mark's squid binary on the French squids
7 April
- Mark, Tim: implemented Multicast HTCP purging on all FL apaches/squids. French Squids still need a binary replacement.
6 April
- 21:44 mark: Put port gi0/26 on csw1-pmtpa into trunking mode: vlans 1-2 only, with vlan 2 being the native vlan, no LACP negotiation
- 11:30 midom: benet put into dump operation
- 10:55 brion: reinstalled PHP on zwinger and apaches, compiled with memory limit and mbstring options enabled. This was left out when upgrading to 4.3.11.
- 2:40 brion: added NetCabo proxies to trusted proxy list (inconveniently shared by Jorge and a Nazi vandal on pt.wikipedia.org)
4 April
- 15:30 jeluf: disbaled logging of upload.wikimedia.org
- 15:15 midom: yet another image server overload. rotated 30G upload.wikimedia logfile, could be fragmentation overhead.
- 12:00 midom: moved log_bin.0[0123]? (40G worth of binlogs) from ariel to khaldun/avicenna backup/arielbinlog, reclaimed some master disk space.
- Do we need those binlogs for anything?
- Yes, we need binlogs back to the last full backup -- TS
- Do we need those binlogs for anything?
- 07:48 Tim: Started memcached on browne, it was in the list but not running. Fixed startup scripts. Noticed that browne can't contact albert on 10/8, modified yum.conf accordingly.
3 April
- 18:25 midom: extended public IP address range (now: 12 addresses)
- 17:50 midom: srv5 joined service as squid.
1 April
- 22:30 midom: Enabled recentchanges-based watchlist hack. Servers go faaaast.
- 23:15 brion: set default block expiry to 1h on dewiki by request of various admins
Archives
- Server admin log/Archive 1
- Server admin log/Archive 2 (2004 Oct - 2004 Nov)
- Server admin log/Archive 3 (2004 Dec - 2005 Mar)