Server admin log/Archive 5

From Wikitech
< Server admin log(Difference between revisions)
Jump to: navigation, search
(7 June)
(7 June)
Line 5: Line 5:
  
 
== 7 June ==
 
== 7 June ==
 +
* 19:00 jeluf: moved binlogs 198 and 199 from ariel to khaldun
 
* 18:48 brion: reactivated search
 
* 18:48 brion: reactivated search
 +
* 9:00-19:00 all: Moved to new Tampa data center
 
* 10:00 brion: replaced lighttpd on fuchsia with apache because the errordocument stopped working for no reason
 
* 10:00 brion: replaced lighttpd on fuchsia with apache because the errordocument stopped working for no reason
 
* 08:00 or so; brion: added fuchsia to wikimedia.org dns, using an alias from dammit because of crappy verio interface. still not on wikipedia.org because we can't get in to it.
 
* 08:00 or so; brion: added fuchsia to wikimedia.org dns, using an alias from dammit because of crappy verio interface. still not on wikipedia.org because we can't get in to it.

Revision as of 19:13, 7 June 2005

Template:Topnavbar

28 March 01:47 (UTC, but cached)

Total bandw. | Squid stats

Ganglia: A|S

7 June

  • 19:00 jeluf: moved binlogs 198 and 199 from ariel to khaldun
  • 18:48 brion: reactivated search
  • 9:00-19:00 all: Moved to new Tampa data center
  • 10:00 brion: replaced lighttpd on fuchsia with apache because the errordocument stopped working for no reason
  • 08:00 or so; brion: added fuchsia to wikimedia.org dns, using an alias from dammit because of crappy verio interface. still not on wikipedia.org because we can't get in to it.
  • 07:00 or somewhat: horrible things begin

6 June

  • 13:40 kate: started copying dumps to vandale
  • 11:30 kate: make a small db change for wikimania registration to implement a change in the form. left a backup of the old one at zwinger:/root/wikimania.prekate.sql
  • 10:05 kate: set up logrotate on knams
  • 01:43 Tim: moved binlogs 194-197

5 June

  • 22:40 kate: reinstalled mint with better partition layout, added it to squid pool
  • 21:00 gwicke: fixed mysql error messages in this wiki after config tweak to index words from 3 chars. You should now be able to search for things like 'DNS'.
  • 14:55 mark: bound bind to 145.97.39.130 (pascal's main ip) only, adapted firewallinit to allow incoming DNS zone transfers
  • 14:19 kate: added lily to squid pool
  • 13:25 mark: Added ip 145.97.39.158 to pascal, adapted /sbin/ifup-local.
  • 10:02 kate: iris -> squid pool
  • 09:03 kate: clematis -> squid pool
  • 08:46 kate: sv,dk,no.wp -> knams
  • 08:09 kate: de.wp -> knams
  • 07:56 kate: put mayflower to knams squid pool. fixed typo in commonsettings breaking squid caching.

4 June

  • 18:27 kate: added hawthorn to squid pool
  • 18:10 kate: created rr.knams pool, put UK, NL, DE and LT on it.
  • 16:28 kate, jer, dammit: started squid on ragweed, put it in lopar pool for now
  • 15:30 jeluf: moved binlogs 190-193 to kkhaldun
  • 13:54 jeronim: built new squid for will as old one had file descriptor limit of 1024 instead of 8192 so it was running out of FDs. In /home/wikipedia/src/squid/squid-2.5.STABLE9.wp20050604.S9plus.no2GB[icpfix,nortt,htcpclr]

3 June

  • 23:30 brion: fixed salting on user_newpassword for accounts not touched since the change.
  • 20:40 mark: Wrote /sbin/ifup-local script on pascal, to handle post-ifup tasks. Currently adds 10.21.0.2/24 IP to eth1 for accessing the LOMs.
  • 20:00 mark: Set up permanent source routing on pascal for Kennisnet out of band access using /etc/sysconfig/network-scripts/route-eth1 and rc.local
  • 19:05 mark: Rebooted csw2-knams with newer crypto image, setup SSH, changed DNS resolver
  • 09:40 kate: created 400GB LV at /sqldata on vandale, ext3. installed mysql. copied ariel's my.cnf over (can someone look at what needs to be changed there?). did not populate any sql data yet.
  • 05:50 kate: REMOVED WILDCARD NS RECORD under *.wikimedia.org. this means you will need to add NS records for new wikis in that domain or they won't work.
  • 05:48 kate: set up recursing NS on pascal and mayflower; tested pdns slave for wikimedia.org on fuchsia, seems to work (but not authorative yet).
  • 00:05 Tim: moving binlogs 186-189 from ariel to khaldun

2 June

  • 06:15 brion: clearing user records from memcached. two instances of can't-log-in reported might have been caused by stale cache records re-saving bogus unsalted passwords, but that's sheer speculation.
  • 06:00 JeLuF: fixed mail on dalembert and goeje to use smtp.pmtpa.wmnet as smarthost
  • 05:45 JeLuF: removed moreri and bart from "apaches" nodegroup

1 June

  • 19:10 JeLuF: moved binlogs 184 and 185 from ariel to khaldun
  • 15:04 Tim: fixed timezone on coronelli
  • 14:35 Tim: had a go at fixing ntpd on various servers. It was not installed on coronelli and not running on srv5, fixed both fairly easily. Synchronised configuration files on srv11-30, they're still reporting "synchronization failed" as ntpd starts up, although I was able to synchronise their clocks manually with ntpdate. "ntpdc -p" seems to indicate that they are working properly.
  • 5:10 jeluf: Added index, set site to read/write
  • 04:10 brion: updated user tables for password hash salting.
  • 3:00 jeluf: set farm to read only

31 May

  • 16:12 Tim: switched profiling from user time to real time
  • 13:45 brion: experimentally disabled MakeSpelFix in lucene search results to compare load / response time
  • 5:00 jeluf: CREATE INDEX id_title_ns ON cur (cur_id, cur_title, cur_namespace); on all wikiquote, wikinews, wiktionary, wikibooks, dewiki and all wikis with 10'000 to 100'000 articles. To be done tomorow: enwiki, frwiki, jawiki, wikipedias with <10'000 articles

30 May

  • 18:02 kate: starting copying khaldun:/usr/etc/images/enwiki/enwiki_upload.tar to srv11:/usr/etc/backup/images/
  • 11:55 brion: enwiki image archives and thumbnails have by now been copied to khaldun. all should be right with the world.
  • 07:49 brion: increased bacon's share of load, but not quite up to previous levels
  • 05:20 jeluf: moved binlogs 175-179 from ariel to khaldun
  • 03:20 brion: took khaldun out of apaches group, added to images group. en.wikipedia.org images are moved to khaldun, thumbnails still copying.

29 May

  • 23:30 brion: working on moving en.wikipedia.org's uploads from albert to khaldun
  • 21:18 brion: reduced load on bacon to keep it from lagging
  • 11:15 brion: added bugzilla stats collection to cron.daily

28 May

  • 20:07 kate: started a full image dump on khaldun using modified backup scripts
  • 08:00-ongoing jeluf: Migrating enwiki to external storage
  • 07:30 jeluf: moved binlogs 170-174 from ariel to khaldun

27 May

  • 22:59 brion: lucene search on wikimedia-wide
  • 11:57 brion: servmonii seems to be offline; not on irc, and smlogmsg fails when doing syncs
  • 11:38 brion: installed simple experimintal edit/move rate limiter with fairly conservative settings for now
  • 07:35 brion: changed default search namespaces from NS_TEMPLATE_TALK to NS_HELP (whoops!)

26 May

  • 06:30 jeluf: migrated dawiki
  • 05:30 jeluf: migrated concatZippedHistoryBlobs of eowiki,glwiki,bgwiki to external storage cluster srv28/29/30

25 May

  • 21:50 brion: vincent has been reinstalled with FC3. Running a full Lucene index build for all wikis now...
  • 20:00 dmonniaux: on bleuenn/chloe/ennael: disabled DNS through Wikimedia servers through PPP (didn't work, prevented squid from restarting); used Lost-Oasis servers instead (cf /etc/resolv.conf); inserted iptables -I INPUT -j ACCEPT so as to allow DNS etc. in (please remove once you know what you're doing)
  • 06:30 jeluf: moved binlogs 166 and 167 from ariel to khaldun
  • 04:17 Tim: noticed that webster had stopped replicating 4.5 hours ago. Offloaded it and ran "REPAIR TABLE bugs.bugs" to fix the problem.
  • 01:47 kate: albert's eth1 died for unknown reasons, site broke. configured eth0 as a trunk port to keep site operational.
  • 00:35 brion: running lucene updates on vincent; out of search rotation during build

24 May

  • 21:15 jeluf: moved binlogs 163-165 from ariel to khaldun
  • 15:16 Tim: started update-special-pages-loop, in a screen on zwinger. Using benet for DB.

23 May

  • 20:00 jeluf: added "-A" to /etc/sysconfig/ntpd, synched clocks
  • 15:30 jeluf: installing MySQL 4.0.24 to srv28-30, srv30 will be master, srv28 and 29 will be slaves
  • 08:53 brion: vincent is serving searches from the Mono-based server experimentally
  • 06:15 jeluf: moved binlogs 160-162 from ariel to khaldun
  • 02:35 brion: page moves back on
  • 02:14 brion: temporarily disabled page moves while cleaning up aftermath of a move vandal
  • 01:34 brion: running Lucene index updates and tests

22 May

  • 22:45 brion: fixed another hidden year 2025 entry on eswiki which screwed up recent changes caching
  • 20:15 jeluf: restarted slave. That was faster than I expected.
  • 20:00 jeluf: stopped slave on benet, doing some dumps.
  • 3:00 erik: ran /home/erik/commonsupdate.pl (logged to commonscategoryupdate*.txt) to change category sort keys "Special:Upload" and "Upload" to proper page titles (bug 73); this fixes paging on categories with more than 200 images. Bug 73 is now fixed, so this should not reoccur, but other wikis will have the same problem and can be quickly fixed with this script if necessary.

21 May

  • 23:10 jeluf: moved binlogs 153-159 from ariel to khaldun:/usr/backup/arielbinlog/
  • 3:00 erik: setting up sr.wikinews.org. Not announced yet until language files are fixed.

20 May

  • 19:00 Chad: put zwinger, holbach and webster on the scs. Took moreri, smellie and anthony off. Tim changed software labels.
  • 15:15 midom: killed suse firewall and kernel security stuff. it freaked out all sysadmins, shouldn't be allowed to live :)
  • 12:50 brion & many: all hell breaks loose with ldap oddness on albert and dns and... stuff
  • 4:30-5:40 Tim & onlookers: DNS failure on zwinger. Took us an hour to fix it instead of 2 minutes, and caused problems site-wide, because we're using non-redundant DNS instead of /etc/hosts. Logins were timing out because commands in the login scripts were waiting for a DNS response. Managed to get root on albert first, and set about modifying resolv.conf on all machines to use albert as well as zwinger. Eventually got root on zwinger, had to kill -9 named. Restarted it, everything is back to normal.

19 May

  • 22:00 jeluf: Upon GerardM's request, and due to ongoing vandalism on li.wikipedia.org, promoted user "cicero" to sysop on liwiki
  • 21:48 jeronim: removed isidore from squids dsh group (and condolences for the eurovision tragedy)
  • 21:30 midom: after surviving ddos aimed at my dsl and lithuania's failure in eurovision I finally moved some ariel binlogs to alrazi/khaldun (raid1 :)
  • 02:45 kate, brion: fixed ldap/firewall for external servers

18 May

  • 18:11 Tim: Categorised the servers by interface and vlan at Interfaces. Fixed routing tables on a few hosts that were non-standard for their category.
  • 16:35 Tim: removed isidore and vincent from dsh ALL node group, non-standard configuration. Also removed bart and moreri, permanently down
  • 14:54 jeluf: flushed firewall on bayle. Back in apache service.
  • 14:38 brion: readded vincent in search group

17 May

  • 19:45 JeLuF: restored lost history on dewiki
  • 10:29 Tim: Updated DNS to get closer to this, and hence reality.
  • 09:45 Tim: Fixing sources in gmetad.conf fixed it
  • 09:28 Tim: Moved ganglia configuration to /h/w/conf/gmond, symlinked config on new apaches, changed cluster name, restarted ganglia. It doesn't seem to have fixed the recording problem.

16 May

  • 16:50 Tim: Moving ariel binlogs 130-139 to avicenna. Don't ask me where 106-129 went.
    • as I already said: khaldun.
  • 05:50 brion: added a bunch of nazi spam subjects to wikipedia-l spam filters, hoping to reduce admin load

15 May

  • .... dammit is moving memcacheds around to work around browne problem ...
  • 12:05 brion: installed libtidy-devel and patched tidy PECL extension on srv11-srv30
  • 11:59 brion: browne is having some funky problems; can't talk to the srv machines, which is Bad for memcahced work
  • 10:10 brion: installed updated LanguageEl.php; had to fix permissions on file

14 May

  • 23:40 brion: disabled catfeed extension for security review
  • 23:25 brion: lucene search now up for all en, de, eo, and ru sites. In theory.
  • 10:30 brion: running enwiki index update again
  • 08:00 brion: vincent back online; eth0 had not initialized properly

13 May

  • 21:45 brion: ran checksetup.pl on bugzilla to apply stealth database updates which broke login
  • 19:05 brion: upgraded bugzilla to 2.18.1
  • 18:37 brion: wikibugs back on irc
  • 18:13 brion: hacked Image.php to ignore metadata with negative fileExists, and updated wgCacheEpoch to force rerendering. broken images should be mostly fixed now
  • 17:57 brion: grants wiki fixed (wrong directory was synced in docroot)
  • 16:20 jeronim: bugzilla bot not running, problems with images ("Missing image" on wiki pages) not fixed
  • 12:10 -14:00 and beyond: jer/kate/midom/tim: power loss @ florida colo. most servers lost power; albert, ariel, bacon, suda, khaldun, webster, holbach, srv2, srv3, srv4, srv6, srv7 did not
  • 06:00 brion: running lucene builds for all remaining en, eo, ru, de wikis

12 May

  • 22:23 brion: hacked language name for 'no' to 'Norsk (bokmål)' per request.
  • 21:00 JeLuF: Test installation of mysql cluster on srv29 and srv30. Management server running on srv0. Installation done according to howto.
  • 8:54 Tim: offloaded ariel to correct for load caused by compressOld.php and the pending deletion script
  • 08:10 Tim: deleting articles on en marked "pending deletion", see Wikipedia:User:Pending deletion script

11 May

  • 19:46 Tim: started compressOld.php, running on a screen on zwinger
  • 00:05 brion: corrected year on a fr.wikinews revision from 2025 to 2005. Assumed a very badly set clock yesterday morning -- does anybody know about this? I can find no trace of it now, though there were several complaints about affected articles, other examples of which now show correct years. Did someone correct them? Who, and when?
midom: System clock wasn't synced to hardware clock before new server crash - servers came up with bad timers. Fixed bad entries in ~15wikis (wikipedias only), therefore frwn remained..

10 May

  • 23:59 brion: synched hardware clocks on all apaches to current system time. (some were hours off, a few were in 2003)
  • 21:33 brion: resynched clock on srv14 to zwinger with ntpdate; was about a minute off.
  • 14:00 midom: restarted all memcacheds
  • 13:15 midom: chain reaction of slow image server maxed out fds on memcached, which caused even more image server load. temporary workaround: remove some old apaches from service, so that memcached would function a bit better.
  • 12:20 midom: ldap server reached maxfiles. fixed in /etc/sysconfig/openldap && restarted
  • 12:00 midom: recovered broken new apaches
  • 07:55 brion: disabled curl extension loading in case it makes a difference if/when mysteriously killed machines are raised from dead
  • 07:31 midom: srv11-srv30 all died
CURL extension in effect
  • 07:30 brion: installed curl PHP module on apaches

9 May

  • 22:00 chaper, jeluf, midom: srv11-srv30 joined apache service.
  • 01:30 brion: removed three invalid image records from commons (from 1.3 era before some name validation fixes)
  • 01:00 brion: Somebody (gwicke?) checked out an entire phase3 source tree inside the 1.4 live installation directory. That's a very bad place for it -- it would get replicated to all servers if a full sync is run. I moved it to /tmp.

8 May

  • 22:09 Tim: discontinued freenode enwiki RC->IRC feed
  • 21:45 JeLuF: removed khaldun from dsh group mysqlslaves
  • 21:25 JeLuF: fixed replication on holbach, otrs.ticket_history was broken. Holbach back in service.
  • 15:00 JeLuF: fixed replication on bacon, otrs.ticket_history was broken. Bacon back in service.
  • 11:08 Tim: added CNAME for irc.wikimedia.org, still working its way through the caches. Opened up port 6667 on browne. Switched on RC->UDP for all wikis, the whole thing is now fully operational.
  • 11:00 brion: cleared image metadata cache entries for commonswiki due to latin1-wiki bug inserting bogus entries
  • 8:40 Tim: installed patched ircd-hybrid on browne

7 May

  • 11:00 brion: replaced wikimania's favicon with the WMF thang. running some lucene updates in background on vincent
  • 02:18 brion: started squid on isidore; had been down for some time. cause unknown.
  • (some time earlier) tim: made unspecified changes to squid configuration for another external squid

6 May

  • 03:00 brion: updated Latin language file changed namespaces on those wikis.
  • 01:05 brion: suda caught up. back in rotation.
  • 00:44 brion: restarted replication on suda (bugzilla's votes table had some kind of index error)
  • 00:37 brion: took suda out of rotation; replication is broken

5 May

  • 21:39 brion: starting rc bots on browne. Configuration has changed, they are not using a proxy and must be run from a machine with an external route.
  • 16:38 Tim: dumping, dropping and reimporting bgwiktionary.brokenlinks seems to have worked, gradually reapplying load
  • 15:55 Tim: Trying standard recovery procedures
  • 15:08: Suda crashed due to corrupt InnoDB page

4 May

  • 22:15 brion: hacked in os interwiki defs for wikipedias (not other wikis, not sure if they're even set up)
  • 18:52 Tim: installed RC->UDP->IRC system. The UDP->IRC component is udprec piped into mxircecho.py running in a screen on zwinger. This removes the high system requirements previously needed for RC bots.
  • 10:30 Tim: Bots K-lined. Removed enwiki and dewiki to avoid further offence, and left them in a reconnect loop. If someone wants to approach Geert yet again, be my guest.
  • 10:20 Tim: moved RC bots to browne, which is mostly idle, has plenty of RAM, and has an external IP address, allowing it to connect to freenode without going through the apparently undocumented and non-working port forwarder on zwinger.
  • 6:45 jeluf: started squid on will, was down.

3 May

  • 22:25 kate: changed liwiki tz to Europe/Berlin
  • 4:40 jeluf: added webster to DB pool again.

2 May

  • 14:15 midom: after second consecutive webster crash, took it out from rotation, trying forced innodb recovery, planning resync:
050502 14:11:15InnoDB: Assertion failure in thread 1207892320 in file btr0cur.c line 3558
InnoDB: Failing assertion: copied_len == local_len + extern_len
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/mysql/en/Forcing_recovery.html
InnoDB: about forcing recovery.
  • 14:00 midom: webster's mysql crashed with some assertions, did come up later and continued to serve requests after some load management
  • 11:00 brion: started squid on srv7, which had been down for unexplained reasons and its IP addresses had not been reassigned
  • 07:20 brion: rebuilt foundation-l list archives after removing some personal info by request

1 May

  • 00:05 brion: changed $wgUploadDirectory settings so they won't break in maintenance scripts. hopefully didn't get them wrong.

30 April

  • 23:30 brion: cleared image cache for all wikis. bogus entries probably added during links refresh; maintenance scripts have wrong $wgUploadDirectory
  • 23:00 brion: cleared image cache entries in memcached for commonswiki due to spurious entries marked as not existing.
  • 04:25 Tim: Setting up for perlbal throughput test on tingxi

29 April

  • 22:18 Tim: resumed refreshLinks.php
  • 15:57 Tim: stopped refreshLinks.php at the end of enwiki, before the delete queries
  • 15:28 Tim: Restarted avicenna, which caused the site to crash due to a large number of threads waiting for Lucene
<TimStarling> what is avicenna's role?
<dammit> was: search server
<dammit> dunno now
<TimStarling> avicenna is reporting 20% user CPU usage
<dammit> every host that runs lucene
<TimStarling> but nothing is showing up in top
<dammit> has broken top output
<dammit> and broken ps output
<TimStarling> nothing important shows up in netstat, I'll just reboot it
<TimStarling> ok?
<dammit> 'k
*site explodes*
  • 09:10 brion: took vincent out of lucene search rotation while it's building; changed default_socket_timeout in php.ini to 3 seconds from 60
  • 04:00 brion: started incremental index update for lucene search indexes
  • 03:38 Tim: resumed refreshLinks.php after having stopped it for a while during peak period

28 April

  • 05:12 Tim: Shutting down apache on srv1 to dedicate it to refreshLinks.php
  • 02:10 brion: set up logrotate on isidore to rotate squid log, in hourly cron
  • 01:40 brion: manually rotated squid log on isidore due to reaching 2gb, restarted squid.

27 April

26 April

  • 06:00ish brion: copied updated lucene indexes to avicenna and maurus, put vincent back in search rotation
  • 05:40-05:55: Severe external network problems
  • 05:25 Tim: deleted obsolete binlogs, moved the remainder (77-87) from ariel to avicenna. 33 GB of disk space remaining on ariel.
  • (yesterday) jeronim: installed python 2.4.1 from source on alrazi, using make altinstall instead of make install, so that the current python 2.3 installation is not interfered with -- the 2.4.1 binary is at /usr/local/bin/python2.4

25 April

  • 23:30 jeronim: clocks were wrong on 5 machines; fixed 4 of them (installed ntpdate on vincent). isidore still needs to be done (dammit? :)
  • 07:55 brion: started a second active search daemon on maurus (vincent is still rebuilding indexes)
  • 05:00 jeluf: enabled LuceneSearch.
  • 01:20 brion: had to restart srv7 squid again. moved logrotate from cron.daily to cron.hourly, where it should have been before but wasn't

24 April

  • 21:30 jeluf: disabled LuceneSearch. All apache processes were in state LuceneSearch::newFromQuery
  • 11:15 jeluf: set wgCountCategorizedImagesAsUsed for commons.
  • 02:55 brion: manually rotate squid log on srv7 again when it reached 2gb and crashed. logrotate needs to be fixed...
  • 02:15 brion: installed GCC 4.0 final on vincent, avicenna for GCJ. Taking vincent out of search rotation for index rebuild.

23 April

  • 13:15 Tim: recaching special pages, with wget script running in a screen on zwinger, which requests recache pages from bayle, which sends the expensive queries to benet.
  • 02:25 brion: manually rotated logs and restarted squid on srv7. had been down for 2.5 hours, but nobody noticed the alarm from servmon.

22 April

  • 10:20 brion: as a temporary hack, bumped rc_namespace on metawiki from tinyint to int. somebody added a russian help namespace at 128/129 which is outside of the signed tinyint range, so pages were recorded with the wrong namesapce.
  • 01:30 brion: removed 'wrap' option from tidy.conf to work around weird corruption problem (may be bug in tidy; investigating)

21 April

  • 18:00 midom: started backup run on benet

20 April

  • 11:25 brion: tidy extension installed on apaches, now active. To go back to external, set $wgTidyInternal = false; or remove extension=tidy.so from php.ini and restart apaches
  • 10:50 brion: added node groups fc3, freebsd, debian
  • 10:06 brion: removed isidore and vincent from fc2-i386 node group, as they're running FreeBSD and Debian
  • 10:00 brion: working on installing tidy extension for php...
  • 03:00 brion: re-enabled search

19 April

  • 16:50 Tim: Pope-related flash crowd, peaking at 2100/s. Apaches were hard hit by searches (about 50% of profile time) so I disabled them temporarily.
  • 16:00 Tim: we were getting reports of gzuncompress errors in memcached-client.php, on every page view on en. I put in an error suppression operator and instead logged all such errors to /home/wikipedia/logs/memcached_errors, to determine which server was the problem. It turned out to be not a server but a key, enwiki:messages to be precise. Deleting it and letting it reload fixed the problem.
  • 07:30 midom: sad notice, smellie down, memory or other hardware troubles, lots of segmentation faults and other signals before reboot, didn't come up after.

17 April

  • 09:00 midom: fixed broken webster replication, caused by table bugs at database bugs
  • 06:45 brion: fixed symlinked php.ini on srv2, srv3
  • 00:00 midom: reformatted suda data area from xfs to ext2, brought into MySQL service for enwiki only

14 April

  • 03:20 brion: eowiki lucene search live! others building...
  • 02:45 brion: started lucene index builds for eowiki, ruwiki, dewiki
  • 02:15 brion: lucene search live for meta
  • 01:45 brion: restarted meta search build, as it was pulling from wrong db. whoops!

13 April

  • 23:51 brion: noticed some spam coming in on bugzilla. hacked rel="nofollow" into comment processing, removed the comment, and disabled the account used to post it.
  • 22:40 brion: starting lucene index builds for metawiki and some other wikipedias
  • 00:08 brion: removed Apache-Midnight-Job from avicenna crontab

12 April

  • 23:50 brion: vincent and avicenna are sharing LuceneSearch burden.
  • 20:00 brion: Chad fixed vincent, which is now running lucene. Isidore lucene stopped, it's going to be squid soon. Will take over an apache for additional search capacity.
  • 13:30 brion: lucene search turned on for en with slightly old index file, daemon running on isidore
  • 10:30 brion: gcj on isidore seems horked; index rebuild is much too slow (eta 18 hours) so stopped it. uploading an index from home, and building mono for further testing.
  • 10:00 midom: holbach restored.
  • 08:55 holbach seems to be deadish
  • 08:50 brion: started lucene index build on isidore
  • 05:50 brion: vincent doesn't seem to be coming up again, will need to be kicked.
  • 05:20 brion: upgrading vincent to 2.6 kernel hoping to resolve threading/memory issues w/ MWDaemon
  • 02:10 brion: rebooting srv6 due to zombie squid eating port 80

11 April

  • 23:05 kate: experimenting with making an en.wp image dump using trickle (cvs: /tools/trickle/)
  • 08:00 midom: broken replication (by chineese scammer) on bacon, fixed by "use otrs; repair table article" - myisam tables are evil, aren't they?

10 April

  • ~23:00: kate: upgraded squid to STABLE9+patches (see squid builds) + restarted all squids.
  • mark: All squids are running with too few FDs (1024), and if noone replaces all daemons by the new one Kate just built, we may have a problem tomorrow during peak hours...
  • 19:15 midom: srv7 is now in squid service
  • 19:07 brion: MWDaemon's memory usage got high enough it started swapping. Hung connections ate up apaches and hung the site until it was restarted.
  • 5:30 brion: lucene search server active for en.wikipedia.org, running on vincent.

9 April

  • 15:45 midom: dropped thttpd (as it was using 32bit mmaps) on dumps in favor of lighttpd. It has superb performance, serves 3500hits/s under ab and served 70MB/s from benet in small reqs... Extreme recommendations for using lighttpd for image uploads.
  • 10:15 brion: running lucene search indexer on vincent (pulling enwiki from benet).
  • 05:25 brion: added additional is rcbots to #is.wikipedia for tionary/books/quote

8 April

7 April

  • Mark, Tim: implemented Multicast HTCP purging on all FL apaches/squids. French Squids still need a binary replacement.

6 April

  • 21:44 mark: Put port gi0/26 on csw1-pmtpa into trunking mode: vlans 1-2 only, with vlan 2 being the native vlan, no LACP negotiation
  • 11:30 midom: benet put into dump operation
  • 10:55 brion: reinstalled PHP on zwinger and apaches, compiled with memory limit and mbstring options enabled. This was left out when upgrading to 4.3.11.
  • 2:40 brion: added NetCabo proxies to trusted proxy list (inconveniently shared by Jorge and a Nazi vandal on pt.wikipedia.org)

4 April

  • 15:30 jeluf: disbaled logging of upload.wikimedia.org
  • 15:15 midom: yet another image server overload. rotated 30G upload.wikimedia logfile, could be fragmentation overhead.
  • 12:00 midom: moved log_bin.0[0123]? (40G worth of binlogs) from ariel to khaldun/avicenna backup/arielbinlog, reclaimed some master disk space.
    • Do we need those binlogs for anything?
      • Yes, we need binlogs back to the last full backup -- TS
  • 07:48 Tim: Started memcached on browne, it was in the list but not running. Fixed startup scripts. Noticed that browne can't contact albert on 10/8, modified yum.conf accordingly.

3 April

  • 18:25 midom: extended public IP address range (now: 12 addresses)
  • 17:50 midom: srv5 joined service as squid.

1 April

  • 22:30 midom: Enabled recentchanges-based watchlist hack. Servers go faaaast.
  • 23:15 brion: set default block expiry to 1h on dewiki by request of various admins

Archives

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox