Server admin log/Archive 20
From Wikitech
April 26
- 17:50: jeronim: on db2 in ntp.conf, changed restrict 66.230.200.234 nomodify notrap to restrict 208.80.152.189 nomodify notrap to match the server 208.80.152.189 line below it. Output from ntpq -p looks much better now, showing an IP address in the refid column instead of ".RSTR."
- 16:53 mark: Installed lvs2, lvs3 and lvs4 for testing
- 15:58 mark: Installed Ubuntu 8.04 on lvs1 for testing
- 15:58 mark: Ubuntu 8.04 Hardy Heron installs are now possible on all VLANs
- 12:09 jeronim: did /etc/init.d/ntpd restart on db2 which fixed clock offset of about 6 seconds; underlying problem not fixed
April 25
- 20:09 mark: lily under extreme load, investigating
- 19:28 brion: added 'Vary: Cookie' HTTP header to blogs... don't know if it'll do a damn thing, I can't even clear things from these squids using the normal methods
- 18:33 brion: upgraded blog.wikimedia.org and whygive.wikimedia.org to wordpress 2.5.1
- 17:13 brion: fixed MWSearch extension to use Http::get() instead of file() to hit the backend. This should resolve the load spikes we've been seeing around 7:30-8:00 UTC daily; the servers slow down while indexes are being resynced, and the long default timeouts caused things to back up on the front end instead of failing out gracefully.
- 16:55 brion: upgraded utfnormal extension on srv42 so dumps will work again. (note that dumpBackup.php no longer works when autoselecting database connections, probably a bug due to the new load balancer. works in live use as a server is explicitly passed on command line.)
- 07:56 river: /var/lock on lily became full from the spamd bayes database; moved it to /var/spamd. expired the old bayes database because its size was causing spamd to be very slow (30+ seconds per mail).
April 24
- 05:50 Tim: fixed wikidiff2 on fedora apaches, was missing since 5.2.5 upgrade.
April 23
- 19:54 brion: restarted apache on bart (secure proxy), seems happier
- 19:50 brion: secure.wikimedia.org connections hanging
- 00:25 brion: resynced db2's clock; was 7 seconds slow, causing all s1 slaves to think they were lagged, causing all enwiki jobs runners to sit waiting for them to catch up
April 22
- 00:42 brion: enabling wgEnableMWSuggest globally for a few minutes to evaluate DB impact
April 21
- 23:18 brion: enabled $wgCookieHttpOnly -- new session & user token/name/id cookies should be sent HttpOnly, so supporting browsers won't expose them to JavaScript as an additional protection against some categories of XSS
- 23:10 brion: upgrading php on srv141, was down during 5.2 updates
- 22:26 brion: got a report of a commons image with missing archive versions. Files are present on upload4 but not on upload3... which is odd because as far as I can tell, only the thumbs are used on upload4 for commons. Why is there a full copy of commons, and why don't they match?
- 22:13 brion: getting lots of complaints from scap about time sync. clock offsets mostly <1s but some >3s
- 20:55 RobH: lvs3 and lvs4 racked and remote access enabled.
- 19:44 RobH: db4 reinstalled.
- 19:44 RobH: lvs1 and lvs2 racked and remote access enabled.
- 19:26 RobH: thistle reinstalled.
- 17:30 RobH: db1 unresponsive, rebooted.
- 17:30 RobH: racked srv141 and brought back online
April 20
- 15:15 mark: squid on khaldun had disappeared due to an upgrade a few days ago, and dependency conflict with the Wikimedia packages
- 13:00 mark: Depooled srv2 and srv4, the only remaining 32 bit apaches in rotation.
April 19
- 18:00 mark: srv133's time was off, corrected
April 18
- 22:00 RobH: relocated srv127
- 21:16 RobH: relocated srv129 and srv128
- 20:37 RobH: relocated srv130
- 20:09 RobH: relocated from B3 to B1 srv134, srv133, srv132, srv131
- 12:50 Tim: upgrading PHP to 5.2.5 on all Fedora Core apaches
April 17
- 21:50 brion: lowering db4 priority from 150 to 50; still loaded
- 21:10 brion: lowering db4 priority from 200 to 150; seems very highly loaded compared to db3 with same priority
- 20:25 RobH: Relocated srv136 & srv135.
- 19:40 RobH: Relocated srv137
- 19:25 RobH. Relocated srv138. Put ext store cluster 14 back in service.
- 19:19 brion: applying pt_title encoding fixes
- 18:59 RobH: Relocated srv141, srv140, srv139, srv138.
- 18:50 RobH: Removed ext store cluster 14 from active use.
- 18:44 mark: Removed AAAA record on khaldun.wikimedia.org, apparently apt doesn't even try v4 when it has a proxy hostname with an AAAA record and a v6 route is not available.
- 18:44 mark: Fixed httpd on pascal
- 18:20 brion: fixed ganglia reporting knams -> pmtpa (old zwinger IP in trusted list on pascal); detail reporting still down due to broken httpd on pascal
- 18:10 mark: Fixed MySQL group in ganglia by making ixia an aggregator again
- 18:11 RobH: srv143 and srv142 relocated.
- 17:58 brion: enabled search suggestion drop-down on testwiki
- 17:13 RobH: srv144 relocated.
- 17:00 RobH: srv145 relocated.
April 16
- 22:46 brion: enabled TitleKey extension, search suggestions, and HttpOnly cookies on wikitech
- 21:40ish brion: hopefully fixed the php5.1 bug with global sessions on secure.wikimedia.org
- 21:21 RobH: srv150 relocated.
- 21:11 RobH: srv149 relocated.
- 21:06 brion: enabling global sessions on secure.wikimedia.org
- 20:57 srv148 relocated.
- 20:47 brion: restarted data dumps on srv31 and srv42
- 20:45 srv147 relocated.
- 20:31 brion: cluster16 back in rotation; tim restarted mysql
- 20:25 brion: rash of complaints of db errors due to srv146 being out (cluster16 ES master). took cluster16 out of $wgDefaultExternalStore while it's being fixed
- 20:11 RobH: srv146 relocated.
- 16:52 brion: fixed ticket.wikimedia.org redirect to otrs
- 10:50 brion: got a mystery SMS complaining of 5-minute lag on dewiki
April 15
- 23:55 brion: giving planet its own little user account :)
- 22:24 brion: PMTPA databases, all KNAMS, and all YASEO are missing from ganglia and have been for a while. What's going on?
- 19:00 mark: Cleaned up csw5-pmtpa's config, added BGP inbound filtering on prefix lists and known bogons
- 17:35 brion: rc_user_text index is missing from frwiki, nlwiki, plwiki, and svwiki. Special:newpages was using it in some cases; have disabled the index and the username lookup feature for it pending fixes.
- 00:25 brion: updated SpecialNewpages.php to tweak index forces per domas's request; new pages was causing some sort of problem
April 14
- 23:55 brion: gettin' ready to svn up! applied flaggedrevs_promote table on test & labs, and the centralauth gu_token field
- 19:10 brion: restarted IRCD, was hanging mysteriously
- 16:25 RobH: srv130 synced and apache restarted.
- 16:00 RobH: srv0 and benet powered down pending drive wipe for decommissioning.
April 13
- 10:46 Tim: pybal on diderot was depooling servers due to name lookup failure (timeout). Traced the problem back to nscd and restarted it, that fixed it.
April 12
- 00:15 brion: robots.txt may or may not be fixed for blog.wikimedia.org; some kind of freakish default, probably from wordpress 404 handling, redirected it to robots.txt/ (with final slash) which disallowed all by default apparently (?!). added a plain file... but caching is still taking the redir that i can see
- 00:10 brion: sql script doesn't work for non-wiki dbs such as 'centralauth' and 'oai' at the moment; lookup fails
- 00:02 brion: setting up sr.planet.wikimedia.org
April 11
- 12:48 mark: Discovered that lighttpd does not allow caching of unknown content-type responses. amane was serving quite a lot of unknown content types, which were consequently not cached by the Squid clusters. Fixed this by adding a lot of content types to lighttpd.conf, as well as a default content-type in case any are missed.
April 10
- 21:30 jeluf: fixed nagios' conf.php, to reflect the latest db.php changes.
- 16:50 brion: restricted wfNoDeleteMainPage to enwiki which I presume it was added for. It's a huge nuisance for other wikis which quite legitimately are rearranging their content.
April 9
- (all day) mark: Restarted various daemons on lots of servers to get DNS resolver libs to use the new DNS IPs (mostly nscd, apache, some mysql)
April 8
- 21:35 brion: fixed (?) nad nsswitch.conf on bart (nis -> ldap)
- 16:48 brion: adjusted new $wgExpensiveParserFunctionLimit to match old $wgMaxIfExistCount
- 16:38 Tim reenabled search
- ?????? Tim disabled search sitewide
- 7:40-8:40 Tim: the lack of a FORCE INDEX caused LogPager queries to be extremely slow. The site eventually went down when the cumulative query load built up sufficiently. Took a bit of time to disable the queries properly, kill the MySQL threads, and get the site back up.
- 07:40 Tim: updated to r32943
- 07:30 jeluf: restored .procmailrc for OTRS. We've lost all mails coming in between 0:38 and 7:30 UTC. I can't find them in /var/spool/mail, and they didn't go to OTRS. Any idea where postfix has put them?
- 07:19 Tim: deleted 100GB of binlogs on ixia
- 04:30 jeluf: migrated some of the changes that I've made to our OTRS. Installed a big red MOTD message on the login screen.
- 01:10 brion: reinstalled OTRS FAQ module, fixing broken ticket zoom.
- 00:40 brion: upgraded OTRS to 2.1.8. If you have information about the patches that were previously applied, please provide them! They have not been copied over since it's unclear what's what.
April 7
- 18:29 RobH: srv117 shutdown due to failed HDD. RMA placed.
- 18:18 RobH: db1 rebooted due to hard lockup.
- 17:25 Tim: running maintenance/archives/upgradeLogging.php on various (eventually all) wikis
- 00:10 brion: running a bzip2 integrity check on enwiki-20080312-pages-meta-history.xml.bz2; .7z is cut off
April 6
- 11:24 mark: Changing resolver IPs on all servers
- 05:10 Tim: cleaned up binlogs on srv139 and srv146
April 5
- 17:42 mark: lighttpd on storage2 had run out of FDs and crashed. Increased the limit.
- 16:52 mark: Stopped announcing prefix 66.230.200.0/24 in BGP.
- 16:00 mark: Removed old IPs from various servers.
April 4
- 19:52 brion: srv117 is borked; logins hanging
April 3
- 21:18 brion: moved dump monitor thread to srv31; stale ruwiki dump marked correctly as aborted now. NOTE: IPs for storage NFS mounts should be changed when enwiki and dewiki dumps finish..........
- 21:15 brion: killed dump & sitemap processes on benet. we're retiring it...
- 15:59 RobH: Removed vincent, biruni, kludge, humboldt, & hypatia from all dsh groups and apache pool for decommissioning.
April 2
- 22:01 RobH: isidore updated with newest wordpress installation for blog and donation blog.
- 17:55 RobH: db1 rebooted.
- 17:45 brion: added bart's new ip to known proxy list
- 17:32 mark: Renumbered friedrich
- 16:07 mark: Renumbered srv8, bayle
- 15:57 mark: Renumbered srv9 and srv10
- 15:43 mark: Renumbered yongle
- 15:34 mark: Renumbered isidore
- 15:26 mark: Renumbered browne
- 15:10 mark: Renumbered storage1, anthony
- 14:16 mark: Renumbered storage2, will
- 14:00 mark: Restored symlinks in /etc/powerdns/templates/, be careful when working on/copying those files, they are heavily symlinked!
- 13:15 mark: Renumbering bart to new IP range
- 11:00 - 11:30 mark: Reloaded csw1-knams with new firmware; temporarily moved traffic to florida
April 1
- 08:00 domas: db1 didn't like oracle migration, crashed
March 31
- 4:30 JeLuF: Added srv145 back to external storage pool 'cluster16'. Added srv130 back to external storage pool 'cluster13'.
- 4:00 JeLuF: Fixed mysql on srv81 and srv145. On srv138, resolved "out of diskspace" situation. The second disk was not mounted and both mysql datafiles were on one disk only.
March 28
- 18:57 RobH: sq12 back online from lockup.
- 18:46 RobH: Replaced DIMM4 in srv166
- 18:09 RobH: srv51 back online from kernel panic.
- 17:59 RobH: srv78 & srv81 back online from kernel panic.
- 17:55 RobH: srv130 & srv131 back online from kernel panic.
- 17:46 RobH: srv145 back online, was powered down?
March 26
- 19:00 brion: previous fix had a bug which broke wikis with language variants. fixed.
- 18:20 brion: Worked around mystery segfaults with voodoo fix (r32477)
- 17:26 brion: mysterious [crashes on private wiki root redirects, still trying to diagnose. (backtrace)
- 15:26 mark: Set up sq50 as temporary LVS balancer instead of avicenna, so it's not a squid atm.
- 15:00 mark: PyBal's configuration file had a syntax error, causing LVS to go down. Avicenna completely swamped and unreachable.
- 14:08 mark: Rendering cluster down due to OOM kills on all 3 servers. Killed apaches and restarted them.
March 25
- 22:31 brion: disabled CentralAuth debug log; found the bug i was looking for :)
- 22:22 brion: enabled CentralAuth debug log
March 24
- 23:11 brion: set default perms for upload to autoconfirmed except on commonswiki... this may be rolled back or changed if unpopular
- 17:50 brion: restarting category builds on commons and enwiki
- 17:45 brion: poked around old paypal post urls
March 21
- 22:00ish brion: switched search front-end to core UI wikimedia-wide. Note some site JS needs fixing like this
- 03:00 brion: mailman paranoia
March 20
- 19:25 brion: restarted lighty on storage2; was down mysteriously
- 16:53 storage2's lighty appears to have died... had lots of errors about too many open files etc
- 12:53 RobH: srv150 back online.
- 12:46 RobH: srv81 rebooted from kernel panic.
March 19
- 23:55 brion: starting batch category table population...
- 23:27 brion: updating code; stub updatelog and category tables applied. will populate tables after gone live...
March 18
- 17:49 brion: benet crashed again. moving DNS for dumps.wikimedia.org over to storage2. it had a lighty pointing to a now-empty backups directory; pointed it at the currently-used dir for dump storage instead.
- 17:00 and earlier -- some network issue with PowerMedium? large packets dying on routes through HGTN. mark did something to the network to cut our PowerMedium route? can't reach 66.230.200.* network from outside now; secure.wikimedia.org and planet.wikimedia.org at least using these addrs publically still
- 08:45 mark, JeLuF: Routing knams-pmtpa switched to another provider, dns switched to "normal". Everything looks fine. During the "knams-down" time, request rate in pmtpa dropped, needs further investigation.
- 08:30 JeLuF: Lost connection pmtpa-knams, switched DNS to scenario "knams-down".
- 07:23 Tim: hume's v1 partition is 92% full, set up a symlink farm to start filling v2.
- 01:18 brion: secondary problem was some kind of overload on avicenna (pmtpa text LVS). river managed to tweak it into submission by taking it off net for a couple minutes. things appear up for now
- 01:06 brion: packet loss down from 33%+ to about 4%... can reach ganglia consistently, still some outage issues
- 00:18ish brion: major net issues in tampa? lots of packet loss; cpu down dramatically
March 17
- 19:48 brion: fixed upload dir on wikimania2008wiki
- 18:00 jeluf: srv51 is down. Replaced by memcached on srv65.
March 16
- 15:28 mark: Renumbered mchenry to the new v4 IP range
- 14:47 mark: Renumbered sanger to the new v4 IP range
- 14:18 mark: Bound IPv6 IPs on csw5-pmtpa's vlan routing interfaces - so most if not all servers will have acquired one or more IPv6 addresses. Renumbered khaldun to the new IP range and published its IPv6 record as AAAA record in DNS (for apt.wikimedia.org)
March 13
- 21:19 mark: Shutdown srv150's switchport, it has a ro fs and doesn't react to IPMI.
- 19:55 brion: reenabled search result context for anons on LuceneSearch wikis
- 04:28 Tim: enabled CentralAuth in dry-run mode on all wikis
March 12
- 21:26 brion: de.labs thumbs mysteriously broken again. who knows...
- 21:05 brion: poked at thumb-handler.php ... it was apparently pointing to the wrong backend URL for de-labs (de.labs) etc. Hacked in a special case for non-wikipedias.... which may well be even more broken. Look at this again... :P
- 17:10 brion: dissolved mediawiki-ng-l list. Too much forced moderation and no mission meant it was never seriously used.
March 11
- 18:57 brion: swapped LuceneSearch for MWSearch plugin on test.wikipedia.org and commons.wikimedia.org. Search front-end now includes thumbnails for image page results, which is kind of handy. :) Will do a little more testing before swapping wholesale; there are still UI differences and things which should be improved.
March 10
- 20:25 brion: arbcom_enwiki was missing from dblist files (except private.dblist). Added it back to all.dblist and special.dblist, works again.
- 19:07 brion: installed svn 1.4.6 on zwinger in /usr/local/svn; use this to svn up if the old version keeps whining
- 18:36 brion: zwinger's old copy of svn (1.2.3) has decided that it can't deal with something in our repository (extensions/DumpHTML/wm-scripts). :(
- 18:02 brion: removed the evil transclusion at Server admin log/All which caused updates of this log page to be insanely slow, by forcing links refresh of 12 huge log pages all combined into a giant page of death
- 17:47 brion: set chapcom lang to 'en' instead of defaulting to 'chapcom'. special: page links now working instead of ':Userlogin' etc. not sure why it did that; seemed fine in command-line tests
- 17:32 brion: reported language config issues on chapcom; exmaining
- 16:54 brion: fixed spider blocks. :P
- 16:37 brion: blocked an evil spider IP from mayflower; SVN http back up
- 16:28 brion: mayflower overloaded in some way; load avg 147 o_O
Marc 9
- 17:13 brion: en.labs.wikimedia.org and de.labs.wikimedia.org have FlaggedRevs testing configurations enabled. Still doing imports from en.wikibooks on en.labs, though. (Internal names are de_labswikimedia and en_labswikimedia.)
March 8
- 08:45 Tim: cluster14 was inexplicably missing mywiki. No data loss, it's been missing since the cluster was created, apparently. Added it.
- 11:09 Tim: srv81 is down. Removed it from external storage rotation.
- 11:00 brion: updated hawhaw; WAP portal now looks nice in Mobile Safari on the iPhone SDK simulator app
March 7
- 23:34 brion: importing de.wikibooks to de.labs.wikimedia.org....
- 21:59 brion: setting up stub en.labs.wikimedia.org and de.labs.wikimedia.org for flaggedrevisions testing
- 12:05 domas: srv25 has 40GB of lucene logs. disk full.
- 12:00 domas: resynced samuel form db1, db5 remaining
- 11:46 Tim: running dumpHTML on hume with 16 threads
- 08:00 domas: s3 master switch, samuel_bin_log.171:224349875 to adler-bin.002:3522
- 00:28 Tim: Updated zwinger:/etc/ntp.conf
- 00:19 Tim: updated MySQL grants for new subnet
March 6
- 23:26 Tim: added 208.80.152.128/26 to suda:/etc/exports and srv1:/var/yp/securenets. Created checklist at IP addresses
- 06:49 brion: noticed zwinger can't access database servers since the IP renumbering. :P
- 00:48 RobH: hume installation complete.
March 5
- 23:57 brion: leuksman.com was offline for a while (net problems at sago)
- 14:12 RobH: srv65 back online.
- 13:59 RobH: srv150 back online from kernel panic.
- 13:38 RobH: upgraded kernel in storage2
- 13:28 RobH: srv127 back online from kernel panic.
- 13:27 RobH: upgraded kernel in storage1
March 4
- 22:30 mark: Changed dhcpd.conf on zwinger, firewall setup on khaldun and dhcp forwarding on csw5-pmtpa to make installs work from the new IP ranges.
- 22:00 mark: Migrated zwinger onto the new IP range, changed its DNS entry to 208.80.152.189.
- 19:08 brion: took out read-only
- 19:05ish brion: put in temporary limit of Special:Newpages to 200; lots of reads with limit 5000 on dewiki were bogging down holbach. DB overload cleared up.
- 18:53 brion: taking s2 and s2a to read-only temporarily while we work out this overload issue
- 18:40 jeluf: DB servers for s2a cluster (dewiki) overloaded. ixia logs
[5100027.207458] Machine check events logged
- 18:25 (large CPU spike up on mysql and apaches; continuing...)
- 11:00 domas: db1 and adler are running compacted/fixed schema/tablespaces - next targets are db5 and samuel, master switch imminent
March 3
- 21:18 brion: removed the special-case in lucene configuration for testwiki to use srv79. That seems to have an experimental version of the lucene server which is currently broken. search now works on testwiki
- 18:57 mark: srv65 went offline, taking its memcached instance with it. Replaced the memcached slot by the last spare one.
- 16:00 RobH: yf1019 kernel upgraded.
- 16:00 RobH: yf1018 kernel upgraded.
- 15:36 RobH: yf1016 kernel upgraded.
- 15:36 RobH: yf1015 kernel upgraded.
- 15:27 RobH: henbane kernel upgraded.
- 14:59 RobH: sage kernel upgraded.
- 14:51 RobH: mayflower kernel upgraded.
- 14:41 RobH: hawthorn kernel upgraded.
- 14:35 RobH: lily kernel upgraded.
March 2
- 11:30 Tim: Not sure what the deal was. Cleaned up the mount options a bit: reduced timeout, switched from TCP to UDP mode (lost TCP connections cause temporary hangs), removed "intr" (useless when in soft mode). Remounted.
- 11:17 Tim: amane immediately locked up again due to hang on NFS read of storage1. Unmounted /mnt/upload4 temporarily to restore service.
- 11:09 Tim: restarted lighttpd on amane, was broken
February 29
- 21:15 RobH: restarted ssh and put srv61 back in pool.
- 21:15 RobH: brought srv130 back from kernel panic.
- 19:56 RobH: Racked hume, new static-dump server. DRAC: 10.1.252.190 DHCPD needs modification to netboot this subnet.
- 14:26 Tim: Removed /etc/cron.daily/find from all ubuntu apache servers that had it. Killed all long-running sort commands.
February 28
February 27
- 22:22 RobH: Shutdown srv11-srv20 + srv6. (Old, warranty expiring, causing heat issues in that rack, per mark)
- 18:34 RobH: upgraded kernel on will
- 18:23 RobH: upgraded kernel on mchenry & sanger
- 18:05 RobH: upgraded kernel on bayle
- 18:00 RobH: upgraded kernel on khaldun
- 17:45 RobH: upgraded kernel on srv9 & srv10
- 17:37 RobH: upgraded kernel on yongle
February 26
- 23:59 RobH: upgraded kernel on yf1009
- 22:48 RobH: upgraded kernel on yf1005 to yf1008
- 22:14 brion: rebuilding enwiki-20080103-pages-meta-current.xml.bz2 (as -2 for now) on srv31
- 21:30 to 22:10 RobH: upgraded kernel on yf1002 to yf1004
- 19:45 RobH: fixed replication on srv77 to srv8
- 14:12 Tim: started lighttpd on benet, had crashed again
February 25
- 23:51 brion: someone mucked up wgRemoveGroups on srwiki, listing pretty much every permission they could think of. pared it down to array( 'bot', 'patroller', 'rollbacker', 'autopatrolled')
- 20:00 RobH: yf1001 security updates.
- 19:58 RobH: yf1000 security updates.
- 19:45 brion: maurus disk space filled up for a bit; there's a 39gb log file in /usr/local/search/log. Freed up some space from old index data; recommend adding some log rotation to search servers!
February 22
- 21:33 RobH: srv171-srv175 kernel and security updates.
- 20:32 RobH: srv161-srv170 kernel and security updates.
- 20:00 RobH: srv151-srv160 kernel and security updates.
- 16:53 RobH: sq33-sq40 kernel and security updates.
- 16:34 RobH: sq24-sq32 kernel and security updates.
- 16:09 RobH: sq16-sq23 kernel and security updates.
- 15:52 RobH: sq41-sq50 kernel and security updates.
- 05:15 Tim: Applying schema updates patch-page_props.sql and patch-ipb_by_text.sql
- 02:00 - 04:45 mark: Migration of office DSL connections to Cisco 2841 - server is policy routed over the lower speed connection.
February 21
- 22:42 RobH: sq10 - sq15 updated (kernel and security updates.)
- 21:45 RobH: sq2 - sq9 updated (kernel and security updates.)
- 20:08 RobH: sq1 updated (kernel and security updates.)
February 20
- 23:53 RobH: knsq28 seems to not be rebuilding. Letting mark know.
- 23:45 RobH: Upgraded kernel and such on knsq16 through knsq22 (apt-get upgrade). Not distro upgrade.
- 23:21 RobH: Upgraded kernel and such on knsq8 through knsq15 (apt-get upgrade). Not distro upgrade.
- 22:15 RobH: fuchsia back up by mark. All traffic remains routed to PMTPA (while rob finishes squid upgrades.)
- 22:15 RobH: fuchsia down. All traffic routed to PMTPA.
- 21:56 RobH: Upgraded kernel and such on knsq23 through knsq26 (apt-get upgrade). Not distro upgrade.
- 21:30 RobH: Upgraded kernel and such on knsq1 through knsq7 (apt-get upgrade). Not distro upgrade.
February 18
- 21:15 brion: manually mounted upload4 on srv189. Was not created in /mnt or listed in fstab.
February 17
- 7:30 jeluf: suda's root FS was 100% full. Changed logrotate.conf to rotate logs daily instead of weekly, added switch.log to the log rotation.
February 13
- After 18:44 RobH: Reinstalled db1 OS.
- 18:44 RobH: rebooted srv37 from crash, back online.
- 18:35 RobH: Restarted apache on srv166 per domas.
- 15:03 RobH: storage2 disk 12 replaced. and is rebuilding
February 11
- 03:38 Tim: srv61 is refusing ssh connections, still serving HTTP. Depooled.
February 10
- 10:40 domas: db1 still needs fixing..
- 07:30 Tim: upgrading the remaining squids with ~tstarling/squid/squid-upgrade.php. The script will upgrade one squid every two hours, in random order. This mitigates the effect of the cache clear for items with a Vary header (i.e. text). sq17 and sq18 were done during script testing.
- 06:18 Tim: upgraded squid on sq16, including XVO feature
- 05:40 Tim: srv150 accepts connections on SSH or HTTP and then hangs for a long time. Removed it from mediawiki_installation and apaches and depooled it.
February 8
- 01:40 Tim: added "hidden" table (oversight) on wikis that didn't have it. Added it to addwiki.php.
February 7
- 17:43 mark: Wrote a Mailman withlist script to change the embedded web_page_url variable to use https, as this is not possible using config_list.
- 15:00ish to 16:30ish RobH: lily lightttpd.conf changed to support/redirect mailman with SSL certificate.
February 6
- 17:45 brion: updated bugzilla to 3.0.3
- 16:13 Tim: MW configuration changes:
- Renamed some wikimedia-specific globals from $wgXxxx to $wmgXxxx. Some of them had rather obvious names that could potentially conflict with extension configuration in the future.
- Moved passwords and private keys out to PrivateSettings.php
- Changed SiteConfiguration.php to allow "tags" such as "fishbowl" and "private" to be applied to wikis. These tags can be used to specify settings in InitialiseSettings.php.
- Used these tags to full effect by adding using fishbowl.dblist and private.dblist to set the fishbowl and private tags, and then removing all the fishbowl/private wiki lists from InitialiseSettings.php. This will make adding new private wikis easier.
- Fixed some whitespace and removed some old commented-out code
- Moved various ancient subdirectories of /h/w/common to /h/w/junk/common
- 14:43 RobH: srv166 had a memory error, reseated memory, and restarted server.
- 14:22 RobH: storage2 disk 2 replaced. Not rebuilding? (please show rob how to force this.)
February 4
- 21:11 RobH: isidore now running bugzilla.wikimedia.org with a SSL Cert.
February 3
- 11:47 mark: lighttpd disappeared on storage1 and was also inaccessible from the new IP range due to an old and broken firewall. Why was it there? Removed it.
- 11:25 mark: Move traffic back to pmtpa
February 2
- 20:30 mark: Added new service IPs to bayle and mchenry being the pmtpa DNS resolvers, and a new service IP for ns0.wikimedia.org on bayle.
- 20:15 mark: Forgot that we have some DNS records pointing at 66.230.200.100 directly, so those were down for a while until I updated DNS.
- 17:52 mark: Moved all text.* traffic to knams as well
- 17:04 mark: Put Canadian traffic on pmtpa, to seed those caches a bit
- 14:40 jeluf: storage1 overloaded. Killed static dump processes on srv136, srv135, srv134, srv133, srv132, srv131, srv42
- 13:15 mark: Updated upload Squid configs to use the new pmtpa IP range, causing immediate pmtpa CARP cache clear, but mitigated by the knams squids.
- 11:37 mark: Moved all upload.* traffic to knams, to prevent an effective CARP cache clear due to IP address changes swamping amane.
February 1
- 20:19 brion: reverted r30405 which broke boardvote and re-enabled the ext
- 20:10 brion: broken boardvote extension... was breaking all special pages; temporarily disabled the ext
- Feb 1 20:08:18 kluge httpd[12208]: PHP Fatal error: Call to undefined function wfBoardVoteInitMessages() in /usr/local/apache/common-local/php-1.5/extensions/BoardVote/GoToBoardVote_body.php on line 3
- 11:15 domas: restarted lighty on benet, did run away?
January 31
- 10:53 Tim: deleted binlogs on srv146
- 00:12 brion: svn.wikimedia.org resolved to old 145.* addy from anthony... since that doesn't work anymore, this is making svn access a pain for seeing about updating the wap interface. Tried to update resolv.conf with current values from zwinger, but still no dice.
- have temporarily resorted to /etc/hosts hack
January 30
- 22:25 brion: various reports of "blank pages" and/or 503 errors from Peru. Nothing narrowed down yet on our end.
- 20:35 brion: switched Apple Dictionary app backend to OpenSearch. bumped MaxClients on yongle up to 20, may resolve the 'gets really slow for no reason' issue
- 20:10 brion: enabling TitleKey sitewide. (Indexes should be rebuilt overnight to ensure they're up to date for changes in the last 15 hours.)
- 05:54 brion: building TitleKey indexes generally (not fully enabled yet so opensearch isn't useless until done; want them built first)
- 05:25 brion: experimenting with TitleKey ext on testwiki
- 04:50 Tim: Fixed thumb-handler to not attempt to "cache" files locally on storage1. Removed bacon from /h/w/upload-scripts/sync.
January 29
- 21:58 mark: Raised persistent_request_timeout on the backend squids from the default 2 minutes to 10 minutes, to make existing connection reuse even more likely between all communicating pairs of squids
- 10:30 Tim: Setting up storage1 as a static HTML dump storage server. Installed ganglia on it.
- 09:10 Tim: updatedb was running on storage1, attempting to index millions of files. Killed it, added /export to PRUNEPATHS, and re-ran it. Seems to work.
January 28
- 22:30 brion: csw5-pmtpa has been spewing alarms about 5/3 and 5/4 optical connections for a while. :(
- domas says this is harmless -- an unused port
- 18:50 brion: svn revert'd some live hack in Parser.php which apparently added a $clearState parameter to Parser::internalParse() which never gets passed to it, thus spewing error logs with billions of lines of PHP warnings
January 24
- 21:00 jeluf: installed lighty on storage1, configured squid so that all dewiki image requests and all commons thumb image requests go to storage1. Images fast again, backend request rate down to normal level.
- 18:40 brion: images still very slow :(
- 14:00 mark: Assigned new, extra IP addresses to Florida Squids, and added the new IP range to all squid.conf's. Also removed the old knams IP range, which has been unused over 2 months. This seems to have caused a massive cache clear in knams upload squids, causing a huge increase of image requests and overload of Amane. A real explanation is as of yet unknown... speculation is that old objects in knams caches have been invalidated somehow because they had the (now removed) old IP prefix in their caching info.
January 23
- 02:09 Tim: reverted refresh_pattern changes in squid (ignore-reload) to fix user JS/CSS problems. With Brion's blessing.
January 22
- 20:46 mark: Set $wgUserEmailReplyTo back to false, as mchenry will now rewrite envelope sender addresses from MediaWiki to wiki@wikimedia.org
- 16:12 Rob: srv11 back online
- 15:55 Rob: srv130,srv132,srv134 back online, see detailed server pages for crash information.
January 21
- 12:30 jeluf: mark reports twice as much backend requests as usual. live-patched opensearch_desc.php to send proper Cache-Control headers. Needs to be updated in SVN. Backend request rate back to normal levels.
- 07:10 brion: set $wgUserEmailUseReplyTo to protect against SPF failures and privacy leakage due to bounce messages in user-to-user emails. (Caused by sSMTP, which forces the envelope sender and From: address to be the same.) This uglifies user-to-user emails but keeps the same. In the long term I recommend replacing sSMTP with a minimal postfix or something like we used to use, which should work in a safe manner.
- 03:24 brion: taking srv184 out of apache rotation to test ssmtp config issues
January 20
- 21:45 jeluf: unpooled srv183, investigated why NFS mounts were missing after a reboot. Seems to be related to https://bugs.launchpad.net/ubuntu/+source/sysvinit/+bug/44836 . The fix suggested in that bug seems to help. Have to package it tomorrow.
- 21:40 brion: mounted NFS shares on srv183
- 21:39 brion: srv183 was rebooted 2h55m ago. its apaches are running, but NFS shared aren't mounted. nothing works properly. lead to several reports of captcha failures, and might have lead to some uplaod-related issues
- 18:30 jeluf: rebooted srv183, un-killable convert jobs were blocking port 80
- 18:29 brion: apache not restarting on srv164, srv176, srv183, srv184 -- "(98)Address already in use: make_sock: could not bind to address 0.0.0.0:80"
- 18:25 brion: killed job runner jobs on srv90-99, they were the error-spewers. syslog is clean. :D
- 18:18 brion: several apaches in srv90-99 range still spewing errors, but seem to have the right file. stuck apc?
- 18:11 brion: removed the random '$key' parameter from MessageCache::transform
- 18:06 brion: space was filled by /var/log/messages and /var/log/syslog; runaway PHP warnings from some live hack extra parameter. truncating the log files and resyncing
- 17:56 brion: turned off their apaches. looking for the space culprit.... they have most of their space wasted in a /a partition and a tiny / where all the stuff is
- 17:53 brion: lots of srv's in 150-190 range out of disk space; broken (LocalRepo.php update failed)
- 11:12 brion: file histories were broken for a few minutes (bad commit got through)
- 07:08 brion: enabling $wgFileRedirects on test.wikipedia
January 19
- 06:29 and a bit before - brion: some brief segfaulting due to a bad recursion in my SiteConfiguration update. Note: non-string values in InitialiseSettings.php (false, null, ints, etc) will now work.
January 18
- 22:46 brion: wikibugs was idle for an hour or so due to being autoblocked for bounces again...
- 22:40 brion: srv11 is hung; no ssh, HTTP opens but doesn't respond
- 18:40 brion: created wikimedia-sf mailing list
January 16
- 22:30ish brion: someone tried to delete sandbox on en.wikipedia, leading to various DB error warnings (transactions full) and breakage of most editing for nearly an hour. Have hacked in a 5000-revision limit on deletions, will prettify it shortly.
- 21:39 brion: Added a default "Cache-control: no-cache" header on output in CommonSettings.php. This will protect PHP Fatal Error blank pages and such from getting cached due to a 200 result code and lack of cache-control headers. Actual cache-control output will override the default one. (Had to manually purge a Special:Random on en.wikipedia... various issues with editing etc)
- 07:32 brion: fixed IRC recentchanges name for wikimania2008.wikimedia (was sending to the 2007 channel)
January 15
- 21:00 jeluf: removed memcached on srv56,57,58 on rainman-sr's request. Memcached was causing problems with the indexer.
January 14
- 21:33 brion: clearing a giant watchlist on users' request; may cause some s1 replag
- 21:00ish brion: we seem to be getting blank PHP fatal error pages stuck in squid caches. :( latest php should mark these as 500...
- 20:00 Rob: All yaseo upload squids upgraded.
- 19:45 Rob: All yaseo text squids upgraded.
- 18:45 Rob: Upgraded squid on sq41-sq50
- 17:45 Rob: Upgraded squid on sq11-sq15
- 17:00 Rob: Upgraded squid on sq6-sq10
- 17:00 Rob: Upgraded squid on sq1-sq4
- 16:20 Rob: Upgraded squid on sq32-sq40
- 16:20 Rob: Upgraded squid on sq24-sq31
- 16:03 Rob: Upgraded squid on sq16-sq23
- 15:26 Rob: Upgraded squid on knsq16,knsq17, knsq18, knsq20, knsq21, knsq22.
- 15:00 Rob: Upgraded squid on knsq8,knsq8, knsq9, knsq10, knsq11, knsq12, knsq13, knsq14, knsq15
January 13
- 20:34 mark: Enabled access log on mayflower's apache (why was it disabled?)
- 18:12 mark: Upgraded all knams text squids to new squid version
- 17:30 mark: Set refresh_pattern . 60 50% 3600 ignore-reload on all text squids to override reload headers
- 17:00 mark: Upgraded knsq1 to the new Squid
- 16:15 mark: Brought up knsq19, and installed a new squid 2.6.18-1wm1 on it, including Domas' Accept-Encoding normalization patches. If you notice anything weird, notify Mark or Domas...
- 04:25 Tim: Updated MW from r29455 to r29682.
January 12
- 11:00 domas: removing titleblacklist. there's certain level of crap beyond which I won't fix stuff.
- 03:10 brion: importing checkuser logs
- 02:59 brion: upgrading to current CheckUser code (per-wiki logs for now)
January 11
- 12:00 domas: installed lighty on zwinger for ganglia use
January 10
- 17:00 domas: disabled CentralNotice
January 9
- 21:00 domas: increased revtext ttl to 1w, fixed parser cache ttl problem, where magicwords were causing most of enwiki (and other template-aware wiki) pages to be cached for 1h only (r29511)
- 09:00 domas: memcached arena increased to 158GB, 79 active nodes, ES instances getting lower buffer pools on servers running memcached (1000M to 100M), full cache drop
- 00:14 brion: now that we've expanded storage2's size and removed a bunch of useless thumb and temp files from the amane backup so there's room again; have restarted up dump runs, including a continuation run of enwiki (which should start up from meta-current)
January 8
- 22:33 jeluf: extended storage2:/export by 650 GB
- 22:03 brion: uploads broken for several minutes by r29361 (reverted)
- 21:48 brion: srv17 and srv18 are whining about high temperatures
- 21:00 Rob: srv17 segfaults in httpd, resynced and restarted apache.
- 17:10 Rob: srv78 Kernel Panic, rebooted and back online.
- 16:45 Rob: srv177 cpu overheating, pulled, replaced thermal paste, back online.
- 16:20 Rob: srv15] cpu overheating, pulled, replaced thermal paste, back online.
- 16:15 Rob: srv189 back in rotation.
- 14:59 Rob: srv189 reinstalled, needs apache setup.
- 14:54 Rob: srv130 rebooted and back online.
- 07:50 domas: added db8 and db10 to ganglia
January 7
- 08:34 Tim: mounted upload4 on albert for static.wikipedia.org symlinks
January 6
- 21:33 mark: Enabled TCP ECN on lily and mayflower
- 21:03 mark: Added mayflower's EUI-64 address to DNS - svn may use it.
- 20:06 mark: Added a v6 service IP to lily (lists.wikimedia.org) and put it in DNS.
January 4
- 00:34 brion: restarting backup syncs from amane to storage2; was broken by bad script... trimming more thumbnails out of storage2 to clear up space
January 3
- 19:29 brion: starting enwiki dump on srv42, will continue with general worker thread
- 19:13 brion: Setting up srv42 to run dump worker threads as well as general batches, since it seems idle.
- 15:05 mark: Rebooted fuchsia with an LVS optimized kernel, moved all LVS services back onto it
- 13:45 mark: LVS on fuchsia overloaded, moved LVS for upload to mint
- 00:26 brion: http://download.wikimedia.org/ now running off storage2. will restart dump runs aiming at it until we have a better place to put the backend (with benet still not checked for its disk issues)
Archives
- Server admin log/Archive 1 (2004 Jun - 2004 Sep)
- Server admin log/Archive 2 (2004 Oct - 2004 Nov)
- Server admin log/Archive 3 (2004 Dec - 2005 Mar)
- Server admin log/Archive 4 (2005 Apr - 2005 Jul)
- Server admin log/Archive 5 (2005 Aug - 2005 Oct)
- Server admin log/Archive 6 (2005 Nov - 2006 Feb)
- Server admin log/Archive 7 (2006 Mar - 2006 Jun)
- Server admin log/Archive 8 (2006 Jul - 2006 Sep)
- Server admin log/Archive 9 (2006 Oct - 2007 Jan)
- Server admin log/Archive 10 (2007 Feb - 2007 Jun)
- Server admin log/Archive 11 (2007 Jul - 2007 Dec)