Server admin log/Archive 20

From Wikitech
< Server admin log(Difference between revisions)
Jump to: navigation, search
(February 22)
(Archives)
Line 236: Line 236:
 
* [[Server admin log/Archive 10]] (2007 Feb - 2007 Jun)
 
* [[Server admin log/Archive 10]] (2007 Feb - 2007 Jun)
 
* [[Server admin log/Archive 11]] (2007 Jul - 2007 Dec)
 
* [[Server admin log/Archive 11]] (2007 Jul - 2007 Dec)
 +
 +
::''[[Server admin log/All|the whole kaboodle]]''
  
 
__notoc__
 
__notoc__

Revision as of 17:27, 22 February 2008

February 22

  • 16:53 RobH: sq33-sq40 kernel and security updates.
  • 16:34 RobH: sq24-sq32 kernel and security updates.
  • 16:09 RobH: sq16-sq23 kernel and security updates.
  • 15:52 RobH: sq41-sq50 kernel and security updates.
  • 05:15 Tim: Applying schema updates patch-page_props.sql and patch-ipb_by_text.sql
  • 02:00 - 04:45 mark: Migration of office DSL connections to Cisco 2841 - server is policy routed over the lower speed connection.

February 21

  • 22:42 RobH: sq10 - sq15 updated (kernel and security updates.)
  • 21:45 RobH: sq2 - sq9 updated (kernel and security updates.)
  • 20:08 RobH: sq1 updated (kernel and security updates.)

February 20

  • 23:53 RobH: knsq28 seems to not be rebuilding. Letting mark know.
  • 23:45 RobH: Upgraded kernel and such on knsq16 through knsq22 (apt-get upgrade). Not distro upgrade.
  • 23:21 RobH: Upgraded kernel and such on knsq8 through knsq15 (apt-get upgrade). Not distro upgrade.
  • 22:15 RobH: fuchsia back up by mark. All traffic remains routed to PMTPA (while rob finishes squid upgrades.)
  • 22:15 RobH: fuchsia down. All traffic routed to PMTPA.
  • 21:56 RobH: Upgraded kernel and such on knsq23 through knsq26 (apt-get upgrade). Not distro upgrade.
  • 21:30 RobH: Upgraded kernel and such on knsq1 through knsq7 (apt-get upgrade). Not distro upgrade.

February 18

  • 21:15 brion: manually mounted upload4 on srv189. Was not created in /mnt or listed in fstab.

February 17

  • 7:30 jeluf: suda's root FS was 100% full. Changed logrotate.conf to rotate logs daily instead of weekly, added switch.log to the log rotation.

February 13

  • After 18:44 RobH: Reinstalled db1 OS.
  • 18:44 RobH: rebooted srv37 from crash, back online.
  • 18:35 RobH: Restarted apache on srv166 per domas.
  • 15:03 RobH: storage2 disk 12 replaced. and is rebuilding

February 11

  • 03:38 Tim: srv61 is refusing ssh connections, still serving HTTP. Depooled.

February 10

  • 10:40 domas: db1 still needs fixing..
  • 07:30 Tim: upgrading the remaining squids with ~tstarling/squid/squid-upgrade.php. The script will upgrade one squid every two hours, in random order. This mitigates the effect of the cache clear for items with a Vary header (i.e. text). sq17 and sq18 were done during script testing.
  • 06:18 Tim: upgraded squid on sq16, including XVO feature
  • 05:40 Tim: srv150 accepts connections on SSH or HTTP and then hangs for a long time. Removed it from mediawiki_installation and apaches and depooled it.

February 8

  • 01:40 Tim: added "hidden" table (oversight) on wikis that didn't have it. Added it to addwiki.php.

February 7

  • 17:43 mark: Wrote a Mailman withlist script to change the embedded web_page_url variable to use https, as this is not possible using config_list.
  • 15:00ish to 16:30ish RobH: lily lightttpd.conf changed to support/redirect mailman with SSL certificate.

February 6

  • 17:45 brion: updated bugzilla to 3.0.3
  • 16:13 Tim: MW configuration changes:
    • Renamed some wikimedia-specific globals from $wgXxxx to $wmgXxxx. Some of them had rather obvious names that could potentially conflict with extension configuration in the future.
    • Moved passwords and private keys out to PrivateSettings.php
    • Changed SiteConfiguration.php to allow "tags" such as "fishbowl" and "private" to be applied to wikis. These tags can be used to specify settings in InitialiseSettings.php.
    • Used these tags to full effect by adding using fishbowl.dblist and private.dblist to set the fishbowl and private tags, and then removing all the fishbowl/private wiki lists from InitialiseSettings.php. This will make adding new private wikis easier.
    • Fixed some whitespace and removed some old commented-out code
    • Moved various ancient subdirectories of /h/w/common to /h/w/junk/common
  • 14:43 RobH: srv166 had a memory error, reseated memory, and restarted server.
  • 14:22 RobH: storage2 disk 2 replaced. Not rebuilding? (please show rob how to force this.)

February 4

  • 21:11 RobH: isidore now running bugzilla.wikimedia.org with a SSL Cert.

February 3

  • 11:47 mark: lighttpd disappeared on storage1 and was also inaccessible from the new IP range due to an old and broken firewall. Why was it there? Removed it.
  • 11:25 mark: Move traffic back to pmtpa

February 2

  • 20:30 mark: Added new service IPs to bayle and mchenry being the pmtpa DNS resolvers, and a new service IP for ns0.wikimedia.org on bayle.
  • 20:15 mark: Forgot that we have some DNS records pointing at 66.230.200.100 directly, so those were down for a while until I updated DNS.
  • 17:52 mark: Moved all text.* traffic to knams as well
  • 17:04 mark: Put Canadian traffic on pmtpa, to seed those caches a bit
  • 14:40 jeluf: storage1 overloaded. Killed static dump processes on srv136, srv135, srv134, srv133, srv132, srv131, srv42
  • 13:15 mark: Updated upload Squid configs to use the new pmtpa IP range, causing immediate pmtpa CARP cache clear, but mitigated by the knams squids.
  • 11:37 mark: Moved all upload.* traffic to knams, to prevent an effective CARP cache clear due to IP address changes swamping amane.

February 1

  • 20:19 brion: reverted r30405 which broke boardvote and re-enabled the ext
  • 20:10 brion: broken boardvote extension... was breaking all special pages; temporarily disabled the ext
    • Feb 1 20:08:18 kluge httpd[12208]: PHP Fatal error: Call to undefined function wfBoardVoteInitMessages() in /usr/local/apache/common-local/php-1.5/extensions/BoardVote/GoToBoardVote_body.php on line 3
  • 11:15 domas: restarted lighty on benet, did run away?

January 31

  • 10:53 Tim: deleted binlogs on srv146
  • 00:12 brion: svn.wikimedia.org resolved to old 145.* addy from anthony... since that doesn't work anymore, this is making svn access a pain for seeing about updating the wap interface. Tried to update resolv.conf with current values from zwinger, but still no dice.
    • have temporarily resorted to /etc/hosts hack

January 30

  • 22:25 brion: various reports of "blank pages" and/or 503 errors from Peru. Nothing narrowed down yet on our end.
  • 20:35 brion: switched Apple Dictionary app backend to OpenSearch. bumped MaxClients on yongle up to 20, may resolve the 'gets really slow for no reason' issue
  • 20:10 brion: enabling TitleKey sitewide. (Indexes should be rebuilt overnight to ensure they're up to date for changes in the last 15 hours.)
  • 05:54 brion: building TitleKey indexes generally (not fully enabled yet so opensearch isn't useless until done; want them built first)
  • 05:25 brion: experimenting with TitleKey ext on testwiki
  • 04:50 Tim: Fixed thumb-handler to not attempt to "cache" files locally on storage1. Removed bacon from /h/w/upload-scripts/sync.

January 29

  • 21:58 mark: Raised persistent_request_timeout on the backend squids from the default 2 minutes to 10 minutes, to make existing connection reuse even more likely between all communicating pairs of squids
  • 10:30 Tim: Setting up storage1 as a static HTML dump storage server. Installed ganglia on it.
  • 09:10 Tim: updatedb was running on storage1, attempting to index millions of files. Killed it, added /export to PRUNEPATHS, and re-ran it. Seems to work.

January 28

  • 22:30 brion: csw5-pmtpa has been spewing alarms about 5/3 and 5/4 optical connections for a while. :(
    • domas says this is harmless -- an unused port
  • 18:50 brion: svn revert'd some live hack in Parser.php which apparently added a $clearState parameter to Parser::internalParse() which never gets passed to it, thus spewing error logs with billions of lines of PHP warnings

January 24

  • 21:00 jeluf: installed lighty on storage1, configured squid so that all dewiki image requests and all commons thumb image requests go to storage1. Images fast again, backend request rate down to normal level.
  • 18:40 brion: images still very slow :(
  • 14:00 mark: Assigned new, extra IP addresses to Florida Squids, and added the new IP range to all squid.conf's. Also removed the old knams IP range, which has been unused over 2 months. This seems to have caused a massive cache clear in knams upload squids, causing a huge increase of image requests and overload of Amane. A real explanation is as of yet unknown... speculation is that old objects in knams caches have been invalidated somehow because they had the (now removed) old IP prefix in their caching info.

January 23

  • 02:09 Tim: reverted refresh_pattern changes in squid (ignore-reload) to fix user JS/CSS problems. With Brion's blessing.

January 22

  • 20:46 mark: Set $wgUserEmailReplyTo back to false, as mchenry will now rewrite envelope sender addresses from MediaWiki to wiki@wikimedia.org
  • 16:12 Rob: srv11 back online
  • 15:55 Rob: srv130,srv132,srv134 back online, see detailed server pages for crash information.

January 21

  • 12:30 jeluf: mark reports twice as much backend requests as usual. live-patched opensearch_desc.php to send proper Cache-Control headers. Needs to be updated in SVN. Backend request rate back to normal levels.
  • 07:10 brion: set $wgUserEmailUseReplyTo to protect against SPF failures and privacy leakage due to bounce messages in user-to-user emails. (Caused by sSMTP, which forces the envelope sender and From: address to be the same.) This uglifies user-to-user emails but keeps the same. In the long term I recommend replacing sSMTP with a minimal postfix or something like we used to use, which should work in a safe manner.
  • 03:24 brion: taking srv184 out of apache rotation to test ssmtp config issues

January 20

  • 21:45 jeluf: unpooled srv183, investigated why NFS mounts were missing after a reboot. Seems to be related to https://bugs.launchpad.net/ubuntu/+source/sysvinit/+bug/44836 . The fix suggested in that bug seems to help. Have to package it tomorrow.
  • 21:40 brion: mounted NFS shares on srv183
  • 21:39 brion: srv183 was rebooted 2h55m ago. its apaches are running, but NFS shared aren't mounted. nothing works properly. lead to several reports of captcha failures, and might have lead to some uplaod-related issues
  • 18:30 jeluf: rebooted srv183, un-killable convert jobs were blocking port 80
  • 18:29 brion: apache not restarting on srv164, srv176, srv183, srv184 -- "(98)Address already in use: make_sock: could not bind to address 0.0.0.0:80"
  • 18:25 brion: killed job runner jobs on srv90-99, they were the error-spewers. syslog is clean. :D
  • 18:18 brion: several apaches in srv90-99 range still spewing errors, but seem to have the right file. stuck apc?
  • 18:11 brion: removed the random '$key' parameter from MessageCache::transform
  • 18:06 brion: space was filled by /var/log/messages and /var/log/syslog; runaway PHP warnings from some live hack extra parameter. truncating the log files and resyncing
  • 17:56 brion: turned off their apaches. looking for the space culprit.... they have most of their space wasted in a /a partition and a tiny / where all the stuff is
  • 17:53 brion: lots of srv's in 150-190 range out of disk space; broken (LocalRepo.php update failed)
  • 11:12 brion: file histories were broken for a few minutes (bad commit got through)
  • 07:08 brion: enabling $wgFileRedirects on test.wikipedia

January 19

  • 06:29 and a bit before - brion: some brief segfaulting due to a bad recursion in my SiteConfiguration update. Note: non-string values in InitialiseSettings.php (false, null, ints, etc) will now work.

January 18

  • 22:46 brion: wikibugs was idle for an hour or so due to being autoblocked for bounces again...
  • 22:40 brion: srv11 is hung; no ssh, HTTP opens but doesn't respond
  • 18:40 brion: created wikimedia-sf mailing list

January 16

  • 22:30ish brion: someone tried to delete sandbox on en.wikipedia, leading to various DB error warnings (transactions full) and breakage of most editing for nearly an hour. Have hacked in a 5000-revision limit on deletions, will prettify it shortly.
  • 21:39 brion: Added a default "Cache-control: no-cache" header on output in CommonSettings.php. This will protect PHP Fatal Error blank pages and such from getting cached due to a 200 result code and lack of cache-control headers. Actual cache-control output will override the default one. (Had to manually purge a Special:Random on en.wikipedia... various issues with editing etc)
  • 07:32 brion: fixed IRC recentchanges name for wikimania2008.wikimedia (was sending to the 2007 channel)

January 15

  • 21:00 jeluf: removed memcached on srv56,57,58 on rainman-sr's request. Memcached was causing problems with the indexer.

January 14

  • 21:33 brion: clearing a giant watchlist on users' request; may cause some s1 replag
  • 21:00ish brion: we seem to be getting blank PHP fatal error pages stuck in squid caches. :( latest php should mark these as 500...
  • 20:00 Rob: All yaseo upload squids upgraded.
  • 19:45 Rob: All yaseo text squids upgraded.
  • 18:45 Rob: Upgraded squid on sq41-sq50
  • 17:45 Rob: Upgraded squid on sq11-sq15
  • 17:00 Rob: Upgraded squid on sq6-sq10
  • 17:00 Rob: Upgraded squid on sq1-sq4
  • 16:20 Rob: Upgraded squid on sq32-sq40
  • 16:20 Rob: Upgraded squid on sq24-sq31
  • 16:03 Rob: Upgraded squid on sq16-sq23
  • 15:26 Rob: Upgraded squid on knsq16,knsq17, knsq18, knsq20, knsq21, knsq22.
  • 15:00 Rob: Upgraded squid on knsq8,knsq8, knsq9, knsq10, knsq11, knsq12, knsq13, knsq14, knsq15

January 13

  • 20:34 mark: Enabled access log on mayflower's apache (why was it disabled?)
  • 18:12 mark: Upgraded all knams text squids to new squid version
  • 17:30 mark: Set refresh_pattern . 60 50% 3600 ignore-reload on all text squids to override reload headers
  • 17:00 mark: Upgraded knsq1 to the new Squid
  • 16:15 mark: Brought up knsq19, and installed a new squid 2.6.18-1wm1 on it, including Domas' Accept-Encoding normalization patches. If you notice anything weird, notify Mark or Domas...
  • 04:25 Tim: Updated MW from r29455 to r29682.

January 12

  • 11:00 domas: removing titleblacklist. there's certain level of crap beyond which I won't fix stuff.
  • 03:10 brion: importing checkuser logs
  • 02:59 brion: upgrading to current CheckUser code (per-wiki logs for now)

January 11

  • 12:00 domas: installed lighty on zwinger for ganglia use

January 10

  • 17:00 domas: disabled CentralNotice

January 9

  • 21:00 domas: increased revtext ttl to 1w, fixed parser cache ttl problem, where magicwords were causing most of enwiki (and other template-aware wiki) pages to be cached for 1h only (r29511)
  • 09:00 domas: memcached arena increased to 158GB, 79 active nodes, ES instances getting lower buffer pools on servers running memcached (1000M to 100M), full cache drop
  • 00:14 brion: now that we've expanded storage2's size and removed a bunch of useless thumb and temp files from the amane backup so there's room again; have restarted up dump runs, including a continuation run of enwiki (which should start up from meta-current)

January 8

  • 22:33 jeluf: extended storage2:/export by 650 GB
  • 22:03 brion: uploads broken for several minutes by r29361 (reverted)
  • 21:48 brion: srv17 and srv18 are whining about high temperatures
  • 21:00 Rob: srv17 segfaults in httpd, resynced and restarted apache.
  • 17:10 Rob: srv78 Kernel Panic, rebooted and back online.
  • 16:45 Rob: srv177 cpu overheating, pulled, replaced thermal paste, back online.
  • 16:20 Rob: srv15] cpu overheating, pulled, replaced thermal paste, back online.
  • 16:15 Rob: srv189 back in rotation.
  • 14:59 Rob: srv189 reinstalled, needs apache setup.
  • 14:54 Rob: srv130 rebooted and back online.
  • 07:50 domas: added db8 and db10 to ganglia

January 7

  • 08:34 Tim: mounted upload4 on albert for static.wikipedia.org symlinks

January 6

  • 21:33 mark: Enabled TCP ECN on lily and mayflower
  • 21:03 mark: Added mayflower's EUI-64 address to DNS - svn may use it.
  • 20:06 mark: Added a v6 service IP to lily (lists.wikimedia.org) and put it in DNS.

January 4

  • 00:34 brion: restarting backup syncs from amane to storage2; was broken by bad script... trimming more thumbnails out of storage2 to clear up space

January 3

  • 19:29 brion: starting enwiki dump on srv42, will continue with general worker thread
  • 19:13 brion: Setting up srv42 to run dump worker threads as well as general batches, since it seems idle.
  • 15:05 mark: Rebooted fuchsia with an LVS optimized kernel, moved all LVS services back onto it
  • 13:45 mark: LVS on fuchsia overloaded, moved LVS for upload to mint
  • 00:26 brion: http://download.wikimedia.org/ now running off storage2. will restart dump runs aiming at it until we have a better place to put the backend (with benet still not checked for its disk issues)

Archives

the whole kaboodle


Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox