Server admin log/Archive 20
From Wikitech
< Server admin log(Difference between revisions)
(→October 12) |
(→October 12) |
||
| Line 1: | Line 1: | ||
== October 12 == | == October 12 == | ||
| + | * 13:00 jeluf: moved wikipedia/m* image directories to ms1 | ||
* 08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled. | * 08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled. | ||
* 02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while. | * 02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while. | ||
Revision as of 12:58, 12 October 2008
October 12
- 13:00 jeluf: moved wikipedia/m* image directories to ms1
- 08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled.
- 02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while.
October 11
- 7:00 jeluf: moved wikipedia/s* image directories to ms1
October 10
- 21:30 jeluf: moved wikipedia/[jqtuwxy]* to ms1
- 19:20 RobH: Bayes online.
- 19:11 brion: recreated special page update logs in /home/wikipedia/logs, hopefully fixing special page updates
- 13:05 Tim: reverted live patch and merged properly tested fix r41928 instead.
- 12:31 Tim: deployed a live patch to fix a regression in MessageCache::loadFromDB() concurrency limiting lock
- 12:17 domas: killed long running threads
- ~12:04: s2 down due to slave server overload
October 9
- 22:52 brion: enabled Collection on de.wikibooks so they can try it out
- 20:00 jeluf: moved wikipedia/i* images to ms1
- 17:05 RobH: thistle raid died due to hdd failed, replaced hdd, reinstalled as raid10.
- 12:00 domas: switched s3 master to db1, did erase bunch of db.php stuff by accident (don't know how :). restored from db.php~ :-)
- 09:31 mark: pascal died yet again, revived it. Will move the htcp proxy tonight...
October 8
- 21:05 brion: yongle still gets stuck from time to time, breaking mobile, apple search, and svn-proxy. i suspect svn-proxy but can't easily prove it still. using separate svn command (in theory) but it's not showing me stuck processes.
- ??:?? rob fixed srv37, then later srv133 into mediawiki-installation node group. he did an audit and didn't see any other problems. i ran a scap to make sure all are now up to date
- Speculation: possible that rumored ongoing image disappearances have been caused by the image-destruction bug still being in place on srv133 for the last month.
- 19:02 mark: Upgraded packages on search1 - search6 and searchidx1
- 18:59 brion: aaron complaining of srv37 not properly updated (doesn't recognize Special:RatingHistory). flaggedrevs.php was out of date there. checking scap infrastructure, stuff seems ok so far...
October 7
- 21:47 brion: started two dump threads (srv31)
- 21:16 RobH: installed and configured gmond on all knams squids.
- 21:00 jeluf: moved wikipedia/g* to ms1
- 18:55 RobH: fixed private uploads issue for arbcom-en and wikimaniateam.
- 17:26 RobH: reinstalled and redeployed knsq24 and knsq29
- 15:00-16:00 robert: switched enwiki to lucene-search 2.1 running on new servers. Test run till tomorrow, if anything goes wrong, reroute search_pool_1 to old searchers on lvs3. Will switch on spell checking when all of the servers are racked. Thanks RobH for tunning config files.
- 15:54 RobH: srv101 crashed again, running tests.
- 15:45 RobH: srv146 was powered down for no reason. Powered back up.
- 15:42 RobH: srv138 locked up, rebooted, back online.
- 15:32 RobH: srv110 was locked up, rebooted, synced, back online.
- 15:31 RobH: srv101 back up and synced.
- 15:22 RobH: rebooted srv56, was locked up, handed off to rainman to finish repair.
- 15:21 RobH: updated lucene.php and synced.
- 15:04 RobH: updated memcached to remove srv110 and add in spare srv137.
- 15:00 RobH: removed all servers from lvs:search_pool_1 and put in search1 and search2 with rainman
October 6
- 23:55 brion: tweaked bugzilla to point rXXXX at CodeReview instead of ViewVC
- 14:29 domas: amane lighty was closing connections immediately, worked properly after restart. upgraded to 1.4.20 on the way.
- 14:36 RobH: setup ganglia on all pmtpa squids.
- 13:50 mark: The slow page loading on the frontend squids appears to be limited to english main page only, for unknown reasons. Set another article as pybal check URL to prevent pooling/depooling oscillation by PyBal for now.
- 09:27 mark: yaseo squids are fully in swap, set DNS scenario yaseo-down
October 5
- 23:14 mark: Frontend squids are not working well at the moment, sometimes serving cached objects with very high delays. I wonder if they are under (socket) memory pressure. Reduced cache_mem on the backend instance on sq25 to free up some memory for testing.
- 20:35 jeluf: wikipedia/b* moved, too
- 19:00 jeluf: switched squids to send requests for upload.wikimedia.org/wikipedia/a* to ms1
- 14:30 jeluf: Moving all wikipedia/a* image directories to ms1
October 4
- 23:17 mark: Repooled knsq16-30 frontends in LVS. Also found that mint was fighting with fuchsia about being LVS master, due to reboot this afternoon.
- 14:30 mark: Several servers in J-16 were shutting down, or going down around this time. Reason unknown, possibly auto shutdown because of high temperature, possibly they were turned off by someone locally.
- 14:03 mark: SARA power failure. Feed B lost power for ~ 6 seconds.
- 00:26 mark: Depooled srv61
- 00:07 brion: found srv37 and srv61 have broken json_decode (wtf!)
- updating packages on srv37. srv61 seems to have internal auth breakage
- updated packages on srv61 too. su still borked, may need LDAP fix or something?
October 3
- 21:40 brion: transferring old upload backups from storage2 to storage3. once complete, can restart dumps!
- 20:01 brion: running updateRestrictions on all wikis (done)
- 17:51 RobH: srv135 & srv136 reinstalled as ubuntu.
- 17:34 RobH: srv132 & srv133 reinstalled as ubuntu.
- 17:13 RobH: srv130 back online.
- 16:40 RobH: depooled srv131, srv132, srv135, srv136 for reinstall.
- 00:25 brion: switched codereview-proxy.wikimedia.org to use local SVN command instead of PECL SVN module; it seemed to be getting bogged down with diffs, but hard to really say for sure
October 1
- 20:02 RobH: srv63 back online.
- 19:35 RobH: srv61 and srv133 back online.
- 18:22 RobH: storage3 online and handed off to brion.
- 17:35 RobH: updated mc-pmtpa.php to put srv61 as spare.
- 17:32 RobH: srv61 faulty fan replaced, back online.
- 09:31 Tim: srv104 (cluster18) hit max_rows, finally. Removed it from the write list.
- 08:36 Tim: fixed ipb_allow_usertalk default on all wikis
- 23:46 mark: Reinstalled knsq24
- 22:55 mark: Reenabled switchports of knsq16 - knsq30
- 20:45 jeluf: fixed resolv.conf on srv131
- 20:45 jeluf: mounted ms1:/export/upload as /mnt/upload5, started lighttpd on ms1
- 19:47 brion: enabled revision deletion on test.wikipedia.org for some public testing.
- 14:25 RobH: Cleaned out the squid cache on knsq16, knsq17, knsq18, knsq19, knsq21, knsq22, knsq23, knsq25, knsq26, knsq27, knsq28, knsq30. DRAC not responsive on knsq20, knsq24, knsq29.
September 30
- 23:45 brion: apt is borked on mayflower due to smartmontools refusing to load
- 20:46 tomasz: cp'd phpmyadmin from dev.civicrm to prod civicrm for david
- 20:06 brion: replaced 'wikimedia.org' with 'meta.wikimedia.org' in local VHosts list in wgConf.php. The general 'wikimedia.org' was causing CodeReview's diff loads (via codereview-proxy.wikimedia.org) to fail as they were hitting localhost instead of the proxy. Do we need to add more vhosts to his list, or redo how it works?
- 19:45 brion: test-deploying CodeReview on mediawiki.org
- 19:?? brion: set up temporary limited SVN JSON proxy as codereview-proxy.wikimedia.org
- 19:17 RobH: updated DNS to add something for Brion.
- 18:21 mark: cache cleaning complete
- 15:05 Tim: doing some manual purges of URLs requested on #wikimedia-tech
- 15:00 mark: Cleaning caches of all backend text squids one by one, starting with pmtpa
- 14:20 mark: pooled all squids manually to fix the issues.
- 14:10 RobH: Site back up, slow as squids play catchup.
- 14:06 RobH: Pushed out old redirects.conf and restarted apaches.
- 14:01 RobH: Site is down, go me =[
- 14:00 RobH: updated redirects.conf and pushed change for orphaned domains.
- 13:38 RobH: updated dns for more orphaned domains.
- 13:11 Tim: cluster13 and cluster14 both have only one server left in rotation. Shut down apache on srv129 and srv139 out of fear that it might hasten their doom.
- 10:12 Tim: Switched ES cluster 3-10 to use Ubuntu servers (again)
- 10:03 Tim: depooled ES on srv127, has been wiped
- 10:00 Tim: depooled thistle, is down
- 09:20 Tim: Set up MediaWiki UDP logging
- 08:05 Tim: removed the ORDER BY clauses from the ApiQueryCategoryMembers queries, to work around MySQL bug, probably involving truncated indexes
- 07:08 Tim: re-enabled the API
- 06:56 Tim: ixia (s2 master) overloaded due to ApiQueryCategoryMembers queries. Disabled the API and killed the offending queries
September 29
- 22:20 brion: reenabled history export ($wgExportAllowHistory), but put $wgExportMaxHistory back to 1000 instead of experimental 0 for enwiki. (sorry enwiki)
- 21:27 RobH: fixed the mounts on srv163 and started apache back up.
- 20:20 brion: srv163 has bad NFS config, missing upload and math mounths. I've shut off its apache so it stops polluting the parser cache with math errors.
- 17:01 RobH: updated apache redirects.conf for orphaned domains, restarted all apaches.
- 15:06 RobH: updated DNS to reflect a number of orphaned domains.
- 08:48 Tim: put db7 back into watchlist rotation (99%)
- 08:08 domas: enabled ipblocks replication on db7, resynced from db16
- 08:00 domas: Replaced gcc-4.2 build on db7 with gcc-4.1 one, from /home/wikipedia/src/mysql-4.0.40-r9-hardy-x86_64-gcc4.1.tar.gz
September 28
- 17:52 mark: Upgraded mchenry to Hardy.
- 17:15 mark: Upgraded sanger to Hardy.
- 13:43 mark: Repooled srv150
- 13:25 mark: Upgraded php5 and APC on all ubuntu apaches... got tired of restarting them. ;)
- 12:06 Tim: on db7: replicate-ignore-table=enwiki.ipblocks. Good enough for now.
- 11:51 Tim: schema update at 04:44 made db7 segfault. Replication stopped, watchlists stopped working after code referencing the new schema was synced. Switched to db16 for watchlist and RCL. Tried INSERT SELECT, that segfaulted too.
- 09:37 mark: Made syslog-ng on db20 filter the flood of 404s in /var/log/remote
- 09:15 mark: Restarted all (and only) segfaulting apaches
- 05:38 Tim: svn up/scap to r41337.
- 04:44 Tim: applying patch-ipb_allow_usertalk.sql on all DBs. No master switches.
September 27
- 20:41 mark: Packaged a newer PHP5 (5.2.4 from Ubuntu Hardy, with CDB support) and a new APC (3.0.19). Deployed it on srv153 for testing.
- 18:15 brion: srv100 looks particularly crashy.
- 18:09 brion: got some complaints about ERROR_ZERO_SIZED_OBJECT on saves, seeing a lot of segfaults in log. Restarting all apaches to see what they do.
September 26
- 22:49 RobH: repooled sq49.
- 22:00 RobH: depooled sq49 for power testing.
- 21:50 RobH: pulled search7 for power testing and left off, as the power circuit would trip if it was left on there.
- 21:18 RobH: put srv189 back into rotation.
- 19:51 RobH: Pulled srv189 for power testing.
September 25
- 21:41 RobH: had to recreate /home/wikipedia/logs/jobqueue/error as it was lost and job queue runners failed due to it not being there. Restarted runners.
- 19:08 domas: fixed clear-profile by replacing 'zwinger' with 'zwinger.wikimedia.org' - apparently datagrams to 127.1 used to fail.
- 18:44 brion: manually applied r41264 to MimeMagic.php to fix uploads of OpenDocument files to private/internal wikis
- 15:25 RobH: bayes minimally installed.
- 15:23 RobH: reverted statistics1 to bayes in dns, pushed dns change.
- 14:04 RobH: bayes racked and ready for install.
- 05:00 mark: Flapped BGP session to HGTN, to resolve blackholing of traffic
- 03:20 Tim: stopped apache on srv167, was segfaulting again. I suspect binary version mismatch between compile and deployment, e.g. APC was compiled for libc 2.5-0ubuntu1, deployed on libc 2.7-10ubuntu3.
- 03:03 Tim: restarted segfaulting apaches srv111,srv168,srv154,srv167,srv46
- 02:28 Tim: srv35 was segfaulting again, probably because it was in both the test.wikipedia.org pool and the main apache pool. Having two copies of everything tends to make the APC cache overflow, which triggers bugs in APC and leads to segfaulting. Removed it from the main apache pool.
September 24
- 20:23 RobH: restarted srv186 apache due to segfault.
- 20:21 RobH: restarted srv179 apache due to segfault.
- 20:05 brion: restarted srv35's apache (test.wikipedia.org) was segfaulting
- 19:25 tomasz: restricted grant for 'exim'@'208.80.152.186' to 150 MAX_USER_CONNECTIONS
- 18:40 mark Increased TCP backlog setting on mchenry from 20 to 128.
- 18:19 brion: restoring ApiQueryDeletedrevs and Special:Export since they're not at issue. Domas thinks some of the hangs may be caused by mails getting stuck via ssmtp when the mail server is overloaded; auto mails on account creation etc may hold funny transactions open
- 17:52 brion: disabling SiteStats::update() actual update query since it's blocking for reasons we can't identify and generally breaking shit
- 17:50 RobH: updated nagios files/node groups for raid checking on hosts without 3ware present
- 17:37 brion: domas thinks the problem is some kind of lock contention on site_stats, causing all the edit updates to hang -- as a result the ES connections stack up while waiting on the core master. I'm disabling ss_active_users update for now, that sounds slow...
- 17:34 RobH: srv131 apache setup is borked, removing from lvs.
- 17:33 RobH: added proper ip info for lo device on srv131
- 17:24 brion: temporarily disabling special:export
- 17:22 brion: the revert got us back to being able to read the site most of the time, but still lots of problems saving -- ES master on cluster18 still has lots of sleeper connections and refuses new saves
- 17:10 brion: trying a set of reverts to recent ES changes
- 16:43 brion: temporarily disabling includes/api/ApiQueryDeletedrevs.php, it may or may not be hitting too much ES or something?
- 16:38 brion: seeing lots of long-delayed sleeping connections on ES masters, not running queries. trying to figure out w/ Aaron what could cause these
- 16:36 mark: Set up a syslog server on db20, logging messages from other servers to /var/log/remote.
- 16:31 brion: confirmed PHP fatal error during connection error (backend connection error "too many connections"). Manually merging r41230 to live copy to skip around the frontend PHP error
- 16:20 brion: we're getting reports of eg "(Can't contact the database server: Unknown error (10.0.2.104))" on save. Trying to investigate, but MediaWiki was borked by the previous reversions of core DB-related files to a 6-month-old version with incompatible paths. Trying to re-sanitize MW to r41097 straight
- 15:45 Rob: setup wikimedia-task-appserver on srv141.
- 15:09 mark: The problem reappeared, looks like a bug in MediaWiki, possibly triggered by some issue in ES. Reverted the files includes/ExternalStore.php includes/ExternalStoreDB.php includes/Revision.php includes/db/Database.php includes/db/LoadBalancer.php to r35098 and ran scap.
- 14:50 mark: Reports of most/all saves failing with PHP fatal error in /usr/local/apache/common-local/php-1.5/includes/ExternalStoreDB.php line 127: Call to a member function nextSequenceValue() on a non-object. Suspected APC cache corruption, did a hard restart of all apaches which appeared to resolve the problem.
- 07:15 Tim: installed wikimedia-nis-client on db20
September 23
- 20:03 RobH: srv170 reporting apache down, synced, restarted.
- 20:02 RobH: srv188 was not running apache, synced and started.
- 19:59 RobH: Installed memcached on srv183, updated mc-pmtpa.php.
- 19:57 RobH: Installed memcached on srv66, updated mc-pmtpa.php.
- 19:54 RobH: Installed memcached on srv141, updated mc-pmtpa.php.
- 19:52 RobH: srv106 back up, apache synced and memcached running.
- 19:45 RobH: srv127 complained of port in use starting apache, rebooted, all is fine.
- 19:27 RobH: removed srv106 from active memcached, replaced with srv127, sync-file mc-pmtpa.php
- 18:00 RobH: srv127 had booting issues into the OS, reinstalled and redeployed.
- 17:08 RobH: srv138 was locked up, restarted.
- 16:53 RobH: srv136 was locked up, restarted, synced, added correct lvs ip info.
- 16:45 RobH: srv126 was locked up, restarted, synced, added correct lvs ip info.
- 16:29 RobH: rebooted srv106, was locked up.
- 16:25 RobH: reinstalled srv101, was old ubuntu with no ES data.
- 16:13 RobH: reinstalled srv143 and srv148 from FC to Ubuntu, redeployed as apache
- 15:57 RobH: reinstalled srv128 and srv140 from FC to Ubuntu, redeployed as apache.
- 14:00-14:50 Tim: cleaned up /home/wikipedia somewhat, put various things in /home/wikipedia/junk or /home/wikipedia/backup, moved some lock files to lockfiles, deleted ancient /h/w/c/*.png symlinks, etc.
- 14:50 Tim: Made sync-common-file use rsync instead of NFS since some mediawiki-installation servers still have a stale NFS handle for /home
- 14:31 RobH: srv189 back in apache rotation
- 14:20 RobH: srv130 back in apache rotation
- 13:56 Tim: started rsync daemon on db20
- 13:49 Tim: restored dsh node groups on zwinger
- 13:40 Tim: installed udplog 1.3 on henbane
- 00:05 - 01:20 Tim: copying everything from the recovered suda image except /home/kate/xx, /home/from-zwinger and /home/wikipedia/logs. Will copy /home/wikipedia/logs selectively.
September 22
- 21:30 brion: noting that ExtensionDistributor extension is disabled for now due to the NFS problem
- 18:59 RobH: srv131 offline due to kernel panic. Cannot bring back until /home issue is resolved.
- 18:00 brion: things seem at least semi-working.
- everything hung
- suda had some kind of kernel crash
- after reboot, it was found to have a couple flaky disks
- brion hacked up MW config files to skip the NFS logging
- mark set up an alternate /home NFS server
- 17:50 mark: Set up db20 as an (empty) temporary suda replacement. Set up NFS server for /home.
- 17:20 mark: suda died.
- 17:25 RobH: srv130 not working right, removed from pool.
- 16:32 RobH: removed srv8 and srv10 from nagios, resynced.
- 15:00 mark: Site down completely. Post-mortem:
- Rob is untangling power cables in rack B2, and both asw-b2-pmtpa and asw3-pmtpa (in B4) lose power
- Two racks unreachable, PyBal sees too many hosts down and won't depool more
- Rob brings power to asw-b2-pmtpa back up, but connectivity loss to B4 is not noticed
- Mark investigates why LVS isn't working, adjusts PyBal parameters, until PyBal pools not a single server
- Apaches are unhappy about completely missing ES clusters
- Connectivity loss to B4 discovered, restored
- Site back online
September 21
- 10:10 Tim: disabled srv106's switch port. Was running the job queue with old configuration, inaccessible by ssh.
September 20
- 14:45 Tim: re-enabled Special:Export with $wgExportAllowHistory=false. Please find some way of doing transwiki requests which doesn't involve crashing the site.
- 14:30 Tim: People were reporting ES current master overload, no ability to save pages at all. This was apparently due to the small number of max connections on srv103/srv104. Most threads were sleeping. The real culprit was apparently db2 being slow due to a long-running (1 hour) Special:Export request. Disabled Special:Export entirely.
- 12:00 mark: Restored zwinger's IPv6 connectivity; removed svn.wikimedia.org from /etc/hosts
- 11:40 mark: Found an IP conflict; 208.80.152.136 was assigned to srv9 but not listed in DNS
- 10:09 Tim: removed srv69 and srv118 from the memcached list, down
- 09:02 Tim: ES on srv84 had new passwords, was not accepting connections from 3.23 clients on srv32-34. Fixed.
- 08:45 Tim: depooled ES srv110, reformatted by Rob while it was still a current ES slave. Depooled srv137, mysqld was shut down on it for some reason. One server left in cluster14.
- srv137 has a corrupt read-only file system on /usr/local/mysql/data2
- 05:34 Tim: svn.wikimedia.org not reachable from zwinger via IPv6, causing very slow operation due to timeouts. Hacked /etc/hosts.
- 04:58 Tim: svn up/scap to r41053
- 01:06 Tim: ES migration failed on all clusters except cluster3 (the cluster I used to test the script), due to MySQL 4.0-4.1 version differences. Restarting with mysqldump --default-character-set=latin1.
- 00:14 Tim: restarted segfaulting apaches: srv167,srv152,srv172,srv171,srv153,srv151,srv176,srv155,srv112,srv119,srv111,srv113
- 00:10 Tomasz: upraded public and private depot to svn 1.5 data format.
- 00:00 Tomasz: svn installed ubuntu 8.04 along with svn 1.5.
September 19
- 23:00 Tomasz: svn installed ubunu 7.10, ready
- 22:55 RobH: db20 installed, ready for next upgrade.
- 22:38 RobH: db19 installed, ready for setup.
- 22:26 RobH: db18 installed, ready for setup.
- 18:00 brion: updated mwlib on bindery.wikimedia.org and Collection extension
- 15:59 RobH: reinstalled srv70, srv100, srv110-srv119 from FC to ubuntu, redeployed.
- 07:30 Tim: srv38 was hanging while attempting to write to log files on /home. Fixed permissions on /mnt/upload4/en/thumb which was causing a high log write rate, restarted apache, disabled search-restart cron job, restarted pybal. Seems to be fixed.
- 01:55 Tim: the issue with ES was the lack of a master pos wait between transfer and slave shutdown. Fixing.
- 01:00 Tim: restarting possibly segfaulting apaches on srv158,srv177,srv178,srv173,srv51,srv187,srv182,srv44,srv117. Keeping srv139 for debugging, it has kindly depooled itself by segfaulting on pybal health checks.
September 18
- 17:39 RobH: srv35, srv37, srv55 & srv59 bootstrapped with ganglia.
- 17:37 RobH: srv40, srv41, srv43-srv53 bootstrapped with ganglia.
- 17:36 RobH: srv60-srv68 bootstrapped with ganglia.
- 17:31 RobH: srv151-srv188 bootstrapped with ganglia.
- 11:45 Tim: reverted db.php change, still has issues.
- 11:18 Tim: removed apaches_yaseo from nagios config, changed apaches_pmtpa to apaches.
- 11:09 Tim: in db.php, switched ES clusters 3-10 to use the ubuntu servers
September 17
- 23:57 brion: set $wgLogo to $stdpath for wikinews -- old local /upload path failed to redirect properly on secure.wikimedia.org interface
- 22:19 mark: Deployed the rest of the new search servers, search2 - search7.
- 19:25 JeLuF: changed robots.php to send both Mediawiki:robots.txt and /apache/common/robots.txt
- 19:23 RobH: Removed srv63 from memcache list, put in spare memcache and synced file.
- 19:14 RobH: restarted memcached on srv74
- 19:00 RobH: reinstalled srv62, srv64, srv65, srv66, srv67, & srv68 from FC to Ubuntu.
- 18:26 RobH: srv63 shutdown due to hdd failure.
- 18:25 RobH: srv61 shutdown due to overheating issue.
- 18:16 RobH: Reinstalled srv51, srv52, srv53, srv54, srv55, srv56, srv57, srv58, srv59, srv60, srv61 as ubuntu apache servers.
- 16:56 RobH: Reinstalled srv44, srv45, srv46, sr47, srv48, srv49, & srv50 as ubuntu apache servers.
- 16:00 RobH: Reinstalled srv35, srv37, srv40, srv41, srv43 as ubuntu apache servers.
- 16:00 RobH: moved srv37 from pybal render group to apache group
- 01:50 brion: killed obsolete juriwiki-l list per delphine
September 16
- 22:59 mark: srv133 is giving Bus errors, read-only file systems, and was therefore automatically depooled by PyBal. Good times.
- 22:59 mark: Installed memcached on srv182 (was missing?), restarted memcached on srv70, srv169 and replaced instance of srv141 by srv142.
- 22:36 mark: Prepared searchidx1 and search1 for production, if things work sufficiently well I'll deploy the others tomorrow
- 21:30 brion: found a bunch of memcache machines down or not running memcached: 170, 141, 70, 169, 182
- 21:01 mark Building search deployment with rainman, with search1 as test host
- 20:33 brion: fixed secure.wikimedia.org for Wikimania wikis -- wikimedia-ssl-backend.conf rewrite rules were mistakenly excluding digits from the wiki pseudodir
- 18:00 JeLuF: made the main page of https://secure.wikimedia.org/ editable via http://meta.wikimedia.org/wiki/Secure.wikimedia.org_template using extract2.php
September 15
- 22:45 Tim: rebooted srv151. Shut down mysqld and then gave it a sync; sysrq b.
- 21:11 RobH: Installed Ubuntu on searchidx1, search1, search2, search3, search4, search5, search6, search7.
- 19:00 RobH: searchidx1 installed.
September 14
- 18:45 mark: Upgraded PyBal on lvs3 to a newer version, and set up SSH checking (once a minute) of all apaches, see LVS.
- 18:42 mark: srv170 is doing OOM kills
- 18:28 mark: Upgraded wikimedia-task-appserver on all Ubuntu app servers, which creates a limited ssh account pybal-check for use by PyBal. Create the account manually on all Fedora apaches
- 17:01 mark: Apache on srv151 is stuck on an NFS mountpoint and cannot be restarted. I'm not rebooting the box as I'm not sure what's going on with ES atm.
September 12
- 23:30 jeluf: apache on srv37 doesn't restart, libhistory.so.4 is missing
- 23:15 mark: NTP ip missing on zwinger, readded
- 23:00 jeluf: proxy robots.txt requests through live-1.5/robots.php, which delivers Mediawiki:robots.txt if it exists and /apache/common/robots.txt else.
- 15:30 Tim: set read_only=0 on srv108 (Rob rebooted it)
- 15:00 RobH: bart crashed, rebooted.
- 14:56 Tim: pulling out all the stops now, running migrate.php migrate-all.
- 14:45 RobH: synced srv104, back online.
- 14:40 RobH: synced db.php.
- 14:32 RobH: srv105 unresponsive, rebooted.
- 14:25 Tim: Removed the corrupted ES installations on srv151-176
- 14:18 RobH: Installed NRPE plugins on db9-db16.
- 09:01 Tim: reverted, blob corruption due to charset conversion observed
- 07:58 Tim: Experimentally switched db.php to use the ubuntu servers for cluster3/4.
- 07:50 Tim: Stopping replication on the ubuntu cluster3 and cluster4 servers, and changing the file permissions on the MyISAM files to prevent any kind of modification by the mysql daemon. This is done by the new lock/unlock commands in ~tstarling/migrateExtToUbuntu/migrate.php.
September 11
- 05:30 Tim: Migrating cluster4. Testing new binlog deletion feature.
September 10
- 15:40 RobH: Racktables database moved from will to db9.
- 15:00 RobH: Reinstalled srv185, srv186, srv187 to newest ubuntu, online as apache.
- 05:00 - 10:10 Tim: copied cluster3 to srv151, srv163 and srv175, second attempt, seems to have worked this time
September 9
- 23:25 brion: for a few minutes got some complaints about 'Can't contact the database server: Unknown error (10.0.6.22)' (db12). This box seems to be semi-down pending some data recovery, but load wasn't disabled from it. May have gotten load due to other servers being lagged at the time. Set its load to 0.
- 18:49 RobH: Moved maurus from A4 to A2.
- 18:05 mark: Made lvs2 a temporary LVS host for upload.pmtpa.wikimedia.org to be able to remove alrazi from its rack. Will redo this setup soon.
- 17:50 RobH: srv61 reinstalled and setup as apache and memcached.
- 17:50 RobH: srv144 reinstalled, needs ES setup.
- 17:50ish brion: updated planet to 2.0, cleared en feed caches. Something was broken in them which caused updated to fail since September 5.
- 17:42 RobH: Updated DNS to reflect new search servers.
- 15:11 RobH: Moved isidore, upon reboot, noticed the wordpress update didnt take, reapplied it to blog and whygive installations.
- 14:49 RobH: zwinger and khaldun moved from A4 to A2.
- 10:26 Tim: copying ES data from srv32 to srv151, srv163 and srv175
- 01:30-10:20 Tim: testing and debugging the ubuntu ES migration script on srv151, srv163 and srv175
- 02:15 Tomasz: Added bugzilla reporting cron on isidore.
- 00:48 Tim: granted root access to zwinger on all ES servers, useful for migration
September 8
- 22:20 RobH: reinstalled srv178, srv179, srv180, srv181, srv182, srv184.
- 21:20 RobH: reinstalled srv175, srv176, & srv177.
- 20:30 RobH: reinstalled srv172, srv173, & srv174.
- 19:23 RobH: reinstalled srv169, srv170, & srv171.
- 18:23 RobH: reinstalled srv166, srv167, & srv168.
- 18:00 RobH: reinstalled srv163, srv164, & srv165.
- 16:40 RobH: reinstalled srv160, srv161, & srv162.
- 15:40 RobH: reinstalled srv157, srv158, & srv159.
- 15:05 RobH: reinstalled srv154, srv155, & srv156.
- 14:36 mark: Exchanged down srv126 for srv140, and down srv137 for srv141 in mc-test.php
- 14:12 RobH: reinstalled srv151, srv152, & srv153.
- 06:16 Tim: Gave myself a RackTables account
- 05:33 Tim: srv146 down, removed from ES rotation
- 05:08 Tim: accidentally crashed srv37. Needs restart.
September 7
- 15:48 mark: alrazi overloaded, switch traffic back to knams and hope it can take the load
- 14:37 mark: knams partially back up, broken line card still down. Moved some important servers to another line card. knsq16 - knsq30 will be down for the upcoming days, as well as most management.
- 10:20 domas: copied in mysql build from db16 to db12 - db12 was running gcc-4.2 one, and in crashloop. next crash will bring up proper build :)
September 6
- 20:15 river: failure of many hosts at knams (including lvs), moved to authdns-scenario knams-down
- 12:05 hashar : merged r40433 to fix &editintro
- 5:30 JeLuF: image upload on enwiki enabled again. Slowly deleting images from amane.
- 3:00 JeLuF: image upload on enwiki disabled, copying enwiki images to storage1
September 5
- 22:00-00:00 Hashar : gmaxwell provided backup of files (downloaded in ~/files/), I recovered non existent one.
- run ~/check_missing_pics.pl for hints (output example)
- 17:03 Tim: Updated trusted-xff.cdb. Fixes AOL problems.
- 14:45 JeLuF: started to rsync enwiki images from amane to storage1 in preparation of tomorrow's final move of the image directory
- 04:24 Tim: sync-file screwup caused thumbnails to be created in the source image directory. Will try to repair.
- 03:13 Tim: srv151 is depooled for some reason. No indication as to why in the logs or config files. Using it to test the new wikimedia-task-appserver package. Will repool once I get it working properly.
September 4
- 22:15 JeLuF: Switched srv179's mysql to read_only
- 22:10 JeLuF: OTRS back online, switched to db9. Changed exim config on mchenry, too.
- 20:00 JeLuF,RobH: Shut down OTRS, migrating its DB from srv179 to db9
- 19:49 RobH: db10 replication slave of db9
- 17:58 RobH: civicrm and dev civicrm database now located on db9 (was on srv10)
- 17:19 RobH: Bugzilla database is now located on db9 (was on srv8)
- 16:52 RobH: Both the wikimedia blog and donation blog databases are now residing on db9 (was on srv8)
- 16:43 Tim: re-enabled thumb.php after some of the culprits came to talk to me on #wikimedia-tech and promised to reform their ways
- 11:09 Tim: fixed APC on srv38 and srv39, was broken.
- 10:35 Tim: srv38 and srv39 have been overloaded since 05:50. Blocked thumb.php for external clients.
- 05:30 Tim: restarted srv138 with sysrq-trigger. Was reporting "bus error" on sync-file.
- 04:03 Tim: upgrading to wmerrors-1.0.2 on all mediawiki-installation
September 3
- 23:00 jeluf: moved enwiki's upload archive from amane to storage1, freeing up some 20G on amane.
- 16:54 brion: tweaking ApiOpenSearchXml to hopefully fix the rendering-thumbs-on-text-apaches problem
- 14:01 RobH: updated libtiff4 on all apaches
- 04:23 Tim: svn up/scap to r40356
- 04:13 Tim: populating ss_active_users
- 03:21 Tim: applying patch-ss_active_users.sql
September 2
- 19:50 mark: Repooled srv181
- 19:31 mark: Many boxes still in inconsistent state because of OOM kills. Some background processes not running (e.g. ntpd). Rebooted srv159, srv182, srv154, srv156, srv157, srv158, srv181, srv188
- 19:28 mark: scap
- 19:01 mark: Killed all stuck convert processes on srv151..srv188 (but left srv189 intact for debugging)
- 18:51 mark: Rebooted srv169, srv180
- 18:48 mark: Remounted /mnt/upload4 on srv151..srv188 (not srv189)
- 18:33 mark: Many application servers are running out of memory, one by one. This seems to be caused by stuck thumbnail convert processes which end up there. The thumbnail convert processes on the regular apaches are indirectly caused by the API, and is opensearch/prefixsearch/allpages related - but I get lost in that code. One sample url is http://en.wikipedia.org/w/api.php?action=opensearch&search=Gina&format=xml Another interesting and likely related question is why many apaches can no longer reach storage1 NFS...
- 17:07 RobH: Restarted ssh process which had stalled on srv188.
- 16:52 mark: Rebooted srv186
- 16:00 RobH: Pushed a number of dns changes for CZ chapter redirects.
- 15:25 RobH: Updated dns for arbcom.de.wikimedia.org. Also added wiki to the cluster.
September 1
- 23:10 mark: Added upload.v4.wikimedia.org hostname (explicitly A-record only), and allowed it in Squid frontend.conf
- 17:40 jeluf: unpooled apache srv138, srv181 ssh not working
- 17:30 jeluf: re-enabled srv124 in ES cluster12
- 17:15 jeluf: re-enabled srv86 in ES cluster7
- 16:32 mark: Deployed the PowerDNS pipe backend with the selective-answer script on all authoritative servers
- 09:38 Tim: srv102 done, re-added cluster17 to the write list
- 04:09 Tim: repooled ES on srv107, schema change done
- 03:50 Tim: depooled apache on srv105, had old MW configuration, no ssh
- 03:45 Tim: starting max_rows change on srv102. srv107 is actually stopped due to disk full, fixing.
- 03:37 Tim: switching masters on cluster17 to srv103.
- 02:14 Tim: Killed job runner on srv107 to speed up schema change.
- 02:10 Tim: Brought srv142 and srv145 into ES rotation in cluster16.
August 31
- 23:05 mark: A parser bug in the PowerDNS Bind backend caused unavailability of the wikimedia.org zone for a few minutes, ouch...
- 22:55 mark: Deployed a PowerDNS pipebackend instance with this script on ns2.wikimedia.org (lily) only. Just one out of three nameservers for stability testing for now. Should there be major trouble, remove all "pipe" backend references from /etc/powerdns/pdns.conf.
- 18:38 Tim: Going to bed. Status is: srv107 replicating but locked with slow alter table. Can be re-added after it catches up. cluster18 is working, for no apparent reason, and should be migrated to max_rows=20M ASAP. cluster17 needs a master switch so that srv102 can be fixed, after that it should be re-added to the write list. Once srv142 is done copying, it can be restarted and repooled, as can srv145. No need to fix the replication there since it's an old cluster.
- 18:30 Tim: re-adding cluster19 to the write list, without srv107 which is still altering.
- 16:22 Tim: srv141 didn't work out, out of disk space, trying copy to srv142 instead (from srv145)
- 14:44 Tim: srv103 and srv110 done, repooling.
- 14:02 Tim: srv108 done, changed master to srv108, started max_rows change on srv107
- 13:51 Tim: started max_rows change on srv110. Not patient enough to do them one at a time.
- 13:38 Tim: copy to srv110 finished. Put srv110 in, srv103 left out for now for max_rows change
- 13:27 Tim: taking srv145 out of rotation for copy to new ext store srv141 (has same partitioning)
- 12:45 Tim: srv109 finished, starting on srv108
- 11:45 Tim: taking srv103 out of rotation for copy to new ext store srv110
- 11:37 Tim: alter table blobs max_rows=10000000; on srv109.
- 11:34 Tim: cluster is too much of a mongrel undocumented mess to set up new ext store servers, and we don't have that many candidates left anyway. Going to try saving the existing clusters.
- 10:27 Tim: received reports that cluster19 has gone the same way. Most likely all slaves and masters set up that time are affected and will fail roughly simultaneously. Will set up new clusters.
- 10:15 Tim: set mysql root password on external storage servers where it was blank
- 10:07 Tim: cluster17 master srv102 has stopped being writable for enwiki due to exhausted MyISAM index table size (max_rows=1000000). Removed from write list, working on it.
- 07:00 Tim: On srv189: added ddebs.ubuntu.com to sources.list. Installed debug symbols for apache.
August 30
- 22:11 mark: Set up an experimental IPv6 to IPv4 proxy on iris
- 17:13 Tim: killed long-running convert processes on srv152-189
August 29
- 21:00 jeluf: checked srv104, added it back to its ES pool, added cluster18 back to wgDefaultExternalStore
- 16:12 RobH: moved srv52 and srv56 from B2 to C4 for heat issues.
- 15:32 RobH: srv149 reinstalled as apache core.
- 13:08 Tim: images on kuwiki were actually broken because the move from amane to storage2 failed. The directory on amane was probably recreated by the thumbnail handler before the migration script created the symlink, resulting in a new writable image directory with no images in it. Merged the two directories and fixed the symlink.
- 12:00 domas: did space cleanups on amaryllis, and all DBs (all <80% disk usage now :) - preparing for vacation. VACATION!!! :)
August 28
- 22:50 mark: Set up a dirty, temporary test setup of PyBal on lvs2 doing SSH logins on all apaches for health checking.
- 21:43 RobH: reinstalled srv134 back online as apache core.
- 21:10 RobH: reinstalled srv130 back online as apache core.
- 20:09 RobH: searchidx1, search1, search2, search3, search4, search5, search6, & search7 racked with remote management enabled.
- 16:09 RobH: db9 reinstalled for misc db role.
- 13:28 Tim: removed dkwiktionary and dkwikibooks from all.dblist. Apparently they're visible on the web when they were previously removed. They were created accidentally years ago due to dk being an alias for da.
- They became visible due to Rob's changes to langlist.
- 05:59 Tim: Following complaint about bad uploads on kuwiki, running "find -type d -not -perm 777 -exec chmod 777 {} \;" in various upload directories with various maxdepth options.
August 27
- 22:57 RobH: srv127 reinstalled and back online as apache.
- 22:34 RobH: srv36 reinstalled and back online as apache.
- 22:09 RobH: srv117 reinstalled and back online as apache.
- 22:00 mark: Commented out most LVS related checks in /home/wikipedia/bin/apache-sanity-check which are no longer relevant
- 22:00 mark: Various changes to the Ubuntu installer, to make SM apache installs work, and for preseeding of NTP config.
- 21:48 RobH: srv81 reinstalled and back online as apache.
- 19:07 RobH: Purged cz.wikimedia.org redirect from all knams squids.
- 18:10 RobH: srv147 reinstalled and deployed as apache.
- 16:30 RobH: sq48 had a possible issue with hdc. Tested fine, cleaned and back online.
- 15:19 RobH: srv146 was read-only. Rebooted, fsck, restarted.
- 08:38 Tim: added FlaggedRevs stats update to crontab on hume
- 08:03 Tim: running FlaggedRevs/maintenance/updateLinks.php on dewiki
August 26
- 20:00 RobH: moved srv84 and srv85 from B4 to B3 rack.
- 18:39 RobH: moved srv82 and srv83 from B4 to B3 rack.
- 17:30 RobH: srv81 reinstalled and running apache. Needs ext store setup.
- 16:35 RobH: srv103 restarted and synced.
- 16:01 brion: srv103 serving pages with stale software but unreachable. needs to be shut down
- 14:53 RobH: reinstalled db10 for misc. db tasks.
- 13:27 Tim: disabled some user account on otrs-wiki
- 11:15 mark: Added coronelli to search pool 3 on lvs3
- 00:26 RobH: fixed my own typo in redirects.conf, pushed, graceful all apache.
- 00:15 RobH: pushed some fixes on InitialiseSettings.php for a private wiki.
August 25
- 23:07 brion: enabled write API, let's see what happens!
- 22:41 brion: query.php disabled as scheduled.
- 22:07 brion: a SiteConfiguration code change broke upload dirs for a bit. reverted it.
- 20:15 brion: set wgNewUserSuppressRC to true, was false unsure why it's annoying
- 14:30 RobH: pushed dns changes to langlist to support cz. as well as a number of other langlist redirects not added to dns.
- 14:15 RobH: Fixed an error in my additions for the cz.wikistuff, pushed out the redirects to apaches.
- 12:10 domas: mark stealing db10 for stuff
- 11:00 domas: reenabled db10, added db14 to s1, db9 given away to non-core tasks, added full contributions load to db16 (as it has covering index)
- 09:55 domas: reverted an instance where 'IndexPager' was causing filesorts... :)
- 08:00 domas: cleaned up hume / diskspace, was full, added /a to updatedb prunepaths, apt-get clean too - 4.5G released
- 08:00 domas: disabled db10 for db14 bootstrap
- 07:36 domas: updating FlaggedRevs schema on ruwiki.
- 02:26 brion: updating MW, including FlaggedRevs schema update (fp_pending_since, flaggedrevs_tracking)
August 24
- 17:15 domas: removing db9 entirely, crashed, disk gone...
- 07:20 Tim: deployed the TrustedXFF extension that I just wrote.
- 02:56 Tim: removed db9 from the contributions, watchlist and recentchangeslinked query groups. Long running queries (2000 seconds) from IndexPager::reallyDoQuery and ApiQueryContributions::execute, probably needs index fixes. Removed general load from the remaining query group server, db7.
August 22
- 21:34 RobH: will moved from A4 to A2.
- 21:00 RobH: diderot unracked
- 00:27 brion: FR feedback on on enwikinews as well
- 00:24 brion: Deleting email record rows from cu_changes; some had slipped through before we disabled the privacy breakage
August 21
- 23:47 brion: FlaggedRevs feedback enabled on test & labs
- 23:35 brion: Enabled experimental HTML diff on test.wikipedia.org, en.labs.wikimedia.org, and de.labs.wikimedia.org
- 18:17 RobH: Updated DNS entries to add a number of .cz domains. Also updated redirects.conf to support the added domains.
- 11:43 Tim: installing GlobalBlocking
- 02:42 Tim: returned db16 to general load, a less critical role
- 02:30 Tim: installed mysql-client-5.0 on db11-16. Installed ganglia-metrics on thistle, db1, db4, db7, db12, db13, db14, db15, db16.
- 02:20 Tim: offloaded query group read load from db16. System+user CPU disappeared.
- Recovery spike in I/O shows that replication was suppressed due to read activity. Caught up in ~8 minutes.
- 02:11 Tim: db16 is chronically lagged, probably overloaded with inflexible query group load
- db16 shows high flat system+user CPU since ~01:05
August 20
- 04:15 Tim: attempting to upgrade hume from Ubuntu 7.10 to 8.04
- 01:24 brion: experimentally lifting $wgExportMaxLimit from 1000 to infinity on enwiki -- testing hack to SpecialExport.php to use unbuffered query
August 19
- 08:38 Tim: done with lomaria
- 07:42 Tim: taking lomaria out of rotation to drop non-s2a databases and change its replication to s2a-only.
- 04:45 Tim: increased load on db13 to relieve db8, stressed by removal of lomaria from s2
- 04:10 Tim: A hotlinking mirror, getting images from thumb.php, was being visited at high rate, DoSing our storage servers. Referer blocked.
- 03:50 Tim: ixia disk space critical, fixed
- 03:45 Tim: Older s3 slave servers are showing signs of strain. Adding more s3 load to db11 to test its capacity.
- db11 is fine at 47% load ratio, reporting 80-90% disk util, await 5-7ms, load ~6
- 96% load ratio, reporting disk util ~90%, await ~6ms, load ~7.5. Wait CPU ~12%. Yawning in mock-boredom.
- 03:37 Tim: lomaria was relatively overloaded. Adjusted loads, put it in an s2a role since we haven't had any s2a servers since holbach was decommissioned
- 02:40 Tim: removed holbach, webster and bacon from db.php, decomissioned. Removed decomissioned servers from $wgSquidServersNoPurge.
- 02:27 Tim: compiled udpprofile on zwinger, started collector. Firewalled port 3811 inbound, /etc/init.d/iptables save. Updated MediaWiki configuration. Updated report.py on bart.
- 01:40 Tim: reduced apache "TimeOut" on srv38/39 from 300 to 10, to limit the impact of LVS flapping
August 18
- 23:00 RobH: added the image scaling servers back into the apache node group and updated their config files. This fixes the thumbnail generation issue evident on both uploads. and se.wikimedia (may have existed elsewhere as well, in fact, it most certainly must have.) All apaches restarted.
August 17
- 22:30 jeluf: restarted apaches on srv38/39 due to user reports about broken thumbnails.
August 16
- 13:20 mark: Reenabled ProxyFetch monitor on rendering cluster on lvs3, and set depool_threshold = .5.
- 12:58 Tim: removed ProxyFetch monitor from rendering cluster in pybal on lvs3
- 12:50 Tim: thumbnailing broke completely, at ~03:00 UTC. The apache processes on srv38/39 were stuck waiting for writes to the storage servers. Couldn't find the associated PHP threads on the storage servers to see if something was holding them up, so I tried restarting apache on srv38/39 instead. Suspect broken connections due to regular depooling by pybal
August 14
- 18:55 domas fixed db16 replication
- 18:50 brion: db16 replication is broken -- contribs/watchlists/recentchangeslinked for enwiki stopped at about 4 hours ago
- ??? ??? db16 crashed
August 13
- 17:10 Tim: Changed http://noc.wikimedia.org/conf/ to use a PHP script to highlight the source files from NFS on request, instead of them being updated periodically. Added a warning header to all affected files.
- 06:17 Tim: Removed old ExtensionDistributor snapshots (find -mtime +1 -exec rm {} \;), synced r39273
- 02:40 brion: fixed permissions on dewiki thumb dir -- root-owned directory not writable by apache worked for existing directories, but failed for the 'archive' directory needed for old-version thumbnails used by FlaggedRevs
August 12
- 21:06 mark: Moved LVS load balancing of apaches to lvs3 as well, using a new service IP (10.2.1.1)
- 18:10 brion: fixed up security config that disabled PHP execution in extension directories; several configs had this wrong and non-functional
- 12:45 tfinc: removed /srv/org.wikimedia.dev.donate & /srv/org.wikimedia.donate on srv9 and removed the apache confs that mention them.
August 11
- 23:53 mark: Moved traffic from Russia (iso code 643) to knams
- 23:53 mark: Moved the rendering cluster LVS to lvs3 as well.
- 22:45 mark: Deployed lvs3 as the first new internal LVS cluster host, and moved over the search pools to it using new service IPs (outside the subnet). The rest of the LVS cluster as well as the documentation are a work in progress - let me know if there are any problems.
August 10
- 17:43 Tim: freed up another 100GB or so by deleting all dumps from February 2008.
- 17:27 Tim: freed up a few GB on storage2 by deleting failed dumps: enwiki/{20080425,20080521,20080618,20080629}, dewiki/20080629.
August 8
- 22:46 RobH: setup network access LOM for db13, db14, db15, & db16
- 22:40 brion: set up 'inactive' group on private wikis; this is just "for show" to indicate disabled accounts, adding a user to the group doesn't actually disable them :)
- 21:15 brion: can't seem to reach the 'oai' audit database on adler from the wiki command-line scripts. This is rather annoying; permissions wrong maybe?
August 6
- 17:25 brion: updated dump index page to indicate dumps are halted atm
August 5
- 22:09 mark: Shutdown BGP session to XO for maintenance
- 18:27 RobH: db14, db15, db16 installed with Ubuntu.
- 18:24 brion: enabling flaggedrevs on ruwiki per [1]
- 17:09 brion: enabling flaggedrevs on enwikinews per [2]
- 6:20 jeluf: set wgEnotifUserTalk to true on all but the top wikis, see bugzilla
August 4
- 05:58 brion: dewiki homepage broken for a few minutes due to a bogus i18n update in imagemap breaking the 'desc' alignment options
August 3
- 14:15 robert: got reports about lots of failed searches on nl and pl.wiki, looks like diderot (again) failed to depool a dead server (rabanus), removed manually.
August 1
- 21:05 brion: forcing display_errors on for CLI so I don't keep discovering my command-line scripts are broken _after_ I run them, they don't show any errors, and I thought they worked. :)
- 06:39 Tim: wrote a PHP syntax check for scap, using parsekit, that runs about 6 times faster than the old one
- 04:58 Tim: installing PHP on suda (CLI only) for syntax check speed test
- 01:46 Tim: removed db1 from rotation, it's stopped in gdb at a segfault.
- 00:22 brion: aha! found the problem. MaxClients was turned down to 10 from default of 150 long ago, while the old prefix search was being tested. :) now back to 150
- 00:19 brion: just turning off the mobile gateway on yongle for now, it just doesn't appear to be working at full load. (files moved to subdir -- in /x/ it works fine seemingly). Server doesn't appear overly loaded -- CPU and load are low -- just the requests stick.
- 00:10 brion: installing APC on yongle, php bits are ungodly slow sometimes
Archives
- Server admin log/Archive 1 (2004 Jun - 2004 Sep)
- Server admin log/Archive 2 (2004 Oct - 2004 Nov)
- Server admin log/Archive 3 (2004 Dec - 2005 Mar)
- Server admin log/Archive 4 (2005 Apr - 2005 Jul)
- Server admin log/Archive 5 (2005 Aug - 2005 Oct)
- Server admin log/Archive 6 (2005 Nov - 2006 Feb)
- Server admin log/Archive 7 (2006 Mar - 2006 Jun)
- Server admin log/Archive 8 (2006 Jul - 2006 Sep)
- Server admin log/Archive 9 (2006 Oct - 2007 Jan)
- Server admin log/Archive 10 (2007 Feb - 2007 Jun)
- Server admin log/Archive 11 (2007 Jul - 2007 Dec)
- Server admin log/Archive 12 (2008 Jan - 2008 Jul)