Server admin log/Archive 5
From Wikitech
< Server admin log(Difference between revisions)
(→3 february: numerous nfs hits were caused by parser code) |
|||
| Line 6: | Line 6: | ||
== 3 february == | == 3 february == | ||
* 16:00 gabriel,jeluf: WIKISLOW!! tar job running on albert killed NFS performance. Killed tar job. mysqldump still running, might be killed when causing trouble. | * 16:00 gabriel,jeluf: WIKISLOW!! tar job running on albert killed NFS performance. Killed tar job. mysqldump still running, might be killed when causing trouble. | ||
| + | :midom: The bigest impact was on Parser::replaceInternalLinks-image hitting NFS. needs refactoring? | ||
== 1 february == | == 1 february == | ||
Revision as of 16:19, 3 February 2005
3 february
- 16:00 gabriel,jeluf: WIKISLOW!! tar job running on albert killed NFS performance. Killed tar job. mysqldump still running, might be killed when causing trouble.
- midom: The bigest impact was on Parser::replaceInternalLinks-image hitting NFS. needs refactoring?
1 february
- 23:55 Jamesday: en and all non-big wikis: "alter table cur drop index cur_namespace". Others to follow during off peak read only time. The cur_namespace index was being used too many times where namespace_title should be used, including sometimes for replaceLinkHolders.
- Now complete for all wikis.
- 22:00 jeluf: basic OS setup of moreri,bart,bayle,browne. Coronelli died during the process.
30 january
- 17:26 hashar: fixed sv.wikinews.org and zhminman.wikibooks.org (incorrect entry in all.dblist (and updated wikinews.dblist).
- 17:22 hashar: s/ariel/suda/ in dsh group mysqlslaves.
29 January
- erik: Spanish, French and Swedish Wikinews created
28 January
- 17:34 hashar: updated mrtg graphs to show benet server.
- erik: Dutch Wikinews created
27 January
- 01:00 gwicke: experimented with different load balancing settings, but didn't get a smooth state. Reverted the changes.
25 January
- 07:00 Jamesday: khaldun apache stopped and InnoDB buffer size raised to 700 MB to give it some hope of staying current with the very high update rate.
- 06:27 brion: added fur.wikipedia.org (Friulian)
- 03:30 brion: upgraded Mailman from 2.1.5c2 to 2.1.5 final release
24 January
- 22:58 gwicke: moved 207.142.131.203 from rabanus to maurus
- 22:55 gwicke: moved 207.142.131.204 from rabanus to benet, rabanus was doing 100% cpu
23 January
- 16:44 gwicke: wprc is restarted twice daily from cron at 4:00UTC and 15:00UTC from now on
- 15:00 Mark:
- 14:05 Jamesday: changed ICPAgent delay on khaldun from 4.9 to 10 and will adjust more because it was reporting modulated clock mode. Idea is to give it low load unless the power is really needed.
- 13:17 gwicke: Added /usr/local/bin/icpagent to rc.local on apaches. That's a script that has the host-specific timings and starts /home/wikipedia/bin/icpagent. ToDo: adapt the wikidev-sudo /h/w/bin/squid-stop script to do the same for icpagent
22 January
- 19:24 gwicke: did some fine-tuning on apache weights using the new 0.1ms-adjustment feature in icpagent, cpu usage now pretty even.
21 January
- 20:30 gwicke: tweaked the 'old' purge function to read only once each 200 purges, should be faster for >1 squids. Enabled it in Commonsettings, $wgMaxSquidPurgeTitles at 500, deferred updates don't seem to be deferred at all currently
- 07:02 Tim: implemented pfsockopen()-based squid purging. Together with the lock time reduction implemented last night, saves have now been brought down to 675ms in low to moderate traffic. Increased $wgMaxSquidPurgeTitles to 5000 since that will now only take 5 seconds or so.
20 January
- 23:29 kate: properly fixed 404.php by actually copying it into the live htdocs directory.
- 21:00 Jamesday: suda taken out of service after innodb had problems opening some tables (ERROR: 1016 Can't open file: 'querycache.InnoDB'. (errno: 1)). May be because khaldun was on 4.0.22 and suda is 4.0.20 or some copying error - the tables seem OK on khaldun.
- 19:30 gwicke: ICP agent running on all apaches, with the default delay of 5ms on most, and with the new default nice of 10. Tingxi and Dalembert look a bit spiky cpu-wise, all others seem to be very even now. ToDo: replace squid in rc.local with "/home/wikipedia/bin/icpagent -d -t 5"
- 19:20 gwicke: enabled tidy again, Anthere had problems on meta with broken markup. Monitoring performance.
- 19:00 Jamesday: test running /home/wikipedia/bin/apache-restart-loop on zwinger. Does a restart of one apache, graceful restart of all with 20 second delay, in an infinite loop. If loop works a cron version can be used - we'll see how it goes. With current apache count this gives each a graceful restart once every 6 minutes and normal restart about once every 90 minutes.
- 14:30 Jamesday: khaldun syslogd reporting temperature above threshold and running in modulated clock mode.
- 14:25 Jamesday: khaldun and suda catching up in replication. Suda will enter service when caught up.
- 11:45 Jamesday: khaldun mysqld shut down for copy to suda. exec/Relay_Master_Log_File: ariel-bin.009 Exec_master_log_pos 194151813
- 06:22 kate: split 'apaches' dsh group into 'apaches' and 'mediawiki-installation'
- 02:00 brion: upgraded yongle's memcached to 1.1.12rc1 on brad's advice (was ancient 1.1.10). will do others shortly if nothing is broken horribly
- 01:20 gwicke: testing icpagent with static delay of 2ms and different nice levels on different hosts. Disabled perlbal, tingxi and dalembert are running apaches. No delay in icpagent doesn't seem to work, 10ms appeared relatively slow. Guess the scheduler needs a time > 0 to engage, should tick every ms on 2.6. Not yet tested: weights on squid, connection limits on squid per peer.
- 00:49 brion: returned tingxi and dalembert to the apaches group in dsh so that their files get updated and they're no longer COMMITTING EDITS TO THE WRONG MASTER SERVER. set suda's mysqld to read-only and SHUT IT DOWN ENTIRELY.
19 January
- at some point brion: disabled tidy to see what effect it would have on cpu usage
- 08:53 Tim: doubled number of memcached instances to 20, adding some to the 4 internal apaches: rose, smellie, anthony and biruni
- 0:30 jeluf: disabled webshop upon request of the board.
18 January
- switched master to ariel after disk holding logs on suda filled.
- 21:30 midom: spotted strange apache behaviour (child crashes). actually it was just logs on new boxes not rotating and 2GB file size limit was hit. Kate fixed it.
- some time, someone: something broke. Ariel now mysql master.
17 January
- 18:23 hashar: apache-stop && apache-start on avicenna to kill a 156MB defunct convert process :(
- 14:06 Tim: Fixed ganglia on the Paris squids and on the internal apaches. This mainly involved synchronising gmond versions -- all are now 2.5.6. Also, gmetad really needs to get its data directly from a gmond, not from a gmetad intermediary. Various artifacts due to stopping or restarting gmetad and gmond are visible.
- some time kate: removed yongle and isidore from apache because it doesn't work with memcached. want to put perlbal on them instead.
- 00:37 brion: upgraded Bugzilla to 2.18 stable
16 January
- 18:30 Jamesday: emergency switch from Paris squids - all traffic was going via rate-limited links, including France. Instructions at Squids.
- 12:40 gwicke: reconfigured french squids to use htcp sibling communication which is superior to icp because it sends more detailed questions. Seems to work fine (it's used in Florida for a year now), to verify do
tail -f /var/log/squid/access.log | grep SIBLING
on the french squids, you should see sibling hits from the other squids.
- 12:00 midom: as I yesterday installed proctitle module on apaches, and Setup.php has included hooks for that, mediawiki process status can be simply verified with ps or top -c
- 02:35 brion: reinstalling PHP 4.3.10 with mbstring module
15 January
- 23:20 gwicke: Mystery non-bug with no-cache headers when doing test requests from the fr squids resolved. Solution: Forwarded-For was empty, so the ip strip function in Setup.php had nothing to chew on. Empty ip string matched all user's newtalk flags further down the road which caused User::newTalk to return true which disabled caching in SkinTemplate.php... See also MediaWiki caching for some background on how MW uses http headers.
- 17:32 hashar: put back isidore in apache pool
- 18:13 gwicke: made wprc more reliable, now uses vtun in persist mode and a cron script checks if it's running ok every minute.
- 13:31 Jamesday: enwikibooks is read only while I recover the most recent 73 records. Ariel is out of service while I work out why I can neither create nor drop enwikibooks.old on it.
- 14:00 Tim: Fixed horribly broken 404.php, which was producing infinite redirect loops in response to almost any 404 error
- 06:00 (or so) kate: removed pen and moved the site to use perlbal instead. seems better.
14 January
- 1200EST Baylink: innocence added the squid name to the error pages; benet seemed the trouble spot; Steps Were Taken.
- 15:30 Jamesday: converted interwiki table to MyISAM - should make for faster truncate/update operations when it's updated.
13 January
- 03:15 brion: took webster's mysql offline to make a copy for an experimental public replication server
12 January
- 23:45 kate: installed pen on dalembert and moved load balancing to use it instead of squid
- 20:00 mark: Installed Cricket on larousse.
- 09:35 hashar: JeLuF fixed benet ganglia which was using the apache configuration instead of squid one. benet shows as down in ganglia apache view probably cause we have to move (and merge) rrds.
11 January
- 20:08 kate: put powerdns on zwinger using mail.wm's IP. it's 2ndary server for gdns.wikimedia.org zone. seems to be working fine.
10 January
- 23:20 gwicke: All css should be cached again. User-specific css is moved to /User:Somebody/-?action=raw, Vary and maxage=0 params removed.
- 07:27 hashar: installed libpng-dev on larousse. Nagios compiled with gd support.
9 January
- 23:30 jeluf: made benet a squid.
- 22:02 kate: ennael crashed again. removed it from DNS.
- 21:22 kate: moved several wikis (fr de nl it lb ch wa) to geodns and fr squids for users within those regions. seems to be working, except for unexplained reboot on ennael.
- 21:17 hashar: changed gu and hi wiktionaries sitenames & metanamespaces
- 16:45 hashar: launched nrpe (nagios) on squids (!browne)
- earlier kate: dshroot -a yum upgrade
8 January
- 22:50 brion: restarted replication on bacon. somehow it was trying to pull suda_relay_bin.012 instead of suda_log_bin.012; about 12 hours behind, now catching up...
- 14:48 hashar: launched the "Nagio Remote Plugin Executor" as a daemon on all apaches. Need to create an init.d script later :o)
- 00:10 hashar: installing bunch of perl cpan modules on larousse. Installing nagios as well in /usr/local/nagios/
6 January
- 09:53 gwicke: Configured French Squids, enabled purging. Log rotation is enabled, log transfer isn't. Mem settings are very conservative (32Mb) for now. All three seem to work fine, but more testing can't harm of course.
5 January 2005
- 00:54 hashar: s/suda/ariel/ in "mysqlslaves" dsh group.
4 January 2005
- 23:17 hashar: added "paris" dsh group. Servers are not in other groups.
- 22:26 gwicke: Updated this wiki to 1.4. Happy new year.
3 January 2005
- 21:30 shaihulud: stopped apache on yongle, it was killing the wiki... Can we do something about memcached pbs ?
- 20:22 shaihulud: after asked most devs (sorry for those I forgot ou not found), phe has now shell access on cluster
- 19:23 brion: fixed database config for bugzilla, shop
- 16:45 JeLuF: Stopped postfix on albert: OTRS seems to use ariel as master. Need to fix it tonight.
- 09:45 Jamesday: Suda now master.
29 December
- 21:05 Jamesday: added and enabled $wgUseLuceneSearch = false in CommonSettings.php to re-enable search for all wikis which were set up to use Lucene - all had apparently had no search at all since yesterday.
- 10:20 jeluf: Added hotfix in Titel::legalChars() to dissallow character %AD in titles and usernames on Latin1 wikis.
28 December
- 16:20 brion: Reduced MaxClients to 24 to keep the 512MB apaches from too much overflowing memory if they get a lot of threads using PHP's maximum amount of memory. Should vary this across the larger machines to allow for some breathing room
- 15:30 Some kind of GC/timeout problem with the Lucene search ate up apache threads. Search disabled for now.
27 December
- 14:35 shaihulud: removed phe (renamed public_key), if somebody is agains phe, please tell me
- 14:25 shaihulud: added the french user phe, as wikidev, he helped me a lot and we need sysadmins :)
- 13:15 shaihulud: webster is down, removed from load balancing
- (various) kate: moved en*, fi*, de* and fr* to lucenesearch, now running on Rose. See the end of CommonSettings.php. Also fixed load balancing by defining $sqle* and $sqli* as well as $sql*.
- 2:00 jeluf: Moved OTRS on albert to HTTPS
26 December
- 00:05 kate: returned briefly to move enwiki to the LuceneSearch extension running on Kluge. The rest will follow shortly.
25 December
- 01:12 brion: bacon caught up, put back on duty
- 00:48 brion: restarted replication on bacon
- 00:30 brion: bacon was 4822 seconds behind and weird history & diff inconsistencies were seen on at least en and ja wikis. took bacon out of 1.4 load balancing rotation for now.
23 December
- 11:22 brion: installed ICU and php_normal on apaches & zwinger, but left it disabled. having a problem where all page views end up as the main page; can't reproduce it in isolation, but regularly reproduce it when turning it on for the whole farm.
- brion: converted *.wikipedia.org to 1.4
22 December
- 23:00 shaihulud: restarted mysql on suda, benet is now another db slave
- 19:50 shaihulud: stopped mysql suda, time to cpy db to benet
- jeronim: set up an offsite backup machine with an 80GB drive at my house - will have backups of uploads and some other stuff - details on how to log in etc are in /home/wikipedia/doc/backup_locations
- 8:00 brion: after much fuss, think i've got the wiktionary messages sorted out.
21 December
- 21:30 JeLuF: installed TeX and ImageMagick, as listed on Apaches
- 21:00 JeLuF: installed tidy on biruni, rose, smellie, anthony and benet. Are all the other tools installed?
- 14:10 Tim: put biruni, rose, smellie, anthony and benet into service as apaches. Benet required a change to CommonSettings.php since it can't contact the mysql servers on the 10.* addresses.
- 11:30 brion: Upgraded wikisource and wiktionary to 1.4. Gave squids a master configuration file in /h/w/conf/squid. Installed clamav on the apaches; master conf at /h/w/conf/clam
- 05:30 Tim: added restrict lines for NTP servers in ntp.conf on the French squids. TICK TICK TICK TICK... ahhh that's the sound of 35 servers ticking in synchrony
- 03:48 Tim: added 10.0.* to zwinger's allowed ntp client list in ntp.conf. This allows the 4 internal servers to synchronise.
- 01:44 Tim: routes on benet were set up incorrectly, especially 10.0.* which was sent through eth1, which is not connected. Fixed this problem and set the default gateway to izwinger.
- 01:25 Tim: sychronised /etc/profile across machines except albert, all now use /etc/profile.local for our customisations. Albert's /etc/profile was the only one that was different from the start, it used /etc/profile.local by default.
19 December
- 23:45 brion: reinstalled PHP with 20MB memory_limit. (can raise limit in php.ini or from CommonSettings.php if necessary)
- 21:20 brion: friedrich having weird errors, took out of rotation
- 13:24 Tim: set up default gateways and proper hostnames on biruni, rose, smellie and anthony.
- installing new servers, work in progress : http://wp.wikidev.net/User_talk:Shaihulud when running the setup-new-fc2-servers, ntp does not sync, have to check
18 December
- 23:40 brion: dalembert was showing errors on 1.4 wikis executing 'SHOW SLAVE STATUS' on suda. Executed 'FLUSH PRIVILEGES' on suda and it seems to have stopped for now. Added hack to 1.4 SQL error reporting to include the db server IP to make these easier to track down.
- 16:45 shaihulud: copyed /usr/local on biruni,rose,smellie,anthony from another apache. All others things still to do
- 15:20 shaihulud: added benet,biruni,rose,smellie,anthony. benet is on public .210 others are on private .0.25 -> .28
- public root passkey is on french squids, name added in our /etc/hosts
17 December
- 23:00 jwales: Installed new servers, IPs 10.0.0.25 to .28. Allow root ssh login. No names yet.
- 22:35 brion: upgraded all apaches to PHP 4.3.10 (minor security updates in .9 and .10)
- 21:00 jwales: Installed a new server, IP .210
- 07:00 brion: upgraded all Wikiquotes to 1.4 (1.4 upgrade)
15 December
- 22:00 jeluf: Reworked spamassassin on albert, teaching bayesian filter, adjusting weights.
- nn:nn Jamesday: For a two step switch to suda, read only, search off, remove the # from #$wgEmergencyMasterSwitch = true; in CommonSettings.php and sync it.
- nn:nn Jamesday: holbach and webster are now handling en and ja instead of en and zh.
14 December
- 23:30 jeluf: moved all image files that were on albert but not on zwinger AND weren't from today to /var/zwinger/htdocs/wikipedia.org/upload/ZOMBIES/... Those were images that have been deleted between the last rsync and today's copy of files from zwinger to albert.
- 12:20 shaihulud: set cronjob on khaldun to recache special pages on all wikis every night at 0:00
- 10:11 brion: moved uploads onto albert. They're in /var/zwinger/htdocs/uploads [still an ugly symlink tree for now], accessible from zwinger+apaches as /mnt/wikipedia/htdocs/uploads. Zwinger is very happy to have the load off. There may still be some images that need to be totally re-synced; files missing on albert will be pulled transparently from zwinger in the meantime.
- 06:20 brion: starting rsync fun, moving uploads to albert permanently
12 December
- 22:00 jeluf: restarted squid on rabanus.
- 05:20 kate: removed my ssh public key. mail me if something I did needs documenting. bye - it's been fun.
11 December
- 12:00 jamesday: en and zh now sharing holbach and webster (new DBs), rest sharing suda and bacon.
- 10:00 brion: 1.4 upgrade on meta, wikinews, and wikisource
- Tim: Meta:spam blacklist can be edited to change site-wide spam regex blocks
10 December
- 20:20 kate: holbach & webster set up as mysql slaves. holbach is in production, webster is still catching up
- 18:37 kate: ariel was taking 200+ seconds to process updates and the site was mostly down. restarted mysqld and it fixed itself.
- 14:13 kate: started mysqld on webster by accident. killed it.
- 13:16 Tim: installed dsh on albert. Put a cron job on albert to copy node_groups from zwinger on a daily basis. This is for redundancy, so we can use dsh when zwinger is slow or down
- 0:20 jeluf: fixed mrtg config in /home/wikipedia/htdocs/wikimedia/live/conf: Added the fourth squid server.
8 December
- 17:45 brion: somebody changed the memcached configuration but didn't update php-1.4/CommonSettings.php. updated it to match 1.3.
7 December
- 14:45 brion: switched commons and wikipedia.org from NFS-based to locally-based docroots (symlink farms themselves, but still)
- 08:59 Tim: installed the CVS client on ariel
6 December
- 15:00: brion: switched some 'w' bits from links to NFS to links to local copy. not sure if it makes much of a difference
- 04:00: brion: Upgraded commons to 1.4beta (1.4 upgrade)
3 December
- 08:10: Tim: moved demo.wikinews.org to en.wikinews.org
- 06:40: jeluf: added dalembert to squid again.
- 05:50: jeluf: removed dalembert from squid configs. Server was heavily loaded, trying to reduce load.
- 04:50: brion: Noticed someone had enabled Special:Asksql; disabled it again. If there was a consensus to turn on this highly dangerous feature again, nobody told me.
1 December
- 10:53: Jamesday: bacon and suda out of service to load missing stop words and switch to a new stop word list. Search is off for all wikis while this runs.
Archives