Server admin log/Archive 5

From Wikitech
Jump to: navigation, search
28 March 04:57 (UTC)

Ganglia

Squid stats

21 July

  • 14:21 jeronim: updated DNS for wikibooks.org to include all 6 squid IPs
  • 07:45 tstarling: moved Wikisource from sources.wikipedia.org to wikisource.org

20 July

  • 20:00 jeluf: moved articles in de.wikiquote from names starting with Wikiquote: to namespace 4

19 July

  • 10:00 Jamesday: copying no query cache setting for Ariel MySQL to my.cnf. Data since Tuesday 12:00 showed immediate response time improvement Tues, apparently better then until miser mode was turned on. Will try a small cache eventually to see if that beats none.
  • 08:00 jeluf: set squid back to 10 and 20 GB storage. Performance with 2MB is not acceptable.

18 July

  • 20:30 gwicke: squid disk cache is set to only 2Mb temporarily to clear the cache from old es pages, needs to be upped to the commented out values (20Gb on maurus and coro, 10 on browne) again when the interface is fixed. Run squidhup afterwards. Also disabled parser cache for ES in InitialiseSettings, needs to be cleared as well. Please write some docs on how.
  • 20:00 shaihulud: es.wikipedia converted to utf-8, need to clear mediawiki cache if somebody know how ?
  • 09:00 shaihulud: dropped fulltext index on en,de,ja,fr on will. To stop the lag.
  • 15:18 tstarling: converted the wikipedias to use the shared document root layout, like the wikiquotes. Obsolete directories moved to old_wiki_dirs. Declared wiki.conf obsolete, replaced by the far smaller remnant.conf. Set up rewrite rules for /stats, redirecting to the appropriate stats directory in wikimedia.org.

16 July

  • 23:00 gwicke: looking into formal grammar and parsers, exploring bisongen and swig

15 July

  • 10:00 shaihulud: Enabled Miser mode until we get a dedicated server for slow queries.
  • 00:30 brion: Lots of Wantedpages queries on en bogging down ariel for about 10 minutes. Killed the threads and it's fine now. Ganglia shows a long load bump on suda before it switched to ariel; please check the replication load setup as well as securing the special pages better

14 July

  • 19:00 shaihulud : added crontab on zwinger to recache special pages at 8 and 21
  • 18:00 gwicke: added script in my crontab on zwinger that creates a nightly snapshot tar of the stable branch (REL1_3 currently), url is http://download.wikimedia.org/phase3_stable.tar.bz2. Changing sf.net page to link to this.
  • 15:30 gwicke: collected meta:Apache hardware quotes from my local shop, cheapest (AthlonXP2.8, 1Gb ram, 80Gb hd) is 370€/444$. Similar prices at http://www.pricewatch.com, possibly others or local Tampa shops
  • 10:15 gwicke: colorized wprc

13 July

  • 23:30 gwicke: wprc installed, finally working..
  • 23:00 gwicke: Made the misermode time-dependant in CommonSettings, enabled automatically between 13:00-19:00 UTC. Did more GFS testing with multiple-hour bonnie++ runs. Some discussion about moving CommonSettings and at some stage also images off NFS. Possible alternatives: Proxy rewrite to central image server + some scripting to get the images there, AFS as an interim solution or waiting for GFS failover code release (likely on friday).
  • 10:00 shaihulud: enabled miser mode. Really need a second db slave so more apaches :)
  • 09:36 shaihulud: restarted apache and squid on suda after adding a missing file and php.ini from another apache what is the pb on webshop ?
  • 06:14 jeronim: killed squid on suda as there are apparently webshop problems (register globals?) due to misconfiguration of suda's apache.

12 July

  • 21:00 or so, gwicke: Addded suda again as apache
  • 18:16 jeluf: killed squid on suda as apache produces <p> where none should be
  • 15:00 gwicke: suda runs as apache now
  • 15:00 shaihulud: disabled suda in load-balacing, still a db slave

11 July

  • 14:30 shai, tim, james : ariel is the master db. will and suda are slaves
  • 08:00 shaihulud : ariel is back, disabled miser mode and eanble full text search

10 July

  • jeronim: IRC proxy and a simple TCP forwarder to freenode running on zwinger (access restricted in iptables) - more information: IRC forwarding
  • 06:00 TimStarling, jeluf: Updated postfix. chown'ed /home/mailman/aliases.mailman to root:nobody, so that mailman scripts are called as nobody, not as mailman. Before this, mailman complained:
Jul 10 06:11:12 zwinger postfix/local[26093]: CBEDB1AC0004: to=<info-de-l@wikipedia.org>,
relay=local, delay=1, status=bounced (Command died with status 2: 
"/home/mailman/mail/mailman post info-de-l". 
Command output: Group mismatch error.  Mailman expected the mail wrapper script to be
executed as group "nobody", but the system's mail server executed the mail script as
group "mailman".  Try tweaking the mail server to run the script as group "nobody",
or re-run configure,  providing the command line option `--with-mail-gid=mailman'. )
  • 04:30 shaihulud: enabled miser mode, disabled text search, disabled ariel in load-balancing. Time to rebuild db on ariel

9 July

  • 16:20: shaihulud : disabled miser mode
  • 06:20: tim: use maintenance/fix_message_cache.php to fix complaints about non-local messages. memcached is breaking regularly and to avoid asking the database and making the site very slow the web servers have been set to use local messages when memcached fails.
  • 01:33: jeronim: re-enabled text search, as ariel has caught up. Miser mode still on.
  • 01:01: jeronim: set miser mode on too to try to move things along more quickly
  • 00:55: jeronim: disabled text search in an attempt to unload ariel to let it sync to suda (replication lag 15 minutes)

8 July

  • 15:00 : shaihulud removed will. Too slow for heavy special queries : Lonelypages, etc...
  • 14:30 : shaihulud : added will in the load-balancing, ariel replication was laging a lot cause load
  • gwicke did some benchmarks of MySQL/MyIsam vs. BerkleyDB/python api, results at meta:Database benchmark. Working on a concept for a wiki based on SubWiki (using Subversion/BerkleyDB and GFS). Got GFS working now, needs testing.

4 July

  • jeronim: htdocs backup to vincent is now done from root's crontab at 2 a.m. each day, using zwinger's rsyncd, like this:
nice -n 19 rsync --stats --verbose --whole-file --archive --delete --delete-excluded \
  --exclude=upload/timeline/ --exclude=upload/thumb/ --exclude=upload/**/timeline/ \
  --exclude=upload/**/thumb/ --exclude=**/dns_cache.db zwinger::htdocs /var/backup/htdocs
    • not sure if --delete (which removes files on the destination if they have been removed from the source) is such a good idea, as if somebody accidentally deletes things from the source and doesn't notice, then they will soon be gone from the backup too.
    • OTOH, image tarballs are planned to be generated from the backup, and having too much extraneous junk in them is no good
      • could solve this by keeping a few weekly tarballs somewhere

3 July

  • 21:03 gwicke: enabled $wgPutIPinRC, grepping the logs isn't practical anymore with ~200Mb logs per hour. Query is something like
select rc_ip, rc_timestamp from recentchanges where rc_user_text like 'Gwicke' and rc_ip != ;
  • shaihulud: As it seems to load too much zwinger,using isidore as mysql server now. Loading in progress
  • 07:20: jeronim: killed rsync (over NFS) backup of htdocs to vincent, in favour of continuing yesterday's rsync to yongle from zwinger's rsyncd. Will duplicate it to vincent when it finishes.
  • jeronim: (yesterday) made image tarballs for all wikis, from the backups on vincent, and put them in the relevant directories with the dumps.
  • jeronim: (yesterday) installed rsyncd on zwinger, just for htdocs at the moment. World-readable, but restricted to cluster's subnet by iptables and rsyncd.conf.

2 July

  • shaihulud: Creating a slave with 3 innodb files splitted on 3 apaches. Main mysql server is zwinger. Dump loading in progress.
  • 13:14: Jamesday: Suda connections=11,844,357 threads_created=239,870. Change since 30 June: connections=6,892,087 threads_created=61,327 (112:1, 20.5 new threads per minute over 2979 minutes. Estimated prior value at 6.5:1 ratio 353/minute).
  • 08:09: jeronim: Moved /home/wikipedia/htdocs/en/upload-old to /home/wikipedia/en.wikipedia-upload-old so that it won't be rsynced to vincent anymore. Does anybody need this directory or can it be deleted?
apache   wikidev     36864 Jun 17 00:23 upload
root     wikidev     36864 Jan 23 22:29 upload-old

30 June

  • 11:35: Jamesday: set global thread_cache_size=90 on Suda. Not in my.cnf. connections=4,952,270 threads_created=178,543 (28:1) at this time.
  • 07:00: brion: tracking down PHP errors. squid debug statements broke image thumbnails and various; removing. /home/wikipedia/sessions was broken on whatever was using them (bad permissions). fixed an Article::getContent interface error in the foundation page's extract.php which produced annoying warnings. still tracking down others.

29 June

  • 19:30: gwicke: updated test from cvs, just cvs up in /home/wikipedia/htdocs/test/w with zwinger's pass
  • 18:00: shaihulud: doing some nightly dump of database on will, stopping slave. Dumps are in /home2 on will
  • 16:34: shaihulud: miser mode disabled
  • 16:08: shaihulud: enable miser mode, time to resync ariel.
  • 07:22 set global thread_cache_size=40 on Suda. Not in my.cnf. connections=842,104 threads_created=69,105 (12:1) at this time. Jamesday 08:13, 29 Jun 2004 (UTC)
  • 00:39 MySQL on Suda reported receiving signal 11. Brion restarted it and it recovered successfully.

28 June

  • 17:15 After discussion set global thread_cache_size=90 on Suda and Ariel. Within each hour at moderately busy times Suda routinely cycles within a 90-100 connection range, so there's no need to make it do more work creating new threads while something is slowing it down and causing the number to increase. Set only interactively, not in my.cnf, for observation over the next few days. Connections were 28,419,175 and threads_created 4,373,734 (6.5:1) an hour after setting. Jamesday 18:42, 28 Jun 2004 (UTC)

27 June

  • 21:00 - high load on zwinger, site frozen since NFS too slow. Brion used the APC to power-cycle zwinger. syslog showed that machine was short on memory before crashing. Activated swap to prevent future crashes.
  • 20:30 - jeluf : installed BigSister server to mormo. Doing some remote tests for maurus, browne, coronelli (port 80 connection), zwinger (DNS, smtp, ssh), gunther.bomis.com (DNS). Installed local agents to ariel and will. Agents are running as user bs (id 200), installation is in /home/bs/. Monitoring console is at http://mormo.org/bs/ . Other agents will be installed in the next days.
  • 9:20 - shaihulud : added an innodb file on /mnt/raid1 on Ariel
    • Doesnt seem to work, I'll have to copy the db from will to ariel.....

26 June

  • 18:300 - shaihulud: moved data in /mnt/raid1 on ariel to zwinger on /mnt/olddrive (the old 80G drive)
  • 8:00 - jeluf: investigated 100% CPU-issue on Ariel. CPU load was in user context. Neither top nor ps displayed a process using more than 1% of CPU. Shifting load to suda, CPU went down to 0%, while increase on suda was not visible. Finally restarted mysql on ariel, now for a short time CPU fingerprint looks normal (80% idle, 20% io-wait), back to 100% user :-(
  • Tim is doing searchindex updates: http://mail.wikipedia.org/pipermail/wikitech-l/2004-June/010885.html
  • 7:30 - jeluf: on will, configured ntp and syslog remote logging to zwinger.
  • 03:01 - jeronim: spamblocked www1 dot com dot cn at request of Guanaco on irc. example diff: http://wikibooks.org/w/wiki.phtml?title=Wikibooks:Sandbox&diff=38922&oldid=38633

25 June

  • 20:30 - shaihulud: installed mytop on will
  • 20:00 - shaihulud: Tryed to move searchindex table on Suda to the first raid5, to improve speed, same pb.
  • 11:05 - gwicke: Re-enabled misermode and disabled search on all wikis as suda was under heavy load from the jawiki search index rebuild. Many timeouts. Coda didn't like bonnie benchmarks. Compiling Arla now. OpenAFS seems to be slower than Arla and doesn't work on my machine so far.

June 24 2004

  • 23:01 - gwicke: Images can now be protected. Added request section to meta:Requests for logos for wikis that have the logo uploaded as Wiki.png, those can be changed (see wgLogo in InitializeSettings.php, there's already one row with the /b/bc/Wiki.png bit)
  • 19:10 - jeluf: suwiktionary repair: stopped slaves on both ariel and will, so that Read_Master_Log_Pos was the same. mysqldump suwiktionary on ariel. start slave on ariel. mysqladmin drop suwiktionary on will. Didn't succeed, so after talking to guys on #mysql, removed categorylinks.frm file manually. Drop than succeeded. created db and populated it using suwiktionary dump from ariel. started slave on will.
  • 10:00 - jeluf: suwiktionary replica on will is broken. needs fixing when replication is in sync.
  • 9:20 - hashar: gracefulled all apaches so they know about webshop. Notified TomK32.
  • 8:56 - hashar: edited /home/wikipedia/conf/webshop.conf and added a "ServerAlias webshop.wikipedia.org" by request of TomK32. Apache not reloaded.
  • 0:00 - jeluf: set up new mysql replica on will. Is doing much better than zwinger, which was not fast enough.

June 23 2004

  • zwinger mysql is down. Did someone kill it or it shut down by itself or what?
  • jeluf killing many "failed replications" on ariel
  • 922 threads on mysql on suda; mostly UPDATE LOW_PRIORITY searchindex
  • TimStarling granted non-root / wikidev shell account to Hashar. Aknowledged by brion at least.
  • jeronim placed squids in offline mode for a while in an attempt to unload the database so that many killed threads will stop taking up slots
  • 11:32 - still 587 Killed queries which won't go away
  • 11:48 - number holds steady at 587
  • Gwicke is researching network filesystem alternatives, current candidates: Coda, OpenAFS, Lustre. Coda already installed on Zwinger. Will start to summarize things at http://wikidev.net/Network_file_systems
  • 15:37 approx - the 587 Killed queries have gone, finally
  • 17:24 Added LoadBalancer in PageHistory.php
  • malicious-looking bot on de:, hitting revert/delete links on image pages but not actually logged-in, so it shouldn't be able to see the links... blocked in squid for the meantime (217.85.228.77)
    • relevant log bits on ariel in /tmp/squid2/
    • last responding router in traceroute: dd-ea1.DD.DE.net.DTAG.DE (62.154.87.58) (Dresden)

---

  • 18:36 jeronim still trying to get PHP working on ariel, just to do some PHP CLI stuff.. some details at PHP on ariel. Help!
    • gave up installing from source, went back to the yum/rpm. There is junk lying around from the source installs, and there is no uninstall target.

yum --download-only install php-mysql
look for the rpm in /var/cache/yum
rpm -i --nodeps php-mysql-4.3.6-5.x86_64.rpm

--($:~/incoming)-- php gmetric_repl_lag.php
PHP Warning:  Unknown(): Unable to load dynamic library '/usr/lib64/php4/mysql.so' - libmysqlclient.so.10: cannot open shared object file: No such file or directory in 
Unknown on line 0
PHP Fatal error:  Call to undefined function:  mysql_connect() in 
/home/jeronim/incoming/gmetric_repl_lag.php on line 15

So, no replication lag metric for ariel until this mess is sorted out. It's working on will though: http://download.wikimedia.org/ganglia/?m=repl_lag&r=hour&s=descending&c=Florida+cluster&h=&sh=1&hc=4

---

  • /etc/yum.conf on ariel altered to use some mirrors. Original is in yum.conf.original -- Jeronim 22:35, 23 Jun 2004 (UTC)

before 2004-06-23

  • will is back, tested with mprime and rapidly reaches 60C and goes to throttled mode
  • several new ganglia metrics in the last few days:
    • mysql_qps - queries per sec (average over 14 second period)
    • mysql_in - incoming connections ("established" in netstat) to mysql port
    • mysql_out - outgoing to mysql
    • http_in - incoming to port 80 on apaches
    • squ_in - incoming to port 80 on squids
      • the *_in and *_out metrics use netstat/awk/grep and use a fair amount of CPU time, at least on the squids which tend to have approaching 1000 established connections at once
  • KILL -9 for MYSQLD IS BAD!! :)
  • jeronim removed xfs from startup for servers with chkconfig --del xfs


Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox