Server admin log/Archive 9

From Wikitech
Jump to: navigation, search

Template:Topnavbar

28 March 01:48 (UTC, purge)

hourly traffic rate | Squid stats

Ganglia: A|S


November 28

  • 19:58 brion: ragweed is down (no ping), OTRS dead
  • 14:35 ævar: Site crashed because of insufficient sanity checks, my bad.
  • 14:30 Domas: srv62 tugela crashed, no core dump yet, if crashes persist will need some poking, either code, or srv62. mcelog empty.
tugela-fc3-x64[3634]: segfault at 00000000010e1000 rip 0000003a781716e0 rsp 0000007fbffff668 error 6
  • 00:44 ævar: Changed the project name and metanamespace for iswikibooks to Wikiorðabók
  • 00:00 Domas: oops, ran tugela on srv51-srv54,srv56-srv69 instead of memcached, will see how it performs/scales/...

November 27

  • 23:21 hashar: thanks to palica : updated Server inventory bot to add a link to ganglia.
  • 19:09 hashar: added two scripts to check database : 'mysql-list' & 'replication'
  • 18:49 hashar: BUG rose got 4 memcached instances but they are not listed in mc-pmtpa.php
  • 18:47 hashar: commented 10.0.2.43:10000 from mc-pmtpa.php
  • 13:13 ævar: Installed Special:Cite on all the wikipedias
  • 10:15 brion: blocked wikipedia-l, wikien-l, and helpdesk-l list archives in mail.wikipedia.org's robots.txt to discourage future complaints about embarrassing newbie posts becoming #1 google hits. Search patches for mailman archives should be integrated at some point...
  • 08:55 JeLuF: added http://www.spy-sweeper-webroot.de/wiki/?/ to squid's leecher blocklist
  • 07:38 Solar: smellie is ready for service. Turned off seLinux.
  • 07:30 Solar: srv5 is out with a bad case of bad blocks
  • 07:00 Solar: Crossed over to the new switch, csw4-pmtpa
  • 03:09:56 ævar: Installed Special:Cite on enwiki as an experiment.
  • 01:39 Tim: took srv55 out of service, likely dud RAM. MCE errors reported.
  • 01:10 Tim: squid on will had crashed. Restarted.
  • 01:05 Domas: fixed default route on tingxi

November 26

  • 17:30 jeluf: changed password of wikipl-l admin account. Gave new PW to Datrio. Docuemented PW at the usual place.
  • 14:36 Tim: put srv52-70 into apache service. I broke srv51 with a restart test.
  • 12:00 Tim: wrote /h/w/b/apache-sanity-check, set up scripts such as apache-start to run it and refuse to start apache if the necessary LVS-friendly conditions are not met.
  • ~11:00 Tim: broke site temporarily due to LVS-related misconfiguration
  • 10:53 Tim: rose, tingxi and srv2 had apache running but no LVS VIP. This would explain the random hanging behaviour with ab -X apaches:80. Fixed temporarily, will look into a permanent solution.
  • 08:45 Tim: LVS wasn't decomissioned properly on iris. LVS on pascal was forwarding packets to LVS on iris, and iris, with no lvsmon running, forwarded most of those packets to sage, which is down. Thus users were seeing connection timeouts. Fixed with ipvsadm -D -t rrvs.knams.wikimedia.org:80.
  • 07:41 Tim: srv5 still not up. Moved its virtual IPs, one to srv6, one to srv8 and one to srv10.
  • 07:15 Tim: did a fsck of srv5 then a system reboot
  • 03:32 srv5's root partition spontaneously declared "read-only filesystem". Logs stopped moving. Mount reported that it was still rw, but it couldn't be written to.
mount uses the contents of /etc/mtab to display mounts. These are not updated when the file system is r/o. Use /proc/mounts instead.
  • 05:50 Tim: introduced time and memory limit for rsvg and convert
  • 01:45 Tim: started image backup using updated scripts in /h/w/b
  • 00:14 ævar: changed the logo for iswiktionary.

November 25

  • 21:45 Hashar: killed some rsvg process on various apaches. Seems they tried to render a 120px thumb of /commons/7/70/Interstate_Highways.svg (possible DOS ? :( ).
  • 04:40 Tim: experimentally enabled keepalive on apache.
  • 03:35 Tim: testing lvsmon failover by stopping squid on clematis
  • 03:05 Jamesday: Adler had 11GB disk free. gzipped first 80 binlogs to raise it to 48GB or so. gzipped version still need to be moved to wherever we're keeping them these days.

November 24

  • 06:30 kate: setting up l3 failover.. see that page for details
  • 02:55 brion: took cornelli out of search rotation while kyle moves it around

November 23

  • 21:18 mark: Routing problems from 38.0.0.0/8 (cogent ip space) to florida. Altered the countries.nerd.dk file to reroute that prefix via knams.
  • 20:44 mark: Reinstated the normal epoll RPM on mint, as epoll wasn't the problem
  • 16:44 brion: fixed arrangement of upload directories for several sites (non-wikipedia :P)
  • 00:35 kate: "ntp source vlan1" fixed NTP problem on csw1, but need to work out why traffic to 64.156.25.242 is being dropped
  • 00:04 kate: upgraded csw4-pmtpa to 12.2(25)SED, enabled ssh and configured vlan 2 properly

November 22

  • 22:33 brion: amane still seems to work. YAY \o/
  • 21:49 brion: restarted apache on zwinger, wasn't loading
  • 21:45 brion: increased php fastcgi workers on amane to absurd levels for thumbs to run
  • 21:30 brion: mostly working now! had to set server.max-workers to 8 in lighty to get it running smoothly
  • 19:28 brion: mounted /mnt/upload3 (amane) on zwinger, was missing mountpoint
  • 19:22 brion: mounted /mnt/upload3 (amane) on srv2, was missing mountpoint
  • 19:11 brion: restarted albert's http temporarily to cover the work period
  • 19:02 brion: khaldun copy finally finished, rearranging bits on amane
  • 15:21 brion: turned albert's http back off (hope you're done) so khaldun can finish its copy without the extra load
  • 07:44 brion: started albert's http so kate can set things up requiring the local fedora yum mirror
  • 05:08 kate: configured asw2-pmtpa. has the new srvs and the equ device on it (equ is 10.0.1.3)
  • 00:55 brion: started copying commons files from bacon -> amane. disabled albert's apache
  • 00:45 brion: started copying enwiki files from khaldun -> amane, non-wikipedia non-wiktionary files from albert -> amane
  • 00:35 brion: started copying files bacon -> amane
  • 00:20 brion: disabled uploads sitewide

November 21

  • 23:10 brion: setting up to move uploads to amane, will disable all uploads and upload.wikimedia.org for a while to make this damn thing happen
  • 21:15 brion: started lucene index rebuild on maurus
  • 21:05 brion: restarted squid on will, was not responding (stuck) on port 80
  • 20:49 brion: restarted apache on ragweed; https was down so otrs inaccessible
  • 20:30 mark: Brought sage and mayflower back up.
  • 20:00 mayflower went down.
  • 20:00 mark: Moved LVS back to pascal to allow iris to be a squid again.
  • 19:45 mark: Modified lvsmon on iris because it was always sending curl requests with Pragma: no-cache! And therefor testing the whole chain to florida.
  • 19:45: sage went down.
  • 18:00 mark: Installed non-epoll RPM on mint to compare.
  • 17:56:40-17:57:31 ævar: Invalid argument notices were being generated in this time period due to me syncing three files and them depending on each other, ok now.
  • 17:30 mark: udpmcast wasn't running on pascal. No idea since when... started.
  • 17:30 jeluf: Restarted ragweed. Came back after powercycling and fsck.
  • 16:30 ragweed broken.
  • 12:14 erik: Updated logo of nap.wikipedia.org and sync'd InitialiseSettings.php

November 20

  • 23:30 mark: Upcoming maintenance of knams tomorrow (ZX will do some firmware upgrades, rebooting at least pascal and vandale). Moved LVS to iris because of that.
  • 20:00 JeLuF: All wikipedia.org upload directories moved off of albert and to amane.
  • 18:03 Hashar: fixed #4022 'Asia/Seoul' timezone for kowiki.
  • 17:50 Hashar: switched some logos to /b/bc/Wiki.png
  • 14:24 JeLuF: chown -R apache:apache amane:/export/upload/wikipedia.org/
  • 14:19 Hashar: in amane:/export/upload/wikipedia.org/ some directories cant be write by apache (af de es & fr). dewiki upload page report an error.
  • 09:26 Tim: Fixed NTP broadcast, documented
  • 03:21 Tim: Fixed perl upgrade on srv51-70 as per [1]

November 19

  • 17:05 Tim: same on fuchsia
  • 16:50 Tim: restarted squid on clematis, disabled swap.
  • 16:05 Tim: upgraded otrs on ragweed to version 2.0.3, after Anthere complained about this bug: [2]. Minor upgrades within the 2.0.x series weren't documented (just an unanswered question on the ML), so I just untarred over the top of the old directory, with a backup in /opt/otrs-2.0.1. Treat any problem symptomatically, some chmodding might be required.
  • 15:40 Tim: restarted squid on bayle

November 18

  • 23:30 brion: installing ploticus 2.32 on mediawiki-installation, set to use gd & truetype fonts (bugzilla:3965)
    • truetype fonts in common/fonts
  • 07:00 jeluf: migration of dewiki's image and thumbnail directories done. archive and shared will be moved when albert has more headroom. Some 30 small to medium wikis moved. Currently running frwiki thumbnail migration.
  • 00:27 brion: blocked another leech [3]

November 17

  • 14:30 mark: ragweed was missing the LVS ip, fixed. Also readded iris as squid.
  • 06:30 Tim: Added root key to srv51-70. The following machines didn't want to cooperate: 56, 64, 66, 67, 69
  • 06:05 Tim: added srv51-70 to DNS, created a node group. Configured albert's BIND as a slave for the 10/8 reverse DNS zone.
  • 05:46 Solar: srv2 is back up.
  • 05:38 Solar: srv56 is up too.
  • 05:26 Solar: srv51-srv70 are ready for Rock & Roll! (Except srv56 has some hardware issue)
  • 04:34 Solar: holbach is rebuilt and ready
  • 03:47 Tim: added tingxi and rose to the apaches node group. Left harris out, it sucks.
  • 03:30 Tim: after moving some more hosts to the misc2 cluster, restarted gmond on the apache cluster to remove hosts which have been moved out
  • 02:24 Tim: fixed amane's date, started ntpd
  • 01:49 Tim: Created "Misc VLAN2" cluster on ganglia, for miscellaneous hosts which, due to being in the wrong VLAN, couldn't be in Miscellaneous.

November 16

  • 8:25 brion: srv50 error_log flooded disk; removed and restarted apache
  • 6:30 jeluf: moved es upload area to amane:/export/upload
  • 5:30 jeluf: moved eo, ang, an upload areas to amane:/export/upload. Backups are still on albert in .../remove.
  • 04:14 Tim: attempted to restart squid on will. It didn't work. I hacked /etc/init.d/squid to send errors to a file instead of /dev/null, and found it was giving error messages like "parseConfigFile: line 17 unrecognized: 'htcp_port 4827'". I started the squid copy in /usr/local/ instead.
  • 01:20 brion: reenabled special:renameuser with the 'archive' bit disabled. it's possible that some undeleted pages will have incorrect rev_user_text data

November 15

  • 23:00 jeluf: moved aa, ab, af, ak, als, am, ar, ast, zh image uploads to amane:/export/upload
  • 20:32 hashar: updated http://wikimedia.org/stats/live/ with a message redirecting to the "new" system ( http://noc.wikimedia.org/stats.php ).
  • 16:13 Tim: running batch imagemagick convert job on bacon, converting 1911 EB scans to PNG.
  • ~12:30 Tim: Deployed diff cache and parser cache push features. Reduced cache expiry for RC feeds on en from 60 to 20 seconds. The performance impact of this should be monitored -- the diff cache should reduce it but it might not be enough.
  • 03:46 Tim: Re-enabled tidy, trimmed error logs. The huge error logs did indeed have a few tidy errors towards the end, once every few minutes, interspersed with lots of "file not found" errors. Preceding this lack of activity was gigabytes of either:
    [Mon Nov 7 04:33:33 2005] [error] PHP Parse error: parse error, unexpected $ in /usr/local/apache/common-local/php-1.5/checkers.php on line 101
    OR
    *** attempt to put segment in horiz list twice
    Neither of which have anything to do with tidy. The other noticeable thing at the very end of the error logs was that apache was segfaulting regularly, but it was doing that just as much after tidy was disabled.
  • 01:22 ævar: resolved bug 3968
  • 00:50 brion: cleaned giant error_log files from srv44 and srv47, which had run out of space during sync
  • 00:41 brion: adding some signature-nazi features, so new sigs with unbalanced html tags will not be inserted

November 14

  • 22:30 mark: Many apaches have error_log's of 100G in size and more! Partly due to tidy, but how is logrotation supposed to be setup? See bug #3966
  • 22:00 - 22:12 hashar: $wgUseTidy = false; its filling error logs on all apaches and seems to stall. Restarted all apaches too. Wikipedians need to FIX their HTML.
  • 14:00 mark: Rebooted srv10, and started Squid on it with no cachedirs (1 null cachedir). Assigned IP .214 to it.
  • 08:28 Tim: restarted squid on srv6. Slow hit service times (~100ms), it wasn't swapping but it had very little spare memory for kernel cache and buffers.
  • 03:05 Tim: bayle was swapping heavily, very slow service times for both hits and misses. Restarted squid, added it to the ganglia squid cluster.

November 13

  • 22:50 jeluf: mounted amane:/export/math to all mediawiki-installation servers for storage of math images.
  • 20:00 midom: srv10 squid hanged, reiserfs issues?
  • 16:57 brion: running data dumps on benet/srv35/srv36

November 12

  • 19:49 ævar: tingxi had languages/LanguageCs.php (and probably something else) out of date, IIRC it has been down for some time, ran scap to bring it and others up to date.

November 11

  • 00:16 brion: changed sitename on eswikinews (meta-namespace was already set)

November 10

  • 14:28 ævar: changed the logo on trwiki
  • 09:06 ævar: Changed the upload url of the wikis that had uploading disabled to point to the commons
  • 09:09 brion: gave up trying to upgrade bugzilla due to bugzilla upgrade failure
  • 08:40 brion: running yum update on pascal; got some glibc double-free bug during bugzilla update, and thought it was time to upgrade some damn packages
  • 08:25 brion: shutting down bugzilla for upgrade to 2.20
  • 07:18 brion: removed check_policy_service from /etc/postfix/main.cf on kate's advice, to see if it's more stable with that off
  • 07:02 brion: restarting postfix on zwinger, mail stopped again
  • 05:59 ævar: Removed harris from /usr/local/dsh/node_groups/mediawiki-installation, responded to ping, had port 22 open, but hung forever on ssh harris
  • 04:27 Tim: set up ftp server on bacon, to accept uploads of scanned page images

November 9

  • 14:18 Tim: fuchsia was swapping, regularly timing out on lvsmon health checks. Restarted squid.
  • 11:09 brion: modified parser cache behavior to do cache with redirect targets. should increase hit rate; if troubles experienced, revert Article.php back to rev 1.396
  • 10:13 brion: reenabled search text extracts for active sessions only
  • 07:32 brion: updating live search indexes
  • 00:54 brion: no mail in last eight hours... restarting postfix

November 8

  • 23:30 jeluf: After intensive fsck, ragweed is back.
  • 19:00 ragweed pings, but doesn't allow SSH login
  • 13:10 holbach crashed
  • 12:05 Tim: deployed local message cache, causing a 60% drop in network traffic on the apache cluster according to ganglia. We had noticed probable network saturation on the 100 Mbps switch asw1, this was the obvious solution. A content hash is stored in memcached and checked on each request. The local cache is stored in files, one file per wiki in /tmp/mediawiki/

November 7

  • 20:51 kate: stopped replication on lomaria. please don't start it without asking me unless it's extremely important.
  • 20:45 brion: trying to get tidy going again
  • 20:30 brion: rebuilding search indexes on maurus.
  • 20:00 brion: set search daemons to restart hourly. *sigh*
  • 14:05 Tim: brought holbach back into service. Tweaked some load ratios.
  • 13:55 Tim: started slave on lomaria. It was idle, the site was slow.
  • 05:45 brion: switched lucene search to default to AND matches
  • 02:50 brion: set up init script for MWDaemon (/etc/init.d/mwdaemon), added a daily cronjob to restart them

November 6

  • 21:04 brion: several servers had disks filled from apache error_log; libart in rsvg apparently spewing out gigs of "*** attempt to put segment in horiz list twice"
  • 20:10 brion: site unusually loaded; giving a kick to the apaches for luck
  • 11:08 jeluf: srv22 was overheated. killed svg renderer (240 cpu minutes)
  • 11:00 jeluf: added Category:Broken_servers for better keeping track of todos
  • 10:40 jeluf: added portal namespace for nowiki upon Jhs' request
  • 08:20 kate: copying from lomaria again... whee!
  • 05:20 brion: added id.wikisource.org by request
  • 04:59 Tim: started lvsmon-ksquid on pascal
  • 04:39 kate: iris crashed... moved lvs to pascal.
  • 02:40 Tim: Made MW check $cluster.dblist instead of all.dblist. This will generate appropriate error conditions for improper access to foreign databases via commandLine.inc, Special:Makesysop or squid misconfiguration.
  • 01:40 Tim: installed memcached on srv41-50, moved instances from various other machines to there, including offloading browne completely. Restarted memcached on srv22, it had a dead instance.

November 5

  • 22:02 kate: restarted replication on lomaria. set up replication on zedler.
  • 11:00 brion: chgrp'd common files on humboldt
  • 09:15 solar: installed new image filer, amane, into the rack.
  • 04:55 kate: stopped replication lomaria again to re-dump. don't start it please. (server is still running)
  • 03:41 Tim: tried to restart dumpHTML on srv31, the machine crashed almost immediately
  • 03:39 brion: starting dumps on yaseo on amaryllis/henbane
  • 03:32 kate: copy finished, restarted replication on lomaria
  • 03:00 brion: refresh-dblist now also creates pmtpa.dblist and yaseo.dblist, based on assignment overrides from clusters.dblist
  • 00:45 brion: started pmtpa dumps on benet, srv35, srv36

November 4

  • 21:45 jeluf: moved lightgy on benet to /usr/local/lighttpd. Added startup to /etc/rc.local
  • 21:00 jeluf: mounted benet:/var/backup to zwinger:/mnt/backup_benet
  • 06:25 brion: restarted search servers; memory usage up to 650-1000mb range, and very slow response on vincent

November 3

  • 11:03 kate: copying lomaria's db to zedler, don't start it
  • 21:45 erik: fixed he.wikinews site name and meta namspace (hopefully), sync'd InitialiseSettings.php and ran update.php accordingly
  • 20:44 brion: investigating connection errors (hacked wfLogDBerror to include hostname); seems to be on the new opteron boxen only
  • 20:30 hashar: started apache on srv35.
  • 20:22 hashar: started apache on avicenna.
  • 20:10 mark: Will was running with only 1024 FDs. As it's the only non-RPM squid around (will is FC1) and I added bayle, I have taken it out, reassigned IPs to srv5 and srv7.
  • 19:55 hashar: some apaches need a reboot. load is incorrectly high on them cause of state=D process (see bug #3869)
  • 15:10 mark: Moved bayle (previously broken, inactive memcached) to the external vlan, made it a temporary squid. I cannot get it to mount izwinger:/home though. Any ideas?
  • 5:30 Tim: copied ~tstarling/.ssh/known_hosts to /etc/ssh/ssh_known_hosts on all pmtpa machines
  • ~5:00 Tim & kate: syslogd stopped working on zwinger, causing DNS to stop working. Kate restarted syslogd.
  • ~5:00 created hewikinews using addwiki.php, sync-common-all
  • 04:07 kate: made amaryllis ns3.wikimedia.org. needs magic stuff so it can be added as auth ns
  • 01:58 Tim: restarted search daemon on vincent, the usual problem

November 2

  • mark: Apparently the restart squid cron job in the squid RPM is broken in a weird way: at some point in time /sbin/pidof /usr/sbin/squid will stop working. I will fix it and roll out a new RPM tomorrow. Sorry for the trouble!
  • 23:20 JeLuF: Found 2 squids on srv8. Killed both, started a new one.
  • 22:20 Tim: adapted lvsmon for knams squid service, started it on iris. See /usr/local/bin/lvsmon-ksquid . There's also a copy in ~tstarling/lvs on zwinger in case iris goes down.
  • 21:30 mark: Installed the new squid RPM on clematis. Not using epoll didn't change memory leaking behaviour.
  • 19:17 kate: LDAP in on pascal was broken after reboot.
Nov  2 19:11:26 pascal slapd[29793]: bdb_db_init: Initializing BDB database
Nov  2 19:11:26 pascal slapd[29794]: bdb(dc=knams,dc=wikimedia,dc=org): Lock table is out of available
-               locks
Nov  2 19:11:26 pascal slapd[29794]: bdb_db_open: db_open(/var/lib/ldap) failed: Cannot allocate
-               memory (12)
Nov  2 19:11:26 pascal slapd[29794]: backend_startup: bi_db_open(0) failed! (12)
Did a db_recover and restarted slapd.
  • 04:38 kate, kyle: csw4 is installed. nothing on it yet.
  • 01:08 kate: pascal broke again, moved LVS to iris
  • 00:10 kate: colo allocated us 84.40.25.224/27, wikicities will move into this network

November 1

  • 23:39 brion: created car-fr-l list for french arbcom
  • 22:25 brion: heavy packet loss between pmtpa and lopar; kate is moving dns off lopar for now
  • 21:10 UTC erik: created ru.wikinews.org using addwiki.php
  • 18:26 mark: Dropped 207.142.131.225 as gateway IP, as it doesn't seem to be in use anymore
  • 18:15 mark: Made csw1-pmtpa act as a DHCP relay agent for rabanus, 10.0.0.15
  • 04:20 kate: replaced mormo.org with pascal & amaryllis as backup MX, using postgrey + other anti-spam stuff
  • 05:48 Solar: anthony, suda, isidore and bayle are back up.
  • 05:10 Tim: Cleaned up the squid list in CommonSettings.php. The need to have variables for the IP addresses of each squid passed long ago, it was just clutter, doubling the length of the section. Added the external IP address of will, which was missing, causing edits to be wrongly attributed in the yaseo wikis.

Archives


Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox