Swift/Open Issues Aug - Sept 2012

From Wikitech
< Swift(Difference between revisions)
Jump to: navigation, search
(Open Issues after upgrade and media originals reads deployment)
(Open Issues after upgrade and media originals reads deployment)
Line 55: Line 55:
 
*cluster in esams and retirement of ms6?
 
*cluster in esams and retirement of ms6?
 
*(eventually) turn off writes on ms7 and reclaim it for other uses
 
*(eventually) turn off writes on ms7 and reclaim it for other uses
 +
* average proxy query duration has risen since moving originals to swift.  This average is driven up by a small number of very slow transactions.  (90th% and 50th% are not as affected.)  The object server shows a fast transmission, it's only the proxy-server that is logging a long transaction time.  What is the proxy server doing that is making it slow?  Is it a slow client read?  Investigate (the extra statsd timers will be useful here).
 +
:* example1, object server took 0.0234s, proxy-server took 7.161s:
 +
:Aug 28 20:53:52 10.0.6.202 object-server xx.xx.xx.xx - - [28/Aug/2012:20:53:52 +0000] "GET /sdl1/37405/AUTH_43651bxxxxxxxxxdfe/wikipedia-commons-local-public.4e/4/4e/Martigny,_ville_romaine_et_moderne,_vestiges_de_canalisations_romaines.jpg" 200 2582570 "http://www.google.it/search?hlxxxxxxxcs" "tx14xxxxxxxxe6a1b8" "Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B206 Safari/7534.48.3" 0.0234
 +
:Aug 28 20:53:59 10.0.6.214 proxy-server xx.xx.xx.xx xx.xx.xx.xx 28/Aug/2012/20/53/59 GET /v1/AUTH_436xxxxxxxxx8dfe/wikipedia-commons-local-public.4e/4/4e/Martigny%252C_ville_romaine_et_moderne%252C_vestiges_de_canalisations_romaines.jpg HTTP/1.0 200 http%3A//www.google.it/search%3Fhl%3Dixxxxx;bbbbfCLcs Mozilla/5.0%20%28iPad%3B%20CPU%20OS%205_1_1%20like%20Mac%20OS%20X%29%20AppleWebKit/534.46%20%28KHTML%2C%20like%20Gecko%29%20Version/5.1%20Mobile/9B206%20Safari/7534.48.3 - - 2582570 - tx14xxxxxxa1b8 - 7.1610 -
 +
:* example2, object server took 0.0529s, proxy-server took 11.4569s:
 +
:Aug 28 20:53:58 10.0.6.204 object-server xx.xx.xx.xx - - [28/Aug/2012:20:53:58 +0000] "GET /sde1/27370/AUTH_4365xxxxxxxfe/wikipedia-commons-local-public.34/3/34/Martigny,_ville_romaine_et_moderne,_Martigny-Bourg.jpg" 200 2985520 "http://www.google.it/search?hlxxxxxxcs" "txf53fxxxxxxxx973ac" "Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B206 Safari/7534.48.3" 0.0529
 +
:Aug 28 20:54:09 10.0.6.214 proxy-server xx.xx.xx.xx xx.xx.xx.xx 28/Aug/2012/20/54/09 GET /v1/AUTH_43xxxxxxxdfe/wikipedia-commons-local-public.34/3/34/Martigny%252C_ville_romaine_et_moderne%252C_Martigny-Bourg.jpg HTTP/1.0 200 http%3A//www.google.it/search%3xxxxxxxLcs Mozilla/5.0%20%28iPad%3B%20CPU%20OS%205_1_1%20like%20Mac%20OS%20X%29%20AppleWebKit/534.46%20%28KHTML%2C%20like%20Gecko%29%20Version/5.1%20Mobile/9B206%20Safari/7534.48.3 - - 2985520 - txf53fxxxxxxxxxc973ac - 11.4569 -
 +
  
 
===Pointers===
 
===Pointers===

Revision as of 21:13, 28 August 2012

Open Issues after upgrade and media originals reads deployment

  • Put 1.5 .debs into the apt repo
  • Re-enable puppet for all of pmtpa swift
    • diff /usr/share/ganglia-logtailer/SwiftHTTPLogtailer.py (frontend)
      +1 allow puppet to write its version of the file. the change is used on the backends.
    • diff /usr/bin/swift-drive-audit (backend)
  • move other things off of ms7 one at a time
  • Upgrade Swift to 1.5 for all of eqiad
  • Finish building eqiad cluster
    • ms-be1004, ms-be1005 had hardware issues; RobH can give an update
  • Pending MW bugs (Aaron) (could someone file these in BZ?)
    • multiple HEADs (7?) for the same file for moves/deletes
      patch pending to reduce this to ~5
      https://gerrit.wikimedia.org/r/#/c/21485/
      more HEAD requests will go away when we stop doing multibackend write (no more MW consistency check across file backends)
    • HEAD/GET for a thumb instead of just a GET
    • excessively long filenames can fail on thumb requests (MW has a 255char limit, but 255 unicode chars double-URL encoded > 1024 chars which makes swift hate)
      https://bugzilla.wikimedia.org/show_bug.cgi?id=39697
  • statsd
    • updated packages by swiftstack
    • verify everything works in labs and/or in eqiad
    • update the ganglia views to use the new statistics
  • start running swift-recon, incorporate into stats and/or monitoring
  • eqiad container sync
  • ms-be6, ms-be10
    • ms-be10 needs diagnosis. maybe a new controller?
    • ms-be6 has multiple disks failing. controller?
    • too many hardware problems, escalate to Dell
  • upgrade to precise
    • eqiad already is
  • redo zones in pmtpa so that a zone represents a rack not a server
    • to move a host to a new zone, remove all devices from the ring and re-add them to the ring in the new zone. Format all the disks on the moved host before the new ring takes effect. Only move one host per week, preferably every other week.
    • Look up hosts in racktables to determine what zone they should be in.
    • move ms-be12 to zone 14
    • move ms-be7 to zone 5
    • move ms-be8 to zone 5
    • move ms-be6 to zone 8
  • audit and replace disks across all backends
    • ms-be6 and ms-be10 recently suffered many disks failing, but they're not alone. Throughout the cluster there are other disks that are having issues. All the backends need an audit to find out which disks are failed, and all those failed disks need to be replaced. Don't replace disks in multiple zones simultaneously.
    • symptoms of dead disks:
      • ls -l /srv/swift-storage/ shows ????s where it should show disk permissions etc. example ms-be7:sdg1
      • errors in /var/log/kern.log, example Aug 27 20:27:12 ms-be7 kernel: [4136747.742211] Filesystem "sdg1": xfs_log_force: error 5 returned.
      • unmuonted disks (ls -l /srv/swift-storage shows ownership as root instead of swift)
      • error messages in puppet about it failing to format and mount a disk
  • hook up disk failure detection to nagios in a useful way so that we are alerted when a disk needs to be swapped out (rather than having to proactively check).
  • tinker with rsync? or object replicator? concurrency setting (lots of connection errors in logs)
  • investigate 507s in swift logs
    • maybe correlated with dead disks?
  • thumb space general discussion on wikitech-l
    • (currently we general any thumb of any size on demand and keep it forever. this is a sure way to fill all available disk space; we need an actual plan for cleaning up regularly.)
  • cluster in esams and retirement of ms6?
  • (eventually) turn off writes on ms7 and reclaim it for other uses
  • average proxy query duration has risen since moving originals to swift. This average is driven up by a small number of very slow transactions. (90th% and 50th% are not as affected.) The object server shows a fast transmission, it's only the proxy-server that is logging a long transaction time. What is the proxy server doing that is making it slow? Is it a slow client read? Investigate (the extra statsd timers will be useful here).
  • example1, object server took 0.0234s, proxy-server took 7.161s:
Aug 28 20:53:52 10.0.6.202 object-server xx.xx.xx.xx - - [28/Aug/2012:20:53:52 +0000] "GET /sdl1/37405/AUTH_43651bxxxxxxxxxdfe/wikipedia-commons-local-public.4e/4/4e/Martigny,_ville_romaine_et_moderne,_vestiges_de_canalisations_romaines.jpg" 200 2582570 "http://www.google.it/search?hlxxxxxxxcs" "tx14xxxxxxxxe6a1b8" "Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B206 Safari/7534.48.3" 0.0234
Aug 28 20:53:59 10.0.6.214 proxy-server xx.xx.xx.xx xx.xx.xx.xx 28/Aug/2012/20/53/59 GET /v1/AUTH_436xxxxxxxxx8dfe/wikipedia-commons-local-public.4e/4/4e/Martigny%252C_ville_romaine_et_moderne%252C_vestiges_de_canalisations_romaines.jpg HTTP/1.0 200 http%3A//www.google.it/search%3Fhl%3Dixxxxx;bbbbfCLcs Mozilla/5.0%20%28iPad%3B%20CPU%20OS%205_1_1%20like%20Mac%20OS%20X%29%20AppleWebKit/534.46%20%28KHTML%2C%20like%20Gecko%29%20Version/5.1%20Mobile/9B206%20Safari/7534.48.3 - - 2582570 - tx14xxxxxxa1b8 - 7.1610 -
  • example2, object server took 0.0529s, proxy-server took 11.4569s:
Aug 28 20:53:58 10.0.6.204 object-server xx.xx.xx.xx - - [28/Aug/2012:20:53:58 +0000] "GET /sde1/27370/AUTH_4365xxxxxxxfe/wikipedia-commons-local-public.34/3/34/Martigny,_ville_romaine_et_moderne,_Martigny-Bourg.jpg" 200 2985520 "http://www.google.it/search?hlxxxxxxcs" "txf53fxxxxxxxx973ac" "Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B206 Safari/7534.48.3" 0.0529
Aug 28 20:54:09 10.0.6.214 proxy-server xx.xx.xx.xx xx.xx.xx.xx 28/Aug/2012/20/54/09 GET /v1/AUTH_43xxxxxxxdfe/wikipedia-commons-local-public.34/3/34/Martigny%252C_ville_romaine_et_moderne%252C_Martigny-Bourg.jpg HTTP/1.0 200 http%3A//www.google.it/search%3xxxxxxxLcs Mozilla/5.0%20%28iPad%3B%20CPU%20OS%205_1_1%20like%20Mac%20OS%20X%29%20AppleWebKit/534.46%20%28KHTML%2C%20like%20Gecko%29%20Version/5.1%20Mobile/9B206%20Safari/7534.48.3 - - 2985520 - txf53fxxxxxxxxxc973ac - 11.4569 -


Pointers

Personal tools
Namespaces

Variants
Actions
Navigation
Ops documentation
Wiki
Toolbox