Swift/Server issues Aug-Sept 2012
From Wikitech
< Swift(Difference between revisions)
ArielGlenn (Talk | contribs) |
|||
| Line 2: | Line 2: | ||
Current status as of this writing: | Current status as of this writing: | ||
| + | |||
| + | Hardware Issues and Troubleshooting. | ||
| + | |||
| + | *Per conversation with Dell, checked the jumpers on the backplane to ensure that j15 was indeed empty. After examining the backplanes on ms-be6,7,8, it is confirmed to have only the 3 jumpers and j15 is empty. | ||
| + | |||
| + | *ms-be6 replaced all 12 HDD w/different manufacturer . | ||
| + | |||
| + | *ms-be6 has a replaced main board, replaced, backplane, replaced sas2008 controller card | ||
| + | |||
These boxes are powered on, have had puppet disabled in /etc/default/puppet and the swift processes are shut off via swift-init stop all. | These boxes are powered on, have had puppet disabled in /etc/default/puppet and the swift processes are shut off via swift-init stop all. | ||
Revision as of 14:59, 25 September 2012
We've seen some issues with the ms-be boxes (all C2100s) in tampa.
Current status as of this writing:
Hardware Issues and Troubleshooting.
- Per conversation with Dell, checked the jumpers on the backplane to ensure that j15 was indeed empty. After examining the backplanes on ms-be6,7,8, it is confirmed to have only the 3 jumpers and j15 is empty.
- ms-be6 replaced all 12 HDD w/different manufacturer .
- ms-be6 has a replaced main board, replaced, backplane, replaced sas2008 controller card
These boxes are powered on, have had puppet disabled in /etc/default/puppet and the swift processes are shut off via swift-init stop all.
- ms-be6 has had ssds uncabled and power pulled, reinstall, latest LSI driver (mpt2sas0), latest controller firmware. After recreating all xfs filesystems on /dev/sdc and up, by hand, and manual remount, it boots and continues to see all drives and mount them. It reported a degraded raid array on /dev/md0 (= /dev/sda and /dev/sdb, /deb/sdb is the one that fell out of the raid), I repaired the raid, a couple hours later it reported degraded again. So we can use this box for testing the other drives but not put it back into the rings.
- ms-be7 has ssds which are cabled up. I followed the same procedure of recreating all xfs filesystems except for the os, remounting manually, rebooting, and it sees all drives. We can put this back into the swift rings next Monday. (After re-enabling puppet and restarting swift processes on the box)
- ms-be10 has ssds which are cabled up. After recreating all xfs filesystems except for the OS, remounting manually, on reboot it reports a few disks (not always the same ones) as not ready/not present, but if you wait it out they eventually mount. Obviously this is a problem. I don't think this should go back into production.
These boxes are powered off.
- ms-be8 is powered off, and we can leave it for Dell for examination.
These boxes have one problem disk.
- ms-be5 has one disk reporting errors, needs replaced once other servers are stable
- ms-be11 has one disk replaced but it shows the wrong logical id so it needs reboot once other servers are stable