This data was collected on capone.
The system appears to be getting stuck in iowait; largely on the homedir partitions (but others are affected):
(bloody prettified webification eats perfectly good plain-text formatting that would otherwise make this information readable)
avg-cpu: %user %nice %system %iowait %steal %idle
32.26 0.00 8.68 55.33 0.00 3.72
Thing to take away here - 55% of the time (in this brief sample) the CPU was waiting for I/O.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 222.00 0.00 248.00 0.00 1968.00 15.87 76.64 341.24 2.58 64.00
sda1 0.00 222.00 0.00 247.00 0.00 1964.00 15.90 76.37 341.26 2.59 64.00
sda2 0.00 0.00 0.00 1.00 0.00 4.00 8.00 0.26 336.00 264.00 26.40
sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 217.00 81.00 44.00 1584.00 1044.00 42.05 24.12 213.57 7.71 96.40
sdb1 0.00 217.00 81.00 44.00 1584.00 1044.00 42.05 24.12 213.57 7.71 96.40
Thing to take away here: the devices sdb and sdb1 (which I believe make up a RAID-0 mirror based on their behavior) contain the home directories where everyone's data lives. These are considered 96% utilized by iostat; and the system is waiting an average of 200ms for requests to be serviced by this storage. (Humans start to notice performance sucking at 20ms and above on average).
The sda and sda1 partitions are pretty dang slow too - don't have enough visibility to see where these mount but they're likely another RAID-0 mirrored rootvol.
If this is real hardware directly attached to the server I'd suspect a backplane or disk controller saturation problem; but again I don't have enough visibility or background on the system to diagnose with authority. I can only point to the smoke...
The load average is quite high:
capone:~> uptime
12:09:06 up 21:34, 11 users, load average: 384.80, 369.12, 398.88
When it was rebooted yesterday, the system was usable for a while before the load average got up above 100 - now most scripts are unrunnable for the most part, and loading a plain subdirectory display requires 30 seconds (usually takes only a second or two at most).
My support history shows some signs of life (though no response to my queries posted yesterday)
5 hours 46 mins ago: Outage verified: We are actively looking into resolving it.
6 hours 0 mins ago: Outage first reported.
A posting to http://www.dreamhoststatus.com/ would be nice...
Crossing fingers...