Monday, July 12, 2004

Tips from my lousy hard drive experience.

I have two 40G drives that I bought at the same time. I consider them identical. About a week ago, one of them failed.

Tip one: Back up early and often! I wrote a script that takes the most important files, and packages them into nice CD-sized chunks. It doesn't back up my music, or a handful of other things, but it's the minimum of what I'd call "full." Another script takes just files that have been changed or added since the last "full" backup. The first one I run when I think of it, and the second one runs nightly. Long story short, I have my most important files safe every day.

Given that, I was able to stay calm in the face of failure. Ten years ago, I could "drop everything" and spend as long as necessary addressing some random computer problem. I can't do that anymore. Recent backups make that easier for me.

I tried some repair tactics, but it wasn't having it. It was having trouble reading certain parts of the disk. I used the backups I had to determine what files were not in my "full" copy, and I set the software to work on copying as much of that as possible.

Tip two: Pay attention to error messages. When tar would hit a file it couldn't read, it would say it was padding it with zeroes. I wrote down filenames thinking that I might look at them individually later. I gave up after a while; I'd already turned my piece of paper over once. In any case, that error message was important later.

For fun, once the evacuation was done, I reformatted the disk and ran a bad blocks check on it. It came up clean. Clean? Yes, clean. I checked again. I did a destructive check. Burned to the ground, it rose from the ashes.

I figured this might be a good time to reformat the other drive also. I'd put them under ext3 ages ago, when I was still suspicious of this whole "journaling" thing, and I wanted an easy way back to my tried-and-true ext2. It turns out, ext3 doesn't always perform as well as the alternatives.

Tip three: Trust, but verify. To copy the data from the good drive to the suspiciously born-again drive, I did this:

tar -cf - /good-drive | gpg -z 0 -so /bad-drive/good-drive.tar.gpg

This means the resulting archive was signed, but with no compression and no encryption. Turning off compression probably helped it go a little faster. What I really wanted was a way to get a good checksum of the file without having to reread it. In retrospect, I could have done this:

tar -cf - /good-drive | tee /bad-drive/good-drive.tar | sha1sum - > /good-drive/sha1

In any case, once the good drive's data were all on the suspicious drive, and signed, I did a verify with GnuPG something like four times, and it always came up valid. Each one took about four hours.

Note that since the failure last week, that drive has been working almost constantly.

Satisfied that my copy of the good drive could be read reliably off the suspicious drive, I reformatted the good drive and then restored its contents.

When it came time to restore those non-essential files I'd copied, I noticed that tar would extract right up to a file that had a problem and then halt. It was still reading the archive, but it didn't extract anything. That list of bad files came in handy; I could tell immediately that it was a bad file that it hit before failure.

I had a look at the documentation for some option that might say "keep working even if you hit a bump." It turns out the '-i' option does this. Actually, it ignores a block of zeroes that normally indicate the end of the archive, y'know, like the zeroes that tar wrote there in the first place when it encountered a truncated input file.

Since I had no place of my own to put all this data, I relied on my wife's new Mac PowerBook running OS X. It was really handy to use netcat for the restore.

Tip four: When using someone else's computer, be nice. Any time I ran anything over there, I'd 'nice -19' the thing. I couldn't help but notice that her Microsoft programs were taking 10% of the processor even when idle, but I deferred to them anyway. It's not my computer.

Tip five: Keep more than the most recent backups. The drive probably failed days before I noticed it. I may be able to consult those earlier backups to find problems that aren't obvious right away.

I'm still not sure what all I've lost, but I'm sure that it's nothing important or irreplaceable, and I'm sure I can find out what it is. I'm not sure the suspicious drive will continue to work, but I know better than to trust it with anything important. All in all, this lousy failing hard drive experience has been the best I can recall.

No comments: