On or around 5 January 2011, the file system on the big RAID array—where I keep all my stuff—failed. The controller and disks, which I bought for the express purpose of avoiding such a situation, were all safe; it appears that the failure was at the software level. The extent of the damage had an interesting pattern: many files that had been created between about May and September were either corrupt or missing. My suspicion is that one of the times I fatfooted the power strip that year, the filesystem became inconsistent, and the inconsistency lay undetected until the filesystem hit its maximum mount count on a reboot earlier this month.
This, folks, is why the nerds say that fault tolerance is no replacement for a backup. About that backup: July 9th. Ouch. Luckily, due to a remarkably tumultuous back half of 2010, the very same that apparently prevented me from taking a more recent backup, I didn't really have anything important in the period from where the backup left off and the file corruption ended. Gone for good are:
Effectively what this little excursion bought me was a whole lot of annoyance but no major catastrophe. After a couple weeks of fussing and limping along with my remaining gear, I decided to man up and Just Fix It™, which, at the time of this writing is what I have been doing virtually non-stop for the past four days.
I probably would have taken a lot less time if I hadn't done this:
find /broken -mtime -183 -type f -print0 | xargs -0 \ tar cv | gzip -c > possibly-corrupt-files.tar.gz
What this says is to find all files that had been modified in the past 183 days (since July 9), feeding their names into tar and compressing the results which I store somewhere safe while I recreate the file system. Of course, xargs has an (ostensibly arbitrary) internal limit to the amount of values it takes as input before it runs its subprogram again. I knew about that. What I didn't know was that you can't just concatenate tar files end to end and have them work as one. The ensuing surgery took me about 10 hours, and once I figured out what to do it went like this:
dd if=woops.tar bs=512 count=$WHATEVER | tar xv -R
It turns out tar plugs a chunk of nulls at the end of each run to signal end-of-file, which means I couldn't just unpack the archive I had made and instead had to go rummaging through it like a bin diver. Luckily I figured out that the -R flag dumps out the block number of each file and that all I had to do when the command hit the end was round that number up to the nearest 20 512-byte blocks (10240 bytes) and run dd again with the new offset. 33 times. If there was a better way to address this, I'd love to know, for when I hopefully have to never do this again.
This is not the first time I have been screwed by the vagaries of xargs. It is an example of a leaky abstraction. A semantically identical (in terms of what I meant to do) command would go like this:
find /broken -mtime -183 -type f | tar cv -T - |\ gzip -c > possibly-corrupt-files.tar.gz
Here the -T flag specifies to look for the list of files in a file, which is conveniently being piped in from find.
The upshot of this little incursion is that it motivated me to do a few things that I had been putting off for years. For instance, I have really wanted to try AFS for some time for a confluence of reasons. Number one of those is to have exactly one logical place where all the stuff I care about lives, exactly one target to back up. Since I use more than one computer, putting it all on a huge mirrored disk array on my server makes sense (when it isn't flaking out), and then shunting it from there. Reason number two is that Samba, which is the easy brain-dead way to achieve this result, was routinely inconveniencing me, so it invariably had to go. Reason the third is that installing AFS gives me an excuse to finally install and get some serious play with the byzantine authentication and authorization cadre of LDAP, PAM, SASL and Kerberos.
As I write, the last of the backup is being ever so gingerly copied into its new home on a fresh AFS cell. It has some peculiarities that I can already tell will take some getting used to.
I will no doubt have something to say about this configuration in the future. I will almost certainly appreciate the centralized authorization and single sign-on I get as a bonus. Who knows though, it might anger me and I'll turf it next week.