RAID is not backup – or, how I learned to appreciate S3 storage

TL;DR: Amazon S3 is dirt cheap and easy to use. Unless you’re the kind of person who is still prioritizing privacy over convenience – and I love you guys, but I just don’t want to live that way anymore* – you might as well take advantage of its services.

RAID is not backup. I knew this. Of course I knew it. But backing up data is a lot like eating kale – we all knows it’s good for you, but most of us would rather eat ice cream and hope for the best. And anyway, I had mirrored drives to hold anything that might possibly be valuable. There were only two ways to lose data: both drives had to fail at once, or I had to mistakenly delete something critical. The likelihood of both drives failing at the same time is minuscule, and obviously, I would never accidentally delete critical data. I’m a pro.

For a while, I haphazardly addressed the backup issue with an external drive to which I could rsync all my important data. But I kept it unplugged when not in use, because a plugged-in drive could be destroyed by a power surge and defeat the purpose. So I had to go down to the basement and dig around for power strips to back up my data. And I mean, really, the backup drive wasn’t fool proof anyway, since a flood or fire would still destroy it. And eventually, I had more data than I could fit on the backup drive, and at some point I just stopped even pretending I was using it as a backup drive. It was a snapshot – a very, very old snapshot.

Backing up to an external server never really occurred to me. It was a lot of data, and when I built the server, it was at a time when the cloud wasn’t even a twinkle in anyone’s eye. Hosted server space was expensive. And anyway, I was allergic to having all of my data stored in a server farm where it could be hacked or swept up in a government seizure.

Years passed. In many ways, I gave up on fighting the battle to keep my data local. I got a gmail account, and I started using it more and more often. I started using google docs. I misplaced my privacy zealot card. But I never thought of how that might change my (non)approach to backing up my personal data.

I started having less and less interest in fiddling with computers just to keep them running. Eventually, my home server just hosted shared files and an email server. And finally, just shared files. And since it wasn’t doing a whole lot, I didn’t worry too much about keeping it up to date. Debian is nice that way – it just kind of keeps going, tolerant of my benign neglect. The kernel became geriatric. Aptitude refused to update glibc. And finally I got some sort of warning about having to fix … something … before I could boot safely. It was still running fine, though. I decided I needed to start from scratch. Most of the configuration on my server was for systems I hadn’t even run in a decade. I didn’t want to keep dragging all of that obsolete history with me. I’d only keep my home directory and, of course, that mirrored drive.

And then I had a bad fall when skiing and hurt my knee. After the initial few days of agony, pain killers did their job, and I was mostly bored. I wouldn’t dare go back to work on pain killers .. but what could go wrong on a reinstall?

Long story short: things can go wrong. Especially when your memory is hazy and you feel fine, but you’re actually kind of dumb compared to your usual self.

We’re still not sure exactly why, but the mirrored drives didn’t mount properly. The drives claimed to have no data. My heart stopped. These drives held all of my photos from at least the last decade. My husband, who had set them up originally and who knows “all the things” about storage, was flummoxed for a while. I didn’t push him. As long as the drives hadn’t been officially been declared dead, my data might still be there. Shroedinger’s photos.

Eventually, he got one of the drives to mount individually, finding the data in an abnormal configuration. The first thing I decided to do is to copy everything to S3. A year ago, this would have seemed daunting to me – but I’ve been using it daily at my new job, so I knew the basics.

I ran into some snags. The main thing I’d advise others – if you want auditing, don’t enable it until after your initial upload. You’ll probably want to revise some things, and you may end up with versioned binaries that are a pain to identify and permanently delete. Once you’ve enabled auditing, you can’t turn it off. And while S3 is cheap, you probably don’t want to be paying monthly fees for data you will never, ever want to look at.

The S3 command line tools are … not perfect. They are finicky, and you’ll probably want to create an alias or script to do exactly the same thing every time. There are some things you just can’t do, or couldn’t do without pulling down the meta data and writing a custom script. For example, searching for any and all files that have multiple versions.

The sync command is made for backup scenarios, but when I first started uploading data, I hit a lot of HTTP error codes. After a few starts and stops, sync started working. I saw this pattern repeat over the next few days – if I hadn’t been uploading in a while, the first attempt at a large sync would fail. After a few minutes, I would run it again and it would work as expected. My best guess is that a server needed to spin up to respond to multiple simultaneous requests.

Be careful with following symbolic links, which is the default for sync. It turned out I had a recursive symbolic link … and the sync happily slurped it all up until I finally got suspicious, stopped the sync, and deleted the symbolic link.

Sync doesn’t default to deleting, only copying and updating. This may or may not be the behavior you want. (I don’t typically want delete, but I do if I’m cleaning house.)

Ultimately I’m spending about $10/month to store about 300Gb of data on an external server with instantaneous access. I can likely trim that down a good bit because I don’t really need to be backing up 15 year old CD rips, but at the time I was just focused on copying everything and worrying about nuance later.

Useful Links:

AWS cost calculator:
http://calculator.s3.amazonaws.com/index.html

(For storage only, click “Amazon S3” on the left – you don’t care about EC2 instances, etc)

In order to use AWS command line tools, you need to set up a user and give it an access key and secret access key.
http://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html

Use the command ‘aws configure’ to set up your credentials locally.

Once you’ve set up your credentials and have them in your ~/.aws, you can use the AWS S3 Sync command:
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

Note: you probably want to use –exclude to make sure you don’t upload your aws credentials to S3! (This may not really matter when you’re using your account purely for backup and will never create another user account in S3.)

Removing versions from audited S3 buckets:
http://boulderapps.co/post/remove-all-versions-from-s3-bucket-using-aws-tools

Allowing public access to an S3 bucket (if you want to be able to allow public web access to certain files from S3):

How to Allow Public Access to an Amazon S3 Bucket

Sample command line calls


aws s3 sync foo s3:///foo --no-follow-symlinks
aws s3 sync foo s3:///foo --no-follow-symlinks --delete
s3cmd du -H s3:///some/arbitrary/directory

*I realize that by putting my data on a well-known service like S3, various nefarious interests (like governments, both foreign and national) are more likely to get access to my data than they would be if I kept my data in a basement server or some small mom and pop hosted service. But I’m already using facebook, gmail, and google drive – not to mention super creepy services like mint. I’ve clearly already made my choice in the battle of convenience vs. privacy.

Leave a Reply

Your email address will not be published. Required fields are marked *

Prove you\'re a human! *

WordPress Anti-Spam by WP-SpamShield

ˆ Back To Top