Managing a researcher's digital life with git-annex
I'm a researcher working mostly on the computational side of computational biology and bioinformatics, but recently I've had to get out of my comfort zone a little more than usual and I keep running into the same damn problem. My digital life is necessarily spread across multiple machines (My laptop, my desktop, the local cluster in our lab, the university HPC cluster, even my phone) and it's so hard to keep track of where everything is.
Git helps organise my code, and it's useful for a number of reasons:
- Any changes I make locally are non-destructive and can be reverted
- I have a complete history of every change I've ever made to a document (With useful messages explaining each change!!!)
- It doesn't mattter which computer I'm on - as long as my latest changes are on the remote, I can synchronise my work
- It's frictionless. I never need to worry about [expand on this...?]
Since I started using Git, I couldn't help but wonder why I didn't organise the rest of my life like this. Each of the above benefits apply to research data (raw data, images, spreadsheets and secondary analysis data, writeups, notes, bibliographies) with every ounce of importance as code. So why don't we use Git to manage research?
The simple answer
In biological research right now, sequencing is the workhorse of the lab. Enormous data files in .fasta, .fastq or .bam format comprise much of our key data, and these files are BIG. Usually you don't even bother copying them to your laptop at any stage. They go straight onto the cluster and you run your pipeline (generating an even more enormous quantity of data in the form of intermediate files) and finally get your results. Usually you actually end up doing this multiple times, saving each result just in case.
It is utterly impossible to work with something like this with plain old git, right? You'd have to store terabytes of text in the index, and that's just for the versioning capability of git. By design, each repository stores the entire index - each file and all of its changes since the start. There are a multitude of approaches to handling this problem for the end user [examples?]
git-annex
The Data
git-annex solves this problem (yes, really solves it) by splitting the problem into two parts - the data, and a map of where it is. You can keep your data wherever the hell you like - on a USB stick (not recommended!), on a hard drive (or ten), on your cluster, on an AWS S3 bucket, on another computer, on Onedrive (boo! hiss!), even using bittorrent if you like it hot. Even a simple URL is a valid remote, opening the door to including the sequence read archive (SRA) or any other online data source in your repository.
The Map
The map is a git index, with your real data files represented by files as you'd expect. The catch is that these files contain one thing - a symbolic link to the real data. You can add, commit, push/pull - perform all the usual git operations, but the contents of the files are abstracted away by git-annex so that you don't have to worry about this part. Whether the file is stored on disk, on a remote server or on the web is irrelevant.
The upshot of this is that you can have the entire corpus of your digital life ready and available to browse in the directory structure of your choosing, without needing to keep it all on disk.
The basic idea
git-annex replaces large files with symbolic links and records the file content in a separate object store. When you run:
git annex add big-file.fastq.gz
git commits the symlink and the content hash, but the actual bytes go into .git/annex/objects/. From there you can push the content to any number of remotes — an external drive, a network share, a cloud bucket — and drop it from your local machine when you no longer need it. Getting it back is:
git annex get big-file.fastq.gz
That's largely it. The rest is just configuring remotes and deciding what lives where.
My setup
I run a single repository that acts as the canonical record of everything I want to track: raw data, processed outputs, figures, notes. The remotes are:
- HPC scratch — where jobs actually run, synced before each analysis batch
- External SSD — a full local backup, also the thing I grab if I'm working offline
- rsync.net — off-site cold storage via the rsync special remote
The cloud remote is set numcopies=2, so git-annex won't let me drop a file locally unless at least two other remotes already have it. That one guard has probably saved me from myself more than once.
Why it works for research
A few properties make git-annex particularly well suited to research workflows:
Locality. On the cluster I only need the input files for the current job — not three months of old raw data. git annex get pulls exactly what I need, runs the job, and git annex drop clears the scratch space afterward. The git history still records that those files exist and where they came from.
Reproducibility. Because git-annex stores a content hash for every file, you can verify that what you have now is byte-for-byte identical to what you had when you first ran an analysis. For a field where reproducibility is taken seriously, that matters.
No special infrastructure. Any machine you can ssh into can be a remote. I added the HPC in about five minutes with git annex initremote.
The rough edges
It is not without friction. The symlink approach confuses some tools — certain bioinformatics pipelines won't follow symlinks and need the --enable-http-backend trick or a git annex unlock before they'll touch the files. The learning curve is real: git-annex has a lot of commands and the documentation, while thorough, assumes you already think in git terms.
But after a year of using it, I wouldn't go back to ad-hoc rsync scripts and hoping for the best. My data is in one place, I know exactly which copies exist where, and I've stopped losing files to drive failures.
If you're a researcher drowning in large files and want to talk through a setup, feel free to get in touch.