Whitepaper: Git-based stacked journalling filesystem

Title: gmfs, an experimental stacked journalling filesystem built on Git
Author: James Stallings II aka Hiro Protagonist (Freenode IRC)
Abstract: Technical Survey, Implementation Instructions and Materials
Summary:
		A brief discussion of the nature of Journalled Filesystems; a documented exposition of the protypical
	implementation; a manifest of requirements for proper operation

Date: Wed Sep 26, 2012

This document and anything in it that is deemed worthy and valuable by or to anyone is made freely available to such
persons, out of the sheer benevolence of my right intent, without encumberance or other hinderance, insofar as they
do not usurp credit for the work, nor prevent other persons from the enjoyment of it's equally unencumbered employment.

--

Introduction

Having been involved over the years with a couple different complex opensource projects, I've come to be rather familiar
with source code management systems, and have learned why they're useful and how they empower and insulate users from
various potential problems that arise in the course of managing large, dynamic projects.

More modern revision control systems generalize over the software development model, becoming more like content-management
systems; enter Git.

Git was first conceived as an opensource tool for managing the source code of the linux kernel, perhaps one of the largest,
most complex, and dare I say most important source code bases in existence. It's design requirements remain rigorous, it's
feature-set progressively broad in scope, and being designed largely for and by the linux kernel development community,
it had to meet or exceed all requirements before being deemed acceptable.

I first encountered git when a project I have long standing involvement with (the opensimulator project) switched to it
as the source-control tool of choice a couple of years ago. I've since struggled with it some, cursed at it some, learned
a lot, and come to appreciate it for the amazingly useful tool it is.

The particular application of it as backend for a journalled filesystem occurred to me recently while reading Git
documentation, as I was preparing to teach it. The documentation mentions, almost as a footnote, that git is perfectly
suitable for managing all of a projects assets, not just it's source code; and it occurred to me at once that I should
attempt initializing a smallish filesystem and seeing how it went.

Presented here is what evolved from that experiment.

--

Journalling Filesystems: Disaster recovery management strategy?

I guess this depends a lot on whether you take the developer's view or the manager's view; in any case, it all boils
down to change mangement, whether or not such change is anticipated.

A journalling filesystem allows one to travel in time, after a fashion, within the filesystem. You can treat it like
any other filesystem, and at any time, roll it back to a previous point in time, and see the filesystem as it was at
that time. The benefits of this capability are manifold and relatively obvious.

--

Backup Strategy or Version Control?

Both, at least in this instance. Git can do some pretty usefull things with branches, and these things are equally
usefull when the whole filesystem is under git management. So not only can git be used for specualtive work, it can also
be used as a point-in-time backup control system with some pretty unique capabilities; backups (re)generated on demand,
ready *remote replication and syncronization (file load balancing potential here?)

* note that this referes to the entire filesystem repository which is not just a current copy of the filesystem, but also
it's entire history and that of every file it contains

The possibilities are intriguing to say the least.

--

So, enough with the blather, here's what I did:

1. Parts List
	Two 32 GB USB 2.0 flash drives

2. My P4D desktop box running Ubuntu Linux 12.04 LTS


Software components employed:

1. Git           SCM/RCS
2. cron          clock-event software scheduler
3. bash          shell scripting language
4. automounter   automagically mount my flash drives in userland with fstab
5. your preferred text editor for working with the files


The custom files:

1. fstab
2. crontab
3. roll.sh

--

Setting it all up

The first thing that must be accomplished is getting the underlying volumes and filesystems viable and operable - in my
case, this means getting the volumes (two usb flash drives) and their filesystems (vfat) mounted in a consistent
location and in a consistent fashion. As they are removeable media and managed by my user, I need them mounted in my
user's file space, and they need to automount when inserted in the ports. Here 's the relevant portion of my fstab:

#/dev/sdc1 USB Thumbdrive
UUID=A974-15B6	/home/twitch/flashdrv0	vfat	rw,user,noauto,nofail

#/dev/sdd1 USB Thumbdrive
UUID=40E5-93BB	/home/twitch/flashdrv1	vfat	rw,user,noauto,nofail

Note that the drives are mounted by UUID, the only way to distinguish between otherwise identical drives.

The next thing is th crontab. Cron is a unix program that runs software based on what time it is. I wont go any further
into it than that. The short story is, we need to do certain processing over the filesystem periodically with git to make
the journalling magic happen, and that processing is encasulated within the roll.sh shell script; that script is run by
cron (every 15 mins, all day long, every day, every week, every month in my case).

The cron entries for my installation are as follows:

0,15,30,45 * * * * /bin/bash /home/twitch/flashdrv1/shbin/roll.sh
0,15,30,45 * * * * /bin/bash /home/twitch/flashdrv0/shbin/roll.sh

Now to the meat of it -- the roll.sh script. Sounds like major mojo, but it really isn't; it's just some basic
automation of git, which does all of the heavy lifting.

Here's roll.sh:

#!/bin/bash

function Recurse
{
   oldIFS=$IFS
   IFS=$'\n'
   for f in "$@"
     do
       if [[ -d "${f}" ]]; then
         echo "/usr/bin/git add ${f}/*" >>~/flashdrv0/logs/gmfs.log
         /usr/bin/git add "${f}/*" >>~/flashdrv0/logs/gmfs.log
         cd "${f}"
         Recurse $(ls -1 ".")
         cd ..
       fi
     done
     IFS=$oldIFS
}

# process cwd as a git-managed filesystem
#
# this is experimental and is just what it sounds like
#
echo "=====================================================================================================" >>~/flashdrv0/logs/gmfs.log
cd /home/twitch/flashdrv0/
echo `date`" - staging changes" >>~/flashdrv0/logs/gmfs.log
Recurse $(ls -1 ".")
/usr/bin/git add ~/flashdrv0/. >>~/flashdrv0/logs/gmfs.log
/usr/bin/git add -u ~/flashdrv0/. >>~/flashdrv0/logs/gmfs.log
echo `date`" - the following changes were staged for commit:" >>~/flashdrv0/logs/gmfs.log
/usr/bin/git status >>~/flashdrv0/logs/gmfs.log
echo `date`" - making commit" >>~/flashdrv0/logs/gmfs.log
/usr/bin/git commit -a -m "`date`" >>~/flashdrv0/logs/gmfs.log
echo `date`" - commit completed" >>~/flashdrv0/logs/gmfs.log


Note the hard coded paths in both the crontab and the roll.sh shell script. This is an area that could potentially
benefit from some configuration points. Note also that the script fully logs all it's activities.

The intention of the git commands employed by the script is as follows:

- (recursively) add any untracked directories and files within them to the repository
- add the root of the filesystem to the repository (this might well be redundant)
- update the repository with any files or folders that have been removed
- make log entries summarizing repository staging conducted thus far (not git but relevant)
- commit the staged changes to the repository, updating the repository to the state of the working directory

--

Idiosyncrosies

Logging is sufficiently vigorous that it is always tainted immediately after the staged changes have been committed to the
repository; so it always shows as modified and ready for staging. This and some other similar circumstances provide for
some 'interesting' issues when working with branches. Among these are that any uncomitted changes will have to be resolved
in the new branch before one can return to the 'master' branch; so probably best practices to manually commit any staged
changes before working with a new branch.

--

Conclusions

This experiment is still very much in progress, and the jury is still out on whether it's a viable journalling filesystem
solution or just a technically interesting curiousity. I'm publishing about it in the interest of of sharing my experiment,
and encouraging others to join in, not in the interest of presenting or announcing a finished product with deliverables.
In short, use at your own risk.