Untitled

On Thu, 9 Feb 2012 01:20:04 PM Matthew Dillon wrote:
>     This is the current design document for HAMMER2.  It lists every
> feature I intend to implement for HAMMER2.  Everything except the freemap
> and cluster protocols (which are both big ticket items) has been
> completely speced out.
>
>     There are many additional features verses the original document,
>     including hardlinks.
>
>     HAMMER2 is all I am working on this year so I expect to make good
>     progress, but it will probably still be July before we have anything
>     usable, and well into 2013 before the whole mess is implemented and
>     even later before the clustering is 100% stable.
>
>     However, I expect to be able to stabilize all non-cluster related
> features in fairly short order.  Even though HAMMER2 has a lot more
> features then HAMMER1 the actual design is simpler than HAMMER1, with
> virtually no edge cases to worry about (I spent 12+ months working edge
> cases out in HAMMER1's B-Tree, for example... that won't be an issue for
> HAMMER2 development).
>
>     The work is being done in the 'hammer2' branch off the main dragonfly
>     repo in appropriate subdirs.  Right now just vsrinivas and I but
>     hopefully enough will get fleshed out in a few months that other people
>     can help too.
>
>     Ok, here's what I have got.
>
>
> 			    HAMMER2 DESIGN DOCUMENT
>
> 			        Matthew Dillon
> 				 08-Feb-2012
> 			     >
>
> * These features have been speced in the media structures.
>
> * Implementation work has begun.
>
> * A working filesystem with some features implemented is expected by July
> 2012.
>
> * A fully functional filesystem with most (but not all) features is
> expected by the end of 2012.
>
> * All elements of the filesystem have been designed except for the freemap
>   (which isn't needed for initial work).  8MB per 2GB of filesystem
>   storage has been reserved for the freemap.  The design of the freemap
>   is expected to be completely speced by mid-year.
>
> * This is my only project this year.  I'm not going to be doing any major
>   kernel bug hunting this year.
>
> 				Feature List
>
> * Multiple roots (allowing snapshots to be mounted).  This is implemented
>   via the super-root concept.  When mounting a HAMMER2 filesystem you
> specify a device path and a directory name in the super-root.
>
> * HAMMER1 had PFS's.  HAMMER2 does not.  Instead, in HAMMER2 any directory
>   in the tree can be configured as a PFS, causing all elements recursively
>   underneath that directory to become a part of that PFS.
>
> * Writable snapshots.  Any subdirectory tree can be snapshotted.  Snapshots
>   show up in the super-root.  It is possible to snapshot a subdirectory
>   and then later snapshot a parent of that subdirectory... really there are
>   no limitations here.
>
> * Directory sub-hierarchy based quotas and space and inode usage tracking.
>   Any directory sub-tree, whether at a mount point or not, tracks aggregate
>   inode use and data space use.  This is stored in the directory inode all
>   the way up the chain.
>
> * Incremental queueless mirroring / mirroring-streams.  Because HAMMER2 is
>   block-oriented and copy-on-write each blockref tracks both direct
>   modifications to the referenced data via (modify_tid) and indirect
>   modifications to the referenced data or any sub-tree via (mirror_tid).
>   This makes it possible to do an incremental scan of meta-data that covers
>   only changes made since the mirror_tid recorded in a prior-run.
>
>   This feature is also intended to be used to locate recently allocated
>   blocks and thus be able to fixup the freemap after a crash.
>
>   HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in
>   that HAMMER2 does not keep track of 'deleted' records.  Instead any
>   recursion by the mirroring code which finds that (modify_tid) has
>   been updated must also send the direct block table or indirect block
>   table state it winds up recursing through so the target can check
>   similar key ranges and locate elements to be deleted.  This can be
>   avoided if the mirroring stream is mostly caught up in that very recent
>   deletions will be cached in memory and can be queried, allowing shorter
>   record deletions to be passed in the stream instead.
>
> * Will support multiple compression algorithms configured on subdirectory
>   tree basis and on a file basis.  Up to 64K block compression will be
> used. Only compression ratios near powers of 2 that are at least 2:1 (e.g.
> 2:1, 4:1, 8:1, etc) will work in this scheme because physical block
> allocations in HAMMER2 are always power-of-2.
>
>   Compression algorithm #0 will mean no compression and no zero-checking.
>   Compression algorithm #1 will mean zero-checking but no other
> compression. Real compression will be supported starting with algorithm 2.
>
> * Zero detection on write (writing all-zeros), which requires the data
>   buffer to be scanned, will be supported as compression algorithm #1.
>   This allows the writing of 0's to create holes and will be the default
>   compression algorithm for HAMMER2.
>
> * Copies support for redundancy.  The media blockref structure would
>   have become too bloated but I found a clean way to do copies using the
>   blockset structure (which is a set of 8 fully associative blockref's).
>
>   The design is such that the filesystem should be able to function at
>   full speed even if disks are pulled or inserted, as long as at least one
>   good copy is present.  A background task will be needed to resynchronize
>   missing copies (or remove excessive copies in the case where the copies
>   value is reduced on a live filesystem).
>
> * Intended to be clusterable, with a multi-master protocol under design
>   but not expected to be fully operational until mid-2013.  The media
>   format for HAMMER1 was less condusive to logical clustering than I had
>   hoped so I was never able to get that aspect of my personal goals
>   working with HAMMER1.  HAMMER2 effectively solves the issues that cropped
>   up with HAMMER1 (mainly that HAMMER1's B-Tree did not reflect the logical
>   file/directory hierarchy, making cache coherency very difficult).
>
> * Hardlinks will be supported.  All other standard features will be
> supported too of course.  Hardlinks in this sort of filesystem require
> significant work.
>
> * The media blockref structure is now large enough to support up to a
> 192-bit check value, which would typically be a cryptographic hash of some
> sort. Multiple check value algorithms will be supported with the default
> being a simple 32-bit iSCSI CRC.
>
> * Fully verified deduplication will be supported and automatic (and
>   necessary in many respects).
>
> * Non-verified de-duplication will be supported as a configurable option on
>   a file or subdirectory tree.  Non-verified deduplication would use the
>   largest available check code (192 bits) and not bother to verify data
>   matches during the dedup pass, which is necessary on extremely large
>   filesystems with a great deal of deduplicable data (as otherwise a large
>   chunk of the media would have to be read to implement the dedup).
>
>   This feature is intended only for those files where occassional
> corruption is ok, such as in a large data store of farmed web content.
>
> 				GENERAL DESIGN
>
> HAMMER2 generally implements a copy-on-write block design for the
> filesystem, which is very different from HAMMER1's B-Tree design.  Because
> the design is copy-on-write it can be trivially snapshotted simply by
> referencing an existing block, and because the media structures logically
> match a standard filesystem directory/file hierarchy snapshots and other
> similar operations can be trivially performed on an entire subdirectory
> tree at any level in the filesystem.
>
> The copy-on-write nature of the filesystem implies that any modification
> whatsoever will have to eventually synchronize new disk blocks all the way
> to the super-root of the filesystem and the volume header itself.  This
> forms the basis for crash recovery.  All disk writes are to new blocks
> except for the volume header, thus allowing all writes to run concurrently
> except for the volume header update at the end.
>
> Clearly this method requires intermediate modifications to the chain to be
> cached so multiple modifications can be aggregated prior to being
> synchronized.  One advantage, however, is that the cache can be flushed at
> any time WITHOUT having to allocate yet another new block when further
> modifications are made as long as the volume header has not yet been
> flushed. This means that buffer cache overhead is very well bounded and
> can handle filesystem operations of any complexity even on boxes with very
> small amounts of physical memory.
>
> I intend to implement a shortcut to make fsync()'s run fast, and that is to
> allow deep updates to blockrefs to shortcut to auxillary space in the
> volume header to satisfy the fsync requirement.  The related blockref is
> then recorded when the filesystem is mounted after a crash and the update
> chain is reconstituted when a matching blockref is encountered again during
> normal operation of the filesystem.
>
> Basically this means that no real work needs to be done at mount-time
> even after a crash.
>
> Directories are hashed, and another major design element is that directory
> entries ARE INODES.  They are one and the same.  In addition to directory
> entries being inodes the data for very small files (512 bytes or smaller)
> can be directly embedded in the inode (overloaded onto the same space that
> the direct blockref array uses).  This should result in very high
> performance.
>
> Inode numbers are not spatially referenced, which complicates NFS servers
> but doesn't complicate anything else.  The inode number is stored in the
> inode itself, an absolutely necessary feature in order to support the
> hugely flexible snapshots that we want to have in HAMMER2.
>
> 				  HARDLINKS
>
> Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of
> a spatial reference to the inode number.  We do not want to have to have
> an index of inode numbers for any basic HAMMER2 feature if we can help it.
>
> Hardlinks are handled by placing the inode for a multiply-hardlinked file
> in the closest common parent directory.  If "a/x" and "a/y" are hardlinked
> the inode for the hardlinked file will be placed in directory "a", e.g.
> "a/3239944", but it will be invisible and will be in an out-of-band
> namespace. The directory entries "a/x" and "a/y" will be given the same
> inode number but in fact just be placemarks that cause HAMMER2 to recurse
> upwards through the directory tree to find the invisible inode number.
>
> Because directories are hashed and a different namespace (hash key range)
> is used for hardlinked inodes, standard directory scans are able to
> trivially skip this invisible namespace and inode-specific lookups can
> restrict their lookup to within this space.
>
> The nature of snapshotting makes handling link-count 2->1 and 1->2 cases
> trivial.  Basically the inode media structure is copied as needed to
> break-up or re-form the standard directory entry/inode.  There are no
> backpointers in HAMMER2 and no reference counts on the blocks (see FREEMAP
> NOTES below), so it is an utterly trivial operation.
>
> 				FREEMAP NOTES
>
> In order to implement fast snapshots (and writable snapshots for that
> matter), HAMMER2 does NOT ref-count allocations.  The freemap which
> is still under design just won't do that.  All the freemap does is
> keep track of 100% free blocks.
>
> This not only trivializes all the snapshot features it also trivializes
> hardlink handling and solves the problem of keeping the freemap sychronized
> in the event of a crash.  Now all we have to do after a crash is make
> sure blocks allocated before the freemap was flushed are properly
> marked as allocated in the allocmap.  This is a trivial exercise using the
> same algorithm the mirror streaming code uses (which is very similar to
> HAMMER1)... an incremental meta-data scan that covers only the blocks that
> might have been allocated between the last allocation map sync and now.
>
> Thus the freemap does not have to be synchronized during a fsync().
>
> The complexity is in figuring out what can be freed... that is, when one
> can mark blocks in the freemap as being free.  HAMMER2 implements this as
> a background task which essentially must scan available meta-data to
> determine which blocks are not being referenced.
>
> Part of the ongoing design work is finding ways to reduce the scope of this
> meta-data scan so the entire filesystem's meta-data does not need to be
> scanned (though in tests with HAMMER1, even full meta-data scans have
> turned out to be fairly low cost).  In other words, its an area that we
> can continue to improve on as the filesystem matures.  Not only that, but
> we can completely change the freemap algorithms without creating
> incompatibilities (at worse simply having to require that a R+W mount do
> a full meta-data scan when upgrading or downgrading the freemap algorithm).
>
> 				  CLUSTERING
>
> Clustering, as always, is the most difficult bit but we have some
> advantages with HAMMER2 that we did not have with HAMMER1.  First,
> HAMMER2's media structures generally follow the kernel's filesystem
> hiearchy.  Second, HAMMER2's writable snapshots make it possible to
> implement several forms of multi-master clustering.
>
> The general mechanics for most of the multi-master clustering
> implementations will be as follows:
>
>     (a) Use the copies mechanism to specify all elements of the cluster,
> 	both local and remote (networked).
>
>     (b) The core synchronization state operates just as it does for copies,
> 	simply requiring a fully-flushed ack from the remote in order to
> 	mark the blocks as having been fully synchronized.
>
> 	The mirror_tid may be used to locate these blocks, allowing the
> 	synchronization state to be updated on the fly at a much later
> 	time without requiring the state to be maintained in-memory.
> 	(also for crash recovery resynchronization purposes).
>
>     (c) Data/meta-data can be retrieved from those copies which are marked
> 	as being synchronized, with priority given to the local storage
> 	relative to any given physical machine.
>
> 	This means that e.g. even in a master-slave orientation the slave
> 	may be able to satisfy a request from a program when the slave
> 	happens to be the local storage.
>
>     (d) Transaction id synchronization between all elements of the cluster,
> 	typically through masking (assigning a cluster number using the low
> 	3 bits of the transaction id).
>
>     (e) General access (synchronized or otherwise) may require cache
> 	coherency mechanisms to run over the network.
>
> 	Implementing cache coherency is a major complexity issue.
>
>     (f) General access (synchronized or otherwise) may require quorum
> 	agreement, using the synchronization flags in the blockrefs
> 	to determine whether agreement has been reached.
>
> 	Implementing quorum voting is a major complexity issue.
>
> There are lots of ways to implement multi-master environments using the
> above core features but the implementation is going to be fairly complex
> even with HAMMER2's feature set.
>
> Keep in mind that modifications propagate all the way to the super-root
> and volume header, so in any clustered arrangement the use of (modify_tid)
> and (mirror_tid) is critical in determining the synchronization state of
> portion(s) of the filesystem.
>
> Specifically, since any modification propagates to the root the
> (mirror_tid) in higher level directories is going to be in a constant
> state of flux.  This state of flux DOES NOT invalidate the cache state for
> these higher levels of directories.  Instead, the (modify_tid) is used on
> a node-by-node basis to determine cache state at any given level, and
> (mirror_tid) is used to determine whether any recursively underlying state
> is desynchronized.
>
> * Simple semi-synchronized multi-master environment.
>
>     In this environment all nodes are considered masters and modifications
>     can be made on any of them, and then propagate to the others
>     asynchronously via HAMMER2 mirror streams.  One difference here is
>     that kernel can activate these userland-managed streams automatically
>     when the copies configuration is used to specify the cluster.
>
>     The only type of conflict which isn't readily resolvable by comparing
>     the (modify_tid) is when file data is updated.  In this case user
>     intervention might be required but, theoretically, it should be
>     possible to automate most merges using a multi-way patch and, if not,
>     choosing one and creating backup copies if the others to allow the
>     user or sysop to resolve the conflict later.
>
> * Simple fully synchronized fail-over environment.
>
>     In this environment there is one designated master and the remaining
>     nodes are slaves.  If the master fails all remaining nodes agree on a
>     new master, possibly with the requirement that a quorum be achieved
>     (if you don't want to allow the cluster to split).
>
>     If network splits are allowed the each sub-cluster operates in this
>     mode but recombining the clusters reverts to the first algorithm.
>     If not allowed whomever no longer has a quorum will be forced to stall.
>
>     In this environment the current designated master is responsible for
>     managing locks for modifying operations.  The designated master will
>     proactively tell the other nodes to mark the blocks related to the
>     modifying operation as no longer being synchronized while any local
>     data at the node that acquired the lock (master or slave) remains
>     marked as being synchronized.
>
>     The node that succesfully gets the lock then issues the modifying
>     operation to both its local copy and to the master, marking the
>     master as being desynchronized until the master acknowledges receipt.
>
>     In this environment any node can access data from local storage if
>     the designated master copy is marked synchronized AND its (modify_tid)
>     matches the slave copy's (modify_tid).
>
>     However, if a slave disconnects from the master then reconnects the
>     slave will have lost the master's desynchronization stream and must
>     mark its root blockref for the master copy HAMMER2_BREF_DESYNCHLD as
>     well as clear the SYNC1/SYNC2 bits.  Setting DESYNCCHLD forces
> on-demand recursive reverification that the master and slave are (or are
> not) in sync in order to reestablish on the slave the synchronization
> state of the master.
>
>     That might be a bit confusing but the whole point here is to allow
>     read accesses to the filesystem to be satisfied by any node in a
>     multi-master cluster, not just by the current designated master.
>
> * Fully cache coherent and synchronized multi-master environment.
>
>     In this environment a quorum is required to perform any modifying
>     action.  All nodes are masters (there is no 'designated' master)
>     and all nodes connect to all other nodes in a cross-bar.
>
>     The quorum is specified by copies setup in the root volume
> configuration. A quorum of nodes in the cluster must agree on the copies
> configuration. If they do not the cluster cannot proceed to mount.  Any
> other nodes not in the quorum which are in the cluster which disagree with
> the configuration will inherit the copies configuration from the quorum.
>
>     Any modifying action will initiate a lock request locally to all nodes
>     in the cluster.  The modifying action is allowed to proceed the instant
>     a quorum of nodes respond in the affirmative (even if some have not
>     yet responded or are down).  The modifying action is considered
> complete once the two-phase commit protocol succeeds.  The modifying
> action typically creates and commits a temporary snapshot on at least a
> quorum of masters as phase-1 and then ties the snapshot back into the main
> mount as phase-2.
>
>     These locks are cache-coherency locks and may be passively maintained
>     in order to aggregate multiple operations under the same lock and thus
>     under the same transaction from the point of view of the rest of the
>     quorum.
>
>     A lock request which interferes with a passively maintained lock will
>     force the two-phase commit protocol to complete and then transfer
>     ownership to the requesting entity, thus avoiding having to deal with
>     deadlock protocols at this point in the state machine.
>
>     Since any node can initiate concurrent lock requests to many other
> nodes it is possible to deadlock.  When two nodes initiate conflicting
> lock requests to the cluster the one achieving the quorum basically wins
> and the other is forced to retry (go back one paragraph).  In this
> situation no deadlock will occur.
>
>     If three are more nodes initiate conflicting lock requests to the
>     cluster a deadlock can occur whereby none of the nodes achieve a
> quorum. In this case every node will know which of the other nodes was
> granted the lock(s).  Deadlock resolution then proceeds simultaniously on
> the three nodes (since they have the same information), whereby the lock
> holders on the losing end of the algorithm transfer their locks to one of
> the other nodes.  The lock state and knowledge of the lock state is
> updated in real time on all nodes until a quorum is achieved.
>
> * Fully cache coherent and synchronized multi-master environment with
>   passive read locking.
>
>     This is a more complex form of clustering than the previous form.
>     Take the previous form and add the ability to passively hold SHARED
>     locks in addition to the EXCLUSIVE locks the previous form is able
>     to hold.
>
>     The advantage of being able to passively hold a shared lock on a
> sub-tree (locks can be held on single nodes or entire sub-trees) is that
> it is then possible for all nodes to validate a node (modify_tid) or
> entire sub-tree (mirror_tid) with a very short network transaction and
> then satisfy a large number of requests from local storage.
>
> * Fully cache coherent and synchronized multi-master environment with
>   passive read locking and slave-only nodes.
>
>     This is the MOST complex form of clustering we intend to support.
>     In a multi-master environment requiring a quorum of masters to operate
>     we implement all of the above plus ALSO allow additional nodes to be
>     added to the cluster as slave-only nodes.
>
>     The difference between a slave-only node and setting up a manual
>     mirror-stream from the cluster to a read-only snapshot on another
>     HAMMER2 filesystem is that the slave-only node will be fully
>     cache coherent with either the cluster proper (if connected to a quorum
>     of masters), or to one or more other nodes in the cluster (if not
>     connected to a quorum of masters), EVEN if the slave itself is not
>     completely caught up.
>
>     So if the slave-only cluster node is connected to the rest of the
> cluster over a slow connection you basically get a combination of local
> disk speeds for any data that is locally in sync and network-limited
> speeds for any data that is not locally in sync.
>
>     slave-only cluster nodes run a standard mirror-stream in the background
>     to pull in the data as quickly as possible.
>
>     This is in constrast to a manual mirror-stream to a read-only
>     snapshot (basically a simple slave), which has no ability to bypass
>     the local storage to handle out-of-date requests (in fact has no
> ability to detect that the local storage is out-of-date anyway).
>
> 						-Matt