Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- On Thu, 9 Feb 2012 01:20:04 PM Matthew Dillon wrote:
- > This is the current design document for HAMMER2. It lists every
- > feature I intend to implement for HAMMER2. Everything except the freemap
- > and cluster protocols (which are both big ticket items) has been
- > completely speced out.
- >
- > There are many additional features verses the original document,
- > including hardlinks.
- >
- > HAMMER2 is all I am working on this year so I expect to make good
- > progress, but it will probably still be July before we have anything
- > usable, and well into 2013 before the whole mess is implemented and
- > even later before the clustering is 100% stable.
- >
- > However, I expect to be able to stabilize all non-cluster related
- > features in fairly short order. Even though HAMMER2 has a lot more
- > features then HAMMER1 the actual design is simpler than HAMMER1, with
- > virtually no edge cases to worry about (I spent 12+ months working edge
- > cases out in HAMMER1's B-Tree, for example... that won't be an issue for
- > HAMMER2 development).
- >
- > The work is being done in the 'hammer2' branch off the main dragonfly
- > repo in appropriate subdirs. Right now just vsrinivas and I but
- > hopefully enough will get fleshed out in a few months that other people
- > can help too.
- >
- > Ok, here's what I have got.
- >
- >
- > HAMMER2 DESIGN DOCUMENT
- >
- > Matthew Dillon
- > 08-Feb-2012
- > >
- >
- > * These features have been speced in the media structures.
- >
- > * Implementation work has begun.
- >
- > * A working filesystem with some features implemented is expected by July
- > 2012.
- >
- > * A fully functional filesystem with most (but not all) features is
- > expected by the end of 2012.
- >
- > * All elements of the filesystem have been designed except for the freemap
- > (which isn't needed for initial work). 8MB per 2GB of filesystem
- > storage has been reserved for the freemap. The design of the freemap
- > is expected to be completely speced by mid-year.
- >
- > * This is my only project this year. I'm not going to be doing any major
- > kernel bug hunting this year.
- >
- > Feature List
- >
- > * Multiple roots (allowing snapshots to be mounted). This is implemented
- > via the super-root concept. When mounting a HAMMER2 filesystem you
- > specify a device path and a directory name in the super-root.
- >
- > * HAMMER1 had PFS's. HAMMER2 does not. Instead, in HAMMER2 any directory
- > in the tree can be configured as a PFS, causing all elements recursively
- > underneath that directory to become a part of that PFS.
- >
- > * Writable snapshots. Any subdirectory tree can be snapshotted. Snapshots
- > show up in the super-root. It is possible to snapshot a subdirectory
- > and then later snapshot a parent of that subdirectory... really there are
- > no limitations here.
- >
- > * Directory sub-hierarchy based quotas and space and inode usage tracking.
- > Any directory sub-tree, whether at a mount point or not, tracks aggregate
- > inode use and data space use. This is stored in the directory inode all
- > the way up the chain.
- >
- > * Incremental queueless mirroring / mirroring-streams. Because HAMMER2 is
- > block-oriented and copy-on-write each blockref tracks both direct
- > modifications to the referenced data via (modify_tid) and indirect
- > modifications to the referenced data or any sub-tree via (mirror_tid).
- > This makes it possible to do an incremental scan of meta-data that covers
- > only changes made since the mirror_tid recorded in a prior-run.
- >
- > This feature is also intended to be used to locate recently allocated
- > blocks and thus be able to fixup the freemap after a crash.
- >
- > HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in
- > that HAMMER2 does not keep track of 'deleted' records. Instead any
- > recursion by the mirroring code which finds that (modify_tid) has
- > been updated must also send the direct block table or indirect block
- > table state it winds up recursing through so the target can check
- > similar key ranges and locate elements to be deleted. This can be
- > avoided if the mirroring stream is mostly caught up in that very recent
- > deletions will be cached in memory and can be queried, allowing shorter
- > record deletions to be passed in the stream instead.
- >
- > * Will support multiple compression algorithms configured on subdirectory
- > tree basis and on a file basis. Up to 64K block compression will be
- > used. Only compression ratios near powers of 2 that are at least 2:1 (e.g.
- > 2:1, 4:1, 8:1, etc) will work in this scheme because physical block
- > allocations in HAMMER2 are always power-of-2.
- >
- > Compression algorithm #0 will mean no compression and no zero-checking.
- > Compression algorithm #1 will mean zero-checking but no other
- > compression. Real compression will be supported starting with algorithm 2.
- >
- > * Zero detection on write (writing all-zeros), which requires the data
- > buffer to be scanned, will be supported as compression algorithm #1.
- > This allows the writing of 0's to create holes and will be the default
- > compression algorithm for HAMMER2.
- >
- > * Copies support for redundancy. The media blockref structure would
- > have become too bloated but I found a clean way to do copies using the
- > blockset structure (which is a set of 8 fully associative blockref's).
- >
- > The design is such that the filesystem should be able to function at
- > full speed even if disks are pulled or inserted, as long as at least one
- > good copy is present. A background task will be needed to resynchronize
- > missing copies (or remove excessive copies in the case where the copies
- > value is reduced on a live filesystem).
- >
- > * Intended to be clusterable, with a multi-master protocol under design
- > but not expected to be fully operational until mid-2013. The media
- > format for HAMMER1 was less condusive to logical clustering than I had
- > hoped so I was never able to get that aspect of my personal goals
- > working with HAMMER1. HAMMER2 effectively solves the issues that cropped
- > up with HAMMER1 (mainly that HAMMER1's B-Tree did not reflect the logical
- > file/directory hierarchy, making cache coherency very difficult).
- >
- > * Hardlinks will be supported. All other standard features will be
- > supported too of course. Hardlinks in this sort of filesystem require
- > significant work.
- >
- > * The media blockref structure is now large enough to support up to a
- > 192-bit check value, which would typically be a cryptographic hash of some
- > sort. Multiple check value algorithms will be supported with the default
- > being a simple 32-bit iSCSI CRC.
- >
- > * Fully verified deduplication will be supported and automatic (and
- > necessary in many respects).
- >
- > * Non-verified de-duplication will be supported as a configurable option on
- > a file or subdirectory tree. Non-verified deduplication would use the
- > largest available check code (192 bits) and not bother to verify data
- > matches during the dedup pass, which is necessary on extremely large
- > filesystems with a great deal of deduplicable data (as otherwise a large
- > chunk of the media would have to be read to implement the dedup).
- >
- > This feature is intended only for those files where occassional
- > corruption is ok, such as in a large data store of farmed web content.
- >
- > GENERAL DESIGN
- >
- > HAMMER2 generally implements a copy-on-write block design for the
- > filesystem, which is very different from HAMMER1's B-Tree design. Because
- > the design is copy-on-write it can be trivially snapshotted simply by
- > referencing an existing block, and because the media structures logically
- > match a standard filesystem directory/file hierarchy snapshots and other
- > similar operations can be trivially performed on an entire subdirectory
- > tree at any level in the filesystem.
- >
- > The copy-on-write nature of the filesystem implies that any modification
- > whatsoever will have to eventually synchronize new disk blocks all the way
- > to the super-root of the filesystem and the volume header itself. This
- > forms the basis for crash recovery. All disk writes are to new blocks
- > except for the volume header, thus allowing all writes to run concurrently
- > except for the volume header update at the end.
- >
- > Clearly this method requires intermediate modifications to the chain to be
- > cached so multiple modifications can be aggregated prior to being
- > synchronized. One advantage, however, is that the cache can be flushed at
- > any time WITHOUT having to allocate yet another new block when further
- > modifications are made as long as the volume header has not yet been
- > flushed. This means that buffer cache overhead is very well bounded and
- > can handle filesystem operations of any complexity even on boxes with very
- > small amounts of physical memory.
- >
- > I intend to implement a shortcut to make fsync()'s run fast, and that is to
- > allow deep updates to blockrefs to shortcut to auxillary space in the
- > volume header to satisfy the fsync requirement. The related blockref is
- > then recorded when the filesystem is mounted after a crash and the update
- > chain is reconstituted when a matching blockref is encountered again during
- > normal operation of the filesystem.
- >
- > Basically this means that no real work needs to be done at mount-time
- > even after a crash.
- >
- > Directories are hashed, and another major design element is that directory
- > entries ARE INODES. They are one and the same. In addition to directory
- > entries being inodes the data for very small files (512 bytes or smaller)
- > can be directly embedded in the inode (overloaded onto the same space that
- > the direct blockref array uses). This should result in very high
- > performance.
- >
- > Inode numbers are not spatially referenced, which complicates NFS servers
- > but doesn't complicate anything else. The inode number is stored in the
- > inode itself, an absolutely necessary feature in order to support the
- > hugely flexible snapshots that we want to have in HAMMER2.
- >
- > HARDLINKS
- >
- > Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of
- > a spatial reference to the inode number. We do not want to have to have
- > an index of inode numbers for any basic HAMMER2 feature if we can help it.
- >
- > Hardlinks are handled by placing the inode for a multiply-hardlinked file
- > in the closest common parent directory. If "a/x" and "a/y" are hardlinked
- > the inode for the hardlinked file will be placed in directory "a", e.g.
- > "a/3239944", but it will be invisible and will be in an out-of-band
- > namespace. The directory entries "a/x" and "a/y" will be given the same
- > inode number but in fact just be placemarks that cause HAMMER2 to recurse
- > upwards through the directory tree to find the invisible inode number.
- >
- > Because directories are hashed and a different namespace (hash key range)
- > is used for hardlinked inodes, standard directory scans are able to
- > trivially skip this invisible namespace and inode-specific lookups can
- > restrict their lookup to within this space.
- >
- > The nature of snapshotting makes handling link-count 2->1 and 1->2 cases
- > trivial. Basically the inode media structure is copied as needed to
- > break-up or re-form the standard directory entry/inode. There are no
- > backpointers in HAMMER2 and no reference counts on the blocks (see FREEMAP
- > NOTES below), so it is an utterly trivial operation.
- >
- > FREEMAP NOTES
- >
- > In order to implement fast snapshots (and writable snapshots for that
- > matter), HAMMER2 does NOT ref-count allocations. The freemap which
- > is still under design just won't do that. All the freemap does is
- > keep track of 100% free blocks.
- >
- > This not only trivializes all the snapshot features it also trivializes
- > hardlink handling and solves the problem of keeping the freemap sychronized
- > in the event of a crash. Now all we have to do after a crash is make
- > sure blocks allocated before the freemap was flushed are properly
- > marked as allocated in the allocmap. This is a trivial exercise using the
- > same algorithm the mirror streaming code uses (which is very similar to
- > HAMMER1)... an incremental meta-data scan that covers only the blocks that
- > might have been allocated between the last allocation map sync and now.
- >
- > Thus the freemap does not have to be synchronized during a fsync().
- >
- > The complexity is in figuring out what can be freed... that is, when one
- > can mark blocks in the freemap as being free. HAMMER2 implements this as
- > a background task which essentially must scan available meta-data to
- > determine which blocks are not being referenced.
- >
- > Part of the ongoing design work is finding ways to reduce the scope of this
- > meta-data scan so the entire filesystem's meta-data does not need to be
- > scanned (though in tests with HAMMER1, even full meta-data scans have
- > turned out to be fairly low cost). In other words, its an area that we
- > can continue to improve on as the filesystem matures. Not only that, but
- > we can completely change the freemap algorithms without creating
- > incompatibilities (at worse simply having to require that a R+W mount do
- > a full meta-data scan when upgrading or downgrading the freemap algorithm).
- >
- > CLUSTERING
- >
- > Clustering, as always, is the most difficult bit but we have some
- > advantages with HAMMER2 that we did not have with HAMMER1. First,
- > HAMMER2's media structures generally follow the kernel's filesystem
- > hiearchy. Second, HAMMER2's writable snapshots make it possible to
- > implement several forms of multi-master clustering.
- >
- > The general mechanics for most of the multi-master clustering
- > implementations will be as follows:
- >
- > (a) Use the copies mechanism to specify all elements of the cluster,
- > both local and remote (networked).
- >
- > (b) The core synchronization state operates just as it does for copies,
- > simply requiring a fully-flushed ack from the remote in order to
- > mark the blocks as having been fully synchronized.
- >
- > The mirror_tid may be used to locate these blocks, allowing the
- > synchronization state to be updated on the fly at a much later
- > time without requiring the state to be maintained in-memory.
- > (also for crash recovery resynchronization purposes).
- >
- > (c) Data/meta-data can be retrieved from those copies which are marked
- > as being synchronized, with priority given to the local storage
- > relative to any given physical machine.
- >
- > This means that e.g. even in a master-slave orientation the slave
- > may be able to satisfy a request from a program when the slave
- > happens to be the local storage.
- >
- > (d) Transaction id synchronization between all elements of the cluster,
- > typically through masking (assigning a cluster number using the low
- > 3 bits of the transaction id).
- >
- > (e) General access (synchronized or otherwise) may require cache
- > coherency mechanisms to run over the network.
- >
- > Implementing cache coherency is a major complexity issue.
- >
- > (f) General access (synchronized or otherwise) may require quorum
- > agreement, using the synchronization flags in the blockrefs
- > to determine whether agreement has been reached.
- >
- > Implementing quorum voting is a major complexity issue.
- >
- > There are lots of ways to implement multi-master environments using the
- > above core features but the implementation is going to be fairly complex
- > even with HAMMER2's feature set.
- >
- > Keep in mind that modifications propagate all the way to the super-root
- > and volume header, so in any clustered arrangement the use of (modify_tid)
- > and (mirror_tid) is critical in determining the synchronization state of
- > portion(s) of the filesystem.
- >
- > Specifically, since any modification propagates to the root the
- > (mirror_tid) in higher level directories is going to be in a constant
- > state of flux. This state of flux DOES NOT invalidate the cache state for
- > these higher levels of directories. Instead, the (modify_tid) is used on
- > a node-by-node basis to determine cache state at any given level, and
- > (mirror_tid) is used to determine whether any recursively underlying state
- > is desynchronized.
- >
- > * Simple semi-synchronized multi-master environment.
- >
- > In this environment all nodes are considered masters and modifications
- > can be made on any of them, and then propagate to the others
- > asynchronously via HAMMER2 mirror streams. One difference here is
- > that kernel can activate these userland-managed streams automatically
- > when the copies configuration is used to specify the cluster.
- >
- > The only type of conflict which isn't readily resolvable by comparing
- > the (modify_tid) is when file data is updated. In this case user
- > intervention might be required but, theoretically, it should be
- > possible to automate most merges using a multi-way patch and, if not,
- > choosing one and creating backup copies if the others to allow the
- > user or sysop to resolve the conflict later.
- >
- > * Simple fully synchronized fail-over environment.
- >
- > In this environment there is one designated master and the remaining
- > nodes are slaves. If the master fails all remaining nodes agree on a
- > new master, possibly with the requirement that a quorum be achieved
- > (if you don't want to allow the cluster to split).
- >
- > If network splits are allowed the each sub-cluster operates in this
- > mode but recombining the clusters reverts to the first algorithm.
- > If not allowed whomever no longer has a quorum will be forced to stall.
- >
- > In this environment the current designated master is responsible for
- > managing locks for modifying operations. The designated master will
- > proactively tell the other nodes to mark the blocks related to the
- > modifying operation as no longer being synchronized while any local
- > data at the node that acquired the lock (master or slave) remains
- > marked as being synchronized.
- >
- > The node that succesfully gets the lock then issues the modifying
- > operation to both its local copy and to the master, marking the
- > master as being desynchronized until the master acknowledges receipt.
- >
- > In this environment any node can access data from local storage if
- > the designated master copy is marked synchronized AND its (modify_tid)
- > matches the slave copy's (modify_tid).
- >
- > However, if a slave disconnects from the master then reconnects the
- > slave will have lost the master's desynchronization stream and must
- > mark its root blockref for the master copy HAMMER2_BREF_DESYNCHLD as
- > well as clear the SYNC1/SYNC2 bits. Setting DESYNCCHLD forces
- > on-demand recursive reverification that the master and slave are (or are
- > not) in sync in order to reestablish on the slave the synchronization
- > state of the master.
- >
- > That might be a bit confusing but the whole point here is to allow
- > read accesses to the filesystem to be satisfied by any node in a
- > multi-master cluster, not just by the current designated master.
- >
- > * Fully cache coherent and synchronized multi-master environment.
- >
- > In this environment a quorum is required to perform any modifying
- > action. All nodes are masters (there is no 'designated' master)
- > and all nodes connect to all other nodes in a cross-bar.
- >
- > The quorum is specified by copies setup in the root volume
- > configuration. A quorum of nodes in the cluster must agree on the copies
- > configuration. If they do not the cluster cannot proceed to mount. Any
- > other nodes not in the quorum which are in the cluster which disagree with
- > the configuration will inherit the copies configuration from the quorum.
- >
- > Any modifying action will initiate a lock request locally to all nodes
- > in the cluster. The modifying action is allowed to proceed the instant
- > a quorum of nodes respond in the affirmative (even if some have not
- > yet responded or are down). The modifying action is considered
- > complete once the two-phase commit protocol succeeds. The modifying
- > action typically creates and commits a temporary snapshot on at least a
- > quorum of masters as phase-1 and then ties the snapshot back into the main
- > mount as phase-2.
- >
- > These locks are cache-coherency locks and may be passively maintained
- > in order to aggregate multiple operations under the same lock and thus
- > under the same transaction from the point of view of the rest of the
- > quorum.
- >
- > A lock request which interferes with a passively maintained lock will
- > force the two-phase commit protocol to complete and then transfer
- > ownership to the requesting entity, thus avoiding having to deal with
- > deadlock protocols at this point in the state machine.
- >
- > Since any node can initiate concurrent lock requests to many other
- > nodes it is possible to deadlock. When two nodes initiate conflicting
- > lock requests to the cluster the one achieving the quorum basically wins
- > and the other is forced to retry (go back one paragraph). In this
- > situation no deadlock will occur.
- >
- > If three are more nodes initiate conflicting lock requests to the
- > cluster a deadlock can occur whereby none of the nodes achieve a
- > quorum. In this case every node will know which of the other nodes was
- > granted the lock(s). Deadlock resolution then proceeds simultaniously on
- > the three nodes (since they have the same information), whereby the lock
- > holders on the losing end of the algorithm transfer their locks to one of
- > the other nodes. The lock state and knowledge of the lock state is
- > updated in real time on all nodes until a quorum is achieved.
- >
- > * Fully cache coherent and synchronized multi-master environment with
- > passive read locking.
- >
- > This is a more complex form of clustering than the previous form.
- > Take the previous form and add the ability to passively hold SHARED
- > locks in addition to the EXCLUSIVE locks the previous form is able
- > to hold.
- >
- > The advantage of being able to passively hold a shared lock on a
- > sub-tree (locks can be held on single nodes or entire sub-trees) is that
- > it is then possible for all nodes to validate a node (modify_tid) or
- > entire sub-tree (mirror_tid) with a very short network transaction and
- > then satisfy a large number of requests from local storage.
- >
- > * Fully cache coherent and synchronized multi-master environment with
- > passive read locking and slave-only nodes.
- >
- > This is the MOST complex form of clustering we intend to support.
- > In a multi-master environment requiring a quorum of masters to operate
- > we implement all of the above plus ALSO allow additional nodes to be
- > added to the cluster as slave-only nodes.
- >
- > The difference between a slave-only node and setting up a manual
- > mirror-stream from the cluster to a read-only snapshot on another
- > HAMMER2 filesystem is that the slave-only node will be fully
- > cache coherent with either the cluster proper (if connected to a quorum
- > of masters), or to one or more other nodes in the cluster (if not
- > connected to a quorum of masters), EVEN if the slave itself is not
- > completely caught up.
- >
- > So if the slave-only cluster node is connected to the rest of the
- > cluster over a slow connection you basically get a combination of local
- > disk speeds for any data that is locally in sync and network-limited
- > speeds for any data that is not locally in sync.
- >
- > slave-only cluster nodes run a standard mirror-stream in the background
- > to pull in the data as quickly as possible.
- >
- > This is in constrast to a manual mirror-stream to a read-only
- > snapshot (basically a simple slave), which has no ability to bypass
- > the local storage to handle out-of-date requests (in fact has no
- > ability to detect that the local storage is out-of-date anyway).
- >
- > -Matt
Add Comment
Please, Sign In to add comment