Guest User

Untitled

a guest
Jun 22nd, 2018
101
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 23.47 KB | None | 0 0
  1. On Thu, 9 Feb 2012 01:20:04 PM Matthew Dillon wrote:
  2. > This is the current design document for HAMMER2. It lists every
  3. > feature I intend to implement for HAMMER2. Everything except the freemap
  4. > and cluster protocols (which are both big ticket items) has been
  5. > completely speced out.
  6. >
  7. > There are many additional features verses the original document,
  8. > including hardlinks.
  9. >
  10. > HAMMER2 is all I am working on this year so I expect to make good
  11. > progress, but it will probably still be July before we have anything
  12. > usable, and well into 2013 before the whole mess is implemented and
  13. > even later before the clustering is 100% stable.
  14. >
  15. > However, I expect to be able to stabilize all non-cluster related
  16. > features in fairly short order. Even though HAMMER2 has a lot more
  17. > features then HAMMER1 the actual design is simpler than HAMMER1, with
  18. > virtually no edge cases to worry about (I spent 12+ months working edge
  19. > cases out in HAMMER1's B-Tree, for example... that won't be an issue for
  20. > HAMMER2 development).
  21. >
  22. > The work is being done in the 'hammer2' branch off the main dragonfly
  23. > repo in appropriate subdirs. Right now just vsrinivas and I but
  24. > hopefully enough will get fleshed out in a few months that other people
  25. > can help too.
  26. >
  27. > Ok, here's what I have got.
  28. >
  29. >
  30. > HAMMER2 DESIGN DOCUMENT
  31. >
  32. > Matthew Dillon
  33. > 08-Feb-2012
  34. > >
  35. >
  36. > * These features have been speced in the media structures.
  37. >
  38. > * Implementation work has begun.
  39. >
  40. > * A working filesystem with some features implemented is expected by July
  41. > 2012.
  42. >
  43. > * A fully functional filesystem with most (but not all) features is
  44. > expected by the end of 2012.
  45. >
  46. > * All elements of the filesystem have been designed except for the freemap
  47. > (which isn't needed for initial work). 8MB per 2GB of filesystem
  48. > storage has been reserved for the freemap. The design of the freemap
  49. > is expected to be completely speced by mid-year.
  50. >
  51. > * This is my only project this year. I'm not going to be doing any major
  52. > kernel bug hunting this year.
  53. >
  54. > Feature List
  55. >
  56. > * Multiple roots (allowing snapshots to be mounted). This is implemented
  57. > via the super-root concept. When mounting a HAMMER2 filesystem you
  58. > specify a device path and a directory name in the super-root.
  59. >
  60. > * HAMMER1 had PFS's. HAMMER2 does not. Instead, in HAMMER2 any directory
  61. > in the tree can be configured as a PFS, causing all elements recursively
  62. > underneath that directory to become a part of that PFS.
  63. >
  64. > * Writable snapshots. Any subdirectory tree can be snapshotted. Snapshots
  65. > show up in the super-root. It is possible to snapshot a subdirectory
  66. > and then later snapshot a parent of that subdirectory... really there are
  67. > no limitations here.
  68. >
  69. > * Directory sub-hierarchy based quotas and space and inode usage tracking.
  70. > Any directory sub-tree, whether at a mount point or not, tracks aggregate
  71. > inode use and data space use. This is stored in the directory inode all
  72. > the way up the chain.
  73. >
  74. > * Incremental queueless mirroring / mirroring-streams. Because HAMMER2 is
  75. > block-oriented and copy-on-write each blockref tracks both direct
  76. > modifications to the referenced data via (modify_tid) and indirect
  77. > modifications to the referenced data or any sub-tree via (mirror_tid).
  78. > This makes it possible to do an incremental scan of meta-data that covers
  79. > only changes made since the mirror_tid recorded in a prior-run.
  80. >
  81. > This feature is also intended to be used to locate recently allocated
  82. > blocks and thus be able to fixup the freemap after a crash.
  83. >
  84. > HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in
  85. > that HAMMER2 does not keep track of 'deleted' records. Instead any
  86. > recursion by the mirroring code which finds that (modify_tid) has
  87. > been updated must also send the direct block table or indirect block
  88. > table state it winds up recursing through so the target can check
  89. > similar key ranges and locate elements to be deleted. This can be
  90. > avoided if the mirroring stream is mostly caught up in that very recent
  91. > deletions will be cached in memory and can be queried, allowing shorter
  92. > record deletions to be passed in the stream instead.
  93. >
  94. > * Will support multiple compression algorithms configured on subdirectory
  95. > tree basis and on a file basis. Up to 64K block compression will be
  96. > used. Only compression ratios near powers of 2 that are at least 2:1 (e.g.
  97. > 2:1, 4:1, 8:1, etc) will work in this scheme because physical block
  98. > allocations in HAMMER2 are always power-of-2.
  99. >
  100. > Compression algorithm #0 will mean no compression and no zero-checking.
  101. > Compression algorithm #1 will mean zero-checking but no other
  102. > compression. Real compression will be supported starting with algorithm 2.
  103. >
  104. > * Zero detection on write (writing all-zeros), which requires the data
  105. > buffer to be scanned, will be supported as compression algorithm #1.
  106. > This allows the writing of 0's to create holes and will be the default
  107. > compression algorithm for HAMMER2.
  108. >
  109. > * Copies support for redundancy. The media blockref structure would
  110. > have become too bloated but I found a clean way to do copies using the
  111. > blockset structure (which is a set of 8 fully associative blockref's).
  112. >
  113. > The design is such that the filesystem should be able to function at
  114. > full speed even if disks are pulled or inserted, as long as at least one
  115. > good copy is present. A background task will be needed to resynchronize
  116. > missing copies (or remove excessive copies in the case where the copies
  117. > value is reduced on a live filesystem).
  118. >
  119. > * Intended to be clusterable, with a multi-master protocol under design
  120. > but not expected to be fully operational until mid-2013. The media
  121. > format for HAMMER1 was less condusive to logical clustering than I had
  122. > hoped so I was never able to get that aspect of my personal goals
  123. > working with HAMMER1. HAMMER2 effectively solves the issues that cropped
  124. > up with HAMMER1 (mainly that HAMMER1's B-Tree did not reflect the logical
  125. > file/directory hierarchy, making cache coherency very difficult).
  126. >
  127. > * Hardlinks will be supported. All other standard features will be
  128. > supported too of course. Hardlinks in this sort of filesystem require
  129. > significant work.
  130. >
  131. > * The media blockref structure is now large enough to support up to a
  132. > 192-bit check value, which would typically be a cryptographic hash of some
  133. > sort. Multiple check value algorithms will be supported with the default
  134. > being a simple 32-bit iSCSI CRC.
  135. >
  136. > * Fully verified deduplication will be supported and automatic (and
  137. > necessary in many respects).
  138. >
  139. > * Non-verified de-duplication will be supported as a configurable option on
  140. > a file or subdirectory tree. Non-verified deduplication would use the
  141. > largest available check code (192 bits) and not bother to verify data
  142. > matches during the dedup pass, which is necessary on extremely large
  143. > filesystems with a great deal of deduplicable data (as otherwise a large
  144. > chunk of the media would have to be read to implement the dedup).
  145. >
  146. > This feature is intended only for those files where occassional
  147. > corruption is ok, such as in a large data store of farmed web content.
  148. >
  149. > GENERAL DESIGN
  150. >
  151. > HAMMER2 generally implements a copy-on-write block design for the
  152. > filesystem, which is very different from HAMMER1's B-Tree design. Because
  153. > the design is copy-on-write it can be trivially snapshotted simply by
  154. > referencing an existing block, and because the media structures logically
  155. > match a standard filesystem directory/file hierarchy snapshots and other
  156. > similar operations can be trivially performed on an entire subdirectory
  157. > tree at any level in the filesystem.
  158. >
  159. > The copy-on-write nature of the filesystem implies that any modification
  160. > whatsoever will have to eventually synchronize new disk blocks all the way
  161. > to the super-root of the filesystem and the volume header itself. This
  162. > forms the basis for crash recovery. All disk writes are to new blocks
  163. > except for the volume header, thus allowing all writes to run concurrently
  164. > except for the volume header update at the end.
  165. >
  166. > Clearly this method requires intermediate modifications to the chain to be
  167. > cached so multiple modifications can be aggregated prior to being
  168. > synchronized. One advantage, however, is that the cache can be flushed at
  169. > any time WITHOUT having to allocate yet another new block when further
  170. > modifications are made as long as the volume header has not yet been
  171. > flushed. This means that buffer cache overhead is very well bounded and
  172. > can handle filesystem operations of any complexity even on boxes with very
  173. > small amounts of physical memory.
  174. >
  175. > I intend to implement a shortcut to make fsync()'s run fast, and that is to
  176. > allow deep updates to blockrefs to shortcut to auxillary space in the
  177. > volume header to satisfy the fsync requirement. The related blockref is
  178. > then recorded when the filesystem is mounted after a crash and the update
  179. > chain is reconstituted when a matching blockref is encountered again during
  180. > normal operation of the filesystem.
  181. >
  182. > Basically this means that no real work needs to be done at mount-time
  183. > even after a crash.
  184. >
  185. > Directories are hashed, and another major design element is that directory
  186. > entries ARE INODES. They are one and the same. In addition to directory
  187. > entries being inodes the data for very small files (512 bytes or smaller)
  188. > can be directly embedded in the inode (overloaded onto the same space that
  189. > the direct blockref array uses). This should result in very high
  190. > performance.
  191. >
  192. > Inode numbers are not spatially referenced, which complicates NFS servers
  193. > but doesn't complicate anything else. The inode number is stored in the
  194. > inode itself, an absolutely necessary feature in order to support the
  195. > hugely flexible snapshots that we want to have in HAMMER2.
  196. >
  197. > HARDLINKS
  198. >
  199. > Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of
  200. > a spatial reference to the inode number. We do not want to have to have
  201. > an index of inode numbers for any basic HAMMER2 feature if we can help it.
  202. >
  203. > Hardlinks are handled by placing the inode for a multiply-hardlinked file
  204. > in the closest common parent directory. If "a/x" and "a/y" are hardlinked
  205. > the inode for the hardlinked file will be placed in directory "a", e.g.
  206. > "a/3239944", but it will be invisible and will be in an out-of-band
  207. > namespace. The directory entries "a/x" and "a/y" will be given the same
  208. > inode number but in fact just be placemarks that cause HAMMER2 to recurse
  209. > upwards through the directory tree to find the invisible inode number.
  210. >
  211. > Because directories are hashed and a different namespace (hash key range)
  212. > is used for hardlinked inodes, standard directory scans are able to
  213. > trivially skip this invisible namespace and inode-specific lookups can
  214. > restrict their lookup to within this space.
  215. >
  216. > The nature of snapshotting makes handling link-count 2->1 and 1->2 cases
  217. > trivial. Basically the inode media structure is copied as needed to
  218. > break-up or re-form the standard directory entry/inode. There are no
  219. > backpointers in HAMMER2 and no reference counts on the blocks (see FREEMAP
  220. > NOTES below), so it is an utterly trivial operation.
  221. >
  222. > FREEMAP NOTES
  223. >
  224. > In order to implement fast snapshots (and writable snapshots for that
  225. > matter), HAMMER2 does NOT ref-count allocations. The freemap which
  226. > is still under design just won't do that. All the freemap does is
  227. > keep track of 100% free blocks.
  228. >
  229. > This not only trivializes all the snapshot features it also trivializes
  230. > hardlink handling and solves the problem of keeping the freemap sychronized
  231. > in the event of a crash. Now all we have to do after a crash is make
  232. > sure blocks allocated before the freemap was flushed are properly
  233. > marked as allocated in the allocmap. This is a trivial exercise using the
  234. > same algorithm the mirror streaming code uses (which is very similar to
  235. > HAMMER1)... an incremental meta-data scan that covers only the blocks that
  236. > might have been allocated between the last allocation map sync and now.
  237. >
  238. > Thus the freemap does not have to be synchronized during a fsync().
  239. >
  240. > The complexity is in figuring out what can be freed... that is, when one
  241. > can mark blocks in the freemap as being free. HAMMER2 implements this as
  242. > a background task which essentially must scan available meta-data to
  243. > determine which blocks are not being referenced.
  244. >
  245. > Part of the ongoing design work is finding ways to reduce the scope of this
  246. > meta-data scan so the entire filesystem's meta-data does not need to be
  247. > scanned (though in tests with HAMMER1, even full meta-data scans have
  248. > turned out to be fairly low cost). In other words, its an area that we
  249. > can continue to improve on as the filesystem matures. Not only that, but
  250. > we can completely change the freemap algorithms without creating
  251. > incompatibilities (at worse simply having to require that a R+W mount do
  252. > a full meta-data scan when upgrading or downgrading the freemap algorithm).
  253. >
  254. > CLUSTERING
  255. >
  256. > Clustering, as always, is the most difficult bit but we have some
  257. > advantages with HAMMER2 that we did not have with HAMMER1. First,
  258. > HAMMER2's media structures generally follow the kernel's filesystem
  259. > hiearchy. Second, HAMMER2's writable snapshots make it possible to
  260. > implement several forms of multi-master clustering.
  261. >
  262. > The general mechanics for most of the multi-master clustering
  263. > implementations will be as follows:
  264. >
  265. > (a) Use the copies mechanism to specify all elements of the cluster,
  266. > both local and remote (networked).
  267. >
  268. > (b) The core synchronization state operates just as it does for copies,
  269. > simply requiring a fully-flushed ack from the remote in order to
  270. > mark the blocks as having been fully synchronized.
  271. >
  272. > The mirror_tid may be used to locate these blocks, allowing the
  273. > synchronization state to be updated on the fly at a much later
  274. > time without requiring the state to be maintained in-memory.
  275. > (also for crash recovery resynchronization purposes).
  276. >
  277. > (c) Data/meta-data can be retrieved from those copies which are marked
  278. > as being synchronized, with priority given to the local storage
  279. > relative to any given physical machine.
  280. >
  281. > This means that e.g. even in a master-slave orientation the slave
  282. > may be able to satisfy a request from a program when the slave
  283. > happens to be the local storage.
  284. >
  285. > (d) Transaction id synchronization between all elements of the cluster,
  286. > typically through masking (assigning a cluster number using the low
  287. > 3 bits of the transaction id).
  288. >
  289. > (e) General access (synchronized or otherwise) may require cache
  290. > coherency mechanisms to run over the network.
  291. >
  292. > Implementing cache coherency is a major complexity issue.
  293. >
  294. > (f) General access (synchronized or otherwise) may require quorum
  295. > agreement, using the synchronization flags in the blockrefs
  296. > to determine whether agreement has been reached.
  297. >
  298. > Implementing quorum voting is a major complexity issue.
  299. >
  300. > There are lots of ways to implement multi-master environments using the
  301. > above core features but the implementation is going to be fairly complex
  302. > even with HAMMER2's feature set.
  303. >
  304. > Keep in mind that modifications propagate all the way to the super-root
  305. > and volume header, so in any clustered arrangement the use of (modify_tid)
  306. > and (mirror_tid) is critical in determining the synchronization state of
  307. > portion(s) of the filesystem.
  308. >
  309. > Specifically, since any modification propagates to the root the
  310. > (mirror_tid) in higher level directories is going to be in a constant
  311. > state of flux. This state of flux DOES NOT invalidate the cache state for
  312. > these higher levels of directories. Instead, the (modify_tid) is used on
  313. > a node-by-node basis to determine cache state at any given level, and
  314. > (mirror_tid) is used to determine whether any recursively underlying state
  315. > is desynchronized.
  316. >
  317. > * Simple semi-synchronized multi-master environment.
  318. >
  319. > In this environment all nodes are considered masters and modifications
  320. > can be made on any of them, and then propagate to the others
  321. > asynchronously via HAMMER2 mirror streams. One difference here is
  322. > that kernel can activate these userland-managed streams automatically
  323. > when the copies configuration is used to specify the cluster.
  324. >
  325. > The only type of conflict which isn't readily resolvable by comparing
  326. > the (modify_tid) is when file data is updated. In this case user
  327. > intervention might be required but, theoretically, it should be
  328. > possible to automate most merges using a multi-way patch and, if not,
  329. > choosing one and creating backup copies if the others to allow the
  330. > user or sysop to resolve the conflict later.
  331. >
  332. > * Simple fully synchronized fail-over environment.
  333. >
  334. > In this environment there is one designated master and the remaining
  335. > nodes are slaves. If the master fails all remaining nodes agree on a
  336. > new master, possibly with the requirement that a quorum be achieved
  337. > (if you don't want to allow the cluster to split).
  338. >
  339. > If network splits are allowed the each sub-cluster operates in this
  340. > mode but recombining the clusters reverts to the first algorithm.
  341. > If not allowed whomever no longer has a quorum will be forced to stall.
  342. >
  343. > In this environment the current designated master is responsible for
  344. > managing locks for modifying operations. The designated master will
  345. > proactively tell the other nodes to mark the blocks related to the
  346. > modifying operation as no longer being synchronized while any local
  347. > data at the node that acquired the lock (master or slave) remains
  348. > marked as being synchronized.
  349. >
  350. > The node that succesfully gets the lock then issues the modifying
  351. > operation to both its local copy and to the master, marking the
  352. > master as being desynchronized until the master acknowledges receipt.
  353. >
  354. > In this environment any node can access data from local storage if
  355. > the designated master copy is marked synchronized AND its (modify_tid)
  356. > matches the slave copy's (modify_tid).
  357. >
  358. > However, if a slave disconnects from the master then reconnects the
  359. > slave will have lost the master's desynchronization stream and must
  360. > mark its root blockref for the master copy HAMMER2_BREF_DESYNCHLD as
  361. > well as clear the SYNC1/SYNC2 bits. Setting DESYNCCHLD forces
  362. > on-demand recursive reverification that the master and slave are (or are
  363. > not) in sync in order to reestablish on the slave the synchronization
  364. > state of the master.
  365. >
  366. > That might be a bit confusing but the whole point here is to allow
  367. > read accesses to the filesystem to be satisfied by any node in a
  368. > multi-master cluster, not just by the current designated master.
  369. >
  370. > * Fully cache coherent and synchronized multi-master environment.
  371. >
  372. > In this environment a quorum is required to perform any modifying
  373. > action. All nodes are masters (there is no 'designated' master)
  374. > and all nodes connect to all other nodes in a cross-bar.
  375. >
  376. > The quorum is specified by copies setup in the root volume
  377. > configuration. A quorum of nodes in the cluster must agree on the copies
  378. > configuration. If they do not the cluster cannot proceed to mount. Any
  379. > other nodes not in the quorum which are in the cluster which disagree with
  380. > the configuration will inherit the copies configuration from the quorum.
  381. >
  382. > Any modifying action will initiate a lock request locally to all nodes
  383. > in the cluster. The modifying action is allowed to proceed the instant
  384. > a quorum of nodes respond in the affirmative (even if some have not
  385. > yet responded or are down). The modifying action is considered
  386. > complete once the two-phase commit protocol succeeds. The modifying
  387. > action typically creates and commits a temporary snapshot on at least a
  388. > quorum of masters as phase-1 and then ties the snapshot back into the main
  389. > mount as phase-2.
  390. >
  391. > These locks are cache-coherency locks and may be passively maintained
  392. > in order to aggregate multiple operations under the same lock and thus
  393. > under the same transaction from the point of view of the rest of the
  394. > quorum.
  395. >
  396. > A lock request which interferes with a passively maintained lock will
  397. > force the two-phase commit protocol to complete and then transfer
  398. > ownership to the requesting entity, thus avoiding having to deal with
  399. > deadlock protocols at this point in the state machine.
  400. >
  401. > Since any node can initiate concurrent lock requests to many other
  402. > nodes it is possible to deadlock. When two nodes initiate conflicting
  403. > lock requests to the cluster the one achieving the quorum basically wins
  404. > and the other is forced to retry (go back one paragraph). In this
  405. > situation no deadlock will occur.
  406. >
  407. > If three are more nodes initiate conflicting lock requests to the
  408. > cluster a deadlock can occur whereby none of the nodes achieve a
  409. > quorum. In this case every node will know which of the other nodes was
  410. > granted the lock(s). Deadlock resolution then proceeds simultaniously on
  411. > the three nodes (since they have the same information), whereby the lock
  412. > holders on the losing end of the algorithm transfer their locks to one of
  413. > the other nodes. The lock state and knowledge of the lock state is
  414. > updated in real time on all nodes until a quorum is achieved.
  415. >
  416. > * Fully cache coherent and synchronized multi-master environment with
  417. > passive read locking.
  418. >
  419. > This is a more complex form of clustering than the previous form.
  420. > Take the previous form and add the ability to passively hold SHARED
  421. > locks in addition to the EXCLUSIVE locks the previous form is able
  422. > to hold.
  423. >
  424. > The advantage of being able to passively hold a shared lock on a
  425. > sub-tree (locks can be held on single nodes or entire sub-trees) is that
  426. > it is then possible for all nodes to validate a node (modify_tid) or
  427. > entire sub-tree (mirror_tid) with a very short network transaction and
  428. > then satisfy a large number of requests from local storage.
  429. >
  430. > * Fully cache coherent and synchronized multi-master environment with
  431. > passive read locking and slave-only nodes.
  432. >
  433. > This is the MOST complex form of clustering we intend to support.
  434. > In a multi-master environment requiring a quorum of masters to operate
  435. > we implement all of the above plus ALSO allow additional nodes to be
  436. > added to the cluster as slave-only nodes.
  437. >
  438. > The difference between a slave-only node and setting up a manual
  439. > mirror-stream from the cluster to a read-only snapshot on another
  440. > HAMMER2 filesystem is that the slave-only node will be fully
  441. > cache coherent with either the cluster proper (if connected to a quorum
  442. > of masters), or to one or more other nodes in the cluster (if not
  443. > connected to a quorum of masters), EVEN if the slave itself is not
  444. > completely caught up.
  445. >
  446. > So if the slave-only cluster node is connected to the rest of the
  447. > cluster over a slow connection you basically get a combination of local
  448. > disk speeds for any data that is locally in sync and network-limited
  449. > speeds for any data that is not locally in sync.
  450. >
  451. > slave-only cluster nodes run a standard mirror-stream in the background
  452. > to pull in the data as quickly as possible.
  453. >
  454. > This is in constrast to a manual mirror-stream to a read-only
  455. > snapshot (basically a simple slave), which has no ability to bypass
  456. > the local storage to handle out-of-date requests (in fact has no
  457. > ability to detect that the local storage is out-of-date anyway).
  458. >
  459. > -Matt
Add Comment
Please, Sign In to add comment