jimklimov

zfs layout on nvpool

Oct 19th, 2020 (edited)
62
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. Hey all, some time ago I was complaining about a vast discrepancy of pool space reported used by datasets (about 500Gb) and just used on pool (over 700Gb) leaving me with little head space to actually work in (and at least once crashing when it all ran out... habit of keeping a dummy reservation dataset to quickly destroy saved that day).
  2.  
  3. One my theory connected this to znapzend that regularly created source snapshots but did not always have where to send them, and/or znapzend did not always succeed the whole sending operation so it either way skipped the cleanup.
  4.  
  5. [2020-11-18 05:29:28.63135] [4106] [debug] cleaning up 3365 source snapshots recursively under nvpool/zones
  6. ugh!
  7.  
  8. after some fixes to the script in the past months, I got it to pass as far as proper cleanup, and got back 50Gb already, and counting...
  9.  
  10. so now the question is: how do zfs tools account space used by metadata of snapshots? My guess is that this could be what ate my disks...
  11.  
  12. (UPDATE)
  13.  
  14. at least, after znapzend took 1280 mins to sent meagre updates and clean up the mostly empty snapshots (on NVMe!) I have over 100Gb added to usable disk space
  15.  
  16. (UPDATE)
  17.  
  18. After the fixed znapzend cleaned up another dataset tree (with unknown amount of snaps - `zfs list -tsnapshot | wc -l` failed many times with "out of memory", sometime closer to end of adventure there were 185K snaps left there), another 100G surfaced, and so most of the unexplained discrepancy is gone. Indirect blocks or whatever else is involved in snapshots and not currently accounted as some dataset size metric, do take nontrivial space and add up considerably en-masse.
  19.  
  20. With cleanups of nvpool/ROOT and nvpool/zones behind (about 30 older inactive BEs are around, with about 7 datasets per rootfs, adding up for each existing recursive snapshot - overall ranging 300-500 policied snaps under each rootfs tree), and nvpool/export tree ongoing, the numbers are:
  21.  
  22. # zfs list -tfilesystem -d1 -o name -r nvpool | grep / | while read Z ; do printf '%25s\t%s\n' "$Z" "`zfs list -t snapshot -r $Z | wc -l`" ; done
  23. nvpool/Media 67
  24. nvpool/ROOT 11125
  25. nvpool/SHARED 5076
  26. nvpool/SHARED-nobackup 3
  27. nvpool/export 22511
  28. nvpool/kohsuke 28
  29. nvpool/temp 5
  30. nvpool/test 20
  31. nvpool/tmp 4
  32. nvpool/var-squid-cache 1
  33. nvpool/zones 14404
  34.  
  35. root@jimoi:/root# zfs list -d1 -ospace -sused -r nvpool
  36. NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
  37. nvpool/tmp 197G 23K 0 23K 0 0
  38. nvpool/kohsuke 197G 59K 13K 23K 0 23K
  39. nvpool/test 197G 73K 0 25K 0 48K
  40. nvpool/temp 197G 668M 0 668M 0 0
  41. nvpool/dump 197G 1.00G 0 1.00G 0 0
  42. nvpool/SHARED-nobackup 197G 1.38G 0 24K 0 1.38G
  43. nvpool/var-squid-cache 197G 3.36G 0 3.36G 0 0
  44. nvpool/swap 198G 8.50G 0 8.01G 501M 0
  45. nvpool/zones 197G 9.42G 0 23K 0 9.42G
  46. nvpool/SHARED 197G 18.4G 0 23K 0 18.4G
  47. nvpool/ROOT 197G 87.0G 0 23K 0 87.0G
  48. nvpool/Media 197G 141G 191M 141G 0 0
  49. nvpool/export 197G 285G 0 23K 0 285G
  50. nvpool 197G 563G 14K 36K 0 563G
  51.  
  52.  
  53. root@jimoi:/root# zfs list -d1 -p -ospace -sused -r nvpool
  54. NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
  55. nvpool/tmp 211983997904 23552 0 23552 0 0
  56. nvpool/kohsuke 211983997904 60416 13312 23552 0 23552
  57. nvpool/test 211983997904 74752 0 25600 0 49152
  58. nvpool/temp 211983997904 700783616 0 700783616 0 0
  59. nvpool/dump 211983997904 1073847296 0 1073847296 0 0
  60. nvpool/SHARED-nobackup 211983997904 1486294528 0 24576 0 1486269952
  61. nvpool/var-squid-cache 211983997904 3610465792 0 3610465792 0 0
  62. nvpool/swap 212509190096 9126805504 0 8601613312 525192192 0
  63. nvpool/zones 211983997904 10115445248 0 23552 0 10115421696
  64. nvpool/SHARED 211983997904 19763642368 0 23552 0 19763618816
  65. nvpool/ROOT 211983997904 93384927744 0 23552 0 93384904192
  66. nvpool/Media 211983997904 151921219072 200092672 151721126400 0 0
  67. nvpool/export 211983997904 305842327040 0 23552 0 305842303488
  68. nvpool 211983997904 604379958272 14336 36864 0 604379907072
  69.  
  70. root@jimoi:/root# zfs list -d1 -Housed,name -sused -p -r nvpool | grep / | awk '{print $1}' | ( A=0; while read B; do A=$(($A+$B)) ; done ;echo $A )
  71. 597025938944
  72.  
  73. root@jimoi:/root# expr 604379958272 - 597025938944
  74. 7354019328
  75.  
  76. So now they are wonderfully close, with some 7G "unaccounted" used space (and about 50k snaps remaining across nvpool total).
  77.  
  78. -----
  79.  
  80. Data for original issue discussion:
  81.  
  82. root@jimoi:/usr/share/znapzend# zfs list -tall -d1 -s used -r nvpool
  83. NAME USED AVAIL REFER MOUNTPOINT
  84. nvpool@20190826-01 0 - 23K -
  85. nvpool@20190830-01 0 - 23K -
  86. nvpool@20190910-02 0 - 23K -
  87. nvpool/tmp 23K 14.8G 23K /nvpool/tmp
  88. nvpool/kohsuke 59K 14.8G 23K /nvpool/kohsuke
  89. nvpool/test 73K 14.8G 25K /nvpool/test
  90. nvpool/SHARED-nobackup 70.8M 14.8G 24K legacy
  91. nvpool/temp 668M 14.8G 668M /nvpool/temp
  92. nvpool/dump 1.00G 14.8G 1.00G -
  93. nvpool/var-squid-cache 7.16G 14.8G 7.16G /var/squid/cache
  94. nvpool/swap 8.50G 15.3G 8.01G -
  95. nvpool/zones 9.42G 14.8G 23K /zones
  96. nvpool/SHARED 18.4G 14.8G 23K legacy
  97. nvpool/ROOT 89.7G 14.8G 23K legacy
  98. nvpool/Media 139G 14.8G 138G /Media
  99. nvpool/export 277G 14.8G 23K /export
  100. nvpool 744G 14.8G 36K /nvpool
  101.  
  102. root@jimoi:/usr/share/znapzend# zpool list nvpool
  103. NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
  104. nvpool 744G 704G 39.8G - - 67% 94% 3.40x DEGRADED -
  105.  
  106. ### In `zfs list`, the "used" sizes of datasets sum up to about 552Gb,
  107. ### however the pool has "used" 744Gb and 14.8G "avail".
  108. ### In `zpool list`, 744G is its *size*, while "alloc" is 704G and "free" is 39.8G.
  109. ### I gather the difference of "avail" vs "free" is well-known (some is 1/64th of
  110. ### the pool if that system reservation against fragmentation still exists...
  111. ### though then how come zfs performance collapses due to fragmentation still?)
  112. ### and might include blocks reserved but not actually written by certain datasets
  113. ### like swap, etc.
  114. ### But "used" zfs being same as "size" of zpool seems like a bug... or maybe not,
  115. ### if it is essentially the reservation and quota all in one on current backing
  116. ### storage. Anyhow, those 552G used or reserved by datasets, vs. 704G allocated,
  117. ### is quite a difference
  118.  
  119. ### this is a mirror, or rather single disk that now remains
  120. ### but metadata enjoys multiple block copies, right?
  121.  
  122. ### so back to first question, are metadata blocks of snapshots accounted in parent
  123. ### datasets, or only in pool directly? And other metadata for that matter?
  124. ### Here it seems to have cost a quarter of pool size, so I wonder... :)
  125.  
  126. # Other clues... dedup table exists...
  127.  
  128. root@jimoi:/usr/share/znapzend# zdb -DDD nvpool
  129. DDT-sha256-zap-duplicate: 64198 entries, size 291 on disk, 141 in core
  130.  
  131. bucket allocated referenced
  132. ______ ______________________________ ______________________________
  133. refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
  134. ------ ------ ----- ----- ----- ------ ----- ----- -----
  135. 2 26.9K 941M 233M 233M 60.1K 2.47G 552M 552M
  136. 4 35.1K 815M 255M 255M 142K 3.27G 1.02G 1.02G
  137. 8 471 6.74M 2.05M 2.05M 4.08K 64.9M 19.9M 19.9M
  138. 16 198 144K 112K 112K 3.80K 2.71M 2.13M 2.13M
  139. 32 33 16.5K 16.5K 16.5K 1.31K 668K 668K 668K
  140. 64 6 5.50K 3.50K 3.50K 452 402K 262K 262K
  141. 128 3 2K 1.50K 1.50K 444 302K 222K 222K
  142.  
  143. DDT-sha256-zap-unique: 38603 entries, size 394 on disk, 251 in core
  144.  
  145. bucket allocated referenced
  146. ______ ______________________________ ______________________________
  147. refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
  148. ------ ------ ----- ----- ----- ------ ----- ----- -----
  149. 1 37.7K 2.09G 645M 645M 37.7K 2.09G 645M 645M
  150.  
  151. DDT-edonr-zap-duplicate: 105314 entries, size 301 on disk, 162 in core
  152.  
  153. bucket allocated referenced
  154. ______ ______________________________ ______________________________
  155. refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
  156. ------ ------ ----- ----- ----- ------ ----- ----- -----
  157. 2 43.2K 4.80G 4.57G 4.57G 87.6K 9.60G 9.14G 9.14G
  158. 4 48.6K 5.60G 5.47G 5.47G 271K 31.3G 30.6G 30.6G
  159. 8 9.48K 958M 914M 914M 87.6K 8.65G 8.25G 8.25G
  160. 16 1.46K 95.4M 89.4M 89.4M 29.6K 1.82G 1.70G 1.70G
  161. 32 105 276K 156K 156K 3.94K 10.4M 5.89M 5.89M
  162. 64 9 4.50K 4.50K 4.50K 748 374K 374K 374K
  163. 128 2 1K 1K 1K 268 134K 134K 134K
  164. 512 1 512B 512B 512B 550 275K 275K 275K
  165. 1K 5 2.50K 2.50K 2.50K 7.77K 3.88M 3.88M 3.88M
  166.  
  167. DDT-edonr-zap-unique: 47842 entries, size 310 on disk, 164 in core
  168.  
  169. bucket allocated referenced
  170. ______ ______________________________ ______________________________
  171. refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
  172. ------ ------ ----- ----- ----- ------ ----- ----- -----
  173. 1 46.7K 4.75G 4.38G 4.38G 46.7K 4.75G 4.38G 4.38G
  174.  
  175.  
  176. DDT histogram (aggregated over all DDTs):
  177.  
  178. bucket allocated referenced
  179. ______ ______________________________ ______________________________
  180. refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
  181. ------ ------ ----- ----- ----- ------ ----- ----- -----
  182. 1 84.4K 6.84G 5.01G 5.01G 84.4K 6.84G 5.01G 5.01G
  183. 2 70.1K 5.72G 4.80G 4.80G 148K 12.1G 9.68G 9.68G
  184. 4 83.7K 6.40G 5.72G 5.72G 413K 34.6G 31.6G 31.6G
  185. 8 9.94K 965M 916M 916M 91.7K 8.71G 8.27G 8.27G
  186. 16 1.65K 95.5M 89.5M 89.5M 33.4K 1.82G 1.70G 1.70G
  187. 32 138 293K 172K 172K 5.25K 11.0M 6.54M 6.54M
  188. 64 15 10K 8K 8K 1.17K 776K 636K 636K
  189. 128 5 3K 2.50K 2.50K 712 436K 356K 356K
  190. 512 1 512B 512B 512B 550 275K 275K 275K
  191. 1K 5 2.50K 2.50K 2.50K 7.77K 3.88M 3.88M 3.88M
  192. Total 250K 20.0G 16.5G 16.5G 786K 64.0G 56.3G 56.3G
  193.  
  194. dedup = 3.41, compress = 1.14, copies = 1.00, dedup * compress / copies = 3.88
  195.  
  196. root@jimoi:/usr/share/znapzend# zpool get all nvpool
  197. NAME PROPERTY VALUE SOURCE
  198. nvpool size 744G -
  199. nvpool capacity 94% -
  200. nvpool altroot - default
  201. nvpool health DEGRADED -
  202. nvpool guid 185555628905970312 default
  203. nvpool version - default
  204. nvpool bootfs nvpool/ROOT/hipster_2020.04-20200809T192157Z local
  205. nvpool delegation on default
  206. nvpool autoreplace off default
  207. nvpool cachefile - default
  208. nvpool failmode continue local
  209. nvpool listsnapshots off default
  210. nvpool autoexpand off default
  211. nvpool dedupditto 0 default
  212. nvpool dedupratio 3.40x -
  213. nvpool free 39.8G -
  214. nvpool allocated 704G -
  215. nvpool readonly off -
  216. nvpool comment - default
  217. nvpool expandsize - -
  218. nvpool freeing 0 default
  219. nvpool fragmentation 67% -
  220. nvpool leaked 0 default
  221. nvpool bootsize - default
  222. nvpool checkpoint - -
  223. nvpool multihost off default
  224. nvpool ashift 0 default
  225. nvpool autotrim off default
  226. nvpool feature@async_destroy enabled local
  227. nvpool feature@empty_bpobj active local
  228. nvpool feature@lz4_compress active local
  229. nvpool feature@multi_vdev_crash_dump enabled local
  230. nvpool feature@spacemap_histogram active local
  231. nvpool feature@enabled_txg active local
  232. nvpool feature@hole_birth active local
  233. nvpool feature@extensible_dataset active local
  234. nvpool feature@embedded_data active local
  235. nvpool feature@bookmarks enabled local
  236. nvpool feature@filesystem_limits enabled local
  237. nvpool feature@large_blocks active local
  238. nvpool feature@large_dnode enabled local
  239. nvpool feature@sha512 enabled local
  240. nvpool feature@skein enabled local
  241. nvpool feature@edonr active local
  242. nvpool feature@device_removal enabled local
  243. nvpool feature@obsolete_counts enabled local
  244. nvpool feature@zpool_checkpoint enabled local
  245. nvpool feature@spacemap_v2 active local
  246. nvpool feature@allocation_classes disabled local
  247. nvpool feature@resilver_defer disabled local
  248. nvpool feature@encryption disabled local
  249. nvpool feature@bookmark_v2 disabled local
  250. nvpool feature@userobj_accounting disabled local
  251. nvpool feature@project_quota disabled local
  252. nvpool feature@log_spacemap disabled local
  253.  
  254. ### UPDATE after a while: now "USED" sums up to more than the pool size is
  255.  
  256. root@jimoi:/root# zfs list -d1 -ospace -sused -r nvpool
  257. NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
  258. nvpool/tmp 8.16G 23K 0 23K 0 0
  259. nvpool/kohsuke 8.16G 59K 13K 23K 0 23K
  260. nvpool/test 8.16G 73K 0 25K 0 48K
  261. nvpool/SHARED-nobackup 8.16G 70.8M 0 24K 0 70.7M
  262. nvpool/temp 8.16G 668M 0 668M 0 0
  263. nvpool/dump 8.16G 1.00G 0 1.00G 0 0
  264. nvpool/var-squid-cache 8.16G 7.16G 0 7.16G 0 0
  265. nvpool/swap 8.65G 8.50G 0 8.01G 501M 0
  266. nvpool/zones 8.16G 9.42G 0 23K 0 9.42G
  267. nvpool/SHARED 8.16G 18.4G 0 23K 0 18.4G
  268. nvpool/ROOT 8.16G 89.7G 0 23K 0 89.7G
  269. nvpool/Media 8.16G 139G 184M 138G 0 0
  270. nvpool/export 8.16G 277G 0 23K 0 277G
  271. nvpool 8.16G 751G 14K 36K 0 751G
  272.  
  273. root@jimoi:/root# zfs list -d1 -ospace -sused -p -r nvpool
  274. NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
  275. nvpool/tmp 8775617888 23552 0 23552 0 0
  276. nvpool/kohsuke 8775617888 60416 13312 23552 0 23552
  277. nvpool/test 8775617888 74752 0 25600 0 49152
  278. nvpool/SHARED-nobackup 8775617888 74200576 0 24576 0 74176000
  279. nvpool/temp 8775617888 700783616 0 700783616 0 0
  280. nvpool/dump 8775617888 1073847296 0 1073847296 0 0
  281. nvpool/var-squid-cache 8775617888 7692004864 0 7692004864 0 0
  282. nvpool/swap 9300844896 9126805504 0 8601578496 525227008 0
  283. nvpool/zones 8775617888 10115492352 0 23552 0 10115468800
  284. nvpool/SHARED 8775617888 19771040768 0 23552 0 19771017216
  285. nvpool/ROOT 8775617888 96337416704 0 23552 0 96337393152
  286. nvpool/Media 8775617888 148885252096 192455680 148692796416 0 0
  287. nvpool/export 8775617888 297791069696 0 23552 0 297791046144
  288. nvpool 8775617888 806495139840 14336 36864 0 806495088640
  289.  
  290. # ... and dude, where's my 200 gig?
  291. root@jimoi:/root# expr 297791069696 + 148885252096 + 96337416704 + 19771040768 + 10115492352 + 9126805504 + 7692004864 + 1073847296 + 700783616 + 74200576
  292. 591567913472
  293.  
  294. root@jimoi:/root# expr 806495139840 - 591567913472
  295. 214927226368
  296.  
  297. root@jimoi:/root# zpool list nvpool
  298. NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
  299. nvpool 744G 711G 33.1G - - 71% 95% 3.40x DEGRADED -
  300.  
  301. root@jimoi:/root# zpool list -p nvpool
  302. NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
  303. nvpool 798863917056 763290726400 35573190656 - - 71% 95 3.40x DEGRADED -
  304.  
RAW Paste Data