Advertisement
Guest User

zfs write tuning

a guest
Mar 28th, 2019
99
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
mIRC 16.65 KB | None | 0 0
  1. 04:22 < scastano> So… this is where I'm at and could use some pointers…. long story short….
  2. 04:22 < scastano> I imaged my system, reinstalled on Ubuntu 18.04, compiled ZFS v0.8 from source and when I was done, performance had dropped to basically 35% of what I was getting before.
  3. 04:23 < scastano> So… I ket Ubuntu 18.04, but rolled back to ZFS v0.7.13 and now most of my performance is back… 3.6GB/s on reads, but my write speed didn't come all the way back to the 4.5GB/s I was at before… it's stuck around 3GB/s
  4. 04:24 < scastano> The major difference is that during read, my disks are showing 100% utilization just like they used to… but under write, they barely say about 60%… no CPU peg, no major iowait…. it seems like something else is holding it back.
  5. 04:25 < scastano> I have a theory that maybe it could be the number of threads, like vdev limits for async and sync writes and I'm wonder if that's possible and/or a good place to try some tuning.
  6. 04:25 < jtara> it's a complex enough system it's hard to troubleshoot from here, but i would look to see if your txg size is stabilizing under write load before doing anything
  7. 04:25 < scastano> I've been reading a ton on the IO scheduler, I think that might be a possible place where things are getting stuck.
  8. 04:25 < jtara> and use zpool iostat -q to see how many write threads you typically have
  9. 04:26 < jtara> it's very easy to get focused on the io scheduler, but if the pieces in front of it are not streaming it data fast enough, tuning it will not help
  10. 04:27 < scastano> Ok, good to know, I'm running zpool iostat -q now and starting a write benchmark
  11. 04:27 < jtara> also consider a test on a one-vdev pool without raidz, just to see how it does there
  12. 04:27 < jtara> whatever thread settings you come up with will not be that much different than what will be optimal with a raidz
  13. 04:29 < jtara> past a certain point, if you add more threads, each thread just grabs a block to write before as many have readied in their buffers
  14. 04:29 < jtara> so you get more writes but smaller
  15. 04:29 < scastano> I can do both of those… so I'm getting numbers coming back, but I'm not sure what they mean or if they are good or bad… with just the write test going it looks like it bounces around a little bit, but there's a max of about 370 active in the asyncq_write, most of the time with nothing pending… other times it bounces up to 1.1K
  16. 04:29 < jtara> the bad news is that causes higher latency, the good news is it will degrade throughput less as your average write size shrinks
  17. 04:30 < jtara> there should be a "how many are actually queued to the device" stat in there somewhere
  18. 04:31 < scastano> it just shows iops, speed and then queue numbers for sync, async and scrub
  19. 04:32 < jtara> paste it somewhere and /proc/spl/kstat/{pool name}/txgs
  20. 04:32 < scastano> And now seems to have flattened out… on 10 second intervals, 4 out of every 5 lines of output shows 0 for pending and active.
  21. 04:33 < scastano> iostat: https://imgur.com/a/g4a8fo3
  22. 04:35 < scastano> With txgs: https://imgur.com/a/g4a8fo3
  23. 04:35 < jtara> that's interesting.  you haven't raised zfs_dirty_data_sync have you?
  24. 04:36 < scastano> no… the only thing I changed was to set the zfetch_max_distance to 128M which was the sweet spot for performance when I was running under Ubuntu 16.04
  25. 04:37 < jtara> for what it's worth, i think you'll find the r/w gap settings miuch more useful than zfetch
  26. 04:37 < jtara> but
  27. 04:38 < jtara> well it looks like your dirty data is hitting 4G and then committing
  28. 04:38 < jtara> and i see surges in your iostat output
  29. 04:38 < jtara> i would check your dirty data throttle and make sure that number is in the middle of it
  30. 04:38 < sarnold> ahoy JollyRoger` :)
  31. 04:39 < scastano> Ok, that makes sense as to why it would burst around like that…
  32. 04:39 < JollyRoger`> Ahoy, sarnold! \o/
  33. 04:39 < jtara> if your dirty data throttle isn't set right then you will slam into the dirty data max variable
  34. 04:39 < jtara> and then nobody's happy
  35. 04:39 < scastano> so which dirty data setting should I be looking at, there's like 10 of them! Hahaha
  36. 04:40 < jtara> how big is your arc
  37. 04:40 < jtara> also zfs_dirty_data_max and zfs_delay_min_dirty_percent and zfs_delay_scale
  38. 04:41 < scastano> ARC is 94G out of the 196 I have installed in the server, I'm grabbing the other values now.
  39. 04:41 < jtara> the idea is to get your dirty data to sort of float around one area
  40. 04:41 < jtara> it's not healthy for it to either bounce or to hit the dirty data max hard
  41. 04:42 < jtara> i've no idea what else may be in play here but if you don't get that right, nothing will be right
  42. 04:42 < scastano> So the dirty data max is right at 4G, percent is 60, delay scale is 500,000
  43. 04:43 < jtara> ah, so it is hitting your dirty data max
  44. 04:43 < jtara> what that means is that delays will start at 60% of 4G (2.4G), and at the midpoint (3.2G) they will be delayed by 500us
  45. 04:44 < jtara> what about zfs_dirty_data_max_max, zfs_dirty_data_max_max_percent and zfs_dirty_data_max_percent
  46. 04:44 < scastano> ok… so do I drop the delay so things happen faster? Or lower the percent?
  47. 04:44 < jtara> honestly i'd be inclined to just raise zfs_dirty_data_max, but you may need to raise other vars to make sure it sticks
  48. 04:44 < jtara> basically when you start dumping data into zfs it starts doing a txg commit to drain the pool
  49. 04:45 < jtara> but it can take a second or two to get going
  50. 04:45 < jtara> in that time, you smash into the dirty data max
  51. 04:45 < jtara> then the pool drains some, you get more data, and smash into it again
  52. 04:45 < jtara> it's aggravated because your block size is so large
  53. 04:45 < scastano> so the max_max is also 4G, max_max_percent is 25, max_percent is 10
  54. 04:46 < scastano> And would this be why I see the txg_sync processes at the top of my iotop list with 99% io?
  55. 04:46 < jtara> i'd try setting max_max to 20G, max to 20G, and max_percent to 25, and see what happens
  56. 04:46 < jtara> if that helps, you can do other things to make it work better with lower memory
  57. 04:46 < jtara> but the quickest way to solve hitting dirty data max is to raise it
  58. 04:46 < scastano> ok… here goes…
  59. 04:47 < jtara> you'll need a reboot
  60. 04:47 < jtara> basically when you hit the max then the pool stops accepting writes completely
  61. 04:47 < jtara> until it drains some
  62. 04:48 < scastano> I can't set these live like the others?
  63. 04:48 < jtara> so it's like a huge locomotive with cars jolting to a stop; it takes time to get going again
  64. 04:48 < jtara> you can't set max_max dynamically, i think
  65. 04:48 < jtara> and it caps everything
  66. 04:50 < scastano> yeah, all it let me change live with the max, not max_max or max_percent.
  67. 04:50 < scastano> but I do see that change reflected… it now looks like its letting is go up to about 17 - 18G
  68. 04:51 < jtara> i think max_max will still cap it, but you can tell by doing some big writes and looking at the txgs file again
  69. 04:51 < scastano> I'm also seeing peak write speeds bounce back up way higher now… over 5.5GB/s
  70. 04:51 < jtara> the "ndirty" column
  71. 04:51 < scastano> yeah, ndirty is going up to 17 - 17G now
  72. 04:51 < jtara> you want that number to stabilize around the midpoint of your dirty data throttle area or below
  73. 04:51 < jtara> if possible
  74. 04:52 < scastano> If I'm set to 20G, I should want that to hit around 10?
  75. 04:52 < scastano> And since I'm hitting 17, should I up it from 20 - 30G?
  76. 04:52 < jtara> if your dirty throttle starts at 60%, you kind of wait to aim for it hitting 80% under steady state
  77. 04:52 < jtara> it's not super critical
  78. 04:52 < jtara> but you don't want it hitting 100%
  79. 04:53 < scastano> Ah.. ok… well it was basically pegging at 3.9 before… now the highest I've seen is just over 18 or the 20.
  80. 04:54 < jtara> if you want an easy test, set zfs_delay_min_dirty_percent to 40% and delay scale to 1000000
  81. 04:54 < jtara> i'm not suggesting you keep those values, just use them to see how to get your txg size to stabilize
  82. 04:54 < scastano> And my write speed is way up already… i'm letting this 15 minute fio run complete, then I'll run one where I'm not changing settings in the middle, set the other values you mentioned in zfs.conf for reboot and see what happens.
  83. 04:54 < jtara> once it's stable around 80% of your total dirty data space then you can work on other stuff :)
  84. 04:55 < jtara> great
  85. 04:55 < jtara> that's why i said this is fluid dynamics, you have to keep everything flowing for it to do well
  86. 04:56 < scastano> Changing the dirty_percent and delay_scale like that made it so ndirty is only hitting about 13 - 14G now.
  87. 04:57  * jtara nods...so now you have a stable flow of dirty data
  88. 04:57 < jtara> i bet zpool iostat -q will look cleaner
  89. 04:58 < scastano> it does, much cleaning now… there's operations on just about every line for the most part.
  90. 04:58 < scastano> Write speed still seems to be bouncing all over the place though… but still faster than it was before.
  91. 04:59 < scastano> so as I adjust these the goal is to keep it stable in the middle there…. ok… is there any downsize to setting it larger than 20G now and turning the percent back up to 60?
  92. 05:00 < jtara> to really troubleshoot/tune this stuff, you have to follow the chain of data, like for async flow you need to first make sure dirty data stabilizes, then that the task queues are ok, then the zio throttle, then vdev aggregation, then vdev thread counts
  93. 05:00 < jtara> in that order
  94. 05:00 < jtara> most people start with the vdev thread counts and wonder where they went wrong
  95. 05:00 < scastano> Yup, that's basically what I was about to do!
  96. 05:01 < jtara> it's like a river, if there's a blockage upstream it doesn't matter what you do downstream
  97. 05:01 < jtara> the downside is mostly if you have multiple storage pools that are all processing a lot of throughput-oriented data
  98. 05:01 < jtara> if they're all consuming a lot of dirty data at once
  99. 05:02 < jtara> if you'll have one primary pool then you can raise it fairly high
  100. 05:02 < jtara> though don't go over 25% of your arc size
  101. 05:03 < jtara> you might be able to go back to 60% with the 1000000 delay scale actually
  102. 05:03 < scastano> Ah… ok… so yeah, this will really be a single pool for major throughput… this huge one with the 36 x 14TB disks, then maybe a small mirror of 1.5TB NVMe drives to run like 12 virtual machines over NFS
  103. 05:03  * jtara nods
  104. 05:04 < jtara> pools under mild load usually stabilize their dirty data between 0 and dirty_data_sync
  105. 05:04 < jtara> or close to it
  106. 05:04 < scastano> I've also left the arc_max default… so it's set at the 50% RAM mark.. which this is a storage box only, so  can up it much higher… 150G+ I would think
  107. 05:05 < jtara> i'd leave it where it is for now and think about that at the end
  108. 05:06 < jtara> anyway, other rules of thumb, zfs_vdev_async_write_active_max_dirty_percent should roughly equal zfs_delay_min_dirty_percent, zfs_vdev_async_write_active_min_dirty_percent should be between dirty_data_sync and zfs_vdev_async_write_active_max_dirty_percent
  109. 05:06 < scastano> ok, so where I am now is a dirty_data_max of 20G, delay at 1000000 and dirty_percent at 60
  110. 05:07 < jtara> ok
  111. 05:08 < scastano> What's really weird now… performance is back down again and ndirty is basically at 0
  112. 05:09 < scastano> NEVERMIND… I'm an idiot, I started a read test!
  113. 05:09  * scastano is an idiot
  114. 05:10 < jtara> hah
  115. 05:11 < jtara> anyway zfs_vdev_async_write_active_min_dirty_percent and max are basically so you can divide your pool into latency and throughput modalities
  116. 05:11 < scastano> Might this effect my read performance at all?
  117. 05:12 < jtara> it shouldn't
  118. 05:12 < scastano> I'm looking for those settings now… and taking notes so I can put this all in my zfs.conf
  119. 05:12 < jtara> these are only dirty data parameters and data you read isn't dirty
  120. 05:13 < jtara> they're documented pretty well but basically they let you scale your write threads depending on the ndirty for the pool
  121. 05:13 < scastano> Ah ok.. I get that, so this is basically just my write buffer so to speak.
  122. 05:14 < jtara> so for throughput based workloads you will have a high ndirty above the start of delay_min
  123. 05:14 < jtara> for latency you have a low ndirty usually at zfs_dirty_data_sync or below
  124. 05:15 < jtara> anyway, test it, see how the txg ndirty looks, and start there
  125. 05:15 < scastano> Yeah… it looks like as I push things up a little bit… I went up in 2G chucks… right around 30G seems to be the sweet spot right now… i'm actually seeing near 100% utilization on my disks, and a per disk write speed of 180 - 200MB/s, but it bouncing around a lot, down as long as 80/100 sometimes… but the max on the hardware is 250mb/s, but I'm getting close.
  126. 05:16 < scastano> at 30G max, with 60 for my percent… the math is working out… it's sitting about 23/24G which is right around the 60% mark.
  127. 05:17 < jtara> you'd expect it a little higher but that's not bad at all - you need some headroom to accomodate surges
  128. 05:17 < jtara> so it's good
  129. 05:18 < scastano> Read performance has come way back, I was at 2.8 - 2.9GB/s before… with these settings I'm around 5.8GB/s which is where I was yesterday under Ubuntu 16.04… so I'm pumped about that.
  130. 05:18 < jtara> :)
  131. 05:18 < scastano> After this I need to run a read test, put all these things in my zfs.conf, reboot and test again.
  132. 05:18 < scastano> Then I can look at the other vdev settings you mentioned to make sure they're in the right range.
  133. 05:19 < scastano> At what point do you think I should push me arc_max up to where it can be?
  134. 05:20 < jtara> you probably can now but i wouldn't push it above 70% in any event.  linux likes its pagecache.
  135. 05:20 < MilkmanDan> How much is necessary when zfs is doing all the page management?
  136. 05:21 < scastano> That's fine, that still puts me north of 130G, that's a good 45G more than I have now.
  137. 05:21 < jtara> double buffering for the win
  138. 05:22 < jtara> anyway the next step is sort of like opening up the exhaust on an engine and tuning the mix, lol
  139. 05:23 < scastano> See… now you're speaking my language! Hahaha
  140. 05:23 < jtara> yeah now you know that you have a steady fuel supply basically
  141. 05:23 < scastano> I just put in an X pipe and dual 3" outs on my 67 mustang with a 427 and let he carb run a little more rich.
  142. 05:23 < jtara> instead of jolts and spurts
  143. 05:26 < scastano> it looks like bring the min_dirty_percent back down to 40 has evened out the writes even more, ndirty isn't much over 18/19 now and average speed seems to be around 5.3GB/s… we'll see where it ends. The last fio run ended at 4.2GB/s
  144. 05:28  * jtara nods
  145. 05:28 < jtara> there are a lot more steps to do a full throughput tuning but i think you found the big issue anyway
  146. 05:28 < scastano> Dude, amazing… thank you! This has by far been one of the most informative bits about tuning and what the HELL I'm looking at!
  147. 05:29 < jtara> haha thanks
  148. 05:29 < jtara> i really need to write up an even-eviller tuning guide
  149. 05:29 < jtara> i really should write this down but basically for throughput: 1. tune dirty data to be stable; 2. open the zio throttle and raise write threads up pretty high; 3. lower sync taskq without impacting throughput; 4. tune zio throttle; 5. adjust aggregation; 6. adjust vdev thread counts
  150. 05:29 < scastano> Yeah, and now I know where to look next… but again, if it stats right here… I'm still over my 3.5GB/s goal to max out my network connection and I've switched to a newer OS that will be support for 2 years longer so there's less to worry about as I move forward too.
  151. 05:30 < jtara> so there's step 1 ;)
  152. 05:31 < scastano> Yeah… I mean, now that I know what each piece does and what numbers I basically need to look for… now I can do some tweaking and testing a little to learn more.
  153. 05:31 < jtara> there are a lot of things that affect each other but once you know which ones it's a lot easier
  154. 05:32 < scastano> It's just a super hard process to understand when you're starting from zero… the other 10 or so place I'm using ZFS, it's on maybe 1 - 4TB systems, maybe 3 - 5 disks… did the install out of the distro packages, mounted the filesystems and moved on. Totally default.
  155. 05:33 < scastano> So when I found 36 250MB/s enterprise disks that couldn't beat my 4 and 6TB small pools in throughput and iops, I knew something was up, but had no idea where to start!
  156. 05:44 <@zfs> [zfsonlinux/zfs] Add TRIM support (#8419) comment by Matthew Ahrens <https://github.com/zfsonlinux/zfs/issues/8419>
  157. 06:00 < PMT_> jtara: edit access to the Open-ZFS wiki isn't that hard to get, if you actually wanted it there.
  158. 06:00 < PMT_> :P
  159. 06:01 < jtara> yeah, i really should, it's a good point
  160. 06:01 < jtara> i know some of my ideas are probably a bit controversial
  161. 06:03 < scastano> Dude, for sure go for it… if you're able to help me and explain things this quickly over just IRC, I'm sure others could benefit from this same info big time.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement