zfs write tuning

04:22 < scastano> So… this is where I'm at and could use some pointers…. long story short….
04:22 < scastano> I imaged my system, reinstalled on Ubuntu 18.04, compiled ZFS v0.8 from source and when I was done, performance had dropped to basically 35% of what I was getting before.
04:23 < scastano> So… I ket Ubuntu 18.04, but rolled back to ZFS v0.7.13 and now most of my performance is back… 3.6GB/s on reads, but my write speed didn't come all the way back to the 4.5GB/s I was at before… it's stuck around 3GB/s
04:24 < scastano> The major difference is that during read, my disks are showing 100% utilization just like they used to… but under write, they barely say about 60%… no CPU peg, no major iowait…. it seems like something else is holding it back.
04:25 < scastano> I have a theory that maybe it could be the number of threads, like vdev limits for async and sync writes and I'm wonder if that's possible and/or a good place to try some tuning.
04:25 < jtara> it's a complex enough system it's hard to troubleshoot from here, but i would look to see if your txg size is stabilizing under write load before doing anything
04:25 < scastano> I've been reading a ton on the IO scheduler, I think that might be a possible place where things are getting stuck.
04:25 < jtara> and use zpool iostat -q to see how many write threads you typically have
04:26 < jtara> it's very easy to get focused on the io scheduler, but if the pieces in front of it are not streaming it data fast enough, tuning it will not help
04:27 < scastano> Ok, good to know, I'm running zpool iostat -q now and starting a write benchmark
04:27 < jtara> also consider a test on a one-vdev pool without raidz, just to see how it does there
04:27 < jtara> whatever thread settings you come up with will not be that much different than what will be optimal with a raidz
04:29 < jtara> past a certain point, if you add more threads, each thread just grabs a block to write before as many have readied in their buffers
04:29 < jtara> so you get more writes but smaller
04:29 < scastano> I can do both of those… so I'm getting numbers coming back, but I'm not sure what they mean or if they are good or bad… with just the write test going it looks like it bounces around a little bit, but there's a max of about 370 active in the asyncq_write, most of the time with nothing pending… other times it bounces up to 1.1K
04:29 < jtara> the bad news is that causes higher latency, the good news is it will degrade throughput less as your average write size shrinks
04:30 < jtara> there should be a "how many are actually queued to the device" stat in there somewhere
04:31 < scastano> it just shows iops, speed and then queue numbers for sync, async and scrub
04:32 < jtara> paste it somewhere and /proc/spl/kstat/{pool name}/txgs
04:32 < scastano> And now seems to have flattened out… on 10 second intervals, 4 out of every 5 lines of output shows 0 for pending and active.
04:33 < scastano> iostat: https://imgur.com/a/g4a8fo3
04:35 < scastano> With txgs: https://imgur.com/a/g4a8fo3
04:35 < jtara> that's interesting.  you haven't raised zfs_dirty_data_sync have you?
04:36 < scastano> no… the only thing I changed was to set the zfetch_max_distance to 128M which was the sweet spot for performance when I was running under Ubuntu 16.04
04:37 < jtara> for what it's worth, i think you'll find the r/w gap settings miuch more useful than zfetch
04:37 < jtara> but
04:38 < jtara> well it looks like your dirty data is hitting 4G and then committing
04:38 < jtara> and i see surges in your iostat output
04:38 < jtara> i would check your dirty data throttle and make sure that number is in the middle of it
04:38 < sarnold> ahoy JollyRoger` :)
04:39 < scastano> Ok, that makes sense as to why it would burst around like that…
04:39 < JollyRoger`> Ahoy, sarnold! \o/
04:39 < jtara> if your dirty data throttle isn't set right then you will slam into the dirty data max variable
04:39 < jtara> and then nobody's happy
04:39 < scastano> so which dirty data setting should I be looking at, there's like 10 of them! Hahaha
04:40 < jtara> how big is your arc
04:40 < jtara> also zfs_dirty_data_max and zfs_delay_min_dirty_percent and zfs_delay_scale
04:41 < scastano> ARC is 94G out of the 196 I have installed in the server, I'm grabbing the other values now.
04:41 < jtara> the idea is to get your dirty data to sort of float around one area
04:41 < jtara> it's not healthy for it to either bounce or to hit the dirty data max hard
04:42 < jtara> i've no idea what else may be in play here but if you don't get that right, nothing will be right
04:42 < scastano> So the dirty data max is right at 4G, percent is 60, delay scale is 500,000
04:43 < jtara> ah, so it is hitting your dirty data max
04:43 < jtara> what that means is that delays will start at 60% of 4G (2.4G), and at the midpoint (3.2G) they will be delayed by 500us
04:44 < jtara> what about zfs_dirty_data_max_max, zfs_dirty_data_max_max_percent and zfs_dirty_data_max_percent
04:44 < scastano> ok… so do I drop the delay so things happen faster? Or lower the percent?
04:44 < jtara> honestly i'd be inclined to just raise zfs_dirty_data_max, but you may need to raise other vars to make sure it sticks
04:44 < jtara> basically when you start dumping data into zfs it starts doing a txg commit to drain the pool
04:45 < jtara> but it can take a second or two to get going
04:45 < jtara> in that time, you smash into the dirty data max
04:45 < jtara> then the pool drains some, you get more data, and smash into it again
04:45 < jtara> it's aggravated because your block size is so large
04:45 < scastano> so the max_max is also 4G, max_max_percent is 25, max_percent is 10
04:46 < scastano> And would this be why I see the txg_sync processes at the top of my iotop list with 99% io?
04:46 < jtara> i'd try setting max_max to 20G, max to 20G, and max_percent to 25, and see what happens
04:46 < jtara> if that helps, you can do other things to make it work better with lower memory
04:46 < jtara> but the quickest way to solve hitting dirty data max is to raise it
04:46 < scastano> ok… here goes…
04:47 < jtara> you'll need a reboot
04:47 < jtara> basically when you hit the max then the pool stops accepting writes completely
04:47 < jtara> until it drains some
04:48 < scastano> I can't set these live like the others?
04:48 < jtara> so it's like a huge locomotive with cars jolting to a stop; it takes time to get going again
04:48 < jtara> you can't set max_max dynamically, i think
04:48 < jtara> and it caps everything
04:50 < scastano> yeah, all it let me change live with the max, not max_max or max_percent.
04:50 < scastano> but I do see that change reflected… it now looks like its letting is go up to about 17 - 18G
04:51 < jtara> i think max_max will still cap it, but you can tell by doing some big writes and looking at the txgs file again
04:51 < scastano> I'm also seeing peak write speeds bounce back up way higher now… over 5.5GB/s
04:51 < jtara> the "ndirty" column
04:51 < scastano> yeah, ndirty is going up to 17 - 17G now
04:51 < jtara> you want that number to stabilize around the midpoint of your dirty data throttle area or below
04:51 < jtara> if possible
04:52 < scastano> If I'm set to 20G, I should want that to hit around 10?
04:52 < scastano> And since I'm hitting 17, should I up it from 20 - 30G?
04:52 < jtara> if your dirty throttle starts at 60%, you kind of wait to aim for it hitting 80% under steady state
04:52 < jtara> it's not super critical
04:52 < jtara> but you don't want it hitting 100%
04:53 < scastano> Ah.. ok… well it was basically pegging at 3.9 before… now the highest I've seen is just over 18 or the 20.
04:54 < jtara> if you want an easy test, set zfs_delay_min_dirty_percent to 40% and delay scale to 1000000
04:54 < jtara> i'm not suggesting you keep those values, just use them to see how to get your txg size to stabilize
04:54 < scastano> And my write speed is way up already… i'm letting this 15 minute fio run complete, then I'll run one where I'm not changing settings in the middle, set the other values you mentioned in zfs.conf for reboot and see what happens.
04:54 < jtara> once it's stable around 80% of your total dirty data space then you can work on other stuff :)
04:55 < jtara> great
04:55 < jtara> that's why i said this is fluid dynamics, you have to keep everything flowing for it to do well
04:56 < scastano> Changing the dirty_percent and delay_scale like that made it so ndirty is only hitting about 13 - 14G now.
04:57  * jtara nods...so now you have a stable flow of dirty data
04:57 < jtara> i bet zpool iostat -q will look cleaner
04:58 < scastano> it does, much cleaning now… there's operations on just about every line for the most part.
04:58 < scastano> Write speed still seems to be bouncing all over the place though… but still faster than it was before.
04:59 < scastano> so as I adjust these the goal is to keep it stable in the middle there…. ok… is there any downsize to setting it larger than 20G now and turning the percent back up to 60?
05:00 < jtara> to really troubleshoot/tune this stuff, you have to follow the chain of data, like for async flow you need to first make sure dirty data stabilizes, then that the task queues are ok, then the zio throttle, then vdev aggregation, then vdev thread counts
05:00 < jtara> in that order
05:00 < jtara> most people start with the vdev thread counts and wonder where they went wrong
05:00 < scastano> Yup, that's basically what I was about to do!
05:01 < jtara> it's like a river, if there's a blockage upstream it doesn't matter what you do downstream
05:01 < jtara> the downside is mostly if you have multiple storage pools that are all processing a lot of throughput-oriented data
05:01 < jtara> if they're all consuming a lot of dirty data at once
05:02 < jtara> if you'll have one primary pool then you can raise it fairly high
05:02 < jtara> though don't go over 25% of your arc size
05:03 < jtara> you might be able to go back to 60% with the 1000000 delay scale actually
05:03 < scastano> Ah… ok… so yeah, this will really be a single pool for major throughput… this huge one with the 36 x 14TB disks, then maybe a small mirror of 1.5TB NVMe drives to run like 12 virtual machines over NFS
05:03  * jtara nods
05:04 < jtara> pools under mild load usually stabilize their dirty data between 0 and dirty_data_sync
05:04 < jtara> or close to it
05:04 < scastano> I've also left the arc_max default… so it's set at the 50% RAM mark.. which this is a storage box only, so  can up it much higher… 150G+ I would think
05:05 < jtara> i'd leave it where it is for now and think about that at the end
05:06 < jtara> anyway, other rules of thumb, zfs_vdev_async_write_active_max_dirty_percent should roughly equal zfs_delay_min_dirty_percent, zfs_vdev_async_write_active_min_dirty_percent should be between dirty_data_sync and zfs_vdev_async_write_active_max_dirty_percent
05:06 < scastano> ok, so where I am now is a dirty_data_max of 20G, delay at 1000000 and dirty_percent at 60
05:07 < jtara> ok
05:08 < scastano> What's really weird now… performance is back down again and ndirty is basically at 0
05:09 < scastano> NEVERMIND… I'm an idiot, I started a read test!
05:09  * scastano is an idiot
05:10 < jtara> hah
05:11 < jtara> anyway zfs_vdev_async_write_active_min_dirty_percent and max are basically so you can divide your pool into latency and throughput modalities
05:11 < scastano> Might this effect my read performance at all?
05:12 < jtara> it shouldn't
05:12 < scastano> I'm looking for those settings now… and taking notes so I can put this all in my zfs.conf
05:12 < jtara> these are only dirty data parameters and data you read isn't dirty
05:13 < jtara> they're documented pretty well but basically they let you scale your write threads depending on the ndirty for the pool
05:13 < scastano> Ah ok.. I get that, so this is basically just my write buffer so to speak.
05:14 < jtara> so for throughput based workloads you will have a high ndirty above the start of delay_min
05:14 < jtara> for latency you have a low ndirty usually at zfs_dirty_data_sync or below
05:15 < jtara> anyway, test it, see how the txg ndirty looks, and start there
05:15 < scastano> Yeah… it looks like as I push things up a little bit… I went up in 2G chucks… right around 30G seems to be the sweet spot right now… i'm actually seeing near 100% utilization on my disks, and a per disk write speed of 180 - 200MB/s, but it bouncing around a lot, down as long as 80/100 sometimes… but the max on the hardware is 250mb/s, but I'm getting close.
05:16 < scastano> at 30G max, with 60 for my percent… the math is working out… it's sitting about 23/24G which is right around the 60% mark.
05:17 < jtara> you'd expect it a little higher but that's not bad at all - you need some headroom to accomodate surges
05:17 < jtara> so it's good
05:18 < scastano> Read performance has come way back, I was at 2.8 - 2.9GB/s before… with these settings I'm around 5.8GB/s which is where I was yesterday under Ubuntu 16.04… so I'm pumped about that.
05:18 < jtara> :)
05:18 < scastano> After this I need to run a read test, put all these things in my zfs.conf, reboot and test again.
05:18 < scastano> Then I can look at the other vdev settings you mentioned to make sure they're in the right range.
05:19 < scastano> At what point do you think I should push me arc_max up to where it can be?
05:20 < jtara> you probably can now but i wouldn't push it above 70% in any event.  linux likes its pagecache.
05:20 < MilkmanDan> How much is necessary when zfs is doing all the page management?
05:21 < scastano> That's fine, that still puts me north of 130G, that's a good 45G more than I have now.
05:21 < jtara> double buffering for the win
05:22 < jtara> anyway the next step is sort of like opening up the exhaust on an engine and tuning the mix, lol
05:23 < scastano> See… now you're speaking my language! Hahaha
05:23 < jtara> yeah now you know that you have a steady fuel supply basically
05:23 < scastano> I just put in an X pipe and dual 3" outs on my 67 mustang with a 427 and let he carb run a little more rich.
05:23 < jtara> instead of jolts and spurts
05:26 < scastano> it looks like bring the min_dirty_percent back down to 40 has evened out the writes even more, ndirty isn't much over 18/19 now and average speed seems to be around 5.3GB/s… we'll see where it ends. The last fio run ended at 4.2GB/s
05:28  * jtara nods
05:28 < jtara> there are a lot more steps to do a full throughput tuning but i think you found the big issue anyway
05:28 < scastano> Dude, amazing… thank you! This has by far been one of the most informative bits about tuning and what the HELL I'm looking at!
05:29 < jtara> haha thanks
05:29 < jtara> i really need to write up an even-eviller tuning guide
05:29 < jtara> i really should write this down but basically for throughput: 1. tune dirty data to be stable; 2. open the zio throttle and raise write threads up pretty high; 3. lower sync taskq without impacting throughput; 4. tune zio throttle; 5. adjust aggregation; 6. adjust vdev thread counts
05:29 < scastano> Yeah, and now I know where to look next… but again, if it stats right here… I'm still over my 3.5GB/s goal to max out my network connection and I've switched to a newer OS that will be support for 2 years longer so there's less to worry about as I move forward too.
05:30 < jtara> so there's step 1 ;)
05:31 < scastano> Yeah… I mean, now that I know what each piece does and what numbers I basically need to look for… now I can do some tweaking and testing a little to learn more.
05:31 < jtara> there are a lot of things that affect each other but once you know which ones it's a lot easier
05:32 < scastano> It's just a super hard process to understand when you're starting from zero… the other 10 or so place I'm using ZFS, it's on maybe 1 - 4TB systems, maybe 3 - 5 disks… did the install out of the distro packages, mounted the filesystems and moved on. Totally default.
05:33 < scastano> So when I found 36 250MB/s enterprise disks that couldn't beat my 4 and 6TB small pools in throughput and iops, I knew something was up, but had no idea where to start!
05:44 <@zfs> [zfsonlinux/zfs] Add TRIM support (#8419) comment by Matthew Ahrens <https://github.com/zfsonlinux/zfs/issues/8419>
06:00 < PMT_> jtara: edit access to the Open-ZFS wiki isn't that hard to get, if you actually wanted it there.
06:00 < PMT_> :P
06:01 < jtara> yeah, i really should, it's a good point
06:01 < jtara> i know some of my ideas are probably a bit controversial
06:03 < scastano> Dude, for sure go for it… if you're able to help me and explain things this quickly over just IRC, I'm sure others could benefit from this same info big time.