Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- 04:22 < scastano> So… this is where I'm at and could use some pointers…. long story short….
- 04:22 < scastano> I imaged my system, reinstalled on Ubuntu 18.04, compiled ZFS v0.8 from source and when I was done, performance had dropped to basically 35% of what I was getting before.
- 04:23 < scastano> So… I ket Ubuntu 18.04, but rolled back to ZFS v0.7.13 and now most of my performance is back… 3.6GB/s on reads, but my write speed didn't come all the way back to the 4.5GB/s I was at before… it's stuck around 3GB/s
- 04:24 < scastano> The major difference is that during read, my disks are showing 100% utilization just like they used to… but under write, they barely say about 60%… no CPU peg, no major iowait…. it seems like something else is holding it back.
- 04:25 < scastano> I have a theory that maybe it could be the number of threads, like vdev limits for async and sync writes and I'm wonder if that's possible and/or a good place to try some tuning.
- 04:25 < jtara> it's a complex enough system it's hard to troubleshoot from here, but i would look to see if your txg size is stabilizing under write load before doing anything
- 04:25 < scastano> I've been reading a ton on the IO scheduler, I think that might be a possible place where things are getting stuck.
- 04:25 < jtara> and use zpool iostat -q to see how many write threads you typically have
- 04:26 < jtara> it's very easy to get focused on the io scheduler, but if the pieces in front of it are not streaming it data fast enough, tuning it will not help
- 04:27 < scastano> Ok, good to know, I'm running zpool iostat -q now and starting a write benchmark
- 04:27 < jtara> also consider a test on a one-vdev pool without raidz, just to see how it does there
- 04:27 < jtara> whatever thread settings you come up with will not be that much different than what will be optimal with a raidz
- 04:29 < jtara> past a certain point, if you add more threads, each thread just grabs a block to write before as many have readied in their buffers
- 04:29 < jtara> so you get more writes but smaller
- 04:29 < scastano> I can do both of those… so I'm getting numbers coming back, but I'm not sure what they mean or if they are good or bad… with just the write test going it looks like it bounces around a little bit, but there's a max of about 370 active in the asyncq_write, most of the time with nothing pending… other times it bounces up to 1.1K
- 04:29 < jtara> the bad news is that causes higher latency, the good news is it will degrade throughput less as your average write size shrinks
- 04:30 < jtara> there should be a "how many are actually queued to the device" stat in there somewhere
- 04:31 < scastano> it just shows iops, speed and then queue numbers for sync, async and scrub
- 04:32 < jtara> paste it somewhere and /proc/spl/kstat/{pool name}/txgs
- 04:32 < scastano> And now seems to have flattened out… on 10 second intervals, 4 out of every 5 lines of output shows 0 for pending and active.
- 04:33 < scastano> iostat: https://imgur.com/a/g4a8fo3
- 04:35 < scastano> With txgs: https://imgur.com/a/g4a8fo3
- 04:35 < jtara> that's interesting. you haven't raised zfs_dirty_data_sync have you?
- 04:36 < scastano> no… the only thing I changed was to set the zfetch_max_distance to 128M which was the sweet spot for performance when I was running under Ubuntu 16.04
- 04:37 < jtara> for what it's worth, i think you'll find the r/w gap settings miuch more useful than zfetch
- 04:37 < jtara> but
- 04:38 < jtara> well it looks like your dirty data is hitting 4G and then committing
- 04:38 < jtara> and i see surges in your iostat output
- 04:38 < jtara> i would check your dirty data throttle and make sure that number is in the middle of it
- 04:38 < sarnold> ahoy JollyRoger` :)
- 04:39 < scastano> Ok, that makes sense as to why it would burst around like that…
- 04:39 < JollyRoger`> Ahoy, sarnold! \o/
- 04:39 < jtara> if your dirty data throttle isn't set right then you will slam into the dirty data max variable
- 04:39 < jtara> and then nobody's happy
- 04:39 < scastano> so which dirty data setting should I be looking at, there's like 10 of them! Hahaha
- 04:40 < jtara> how big is your arc
- 04:40 < jtara> also zfs_dirty_data_max and zfs_delay_min_dirty_percent and zfs_delay_scale
- 04:41 < scastano> ARC is 94G out of the 196 I have installed in the server, I'm grabbing the other values now.
- 04:41 < jtara> the idea is to get your dirty data to sort of float around one area
- 04:41 < jtara> it's not healthy for it to either bounce or to hit the dirty data max hard
- 04:42 < jtara> i've no idea what else may be in play here but if you don't get that right, nothing will be right
- 04:42 < scastano> So the dirty data max is right at 4G, percent is 60, delay scale is 500,000
- 04:43 < jtara> ah, so it is hitting your dirty data max
- 04:43 < jtara> what that means is that delays will start at 60% of 4G (2.4G), and at the midpoint (3.2G) they will be delayed by 500us
- 04:44 < jtara> what about zfs_dirty_data_max_max, zfs_dirty_data_max_max_percent and zfs_dirty_data_max_percent
- 04:44 < scastano> ok… so do I drop the delay so things happen faster? Or lower the percent?
- 04:44 < jtara> honestly i'd be inclined to just raise zfs_dirty_data_max, but you may need to raise other vars to make sure it sticks
- 04:44 < jtara> basically when you start dumping data into zfs it starts doing a txg commit to drain the pool
- 04:45 < jtara> but it can take a second or two to get going
- 04:45 < jtara> in that time, you smash into the dirty data max
- 04:45 < jtara> then the pool drains some, you get more data, and smash into it again
- 04:45 < jtara> it's aggravated because your block size is so large
- 04:45 < scastano> so the max_max is also 4G, max_max_percent is 25, max_percent is 10
- 04:46 < scastano> And would this be why I see the txg_sync processes at the top of my iotop list with 99% io?
- 04:46 < jtara> i'd try setting max_max to 20G, max to 20G, and max_percent to 25, and see what happens
- 04:46 < jtara> if that helps, you can do other things to make it work better with lower memory
- 04:46 < jtara> but the quickest way to solve hitting dirty data max is to raise it
- 04:46 < scastano> ok… here goes…
- 04:47 < jtara> you'll need a reboot
- 04:47 < jtara> basically when you hit the max then the pool stops accepting writes completely
- 04:47 < jtara> until it drains some
- 04:48 < scastano> I can't set these live like the others?
- 04:48 < jtara> so it's like a huge locomotive with cars jolting to a stop; it takes time to get going again
- 04:48 < jtara> you can't set max_max dynamically, i think
- 04:48 < jtara> and it caps everything
- 04:50 < scastano> yeah, all it let me change live with the max, not max_max or max_percent.
- 04:50 < scastano> but I do see that change reflected… it now looks like its letting is go up to about 17 - 18G
- 04:51 < jtara> i think max_max will still cap it, but you can tell by doing some big writes and looking at the txgs file again
- 04:51 < scastano> I'm also seeing peak write speeds bounce back up way higher now… over 5.5GB/s
- 04:51 < jtara> the "ndirty" column
- 04:51 < scastano> yeah, ndirty is going up to 17 - 17G now
- 04:51 < jtara> you want that number to stabilize around the midpoint of your dirty data throttle area or below
- 04:51 < jtara> if possible
- 04:52 < scastano> If I'm set to 20G, I should want that to hit around 10?
- 04:52 < scastano> And since I'm hitting 17, should I up it from 20 - 30G?
- 04:52 < jtara> if your dirty throttle starts at 60%, you kind of wait to aim for it hitting 80% under steady state
- 04:52 < jtara> it's not super critical
- 04:52 < jtara> but you don't want it hitting 100%
- 04:53 < scastano> Ah.. ok… well it was basically pegging at 3.9 before… now the highest I've seen is just over 18 or the 20.
- 04:54 < jtara> if you want an easy test, set zfs_delay_min_dirty_percent to 40% and delay scale to 1000000
- 04:54 < jtara> i'm not suggesting you keep those values, just use them to see how to get your txg size to stabilize
- 04:54 < scastano> And my write speed is way up already… i'm letting this 15 minute fio run complete, then I'll run one where I'm not changing settings in the middle, set the other values you mentioned in zfs.conf for reboot and see what happens.
- 04:54 < jtara> once it's stable around 80% of your total dirty data space then you can work on other stuff :)
- 04:55 < jtara> great
- 04:55 < jtara> that's why i said this is fluid dynamics, you have to keep everything flowing for it to do well
- 04:56 < scastano> Changing the dirty_percent and delay_scale like that made it so ndirty is only hitting about 13 - 14G now.
- 04:57 * jtara nods...so now you have a stable flow of dirty data
- 04:57 < jtara> i bet zpool iostat -q will look cleaner
- 04:58 < scastano> it does, much cleaning now… there's operations on just about every line for the most part.
- 04:58 < scastano> Write speed still seems to be bouncing all over the place though… but still faster than it was before.
- 04:59 < scastano> so as I adjust these the goal is to keep it stable in the middle there…. ok… is there any downsize to setting it larger than 20G now and turning the percent back up to 60?
- 05:00 < jtara> to really troubleshoot/tune this stuff, you have to follow the chain of data, like for async flow you need to first make sure dirty data stabilizes, then that the task queues are ok, then the zio throttle, then vdev aggregation, then vdev thread counts
- 05:00 < jtara> in that order
- 05:00 < jtara> most people start with the vdev thread counts and wonder where they went wrong
- 05:00 < scastano> Yup, that's basically what I was about to do!
- 05:01 < jtara> it's like a river, if there's a blockage upstream it doesn't matter what you do downstream
- 05:01 < jtara> the downside is mostly if you have multiple storage pools that are all processing a lot of throughput-oriented data
- 05:01 < jtara> if they're all consuming a lot of dirty data at once
- 05:02 < jtara> if you'll have one primary pool then you can raise it fairly high
- 05:02 < jtara> though don't go over 25% of your arc size
- 05:03 < jtara> you might be able to go back to 60% with the 1000000 delay scale actually
- 05:03 < scastano> Ah… ok… so yeah, this will really be a single pool for major throughput… this huge one with the 36 x 14TB disks, then maybe a small mirror of 1.5TB NVMe drives to run like 12 virtual machines over NFS
- 05:03 * jtara nods
- 05:04 < jtara> pools under mild load usually stabilize their dirty data between 0 and dirty_data_sync
- 05:04 < jtara> or close to it
- 05:04 < scastano> I've also left the arc_max default… so it's set at the 50% RAM mark.. which this is a storage box only, so can up it much higher… 150G+ I would think
- 05:05 < jtara> i'd leave it where it is for now and think about that at the end
- 05:06 < jtara> anyway, other rules of thumb, zfs_vdev_async_write_active_max_dirty_percent should roughly equal zfs_delay_min_dirty_percent, zfs_vdev_async_write_active_min_dirty_percent should be between dirty_data_sync and zfs_vdev_async_write_active_max_dirty_percent
- 05:06 < scastano> ok, so where I am now is a dirty_data_max of 20G, delay at 1000000 and dirty_percent at 60
- 05:07 < jtara> ok
- 05:08 < scastano> What's really weird now… performance is back down again and ndirty is basically at 0
- 05:09 < scastano> NEVERMIND… I'm an idiot, I started a read test!
- 05:09 * scastano is an idiot
- 05:10 < jtara> hah
- 05:11 < jtara> anyway zfs_vdev_async_write_active_min_dirty_percent and max are basically so you can divide your pool into latency and throughput modalities
- 05:11 < scastano> Might this effect my read performance at all?
- 05:12 < jtara> it shouldn't
- 05:12 < scastano> I'm looking for those settings now… and taking notes so I can put this all in my zfs.conf
- 05:12 < jtara> these are only dirty data parameters and data you read isn't dirty
- 05:13 < jtara> they're documented pretty well but basically they let you scale your write threads depending on the ndirty for the pool
- 05:13 < scastano> Ah ok.. I get that, so this is basically just my write buffer so to speak.
- 05:14 < jtara> so for throughput based workloads you will have a high ndirty above the start of delay_min
- 05:14 < jtara> for latency you have a low ndirty usually at zfs_dirty_data_sync or below
- 05:15 < jtara> anyway, test it, see how the txg ndirty looks, and start there
- 05:15 < scastano> Yeah… it looks like as I push things up a little bit… I went up in 2G chucks… right around 30G seems to be the sweet spot right now… i'm actually seeing near 100% utilization on my disks, and a per disk write speed of 180 - 200MB/s, but it bouncing around a lot, down as long as 80/100 sometimes… but the max on the hardware is 250mb/s, but I'm getting close.
- 05:16 < scastano> at 30G max, with 60 for my percent… the math is working out… it's sitting about 23/24G which is right around the 60% mark.
- 05:17 < jtara> you'd expect it a little higher but that's not bad at all - you need some headroom to accomodate surges
- 05:17 < jtara> so it's good
- 05:18 < scastano> Read performance has come way back, I was at 2.8 - 2.9GB/s before… with these settings I'm around 5.8GB/s which is where I was yesterday under Ubuntu 16.04… so I'm pumped about that.
- 05:18 < jtara> :)
- 05:18 < scastano> After this I need to run a read test, put all these things in my zfs.conf, reboot and test again.
- 05:18 < scastano> Then I can look at the other vdev settings you mentioned to make sure they're in the right range.
- 05:19 < scastano> At what point do you think I should push me arc_max up to where it can be?
- 05:20 < jtara> you probably can now but i wouldn't push it above 70% in any event. linux likes its pagecache.
- 05:20 < MilkmanDan> How much is necessary when zfs is doing all the page management?
- 05:21 < scastano> That's fine, that still puts me north of 130G, that's a good 45G more than I have now.
- 05:21 < jtara> double buffering for the win
- 05:22 < jtara> anyway the next step is sort of like opening up the exhaust on an engine and tuning the mix, lol
- 05:23 < scastano> See… now you're speaking my language! Hahaha
- 05:23 < jtara> yeah now you know that you have a steady fuel supply basically
- 05:23 < scastano> I just put in an X pipe and dual 3" outs on my 67 mustang with a 427 and let he carb run a little more rich.
- 05:23 < jtara> instead of jolts and spurts
- 05:26 < scastano> it looks like bring the min_dirty_percent back down to 40 has evened out the writes even more, ndirty isn't much over 18/19 now and average speed seems to be around 5.3GB/s… we'll see where it ends. The last fio run ended at 4.2GB/s
- 05:28 * jtara nods
- 05:28 < jtara> there are a lot more steps to do a full throughput tuning but i think you found the big issue anyway
- 05:28 < scastano> Dude, amazing… thank you! This has by far been one of the most informative bits about tuning and what the HELL I'm looking at!
- 05:29 < jtara> haha thanks
- 05:29 < jtara> i really need to write up an even-eviller tuning guide
- 05:29 < jtara> i really should write this down but basically for throughput: 1. tune dirty data to be stable; 2. open the zio throttle and raise write threads up pretty high; 3. lower sync taskq without impacting throughput; 4. tune zio throttle; 5. adjust aggregation; 6. adjust vdev thread counts
- 05:29 < scastano> Yeah, and now I know where to look next… but again, if it stats right here… I'm still over my 3.5GB/s goal to max out my network connection and I've switched to a newer OS that will be support for 2 years longer so there's less to worry about as I move forward too.
- 05:30 < jtara> so there's step 1 ;)
- 05:31 < scastano> Yeah… I mean, now that I know what each piece does and what numbers I basically need to look for… now I can do some tweaking and testing a little to learn more.
- 05:31 < jtara> there are a lot of things that affect each other but once you know which ones it's a lot easier
- 05:32 < scastano> It's just a super hard process to understand when you're starting from zero… the other 10 or so place I'm using ZFS, it's on maybe 1 - 4TB systems, maybe 3 - 5 disks… did the install out of the distro packages, mounted the filesystems and moved on. Totally default.
- 05:33 < scastano> So when I found 36 250MB/s enterprise disks that couldn't beat my 4 and 6TB small pools in throughput and iops, I knew something was up, but had no idea where to start!
- 05:44 <@zfs> [zfsonlinux/zfs] Add TRIM support (#8419) comment by Matthew Ahrens <https://github.com/zfsonlinux/zfs/issues/8419>
- 06:00 < PMT_> jtara: edit access to the Open-ZFS wiki isn't that hard to get, if you actually wanted it there.
- 06:00 < PMT_> :P
- 06:01 < jtara> yeah, i really should, it's a good point
- 06:01 < jtara> i know some of my ideas are probably a bit controversial
- 06:03 < scastano> Dude, for sure go for it… if you're able to help me and explain things this quickly over just IRC, I'm sure others could benefit from this same info big time.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement