hydrus dupe sort

>>14742
Maybe I should have waited after all, lol
Oh well, I'm sure I'll get plenty more. I thought the current system is pretty good, but I guess I can think of a few things. Of course this will be based on v374, but from the handful of duplicates I've sorted since I don't think too much has changed.

Before I get started, I encountered a rare but more serious problem a few days ago now that I'm done with duplicates. I was importing some video clips I got with jdownloader, but I had not much space on my external drive, so I was picking the best quality videos out of the folders, deleting the rest, and then importing them. I forgot to empty the recycle bin so the disk actually filled up while I was importing (and thus copying) the videos. Hydrus recognized the problem and threw up a descriptive error message, but the UI started acting weird, I couldn't pause the file import process, couldn't really close, ended up just ending the process, but when I did that it left something still running, some script, and threw an error when I tried to reopen. Sorry that I don't know what that was, but Hydrus itself was noticing it still running I think, or possibly Python was. It wasn't like the messages from just closing Hydrus wrong, I think something was still trying to copy even though the disk was full. Unmounting and remounting the Veracrypt partition didn't fix the problem, but restarting the machine did, so it wasn't reading from disk really, just something in memory. Not sure if that's helpful with so little detail but thought I'd mention it.

Now for dupe stuff, I'm not sure if the option moved or was removed, but I think I used to be able to set the batch size for duplicates and it's just been 250 since some version a while ago, and I couldn't find the option to change it back. I prefer 50 at a time like the old preset, or maybe 100 or 150. 250 can be pretty exhausting and also makes the client hang for a bit when saving the changes, especially if you took long enough doing it (multitasking during a 15 minute period for instance) for the client's main window to get backgrounded by whatever Windows' program priority manager thing is. My computer is also a little old, hasn't had hardware upgrades for five years or so, but it's a decent budget tower. If the comparison window is fullscreened, like it usually is, this also hangs and very occasionally it will actually crash if you click or tab out, and you can lose sorting progress and make your database unhappy when you reopen the client. Only happened once or twice the whole time, but I'd love to be able to change the size again, or if I already can, can I have a reminder where the setting is? I know I can just close at any time for a partial, but it's very unsatisfying and makes the already boring experience of duplicate detection less satisfying when you have to settle for a partial or do another 180 pairs for your chemical reward.

I was the guy who suggested the background change color way back when, and made that post about having all duplicates present at once. I'm not sure that's actually the way to go anymore since it could get confusing with related alternates and straight quality contrasts (you could end up with multiple alternates with multiple versions in one "exact" match for minor art edits or aspect ratio changes, and have to somehow make a relational web between them within one "choice"). However, it would be helpful to have all of a group of exact duplicates of one file appear together as a clump, in series. Sometimes I'd get three or four choices on the same image in a row, other times I wouldn't or I'd get some right away, then some later on in the group. For instance, sometimes I'd have three or four instances of a picture, different filesizes, resolutions, pixelation, but then just one low-quality alternate version. So of course I tag it as an alternate, but then that picture it's an alternate to is obsoleted by a better quality. I know the relationship carries over to the new pair, but what I don't know is where the higher quality version of the alternate is. Does it not exist to begin with in my database? Did I already accidentally obsolete it with the lower quality alternate, or with a copy of the original? Did I already set it as an alternate, but it hasn't gotten around to being compared to the other alternate yet? If so, will it, or are they considered separate alternates even if they're close enough to be considered alternates under this distance otherwise? Has it just not come along yet? It can give you a bit of sorting anxiety. What I'd suggest is a process which might lead to a lag or a bit of a loading bar when opening up the duplicate sorting panel, but would also be more comprehensive: when picking duplicates, the code first finds an image with a duplicate, then it looks at that duplicate and finds any other duplicates for that duplicate at that same comparison distance, and so on, until it has created a complete web of potential duplicate relationships centered around that starting image. I'm not sure how efficient this is because I don't know how the database is set up, but it should be possible since we can already look up the duplicates for any one image, this is just that but with nesting lookup loops. For most images, that would be just two to four images total, so one to six potential pairs. For others it could potentially be huge, but you could for sanity and brevity dynamically limit the total number of pairs to include to the batch quantity of duplicates for the filter, or half of it or something. This can be extremely dreary and already just by chance or programming I often encountered long unbroken stretches of potential duplicates (mostly on wallpapers, anything 95% black with just some text or a logo in the middle will often come up as duplicates of other wallpapers in the same vein, and sometimes even different colored wallpapers would too, and not always just retints of the same thing), but it also lets you be more sure of what dupes you have - at least at that comparison difference. I haven't even begun with a distance greater than 0. Probably I'll find a lot of alternates in a sea of unrelated stuff is my guess for when I do.

Other than that, the duplicates window struggles a bit with the Windows magnifier, which is very necessary for me to use when I'm evaluating up to 16k images on a small 4k monitor. This is partially unavoidable because the magnifier works poorly with everything, including browser and file windows, but it has some more weird behavior with Hydrus, or maybe just with python. It works best when you open the dupes window maximized, let it load the first image, then open the magnifier (docked to top of screen), bumping the window down and resizing it. Hydrus has a little trouble with the mouse, when it dips up into the magnifier area of the screen, it sometimes makes the cursor unclickable if you hover back down to the window bar or even the top part of the dupe area sometimes. This might just be a problem with magnifier itself though.

Speaking of big files and older hardware, when I scroll from a lower res initial image to a really large second image, or between two large images, the filter window lags but the buttons are still responsive and clicking still works. This can cause a user to left click to select an image as the preferred duplicate while the other version of the image is still displayed, making the exact wrong choice. I think the file info (size, dimensions, tags, etc) also show up before the image is replaced so it's tricky if you're not watching for it, especially if it's much larger than the first image shown and looks nearly or exactly the same at screen res used for dupe sorting (no bad pixelation). The larger image can also be worse, over-upscaled without AI or doctored so it's blurry and "soft", or straight up pixelated and blown up but then resaved as another filetype so the filter thinks it's "good quality" because no apparent compression anymore. This is also the most common action in the duplicates filter. Scroll up, scroll down (or arrow keys), then scroll to the preferred image and click. For images larger than say 1024x1024 you might want to have it load in a thumbnail somehow, with a bar or timer showing that the full image is still loading and will be visible soon. Now that I think about it this might be the main issue with the dupe sorter right now. I always go back when I'm not sure so I don't think it's cost me any highest quality high res images, but it can be dangerous for the uninitiated. I think actually the simplest way to do this would be to code a few things into switching between images:
 - A reliable, slight delay of a quarter or fifth of a second or so, during which it both takes no new scroll/key inputs and locks the filesize etc data box from changing. Keeps you from giving yourself a seizure on small images or hanging the client on large images when your finger slips and you scroll 20 notches instantly. Also lets the person using the program get used to it and know to expect the delay.
 - A function to remove the previous image immediately after the scroll or key is received, during this locked period. A blank area will spook or confuse a user, and a picture could make them think it's a duplicate, so maybe just a larger Hydrus logo with some opacity should appear in the gray background between pics. Maybe during the fifth of a second window some loading of the next image could appear.
 - If the image isn't ready within that fifth of a second, re-enable the swapping inputs so they can swap off if they want. Sometimes a really big image will actually take some seconds to load and display if Hydrus is being deprived of RAM/CPU or something, and I think sometimes the dupe window itself as a whole can freeze or halt during this if it's big enough too. Being able to scroll back if it's not already freezing lets the user close some things or just get ready to load the big image. Also it can be hard to remember details of the previous image visually if there's a long gap, which is why I think it shouldn't be more than a fraction of a second. It's super useful to "watch the picture change" while scrolling between as it is right now. Also after the delay empty out the box of the old image's file data if the new file hasn't already appeared and overwrriten it.
 - Once the new image has finished loading and is confirmed displayed, put in the new file info in the box.

Another thing to mention, and sorry if this was already addressed during the versions I missed, I zoom in using Ctrl+Mousewheel, but I have to be very careful because the zoom task seems to be unoptimized so on a larger image it can hang the dupe window (or the file view window) and the Hydrus client window itself at the same time sometimes, for several seconds. Also the way the zoom function works now, each mousewheel tick doubles the zoom level rather than some lesser gradation, so a little scrolling which on most clients would zoom in 3x or 5x or so zooms into 30 blown up pixels in Hydrus. The main problem, though, is that the client reads each and every input, queues them, freezes the window while it figures out what it's doing, then loads each zoomed in slice in sequence while figuring out the next one, so the window hangs and freezes for a while then stutters inwards frame by frame accelerating as it goes since the higher zoom levels usually go quicker. Not a big deal but it's disorienting when you forget that it's like that and give an extra hefty ctrl-scroll to check some detail you can't really see even with magnifier.

If you select to delete a pair of files (send to Hydrus trash), or just one probably, then you change your mind and go back to that pair, it saves your decision, so if you decide "yeah, I'll delete them after all" it presents you with the option to delete them from disk rather than just to trash, just like if you were looking through thumbnails or something and selected to delete them a second time. I'm not sure if you can now page back and forth between pairs you've already sorted somehow, but in my monkey brain going back a pair undoes the decision for that pair. It's no big deal if it actually doesn't since it's going overwrite that with the new decision if I change my mind about a related/unrelated or a best-of-two, but for deletion it can leave me with at least the appearance that to move on in my queue without closing midway and finishing a partial, I have to either keep one of the images or directly delete both with no hydrus trash.

On that note, if it does save choices even when you go back, but allows them to be overwritten, is there a way we could skip around back and forth freely among decisions we've already made, checking them? That would let us take closer looks at larger-than-two groups of files for decision purposes without the UI getting to be a mess, if we could just look at our past decisions between related pairs in a group freely.

If we ever have any auto dupe features, could we have an auto culling by dimensions/ratio? For instance, if I could custom set up thresholds for a button which, when pressed, would look through the latest searched tree of 0 distance duplicates and automatically delete any files which met both of these two conditions:
 - Height and width both under 600 pixels, and
 - Has at least one potential duplicate with at least 250% pixel height, and
 - That same duplicate has at least 250% pixel length, and
 - That same duplicate is the same aspect ratio (exact or within a tolerance)
The idea being that no matter how blurry it may be, the larger picture is either better than the 600 pixel version, or if not I don't even want whatever this picture is to begin with so at least my database is now one pic smaller without me even seeing it individually. Of course my tolerance for pictures for ants is different to other peoples', hence the customization of criteria and thresholds.

That's all I can think of for now.