training writeup

the metadata can be pulled from the lora itself but

DATA:
~100 images of school uniform x5
~40 images of frilled bikini x5
~40 images of nsfw x4
~20 images of random outfits x8
~15 images that i wanted deprioritized x1


TRAINING COUNTS:
batch size=3
this ran for 7 epochs
each epoch was around ~950 images trained
epoch 6 and 7 were beginning to show signs of overcook so i left them out

images were all tagged with
ajitani hifumi, <outfit>
using shuffle tags + keep_token 2
	I saw a distinct quality increase after doing this


SCHEDULER: cosine_with_restarts
I used the cosine_with_restarts because i have no idea how to pick a scheduler
and BA anon was already using it.
I haven't experimented with this yet.
WARMUP RATIO: again, copied BA anon.  Haven't experimented.


LEARNING RATE:
LR + unetLR 2e-4
text LR: 1e-4
After a lot of trial and error, I settled on a base LR of 1e-4
being multiplied by 2/3 of the batch size (0.66 * 3 = 2)
I read/heard somewhere that text enc LR should be half of the other LR.


BAD LEARNING RATE:
LR+unetLR = 3e-4
text LR = 1.5e-4
	this resulted in random stuff randomly popping up when it shouldn't be:
	drinks popping up in school/her hands
	straps appearing on her clothes
	?? objects just showing up

DIM/ALPHA:
I tried dim=128 many many many times, and dim=64 a couple times.
dim=64 produced great results but the quality was lacking compared to the dim=128 ones.
I don't know if its because If my dataset is too large/varied/multi-concept or because
I need to tune for it better but I figured I'll leave further exploration on that for later.
I would definitely recommend anons give it a try with smaller data loras, it cooks super
fast with amazing results.

I still dunno what exactly alpha does since the technical details go over my head, but it does
seem to apply an inverse relationship with DIM to training speed.
128/128 - the way it was always done in the past, I found that this overcooks too easily
and makes it harder (for me) to pin down quality
128/64 - this felt like it gave significantly more breathing room compared to 128/128
128/32 - my early trials showed alpha=32 producing great results, but it took a lot more time to train
so i stopped experimenting with it due to impatience.  Another anon prefers 128/32 and gets the best results there.
128/1 - memes.  this being a default seems wrong.  At least one technical anon mentioned it doesn't make sense to use 128/1 and
	instead one should be lowering dim at that point.
	Even with training at 3x LRs and 10 epochs (10,000+ steps) it was undercooked.
	And yet with all that it started showing weird flaws as well.
	If theres a way to get value out of this, i'm not the one figuring it out.


MIX-PRECISION: BP, SAVE-PRECISION: FP
When i use FP for mix-precision i run out of VRAM so its not a choice, unless I want
to reduce my batch size.  I prioritized experimenting with things that down slow me down
significantly.


FLIP AUGMENT: OFF
	I tried the flip augment flag many times.  It has a significant impact.
	It boosts the speed at which the data converges significantly and sometimes
	gets better quality.  I think it might make it too easy to overcook though.
	I haven't tested it since further data cleanups + reduced LR rates so it might still have value.
	But i was often erring towards overcooking so i turned it off in the most recent bakes.


COLOR AUG: OFF
	I tried experimenting with this, It does things.  It changes something.
	But I can't tell you what exactly it did and whether it was beneficial enough to be worth using.
	Stopped experimenting with it after 1 trial since it didn't amaze.


RESOLUTION: 512,512
I did one trial of 768,768 and it APPEARS to have brought some nice improvements.
But it required dropping my batch size to 1 and took 3+ hours to train.
It would take further experimenting to find the right combination of settings to get the
right mileage out of this, and since it takes me aeons to train on it I dropped back to 512.