Experimental Training code with Full Finetune and LoRA Support Mimicking the features implemented by Tdrussell(for linux or WSL2)

#28

by Bluvoll - opened Feb 4

Discussion

Bluvoll

Feb 4

•

edited Feb 4

I just pushed to my repo all the changes needed to partially emulate what Tdrussell expressed on his comment:

Features for anima leveraged from baseline diffusion-pipe:

Finetuning in single/multi-gpu with each GPU holding a copy of the model (DDP style training)
-Finetuning in multi-gpu with sharding across GPUs, the model is shared across X Gpus but its a single copy controlled by pipeline_stages
-LoRA training in single/multi-gpu/sharding across GPUs, same rules as above
-Pure tags training
-Pure captions training, these must have the suffix _nl.txt
-Mixed Captions + Tags training
-Mixed Tags + Captions training, the order on how both sets are "fused" is changed
-Combination of all the 4 methods and separation via percentages.
-Caching text encoder ouputs DO NOT do this if you want shuffle, dropout plus mixed captions!
-Tag shuffle for danbooru style tags.
Sentence shuffle for captions, it will shuffle by segments that are delimited by period.

Example of natural language caption:

"A medium close-up of a young anime girl with long dark blue hair and large, shimmering blue-green eyes. She has a shy, embarrassed expression with a noticeable blush on her cheeks and is looking up at the viewer. Her hands are clasped together near her chest. She wears a school uniform consisting of a white shirt, a teal sweater vest, and a red ribbon tie, with a green armband on her left arm. The background suggests a classroom setting with a desk and curtains.
"
The shuffling will be done by segments like "A medium close-up of a young anime girl with long dark blue hair and large, shimmering blue-green eyes. "
-train_llm_adapter you can decide to train the LLMAdapter used by the model in a similar fashion to Clip in SDXL not recommended
-Tag dropout, by default 10%, Drop 10% of tags randomly (minimum 3 tags always survive).
-Full uncond dropout/caption dropout, default 5% will drop absolutely everything and the model will train those images with "" instead of captions or tags
-debug_caption_processing you can have the console output how the captions are being processed this is truncated to avoid spamming your console but the model sees the full content, you can set how often you see these prints with debug_caption_interval = 100 where 100 is how many optimizer steps for a print.
-Added CAME optimizer for some folks that used it in Lora Easy Training Scripts, baseline LR for this is 5e-6, yes -6

Training information with AdamW8Kahan for LoRA:
-Batch Size 1, resolution 512px and checkpointing, you have about 10GB VRAM Usage.
-Batch Size 1, resolution 1024px and checkpointing, you have about 14GB VRAM Usage.
-Suggested LR is 5e-5 or lower!
-Suggested resolution is 768px as the model was trained at 512px due to money and constraints, treat the resolutions as you would with SD1.5 for now, the model can TOTALLY take 1024px in a lora if you want, no problems but some issues might arise due to lack of base model training!

Training information with AdamW8Kahan for Full Finetuning:
-Batch Size 1, resolution 512px and checkpointing, you have about 31GB VRAM Usage.
-Batch Size 1, resolution 1024px and checkpointing, you have about 33GB VRAM Usage.
-Suggested LR is 8e-6 or lower!
-Suggested resolution is 768px as the model was trained at 512px due to money and constraints, treat the resolutions as you would with SD1.5 for now, the model can TOTALLY take 1024px in Finetuning if you want, no problems but some issues might arise due to lack of base model training, prepare a lot of data!

Here the repo with anima.toml and dataset.toml ready for usage! https://github.com/bluvoll/diffusion-pipe/tree/main

LORA trained on Delta26's style at 1024px https://drive.google.com/file/d/1r9JtTD5bzPV3NvfLFz_HB1hm5uHKKtxs/view?usp=sharing if you want to test results of training, its overfit for purposes of testing.

Tune pipeline_stages for your amount of GPUs, using 2, will default to sharding if you have multi GPUs and you can afford to load it in full for LoRA or Finetuning, keep this to 1!

All test done on 2x4090, when in doubt check anima.toml inside examples, has a comments on everything, also remember to UNCOMMENT, the lora segment on anima.toml in the root of the folder to enable LORA training.

To run, with your activated conda/venv

deepspeed train.py --deepspeed --config anima.toml for multi-gpu or
deepspeed --num_gpus=1 train.py --deepspeed --config anima.toml for single GPU

Vickweb3

Feb 4

This is a long read

Bluvoll

Feb 5

This is a long read

I know, if you want I can expand on the features to make it more detailed :doro:

Bakanayatsu

Feb 5

•

edited Feb 5

This is a long read

not really that much when you consider it's only 1k words

This is a long read

I know, if you want I can expand on the features to make it more detailed :doro:

Is this supported in musubi-tuner? diffusion-pipe is 2x slower for me.

As for captioning, would Kimi K2.5 suffice as captioner? It's the most accurate uncensored captioner in my tests.

kexing

Feb 5

Hello, the project runs well for me, but when I tried to reduce VRAM usage and improve training speed by enabling cache_text_embeddings = true, the program crashed during the caching process. The script enters dataset_manager.cache() and attempts to move a model to the meta device. If the model contains bitsandbytes 4-bit (nf4) weights (Params4bit), a RuntimeError is raised during this step. The error message indicates a failure when creating a Parameter from Params4bit, stating that detach() returns a Tensor instead of the same parameter type.

Bluvoll

Feb 5

•

edited Feb 5

Hello, the project runs well for me, but when I tried to reduce VRAM usage and improve training speed by enabling cache_text_embeddings = true, the program crashed during the caching process. The script enters dataset_manager.cache() and attempts to move a model to the meta device. If the model contains bitsandbytes 4-bit (nf4) weights (Params4bit), a RuntimeError is raised during this step. The error message indicates a failure when creating a Parameter from Params4bit, stating that detach() returns a Tensor instead of the same parameter type.

I got this bug as well, I'm trying to fix it, but caching text embeddings will disable all the special things I did for captions / tags, like mixing, dropout, shuffle, take that into consideration, but I'm looking to fix it, but if you want, you can disable NF4, and enable caching embeddings, since as I understand it, both are incompatible or I just lack the knowledge to have it work in tandem.

But this guideline works NF4 + Caching = Fails., FP8 + Caching = Works, BF16/FP16 + Caching = Works.

But I advise against caching, you lose a lot of neat utilities.

Kihero

Feb 9

This comment has been hidden (marked as Resolved)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment