Training question from a beginner

#100

by bnotoe - opened Mar 25

Mar 25

Hi everyone

I'm a total noob at this and I openly admit I don't know what I'm doing. I mix-matched a bunch of training parameters from random online tutorials and LoRA guides (this Huggingface page and Civitai), so my TOML is probably a Frankenstein monster.

The generations come out with a really ugly green wash/tint (especially in backgrounds, water, space scenes, and shadows, everything looks like murky olive soup) plus a heavy painterly/scratchy texture on skin and flat surfaces that shouldn't have it.

Here are two example outputs that show the problems (Nami from One Piece on water and the guy in the spaceship interior):

This is the exact TOML (in the "LoRA_Easy_Training_Scripts-refresh" program) I'm using right now:

[[subsets]]
caption_extension = ".txt"
image_dir = "/content/drive/MyDrive/Loras/[ARTISTSTYLE]/dataset"
name = "subset 1"
num_repeats = 1
random_crop_padding_percent = 0.0

[train_mode]
train_mode = "lora"

[general_args.args]
persistent_data_loader_workers = true
pretrained_model_name_or_path = ""
full_bf16 = true
mixed_precision = "bf16"
gradient_checkpointing = true
seed = 23
max_data_loader_n_workers = 1
max_token_length = 225
prior_loss_weight = 1.0
sdpa = true
cache_latents = true
vae_batch_size = 1
max_train_epochs = 20

[general_args.dataset_args]
resolution = 1024
batch_size = 1

[network_args.args]
network_dim = 8
network_alpha = 8.0
min_timestep = 0
max_timestep = 1000

[optimizer_args.args]
lr_scheduler = "cosine"
warmup_ratio = 0.1
max_grad_norm = 1.0
optimizer_type = "Came"
loss_type = "l2"
learning_rate = 1.25e-5

[saving_args.args]
output_dir = "/content/drive/MyDrive/Loras/[ARTISTSTYLE]/output"
output_name = "[ARTISTSTYLE]"
save_precision = "bf16"
save_model_as = "safetensors"
save_every_n_epochs = 1
save_last_n_epochs_state = 1
save_state = true

[logging_args.args]
log_prefix_mode = "disabled"
run_name_mode = "default"

[anima_args.args]
pretrained_model_name_or_path = "/content/drive/MyDrive/Downloaded_models/anima-preview2.safetensors"
qwen3 = "/content/drive/MyDrive/Downloaded_text_encoders/qwen_3_06b_base.safetensors"
vae = "/content/drive/MyDrive/Downloaded_VAEs/qwen_image_vae.safetensors"
qwen3_max_token_length = 512
t5_max_token_length = 512
timestep_sampling = "sigma"
discrete_flow_shift = 3.0

[edm_loss_args.args]
edm2_loss_weighting = false

[extra_args.args]
weighting_scheme = "uniform"
debiased_estimation_loss = true
noise_offset = 0.1

[bucket_args.dataset_args]
enable_bucket = true
min_bucket_reso = 512
max_bucket_reso = 1536
bucket_reso_steps = 64

[network_args.args.network_args]
loraplus_lr_ratio = "2.0"
network_reg_dims = ".*blocks\\\\.(1[89]|2[0-7])\\\\..*=16"
include_patterns = "['.*unet_blocks_([0-9]|1[0-9]|2[0-7])\\\\..*']"
exclude_patterns = "['.*_te_layers_.*', '.*adaln_modulation.*']"
network_reg_lrs = ".*blocks\\\\.(1[89]|2[0-7])\\\\..*=3e-05, .*blocks\\\\.([0-9]|1[0-7])\\\\..*=1.25e-05"

[optimizer_args.args.optimizer_args]
weight_decay = "0.1"

I'm using the Lora Easy Training Colab notebook: Lora_Easy_Training_Colab

I'm running on the free Colab T4 GPU with full_bf16 = true. I read that the T4 has trouble with full BrainFloat16 and it can easily produce NaNs. Could this be causing my issues?

I also tried training in fp16, but the style barely learned at all, the results were much weaker than with bf16 (even though the colors are muted, you can still see the style starting to come through in bf16).

Any help would be super appreciated! I'm happy to try any suggested changes to the TOML or training settings.

rconhf

Mar 26

You deffinitely shouldn't be using noise offset for this model, that's the first and only thing i can pretty confidently say is "wrong."

that could 100% lead to weird colors and grainy stuff happening where it shouldn't. try without.

Xnsviel

Mar 26

•

edited Mar 26

Try default config.

Alpha should be half of dim if you don't want to overtrain quickly.
network_dim=8
network_alpha=4

learning_rate=2e-4

Remove these fancy stuff that's not even required:
prior_loss_weight = 1.0
edm2_loss_weighting = false
timestep_sampling = "sigma"
discrete_flow_shift = 3.0
weighting_scheme = "uniform"
debiased_estimation_loss = true
noise_offset = 0.1

Also remove Loraplus

brokencontroller

Mar 26

How are you even able to train it, it's constantly giving me a 'CUDA out of memory' error ;_;

AzimuthDaler

Mar 26

64gb of ram and 24gb of vram

Nuke1229

Mar 27

How are you even able to train it, it's constantly giving me a 'CUDA out of memory' error ;_;

What are your specs?

brokencontroller

Mar 27

•

edited Mar 27

How are you even able to train it, it's constantly giving me a 'CUDA out of memory' error ;_;

What are your specs?

it's the same Colab as OP so it's a T4 but he's seemingly able to train it just fine.
EDIT: I'm silly and didn't cache latents earlier. It works now! Hopefully I don't have OP's issues with it.

bnotoe

Mar 27

Hey guys, thanks for the help!

Huge facepalm from me… turns out it wasn’t a training problem at all.

I was testing with the “image → VAE” node and had lowered the noise in the KSampler. After I was done, I forgot to set the noise back to 1.0. So I wasted days retraining when the green tint was just a dumb workflow mistake. Classic retarded noobie error.

That said, your advice still helped a lot! Removing noise_offset, setting alpha to half the dim, and stripping out all the extra fancy stuff made the LoRA train quite a bit cleaner.

Here are the new results with the fixed workflow (same prompt/seed):

And here’s my current (simpler) TOML:

[[subsets]]
caption_extension = ".txt"
image_dir = "/content/drive/MyDrive/Loras/[ARTISTSTYLE]/dataset"
name = "subset 1"
num_repeats = 1
random_crop_padding_percent = 0.0

[train_mode]
train_mode = "lora"

[general_args.args]
persistent_data_loader_workers = true
pretrained_model_name_or_path = ""
gradient_checkpointing = true
seed = 23
max_data_loader_n_workers = 1
max_token_length = 225
sdpa = true
max_train_epochs = 20
cache_latents = true
vae_batch_size = 1
full_bf16 = true
mixed_precision = "bf16"

[general_args.dataset_args]
resolution = 1024
batch_size = 1

[network_args.args]
network_dim = 16
network_alpha = 8.0
min_timestep = 0
max_timestep = 1000

[optimizer_args.args]
optimizer_type = "AdamW8bit"
lr_scheduler = "cosine"
loss_type = "l2"
learning_rate = 0.0002
warmup_ratio = 0.1
max_grad_norm = 1.0

[saving_args.args]
output_dir = "/content/drive/MyDrive/Loras/[ARTISTSTYLE]/output"
save_precision = "bf16"
save_model_as = "safetensors"
save_every_n_epochs = 1
save_last_n_epochs_state = 1
save_state = true
output_name = "[ARTISTSTYLE]"

[logging_args.args]
log_prefix_mode = "disabled"
run_name_mode = "default"

[anima_args.args]
pretrained_model_name_or_path = "/content/drive/MyDrive/Downloaded_models/anima-preview2.safetensors"
qwen3 = "/content/drive/MyDrive/Downloaded_text_encoders/qwen_3_06b_base.safetensors"
vae = "/content/drive/MyDrive/Downloaded_VAEs/qwen_image_vae.safetensors"
qwen3_max_token_length = 512
t5_max_token_length = 512
timestep_sampling = "sigmoid"
sigmoid_scale = 1.0
discrete_flow_shift = 3.0

[edm_loss_args.args]
edm2_loss_weighting = false

[bucket_args.dataset_args]
enable_bucket = true
min_bucket_reso = 512
bucket_reso_steps = 64
max_bucket_reso = 1536

Also @brokencontroller ,

Quick tip since you're on the free T4 and can only train for 1.5–4 hours at a time:
Make sure you keep save_state = true and save_last_n_epochs_state = 1. That way you can resume training from where you left off instead of starting over every time Colab disconnects.

Learned my lesson, always double-check the workflow before (re)training! Thanks again :

Imeeeeri

Mar 30

Hey guys, thanks for the help!

Huge facepalm from me… turns out it wasn’t a training problem at all.

I was testing with the “image → VAE” node and had lowered the noise in the KSampler. After I was done, I forgot to set the noise back to 1.0. So I wasted days retraining when the green tint was just a dumb workflow mistake. Classic retarded noobie error.

That said, your advice still helped a lot! Removing noise_offset, setting alpha to half the dim, and stripping out all the extra fancy stuff made the LoRA train quite a bit cleaner.

Here are the new results with the fixed workflow (same prompt/seed):

And here’s my current (simpler) TOML:
[[subsets]]
caption_extension = ".txt"
image_dir = "/content/drive/MyDrive/Loras/[ARTISTSTYLE]/dataset"
name = "subset 1"
num_repeats = 1
random_crop_padding_percent = 0.0

[train_mode]
train_mode = "lora"

[general_args.args]
persistent_data_loader_workers = true
pretrained_model_name_or_path = ""
gradient_checkpointing = true
seed = 23
max_data_loader_n_workers = 1
max_token_length = 225
sdpa = true
max_train_epochs = 20
cache_latents = true
vae_batch_size = 1
full_bf16 = true
mixed_precision = "bf16"

[general_args.dataset_args]
resolution = 1024
batch_size = 1

[network_args.args]
network_dim = 16
network_alpha = 8.0
min_timestep = 0
max_timestep = 1000

[optimizer_args.args]
optimizer_type = "AdamW8bit"
lr_scheduler = "cosine"
loss_type = "l2"
learning_rate = 0.0002
warmup_ratio = 0.1
max_grad_norm = 1.0

[saving_args.args]
output_dir = "/content/drive/MyDrive/Loras/[ARTISTSTYLE]/output"
save_precision = "bf16"
save_model_as = "safetensors"
save_every_n_epochs = 1
save_last_n_epochs_state = 1
save_state = true
output_name = "[ARTISTSTYLE]"

[logging_args.args]
log_prefix_mode = "disabled"
run_name_mode = "default"

[anima_args.args]
pretrained_model_name_or_path = "/content/drive/MyDrive/Downloaded_models/anima-preview2.safetensors"
qwen3 = "/content/drive/MyDrive/Downloaded_text_encoders/qwen_3_06b_base.safetensors"
vae = "/content/drive/MyDrive/Downloaded_VAEs/qwen_image_vae.safetensors"
qwen3_max_token_length = 512
t5_max_token_length = 512
timestep_sampling = "sigmoid"
sigmoid_scale = 1.0
discrete_flow_shift = 3.0

[edm_loss_args.args]
edm2_loss_weighting = false

[bucket_args.dataset_args]
enable_bucket = true
min_bucket_reso = 512
bucket_reso_steps = 64
max_bucket_reso = 1536
Also @brokencontroller ,

Quick tip since you're on the free T4 and can only train for 1.5–4 hours at a time:
Make sure you keep save_state = true and save_last_n_epochs_state = 1. That way you can resume training from where you left off instead of starting over every time Colab disconnects.

Learned my lesson, always double-check the workflow before (re)training! Thanks again :

Hello, sorry for the inconvenience, but could you know how I could resume my training where I left off? I am training a lora and I would like to know how to continue the rest of the training. I would greatly appreciate your help

bnotoe

Mar 31

•

edited Mar 31

@Imeeeeri

No problem at all :)

To resume your training where you left off:

Make sure you have Save State and Save Last State enabled in the Saving Args section (they should both be checked), these need to be selected to be able to resume training later on.

After training stops, go to your output folder in Google Drive. You should see a folder named something like [ARTIST STYLE]-000006-state (the number changes depending on the epoch).Inside that folder you'll find 5 files:
model.safetensors
optimizer.bin
random_states_0.pkl
scheduler.bin
train_state.json

Back in the LoRA trainer WebUI, go to Saving Args → check the box Resume State, then click the folder icon and point it to that state folder (or the train_state.json file inside it).

That’s it, just go through your normal steps and hit Start Training again and it should continue from where you left off.

Extra tip: You can also download the entire state folder to your computer and upload it to another Google account (as long as your dataset and model files are there too). Works great if you want to switch accounts or continue on a different Colab.

Also, small performance tip: Try training at 768px resolution instead of 1024px. It’s roughly 2x faster, but you can still generate at full 1024x1024 (1 megapixel) quality later. Best of both worlds :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment