The training script for the Anima model has already been implemented for sd-scripts

#35

by duongve - opened Feb 5

Feb 5

I have completed the implementation for both LoRA and full fine-tuning. You can check my pull request here: https://github.com/kohya-ss/sd-scripts/pull/2260. And here is my repo for implement it: https://github.com/duongve13112002/sd-scripts/tree/sd3

Guide:
I. Training

This model can be trained using the sd-scripts training pipeline with either LoRA or full finetuning.

LoRA Training

Use LoRA adapters to reduce memory usage and training time while keeping the base model frozen:

accelerate launch anima_train_network.py \
  --dit_path <path_to_anima_dit> \
  --vae_path <path_to_wanvae> \
  --qwen3_path <path_to_qwen3> \
  --dataset_config my_dataset_config.toml \
  --output_dir output \
  --output_name anima_lora \
  --network_module networks.lora_anima \
  --network_dim 16 \
  --learning_rate 1e-4 \
  --max_train_epochs 10 \
  --cache_text_encoder_outputs \
  --cache_latents

Full Finetuning

For full model finetuning, including DiT and optional LLM Adapter components:

accelerate launch anima_train.py \
  --dit_path <path_to_anima_dit> \
  --vae_path <path_to_wanvae> \
  --qwen3_path <path_to_qwen3> \
  --dataset_config my_dataset_config.toml \
  --output_dir output \
  --output_name anima_ft \
  --learning_rate 1e-5 \
  --max_train_epochs 10 \
  --cache_text_encoder_outputs \
  --cache_latents \
  --llm_adapter_lr 0

Setting --llm_adapter_lr 0 freezes the LLM Adapter during training.

II. Notes

Anima is a ~2B parameter DiT-based model.
VRAM usage can be significantly reduced by enabling:

--cache_text_encoder_outputs
--cache_latents
--gradient_checkpointing

LoRA training supports per-layer rank configuration via --network_args.

kazuyi1222

Feb 5

This comment has been hidden (marked as Off-Topic)

kazuyi1222

Feb 6

我使用你的分支训练了一个lora，好像出图不太对劲（

这是参数：accelerate launch anima_train_network.py
--dit_path "/root/autodl-tmp/anima-preview.safetensors"
--vae_path "/root/autodl-tmp/qwen_image_vae.safetensors"
--qwen3_path "/root/autodl-tmp/qwen"
--dataset_config "/root/autodl-tmp/sd-/dataset.toml"
--output_dir "/root/autodl-tmp/sd-/output"
--output_name "mmm_anima"
--network_module networks.lora_anima
--network_dim 32
--network_alpha 8
--learning_rate 7e-5
--max_train_epochs 40
--gradient_checkpointing
--train_batch_size 10
--cache_latents
--logging_dir "/root/autodl-tmp/sd-/output"
--xformers
--mixed_precision "bf16"
--optimizer_type "pytorch_optimizer.Adan"
--lr_scheduler "cosine"
--max_grad_norm 0
--lr_warmup_steps 50
--optimizer_args "weight_decay=0.02" "betas=0.98,0.92,0.99"
--save_every_n_epochs=2
--network_train_unet_only

duongve

Feb 6

@kazuyi1222 Can you tell me more detail about the dataset which you used for training?

SimaDude

Feb 6

•

edited Feb 6

I'm getting AttributeError: 'WanVAE_' object has no attribute 'device' after caching latents:

                    INFO     Loaded Anima VAE successfully.                            anima_utils.py:159
                    INFO     [Dataset 0]                                               train_util.py:2624
                    INFO     caching latents with caching strategy.                    train_util.py:1117
                    INFO     caching latents...                                        train_util.py:1167
100%|████████████████████████████████████████████████████████████████████| 63/63 [00:51<00:00,  1.23it/s]
2026-02-06 15:18:56 INFO     move vae and unet to cpu to save memory           anima_train_network.py:202
Traceback (most recent call last):
  File "/mnt/FastDrive/opensource/sd-scripts/anima_train_network.py", line 535, in <module>
    trainer.train(args)
  File "/mnt/FastDrive/opensource/sd-scripts/train_network.py", line 633, in train
    self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype)
  File "/mnt/FastDrive/opensource/sd-scripts/anima_train_network.py", line 203, in cache_text_encoder_outputs_if_needed
    org_vae_device = vae.device
                     ^^^^^^^^^^
  File "/home/simadude/.conda/envs/sd-scripts/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1965, in __getattr__
    raise AttributeError(
AttributeError: 'WanVAE_' object has no attribute 'device'

EDIT: accidentally wrote "after one epoch" instead of "after caching latents"

Espamholding

Feb 6

•

edited Feb 6

Besides the above issue which I also get, latents are re-cached every run, and they are saved with incorrect keys. For a 1024x1024 image called "132.png",

NpzFile '132_1024x1024_anima.npz' with keys: latents_1x128, original_size_1x128, crop_ltrb_1x128"

And it is, of course, looking for a key latents_128x128 (which does not exist)

duongve

Feb 6

•

edited Feb 6

@Espamholding , @SimaDude oops it is my bad i havent checked cache text embedding and latent. I just tested without them. I will check it later and push the code soon dont worry.

MightyCrimson

Feb 6

Where is the anima usage documentation? I would like to know the recommended training parameters, please.

SimaDude

Feb 6

Wish me luck y'all, I'm halfway through the training (6/10 epochs done) 🙏
Had to also fix this in anima_train_network.py for some reason:

@@ -250,7 +250,7 @@ class AnimaNetworkTrainer(train_network.NetworkTrainer):
             text_encoders[0].to(accelerator.device, dtype=weight_dtype)
 
     def sample_images(self, accelerator, args, epoch, global_step, device, vae, tokenizer, text_encoder, unet):
-        text_encoders = text_encoder  # compatibility
+        text_encoders = text_encoder if isinstance(text_encoder, list) else [text_encoder]
         te = self.get_models_for_text_encoding(args, accelerator, text_encoders)
         qwen3_te = te[0] if te is not None else None

SimaDude

Feb 6

And it's working!

Prompt: @twistedscarlett60, 1girl, kim (omori),

Obviously I'll need to make a better dataset (because some features of the character are missing as I can see), but still!

duongve

Feb 6

@SimaDude @Espamholding @MightyCrimson i fixed all the problems and added the document for it. Please check it out

Espamholding

Feb 6

Well, I finished training now. After I wrote my first comment, I hacked away any VAE related errors stopping me from training and kept on with the cached latents. The result I got was also fried like kazuyi1222's, since you didn't mention that person I imagine that issue is not fixed. Smells like another VAE issue to me.

duongve

Feb 6

@Espamholding let me check more detail, i tested on full fine-tune and it work well. it is so weird.

Espamholding

Feb 6

•

edited Feb 6

Shoving sd_scripts' cached latents of SDXL (with the EQ VAE) and the Qwen VAE with your code respectively (after 2nd commit) into Comfy,

Wrongly applied scaling factor? It certainly feels like we're training the model to be closer to that more fried looking latent.

Code for convenience for converting a sd_scripts cached latent into a comfy one, to test against comfy's implementation of the qwen vae:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-i", help="Latent in name")
parser.add_argument("-o", help="Latent out name")
args = parser.parse_args()
import torch
import numpy as np
from safetensors.torch import save_file
lat_np_f = np.load(args.i)
target_key = "oooo"
for k in lat_np_f.keys():
    if k.startswith("latents"):
        target_key = k
lat_t = torch.from_numpy(lat_np_f[target_key])
save_file({'latent_format_version_0': torch.Tensor([]), 'latent_tensor': lat_t.unsqueeze(0)}, args.o)

duongve

Feb 6

@Espamholding Thanks, i found this bug and fixed it in the last commit

Chippo

Feb 6

@duongve its fixed !! thank you so much 💖

SimaDude

Feb 6

@duongve Using flag --cache_text_encoder_outputs throws an error. Cache latent works.

override steps. steps for 10 epochs is / 指定エポックまでのステップ数: 630
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 63
  num validation images * repeats / 学習画像の数×繰り返し回数: 0
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 63
  num epochs / epoch数: 10
  batch size per device / バッチサイズ: 1
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 630
2026-02-06 22:24:55 INFO     text_encoder is not needed for training. deleting to   train_network.py:1322
                             save memory.                                                                
2026-02-06 22:24:56 INFO     unet dtype: torch.bfloat16, device: cuda:0             train_network.py:1347
steps:   0%|                                                                     | 0/630 [00:00<?, ?it/s]
epoch 1/10

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2026-02-06 22:24:57 INFO     epoch is incremented. current_epoch: 0, epoch: 1           train_util.py:781
2026-02-06 22:24:57 INFO     epoch is incremented. current_epoch: 0, epoch: 1           train_util.py:781
2026-02-06 22:24:57 INFO     epoch is incremented. current_epoch: 0, epoch: 1           train_util.py:781
2026-02-06 22:24:57 INFO     epoch is incremented. current_epoch: 0, epoch: 1           train_util.py:781
2026-02-06 22:24:57 INFO     epoch is incremented. current_epoch: 0, epoch: 1           train_util.py:781
2026-02-06 22:24:57 INFO     epoch is incremented. current_epoch: 0, epoch: 1           train_util.py:781
2026-02-06 22:24:57 INFO     epoch is incremented. current_epoch: 0, epoch: 1           train_util.py:781
2026-02-06 22:24:57 INFO     epoch is incremented. current_epoch: 0, epoch: 1           train_util.py:781
Traceback (most recent call last):
  File "/mnt/FastDrive/opensource/sd-scripts/anima_train_network.py", line 532, in <module>
    trainer.train(args)
  File "/mnt/FastDrive/opensource/sd-scripts/train_network.py", line 1427, in train
    loss = self.process_batch(
           ^^^^^^^^^^^^^^^^^^^
  File "/mnt/FastDrive/opensource/sd-scripts/anima_train_network.py", line 423, in process_batch
    input_ids = [ids.to(accelerator.device) for ids in batch["input_ids_list"]]
                                                       ~~~~~^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not iterable
steps:   0%|                                                                     | 0/630 [00:01<?, ?it/s]

duongve

Feb 6

@SimaDude i fixed this issue in the latest commit

Mightys

Feb 6

Thanks for the doc, Are these the correct values for training characters in Anima?

duongve

Feb 6

•

edited Feb 6

@Mightys Yes but i think discrete_flow_shift should be 1 because i checked it on diffusion-pipe. I am not sure this value but i am using 3 at the moment

Espamholding

Feb 6

•

edited Feb 6

Definitely looking better now!

Guess I'll find out later how well this model takes to styles and sloppier vs better datasets...

SimaDude

Feb 7

@duongve Yep, it works. Thank you so much!

Mightys

Feb 7

Hello again. I don't know if you can help me with this error, but I'm trying to train in Colab, and it just keeps getting stuck indefinitely like in the image.

duongve

Feb 7

@Mightys Did you create a dataset config for training. I think this may be the problem

Mightys

Feb 7

@duongve
I've already solved it, but now I have another problem. Sorry for so many comments. When I try to train in both mixed FP16 and full FP16, it results in NAN. Is this a script issue, or can Anima not train in FP16?

kazuyi1222

Feb 8

@duongve
I've already solved it, but now I have another problem. Sorry for so many comments. When I try to train in both mixed FP16 and full FP16, it results in NAN. Is this a script issue, or can Anima not train in FP16?

Anima uses full bf16, so it cannot be trained with fp16.

CappyAdams

Feb 8

•

edited Feb 8

hey, silly question but I keep gettin this error.. I do infact use --qwen3_path

ValueError: Either qwen3_tokenizer or qwen3_path must be provided

UPDATE!

I fixed it. make sure to run EVERYTHING in the same line

cpbmc

Feb 8

•

edited Feb 10

Congrats for merging back to mainline and thanks for the training scripts.

EDIT: Turns that I was using the wrong (fp8) base model.

~~Anyone successfully training character LoRAs with 4090 (Ada Lovelace)? (Driver version 580.126.09, Cuda version 13.0. torch 2.9.1)~~

I tried with a simple dataset of danbooru images (around 100), tagged with wd tagger with minor manual editing, as well as a smaller one (roughly 20, used to work for lumina and illustrious). Even a single image overfitting would break the lora weight, making at least ~1/10 of the lora_up weight bimodal. as below.

~~The resulting image becomes blurry and brighter (visible at 50 steps with a single image), until totally collapse of white / grid image, regardless of the learning rate above.~~

Click to expand

![grim](https://cdn-uploads.huggingface.co/production/uploads/634c5825b6628cbe28635b07/99FUzfb2o6SIZMkaXR64z.png)

I have tried different settings and different datasets, but getting consistent results (learning rate to 1e-5 or 5e-6 also didn't help)
Tried different ranks (8, 16, 32), different alpha (1, 2,4, 8, 16). Also tried training without bf16 precision settings. Adding/removing --no_half_vae --network_train_unet_only --gradient_checkpointing also didn't help. And I also tried to use the wan 2.1 vae but not really helpful either.

    --network_module=networks.lora_anima \
    --save_model_as=safetensors \
    --optimizer_type="AdamW8bit"   \
    --lr_scheduler="constant"   \
    --network_dim=8 \
    --network_alpha=8 \
    --learning_rate=1e-4   \
    --max_train_epochs=5   \
    --mixed_precision="bf16"   \
    --save_every_n_epochs=1   \
    --save_precision="bf16" \
    --cache_text_encoder_outputs \
    --cache_latents

Any hints would be greatly appreciated.

SimaDude

Feb 9

@cpbmc That's odd, I don't have any issues on my RTX 4060ti but here are my flags if it helps anyhow:

--network_module networks.lora_anima --network_dim 16 --network_alpha 16 --learning_rate 1e-4 --max_train_epochs 10 --gradient_checkpointing --mixed_precision bf16 --cache_latents --cache_text_encoder_outputs

Maybe there's something with your dataset? I remember having issues because all image sizes were off when I was just starting with noobai and illustrious. Try to make a dataset with 1024x1024 images only and try that, or maybe enable bucketing (if you haven't yet), like here's how I have it in my dataset (I disabled shuffle caption because I have no idea how to work with it):

[general]
shuffle_caption = false
caption_extension = '.txt'
enable_bucket = true
min_bucket_reso = 512
max_bucket_reso = 2048
bucket_reso_steps = 64
bucket_no_upscale = true

Espamholding

Feb 9

@cpbmc I am also not seeing this issue. Have you tried with the regular adamw, not the 8 bit one? That is the likely difference I can think of between my config and what you've said. Otherwise - I have an Intel Arc A770 16GB, and I'm training 32/32 rank/alpha loras, 0.000122 constant lr with regular adamw, 1 batch size.

This is 2 loras, one 12000 step lora for 3 characters, and one 4000 step lora for a style. No blurriness or greyness. I doubt you trained more than 10000 steps, especially with the 20 image dataset, so I'd expect to see the issue in this?

kazuyi1222

Feb 9

我已经添加了--network_train_unet_only，请问这是在训练te吗？

cpbmc

Feb 10

•

edited Feb 10

Thank you @SimaDude @Espamholding thanks for confirming that it works on 40xx, it gave me more confidence in debugging. It turns out that it is because of me using the fp8 dit model... Very stupid mistake. Now I got it working.

Espamholding

Feb 10

Comparison of some loras I did for Noobai Vpred EQ VAE (experimental ver.), vs Anima Preview with this. Not exactly 1:1, I did improve the datasets a bit but close enough. Besides the expected issues around fingers and other undertrainedness of the preview model, things are looking pretty good!

kazuyi1222

Feb 11

discrete_flow_shift为3

discrete_flow_shift为1

Turkeychopio

Mar 15

Ran in to a ton of issues, trying to fix with chatgpt but no help so far.

create LoRA for Text Encoder: 0 modules
create LoRA for U-Net: 0 modules
ValueError: optimizer got an empty parameter list

I'm running the command:
accelerate launch sd-scripts\anima_train_network.py --config_file "D:\Pictures\custom_lora\502 - ArcaneCaitlyn_Anima\model\config_lora-20260315-182955.toml" --qwen3 "F:\Programs\stable-diffusion\Forge Neo\sd-webui-forge-neo\models\text_encoder\qwen_3_06b_base.safetensors" --vae "F:\Programs\stable-diffusion\Forge Neo\sd-webui-forge-neo\models\VAE\qwen_image_vae.safetensors"

The command seems fine as it looks like it's about to start but then fails

Bakanayatsu

Apr 21

•

edited Apr 21

Is it possible to extract lora from full finetuned model for anima? Can anyone give a config on dreambooth for anima if possible?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment