The training script for the Anima model has already been implemented for sd-scripts
I have completed the implementation for both LoRA and full fine-tuning. You can check my pull request here: https://github.com/kohya-ss/sd-scripts/pull/2260. And here is my repo for implement it: https://github.com/duongve13112002/sd-scripts/tree/sd3
Guide:
I. Training
This model can be trained using the sd-scripts training pipeline with either LoRA or full finetuning.
- LoRA Training
Use LoRA adapters to reduce memory usage and training time while keeping the base model frozen:
accelerate launch anima_train_network.py \
--dit_path <path_to_anima_dit> \
--vae_path <path_to_wanvae> \
--qwen3_path <path_to_qwen3> \
--dataset_config my_dataset_config.toml \
--output_dir output \
--output_name anima_lora \
--network_module networks.lora_anima \
--network_dim 16 \
--learning_rate 1e-4 \
--max_train_epochs 10 \
--cache_text_encoder_outputs \
--cache_latents
- Full Finetuning
For full model finetuning, including DiT and optional LLM Adapter components:
accelerate launch anima_train.py \
--dit_path <path_to_anima_dit> \
--vae_path <path_to_wanvae> \
--qwen3_path <path_to_qwen3> \
--dataset_config my_dataset_config.toml \
--output_dir output \
--output_name anima_ft \
--learning_rate 1e-5 \
--max_train_epochs 10 \
--cache_text_encoder_outputs \
--cache_latents \
--llm_adapter_lr 0
Setting --llm_adapter_lr 0 freezes the LLM Adapter during training.
II. Notes
- Anima is a ~2B parameter DiT-based model.
- VRAM usage can be significantly reduced by enabling:
--cache_text_encoder_outputs
--cache_latents
--gradient_checkpointing
- LoRA training supports per-layer rank configuration via
--network_args.
我使用你的分支训练了一个lora,好像出图不太对劲(
这是参数:accelerate launch anima_train_network.py
--dit_path "/root/autodl-tmp/anima-preview.safetensors"
--vae_path "/root/autodl-tmp/qwen_image_vae.safetensors"
--qwen3_path "/root/autodl-tmp/qwen"
--dataset_config "/root/autodl-tmp/sd-/dataset.toml"
--output_dir "/root/autodl-tmp/sd-/output"
--output_name "mmm_anima"
--network_module networks.lora_anima
--network_dim 32
--network_alpha 8
--learning_rate 7e-5
--max_train_epochs 40
--gradient_checkpointing
--train_batch_size 10
--cache_latents
--logging_dir "/root/autodl-tmp/sd-/output"
--xformers
--mixed_precision "bf16"
--optimizer_type "pytorch_optimizer.Adan"
--lr_scheduler "cosine"
--max_grad_norm 0
--lr_warmup_steps 50
--optimizer_args "weight_decay=0.02" "betas=0.98,0.92,0.99"
--save_every_n_epochs=2
--network_train_unet_only
I'm getting AttributeError: 'WanVAE_' object has no attribute 'device' after caching latents:
INFO Loaded Anima VAE successfully. anima_utils.py:159
INFO [Dataset 0] train_util.py:2624
INFO caching latents with caching strategy. train_util.py:1117
INFO caching latents... train_util.py:1167
100%|████████████████████████████████████████████████████████████████████| 63/63 [00:51<00:00, 1.23it/s]
2026-02-06 15:18:56 INFO move vae and unet to cpu to save memory anima_train_network.py:202
Traceback (most recent call last):
File "/mnt/FastDrive/opensource/sd-scripts/anima_train_network.py", line 535, in <module>
trainer.train(args)
File "/mnt/FastDrive/opensource/sd-scripts/train_network.py", line 633, in train
self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype)
File "/mnt/FastDrive/opensource/sd-scripts/anima_train_network.py", line 203, in cache_text_encoder_outputs_if_needed
org_vae_device = vae.device
^^^^^^^^^^
File "/home/simadude/.conda/envs/sd-scripts/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1965, in __getattr__
raise AttributeError(
AttributeError: 'WanVAE_' object has no attribute 'device'
EDIT: accidentally wrote "after one epoch" instead of "after caching latents"
Besides the above issue which I also get, latents are re-cached every run, and they are saved with incorrect keys. For a 1024x1024 image called "132.png",
NpzFile '132_1024x1024_anima.npz' with keys: latents_1x128, original_size_1x128, crop_ltrb_1x128"
And it is, of course, looking for a key latents_128x128 (which does not exist)
@Espamholding , @SimaDude oops it is my bad i havent checked cache text embedding and latent. I just tested without them. I will check it later and push the code soon dont worry.
Where is the anima usage documentation? I would like to know the recommended training parameters, please.
Wish me luck y'all, I'm halfway through the training (6/10 epochs done) 🙏
Had to also fix this in anima_train_network.py for some reason:
@@ -250,7 +250,7 @@ class AnimaNetworkTrainer(train_network.NetworkTrainer):
text_encoders[0].to(accelerator.device, dtype=weight_dtype)
def sample_images(self, accelerator, args, epoch, global_step, device, vae, tokenizer, text_encoder, unet):
- text_encoders = text_encoder # compatibility
+ text_encoders = text_encoder if isinstance(text_encoder, list) else [text_encoder]
te = self.get_models_for_text_encoding(args, accelerator, text_encoders)
qwen3_te = te[0] if te is not None else None
@SimaDude @Espamholding @MightyCrimson i fixed all the problems and added the document for it. Please check it out
Well, I finished training now. After I wrote my first comment, I hacked away any VAE related errors stopping me from training and kept on with the cached latents. The result I got was also fried like kazuyi1222's, since you didn't mention that person I imagine that issue is not fixed. Smells like another VAE issue to me.
@Espamholding let me check more detail, i tested on full fine-tune and it work well. it is so weird.
Shoving sd_scripts' cached latents of SDXL (with the EQ VAE) and the Qwen VAE with your code respectively (after 2nd commit) into Comfy,
Wrongly applied scaling factor? It certainly feels like we're training the model to be closer to that more fried looking latent.
Code for convenience for converting a sd_scripts cached latent into a comfy one, to test against comfy's implementation of the qwen vae:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-i", help="Latent in name")
parser.add_argument("-o", help="Latent out name")
args = parser.parse_args()
import torch
import numpy as np
from safetensors.torch import save_file
lat_np_f = np.load(args.i)
target_key = "oooo"
for k in lat_np_f.keys():
if k.startswith("latents"):
target_key = k
lat_t = torch.from_numpy(lat_np_f[target_key])
save_file({'latent_format_version_0': torch.Tensor([]), 'latent_tensor': lat_t.unsqueeze(0)}, args.o)
@duongve
Using flag --cache_text_encoder_outputs throws an error. Cache latent works.
override steps. steps for 10 epochs is / 指定エポックまでのステップ数: 630
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 63
num validation images * repeats / 学習画像の数×繰り返し回数: 0
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 63
num epochs / epoch数: 10
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 630
2026-02-06 22:24:55 INFO text_encoder is not needed for training. deleting to train_network.py:1322
save memory.
2026-02-06 22:24:56 INFO unet dtype: torch.bfloat16, device: cuda:0 train_network.py:1347
steps: 0%| | 0/630 [00:00<?, ?it/s]
epoch 1/10
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2026-02-06 22:24:57 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:781
2026-02-06 22:24:57 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:781
2026-02-06 22:24:57 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:781
2026-02-06 22:24:57 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:781
2026-02-06 22:24:57 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:781
2026-02-06 22:24:57 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:781
2026-02-06 22:24:57 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:781
2026-02-06 22:24:57 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:781
Traceback (most recent call last):
File "/mnt/FastDrive/opensource/sd-scripts/anima_train_network.py", line 532, in <module>
trainer.train(args)
File "/mnt/FastDrive/opensource/sd-scripts/train_network.py", line 1427, in train
loss = self.process_batch(
^^^^^^^^^^^^^^^^^^^
File "/mnt/FastDrive/opensource/sd-scripts/anima_train_network.py", line 423, in process_batch
input_ids = [ids.to(accelerator.device) for ids in batch["input_ids_list"]]
~~~~~^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not iterable
steps: 0%| | 0/630 [00:01<?, ?it/s]
@Mightys Yes but i think discrete_flow_shift should be 1 because i checked it on diffusion-pipe. I am not sure this value but i am using 3 at the moment
@duongve
I've already solved it, but now I have another problem. Sorry for so many comments. When I try to train in both mixed FP16 and full FP16, it results in NAN. Is this a script issue, or can Anima not train in FP16?
Anima uses full bf16, so it cannot be trained with fp16.
Congrats for merging back to mainline and thanks for the training scripts.
EDIT: Turns that I was using the wrong (fp8) base model.
Anyone successfully training character LoRAs with 4090 (Ada Lovelace)? (Driver version 580.126.09, Cuda version 13.0. torch 2.9.1)
I tried with a simple dataset of danbooru images (around 100), tagged with wd tagger with minor manual editing, as well as a smaller one (roughly 20, used to work for lumina and illustrious). Even a single image overfitting would break the lora weight, making at least ~1/10 of the lora_up weight bimodal. as below.
The resulting image becomes blurry and brighter (visible at 50 steps with a single image), until totally collapse of white / grid image, regardless of the learning rate above.
Click to expand
I have tried different settings and different datasets, but getting consistent results (learning rate to 1e-5 or 5e-6 also didn't help)
Tried different ranks (8, 16, 32), different alpha (1, 2,4, 8, 16). Also tried training without bf16 precision settings. Adding/removing --no_half_vae --network_train_unet_only --gradient_checkpointing also didn't help. And I also tried to use the wan 2.1 vae but not really helpful either.
--network_module=networks.lora_anima \
--save_model_as=safetensors \
--optimizer_type="AdamW8bit" \
--lr_scheduler="constant" \
--network_dim=8 \
--network_alpha=8 \
--learning_rate=1e-4 \
--max_train_epochs=5 \
--mixed_precision="bf16" \
--save_every_n_epochs=1 \
--save_precision="bf16" \
--cache_text_encoder_outputs \
--cache_latents
Any hints would be greatly appreciated.
@cpbmc That's odd, I don't have any issues on my RTX 4060ti but here are my flags if it helps anyhow:
--network_module networks.lora_anima --network_dim 16 --network_alpha 16 --learning_rate 1e-4 --max_train_epochs 10 --gradient_checkpointing --mixed_precision bf16 --cache_latents --cache_text_encoder_outputs
Maybe there's something with your dataset? I remember having issues because all image sizes were off when I was just starting with noobai and illustrious. Try to make a dataset with 1024x1024 images only and try that, or maybe enable bucketing (if you haven't yet), like here's how I have it in my dataset (I disabled shuffle caption because I have no idea how to work with it):
[general]
shuffle_caption = false
caption_extension = '.txt'
enable_bucket = true
min_bucket_reso = 512
max_bucket_reso = 2048
bucket_reso_steps = 64
bucket_no_upscale = true
@cpbmc I am also not seeing this issue. Have you tried with the regular adamw, not the 8 bit one? That is the likely difference I can think of between my config and what you've said. Otherwise - I have an Intel Arc A770 16GB, and I'm training 32/32 rank/alpha loras, 0.000122 constant lr with regular adamw, 1 batch size.
This is 2 loras, one 12000 step lora for 3 characters, and one 4000 step lora for a style. No blurriness or greyness. I doubt you trained more than 10000 steps, especially with the 20 image dataset, so I'd expect to see the issue in this?
Thank you @SimaDude @Espamholding thanks for confirming that it works on 40xx, it gave me more confidence in debugging. It turns out that it is because of me using the fp8 dit model... Very stupid mistake. Now I got it working.














