Wan 2.1 I2V LoRA — Character Consistency Training

Professional training script for Wan 2.1 I2V LoRA that actually works for character consistency in video generation.

🔑 Why This Works (vs wavespeed.ai)

Feature	wavespeed.ai	This Script
Data format	Images (zip)	Video clips
Rank	Max 64	128-512
I2V layers	Unknown	add_k_proj, add_v_proj (image cross-attention)
Training type	T2I-style on images	True I2V on video
Temporal consistency	❌ None	✅ Learned from video

📁 Files

train_wan_i2v_lora.py — Full training script
TRAINING_GUIDE.md — Detailed setup instructions

🎯 Quick Start

1. Prepare Dataset

mkdir dataset
cp your_videos/*.mp4 dataset/
# Create captions.txt with format: video_name|SKSCHAR your prompt here
cat > dataset/captions.txt << 'EOF'
video_0|SKSCHAR woman walking in park
video_1|SKSCHAR woman talking to camera
EOF

2. Run Training (14B model)

# Requires A100 80GB / L40S / H100 (48GB+ VRAM)
# ~$2-3/hour on RunPod/Vast.ai/Lambda Labs

pip install torch transformers diffusers accelerate peft

accelerate launch train_wan_i2v_lora.py \
    --pretrained_model Wan-AI/Wan2.1-I2V-14B-480P-Diffusers \
    --dataset_dir ./dataset \
    --output_dir ./output \
    --rank 128 \
    --lora_alpha 128 \
    --lr 1e-4 \
    --max_steps 1000 \
    --grad_accum 4 \
    --mixed_precision bf16 \
    --trigger_word SKSCHAR \
    --push_to_hub \
    --hub_model_id yourname/character-lora

3. Inference

from diffusers import WanImageToVideoPipeline, AutoencoderKLWan
from transformers import CLIPVisionModel
import torch

model_id = "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers"
pipe = WanImageToVideoPipeline.from_pretrained(
    model_id,
    vae=AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32),
    image_encoder=CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32),
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

pipe.load_lora_weights("./output/final", adapter_name="char")
pipe.set_adapters(["char"], [0.8])

from diffusers.utils import load_image
image = load_image("reference.jpg")

output = pipe(
    image=image,
    prompt="SKSCHAR woman dancing gracefully",
    height=480, width=832,
    num_frames=81,
    guidance_scale=5.0,
    num_inference_steps=25,
).frames[0]

from diffusers.utils import export_to_video
export_to_video(output, "result.mp4", fps=16)

📊 Demo Dataset

Ready-to-use test dataset: Useravailablepls/wan-i2v-lora-demo-videos

python3 -c "
import requests, os
os.makedirs('demo_dataset', exist_ok=True)
base = 'https://huggingface.co/datasets/Useravailablepls/wan-i2v-lora-demo-videos/resolve/main/'
for f in ['video_0.mp4', 'video_1.mp4', 'captions.txt']:
    r = requests.get(base + f)
    open(f'demo_dataset/{f}', 'wb').write(r.content)
    print(f'Downloaded {f}')
"

🔧 Key Parameters

Parameter	Recommended	Why
`rank`	128-256 (14B), 64 (1.3B)	Higher = better consistency
`lora_alpha`	= rank	Standard practice
`lr`	1e-4	Constant schedule
`max_steps`	500-1000	More = overfitting
`grad_accum`	4-8	Effective batch size
`num_frames`	81	(81-1)/4+1 = 21 latent frames

📚 Based On

Wan 2.1 (arXiv:2503.20314)
Pusa VTA (arXiv:2507.16116) — LoRA rank=512, alpha=1.4 recipe
UniAnimate-DiT (arXiv:2504.11289) — Video conditioning at patchified level
starsfriday's LoRA configs (target_modules from adapter_config.json)

⚠️ Hardware Requirements

Model	VRAM	GPU Examples	Cost/hr
14B	48-80GB	A100, L40S, H100	$2-6
1.3B	16-24GB	T4, A10G, RTX 3090	Free-$1

Free options for 1.3B: Google Colab (T4), Kaggle (T4)

📝 Citation

@article{wan2025wan21,
  title={Wan 2.1: Comprehensive and Efficient Video Generation},
  author={Wan Video Team},
  journal={arXiv preprint arXiv:2503.20314},
  year={2025}
}

📜 License

Apache-2.0 (same as base Wan 2.1 model)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Useravailablepls/wan-i2v-character-lora

PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

Paper • 2507.16116 • Published Jul 22, 2025 • 13

UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

Paper • 2504.11289 • Published Apr 15, 2025 • 2

Wan: Open and Advanced Large-Scale Video Generative Models

Paper • 2503.20314 • Published Mar 26, 2025 • 61