PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation
Paper β’ 2507.16116 β’ Published β’ 13
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Professional training script for Wan 2.1 I2V LoRA that actually works for character consistency in video generation.
| Feature | wavespeed.ai | This Script |
|---|---|---|
| Data format | Images (zip) | Video clips |
| Rank | Max 64 | 128-512 |
| I2V layers | Unknown | add_k_proj, add_v_proj (image cross-attention) |
| Training type | T2I-style on images | True I2V on video |
| Temporal consistency | β None | β Learned from video |
train_wan_i2v_lora.py β Full training scriptTRAINING_GUIDE.md β Detailed setup instructionsmkdir dataset
cp your_videos/*.mp4 dataset/
# Create captions.txt with format: video_name|SKSCHAR your prompt here
cat > dataset/captions.txt << 'EOF'
video_0|SKSCHAR woman walking in park
video_1|SKSCHAR woman talking to camera
EOF
# Requires A100 80GB / L40S / H100 (48GB+ VRAM)
# ~$2-3/hour on RunPod/Vast.ai/Lambda Labs
pip install torch transformers diffusers accelerate peft
accelerate launch train_wan_i2v_lora.py \
--pretrained_model Wan-AI/Wan2.1-I2V-14B-480P-Diffusers \
--dataset_dir ./dataset \
--output_dir ./output \
--rank 128 \
--lora_alpha 128 \
--lr 1e-4 \
--max_steps 1000 \
--grad_accum 4 \
--mixed_precision bf16 \
--trigger_word SKSCHAR \
--push_to_hub \
--hub_model_id yourname/character-lora
from diffusers import WanImageToVideoPipeline, AutoencoderKLWan
from transformers import CLIPVisionModel
import torch
model_id = "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers"
pipe = WanImageToVideoPipeline.from_pretrained(
model_id,
vae=AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32),
image_encoder=CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32),
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.load_lora_weights("./output/final", adapter_name="char")
pipe.set_adapters(["char"], [0.8])
from diffusers.utils import load_image
image = load_image("reference.jpg")
output = pipe(
image=image,
prompt="SKSCHAR woman dancing gracefully",
height=480, width=832,
num_frames=81,
guidance_scale=5.0,
num_inference_steps=25,
).frames[0]
from diffusers.utils import export_to_video
export_to_video(output, "result.mp4", fps=16)
Ready-to-use test dataset: Useravailablepls/wan-i2v-lora-demo-videos
python3 -c "
import requests, os
os.makedirs('demo_dataset', exist_ok=True)
base = 'https://huggingface.co/datasets/Useravailablepls/wan-i2v-lora-demo-videos/resolve/main/'
for f in ['video_0.mp4', 'video_1.mp4', 'captions.txt']:
r = requests.get(base + f)
open(f'demo_dataset/{f}', 'wb').write(r.content)
print(f'Downloaded {f}')
"
| Parameter | Recommended | Why |
|---|---|---|
rank |
128-256 (14B), 64 (1.3B) | Higher = better consistency |
lora_alpha |
= rank | Standard practice |
lr |
1e-4 | Constant schedule |
max_steps |
500-1000 | More = overfitting |
grad_accum |
4-8 | Effective batch size |
num_frames |
81 | (81-1)/4+1 = 21 latent frames |
| Model | VRAM | GPU Examples | Cost/hr |
|---|---|---|---|
| 14B | 48-80GB | A100, L40S, H100 | $2-6 |
| 1.3B | 16-24GB | T4, A10G, RTX 3090 | Free-$1 |
Free options for 1.3B: Google Colab (T4), Kaggle (T4)
@article{wan2025wan21,
title={Wan 2.1: Comprehensive and Efficient Video Generation},
author={Wan Video Team},
journal={arXiv preprint arXiv:2503.20314},
year={2025}
}
Apache-2.0 (same as base Wan 2.1 model)