Wan2.1 LoRA SFT for Robot Video Generation

Fine-tuned from Wan2.1-I2V-14B-480P using VideoX-Fun framework on RoboTwin dataset.

Part of the WorldArena World Model Challenge.

Training Details

Parameter Value
Framework VideoX-Fun
Base Model Wan2.1-I2V-14B-480P
Training Data RoboTwin aloha-agilex_clean_50 (2,500 videos, 50 tasks)
LoRA Rank 32
LoRA Alpha 16
LoRA Targets q, k, v, ffn.0, ffn.2
Learning Rate 1e-4 (constant with warmup)
Warmup Steps 100
Precision bf16
Resolution 640 × 640
Frames 81 per video
Batch Size 1 per GPU (2 GPUs)
Total Steps 12,500 (planned), 2,200 (completed)

Checkpoints Available

File Training Steps Description
checkpoint-1200.safetensors 1,200 ~1 epoch of training
checkpoint-2200.safetensors 2,200 Latest checkpoint, used for 1000-video inference

Each checkpoint has a ComfyUI-compatible version.

Usage

Inference with VideoX-Fun:

Evaluation Results (WorldArena Track 1, 10 samples)

Metric Baseline Wan2.1 SFT-Wan2.1 (ckpt300)
Image Quality 70.44 48.80
Background Consistency 69.58 90.40
Subject Consistency 59.90 82.33
Flow Score 16.18 21.29

SFT-Wan2.1 shows significant improvements in background consistency (+20.8), subject consistency (+22.4), and flow score (+5.1).

Limitations

  • Intended for research on robot video world models
  • May fail on long-horizon manipulation and complex physical contact
  • Instruction following accuracy still limited (~54%)
  • Full 1,000-video WorldArena evaluation in progress
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cjgogo/wan2.1-robotwin-lora-sft

Adapter
(114)
this model