Wan2.1 LoRA SFT for Robot Video Generation
Fine-tuned from Wan2.1-I2V-14B-480P using VideoX-Fun framework on RoboTwin dataset.
Part of the WorldArena World Model Challenge.
Training Details
| Parameter | Value |
|---|---|
| Framework | VideoX-Fun |
| Base Model | Wan2.1-I2V-14B-480P |
| Training Data | RoboTwin aloha-agilex_clean_50 (2,500 videos, 50 tasks) |
| LoRA Rank | 32 |
| LoRA Alpha | 16 |
| LoRA Targets | q, k, v, ffn.0, ffn.2 |
| Learning Rate | 1e-4 (constant with warmup) |
| Warmup Steps | 100 |
| Precision | bf16 |
| Resolution | 640 × 640 |
| Frames | 81 per video |
| Batch Size | 1 per GPU (2 GPUs) |
| Total Steps | 12,500 (planned), 2,200 (completed) |
Checkpoints Available
| File | Training Steps | Description |
|---|---|---|
| checkpoint-1200.safetensors | 1,200 | ~1 epoch of training |
| checkpoint-2200.safetensors | 2,200 | Latest checkpoint, used for 1000-video inference |
Each checkpoint has a ComfyUI-compatible version.
Usage
Inference with VideoX-Fun:
Evaluation Results (WorldArena Track 1, 10 samples)
| Metric | Baseline Wan2.1 | SFT-Wan2.1 (ckpt300) |
|---|---|---|
| Image Quality | 70.44 | 48.80 |
| Background Consistency | 69.58 | 90.40 |
| Subject Consistency | 59.90 | 82.33 |
| Flow Score | 16.18 | 21.29 |
SFT-Wan2.1 shows significant improvements in background consistency (+20.8), subject consistency (+22.4), and flow score (+5.1).
Limitations
- Intended for research on robot video world models
- May fail on long-horizon manipulation and complex physical contact
- Instruction following accuracy still limited (~54%)
- Full 1,000-video WorldArena evaluation in progress
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for cjgogo/wan2.1-robotwin-lora-sft
Base model
Wan-AI/Wan2.1-I2V-14B-480P