🚀 RefAlign: Representation Alignment for Reference-to-Video Generation

Lei Wang^1,2,*,‡, Yuxin Song^2,‡, Ge Wu¹, Haocheng Feng², Hang Zhou², Jingdong Wang² Yaxing Wang^4† Jian Yang^1,3†

¹ PCA Lab, VCIP, College of Computer Science, Nankai University ² Baidu Inc. ³ PCA Lab, School of Intelligence Science and Technology, Nanjing University ⁴ College of Artificial Intelligence, Jilin University

†Corresponding authors *Interns in Baidu Inc. ‡Equal Contribution

🏆 OpenS2V-Eval Leaderboard

RefAlign achieves state-of-the-art performance on OpenS2V-Eval across multiple metrics.

Model	Venue	TotalScore ↑	Aesthetic ↑	MotionSmoothness ↑	MotionAmplitude ↑	FaceSim ↑	GmeScore ↑	NexusScore ↑	NaturalScore ↑
🥇 RefAlign-14B (Ours)	Open-Source	60.42%	46.84%	97.61%	22.48%	55.23%	68.32%	48.52%	73.63%
🥇 RefAlign-1.3B (Ours)	Open-Source	56.30%	42.96%	94.74%	20.74%	53.06%	66.85%	43.97%	66.25%
Saber	Closed-Source	57.91%	42.42%	96.12%	21.12%	49.89%	67.50%	47.22%	72.55%
VINO	Open-Source	57.85%	45.92%	94.73%	12.30%	52.00%	69.69%	42.67%	71.99%
BindWeave	Closed-Source	57.61%	45.55%	95.90%	13.91%	53.71%	67.79%	46.84%	66.85%
VACE-14B	Open-Source	57.55%	47.21%	94.97%	15.02%	55.09%	67.27%	44.08%	67.04%
Phantom-14B	Open-Source	56.77%	46.39%	96.31%	33.42%	51.46%	70.65%	37.43%	69.35%
Kling1.6	Closed-Source	56.23%	44.59%	86.93%	41.60%	40.10%	66.20%	45.89%	74.59%
Phantom-1.3B	Open-Source	54.89%	46.67%	93.30%	14.29%	48.56%	69.43%	42.48%	62.50%
MAGREF-480P	Open-Source	52.51%	45.02%	93.17%	21.81%	30.83%	70.47%	43.04%	66.90%
SkyReels-A2-P14B	Open-Source	52.25%	39.41%	87.93%	25.60%	45.95%	64.54%	43.75%	60.32%
Vidu2.0	Closed-Source	51.95%	41.48%	90.45%	13.52%	35.11%	67.57%	43.37%	65.88%

📦 Model Weights

Model	Params	Hugging Face	ModelScope
RefAlign-1.3B	1.3B
RefAlign-14B	14B

⚠️ Note

The provided weights are DiT (Diffusion Transformer) checkpoints fine-tuned from Wan2.1.
To run RefAlign, please:

Download the original Wan2.1 model (including VAE, text encoder, etc.).

Replace the DiT weights in Wan2.1 with the RefAlign weights provided above.

No modification is required for other components.

🎬 Inference

# Inference RefAlign-1.3B
python examples/wanvideo/model_inference/Wan2.1-T2V-1.3B_subject.py

# Inference RefAlign-14B
python examples/wanvideo/model_inference/Wan2.1-T2V-14B_subject.py

Citation

If you find RefAlign useful, please consider giving our repository a star (⭐) and citing our paper.

@misc{wang2026refalign,
  title={RefAlign: Representation Alignment for Reference-to-Video Generation},
  author={Lei Wang and Yuxin Song and Ge Wu and Haocheng Feng and Hang Zhou and Jingdong Wang and Yaxing Wang and Jian Yang},
  year={2026},
  eprint={2603.25743},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Acknowledgement

This project is based on DiffSynth-Studio. Thanks for their awesome works. We sincerely acknowledge the excellent and inspiring prior work, Phantom, VINO, OpenS2V, Phantom-Data and Wan2.1.

Contact

If you have any questions, please feel free to reach out to me at scitop1998@gmail.com.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for gudaochangsheng/RefAlign-1.3B

Base model

Wan-AI/Wan2.1-T2V-1.3B

Finetuned

(38)

this model

Datasets used to train gudaochangsheng/RefAlign-1.3B

Paper for gudaochangsheng/RefAlign-1.3B

RefAlign: Representation Alignment for Reference-to-Video Generation

Paper • 2603.25743 • Published 7 days ago