1 / README.md

Duplicate from gudaochangsheng/RefAlign-1.3B

c124ce3 about 1 month ago

6.74 kB

	---
	license: mit
	datasets:
	- BestWishYsh/OpenS2V-5M
	- ZhuoweiChen/Phantom-data-Koala36M
	base_model:
	- Wan-AI/Wan2.1-T2V-1.3B
	pipeline_tag: image-text-to-video
	---


	# 🚀 RefAlign: Representation Alignment for Reference-to-Video Generation

	[![arXiv](https://img.shields.io/badge/arXiv-RefAlign-<COLOR>.svg)](https://arxiv.org/abs/2603.25743) [![arXiv](https://img.shields.io/badge/paper-RefAlign-b31b1b.svg)](https://arxiv.org/pdf/2603.25743) ![Visitors](https://visitor-badge.laobi.icu/badge?page_id=gudaochangsheng/RefAlign) [![HF-1.3B](https://img.shields.io/badge/HF-RefAlign--1.3B-yellow?logo=huggingface)](https://huggingface.co/gudaochangsheng/RefAlign-1.3B)
	[![HF-14B](https://img.shields.io/badge/HF-RefAlign--14B-yellow?logo=huggingface)](https://huggingface.co/gudaochangsheng/RefAlign-14B)
	[![MS-1.3B](https://img.shields.io/badge/ModelScope-RefAlign--1.3B-blue)](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-1.3B)
	[![MS-14B](https://img.shields.io/badge/ModelScope-RefAlign--14B-blue)](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-14B)
	[![Code](https://img.shields.io/badge/Code-RefAlign-black?style=flat&logo=github)](https://github.com/gudaochangsheng/RefAlign)
	[![Project Page](https://img.shields.io/badge/Project-Page-2ea44f?style=flat-square)](https://gudaochangsheng.github.io/RefAlign-Page/)

	<div align="center">
	<a href="https://gudaochangsheng.github.io/">Lei Wang</a><sup>1,2,*,&ddagger;</sup>,
	<a href="https://scholar.google.com/citations?hl=zh-TW&user=1uL_9HAAAAAJ">Yuxin Song</a><sup>2,&ddagger;</sup>,
	<a href="https://github.com/Martinser">Ge Wu</a><sup>1</sup>,
	<a href="https://scholar.google.com.hk/citations?user=pnuQ5UsAAAAJ&hl=zh-CN&oi=ao">Haocheng Feng</a><sup>2</sup>,
	<a href="https://hangz-nju-cuhk.github.io/">Hang Zhou</a><sup>2</sup>,
	<a href="https://jingdongwang2017.github.io/">Jingdong Wang</a><sup>2</sup>
	<a href="https://yaxingwang.github.io/">Yaxing Wang</a><sup>4&dagger;</sup>
	<a href="https://scholar.google.com.hk/citations?user=6CIDtZQAAAAJ&hl=en">Jian Yang</a><sup>1,3&dagger;</sup>
	</div>

	<div align="center">
	<sup>1</sup> PCA Lab, VCIP, College of Computer Science, Nankai University
	<sup>2</sup> Baidu Inc.
	<sup>3</sup> PCA Lab, School of Intelligence Science and Technology, Nanjing University
	<sup>4</sup> College of Artificial Intelligence, Jilin University
	</div>

	<div align="center">
	&dagger;Corresponding authors *Interns in Baidu Inc. &ddagger;Equal Contribution
	</div>

	<div align="center">
	<img src="asserts/abstract-refalign.png" alt="demo" style="width: 100%;" />
	<br>
	</div>

	---

	## 🏆 OpenS2V-Eval Leaderboard

	> RefAlign achieves state-of-the-art performance on [OpenS2V-Eval](https://huggingface.co/spaces/BestWishYsh/OpenS2V-Eval) across multiple metrics.

	\| Model \| Venue \| TotalScore ↑ \| Aesthetic ↑ \| MotionSmoothness ↑ \| MotionAmplitude ↑ \| FaceSim ↑ \| GmeScore ↑ \| NexusScore ↑ \| NaturalScore ↑ \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| 🥇 RefAlign-14B (Ours) \| Open-Source \| 60.42% \| 46.84% \| 97.61% \| 22.48% \| 55.23% \| 68.32% \| 48.52% \| 73.63% \|
	\| 🥇 RefAlign-1.3B (Ours) \| Open-Source \| 56.30% \| 42.96% \| 94.74% \| 20.74% \| 53.06% \| 66.85% \| 43.97% \| 66.25% \|
	\| Saber \| Closed-Source \| 57.91% \| 42.42% \| 96.12% \| 21.12% \| 49.89% \| 67.50% \| 47.22% \| 72.55% \|
	\| VINO \| Open-Source \| 57.85% \| 45.92% \| 94.73% \| 12.30% \| 52.00% \| 69.69% \| 42.67% \| 71.99% \|
	\| BindWeave \| Closed-Source \| 57.61% \| 45.55% \| 95.90% \| 13.91% \| 53.71% \| 67.79% \| 46.84% \| 66.85% \|
	\| VACE-14B \| Open-Source \| 57.55% \| 47.21% \| 94.97% \| 15.02% \| 55.09% \| 67.27% \| 44.08% \| 67.04% \|
	\| Phantom-14B \| Open-Source \| 56.77% \| 46.39% \| 96.31% \| 33.42% \| 51.46% \| 70.65% \| 37.43% \| 69.35% \|
	\| Kling1.6 \| Closed-Source \| 56.23% \| 44.59% \| 86.93% \| 41.60% \| 40.10% \| 66.20% \| 45.89% \| 74.59% \|
	\| Phantom-1.3B \| Open-Source \| 54.89% \| 46.67% \| 93.30% \| 14.29% \| 48.56% \| 69.43% \| 42.48% \| 62.50% \|
	\| MAGREF-480P \| Open-Source \| 52.51% \| 45.02% \| 93.17% \| 21.81% \| 30.83% \| 70.47% \| 43.04% \| 66.90% \|
	\| SkyReels-A2-P14B \| Open-Source \| 52.25% \| 39.41% \| 87.93% \| 25.60% \| 45.95% \| 64.54% \| 43.75% \| 60.32% \|
	\| Vidu2.0 \| Closed-Source \| 51.95% \| 41.48% \| 90.45% \| 13.52% \| 35.11% \| 67.57% \| 43.37% \| 65.88% \|

	## 📦 Model Weights

	\| Model \| Params \| Hugging Face \| ModelScope \|
	\|---\|---:\|---\|---\|
	\| RefAlign-1.3B \| 1.3B \| [![HF Download](https://img.shields.io/badge/HuggingFace-Download-yellow?logo=huggingface)](https://huggingface.co/gudaochangsheng/RefAlign-1.3B) \| [![MS Download](https://img.shields.io/badge/ModelScope-Download-blue)](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-1.3B) \|
	\| RefAlign-14B \| 14B \| [![HF Download](https://img.shields.io/badge/HuggingFace-Download-yellow?logo=huggingface)](https://huggingface.co/gudaochangsheng/RefAlign-14B) \| [![MS Download](https://img.shields.io/badge/ModelScope-Download-blue)](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-14B) \|

	> ⚠️ Note
	>
	> The provided weights are DiT (Diffusion Transformer) checkpoints fine-tuned from Wan2.1.
	> To run RefAlign, please:
	>
	> 1. Download the original [Wan2.1](https://huggingface.co/collections/Wan-AI/wan21) model (including VAE, text encoder, etc.).
	> 2. Replace the DiT weights in Wan2.1 with the RefAlign weights provided above.
	>
	> No modification is required for other components.

	## 🎬 Inference


	```shell
	# Inference RefAlign-1.3B
	python examples/wanvideo/model_inference/Wan2.1-T2V-1.3B_subject.py

	# Inference RefAlign-14B
	python examples/wanvideo/model_inference/Wan2.1-T2V-14B_subject.py
	```
	## Citation

	If you find RefAlign useful, please consider giving our repository a star (⭐) and citing our [paper](https://arxiv.org/abs/2603.25743).

	```
	@misc{wang2026refalign,
	title={RefAlign: Representation Alignment for Reference-to-Video Generation},
	author={Lei Wang and Yuxin Song and Ge Wu and Haocheng Feng and Hang Zhou and Jingdong Wang and Yaxing Wang and Jian Yang},
	year={2026},
	eprint={2603.25743},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```
	## Acknowledgement

	This project is based on [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio). Thanks for their awesome works.
	We sincerely acknowledge the excellent and inspiring prior work, [Phantom](https://github.com/Phantom-video/Phantom), [VINO](https://sotamak1r.github.io/VINO-web/), [OpenS2V](https://github.com/PKU-YuanGroup/OpenS2V-Nexus), [Phantom-Data](https://phantom-video.github.io/Phantom-Data/) and [Wan2.1](https://wan.video/).
	## Contact
	If you have any questions, please feel free to reach out to me at `scitop1998@gmail.com`.