| --- |
| license: mit |
| datasets: |
| - BestWishYsh/OpenS2V-5M |
| - ZhuoweiChen/Phantom-data-Koala36M |
| base_model: |
| - Wan-AI/Wan2.1-T2V-1.3B |
| pipeline_tag: image-text-to-video |
| --- |
| |
|
|
| # π RefAlign: Representation Alignment for Reference-to-Video Generation |
|
|
| [](https://arxiv.org/abs/2603.25743) [](https://arxiv.org/pdf/2603.25743)  [](https://huggingface.co/gudaochangsheng/RefAlign-1.3B) |
| [](https://huggingface.co/gudaochangsheng/RefAlign-14B) |
| [](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-1.3B) |
| [](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-14B) |
| [](https://github.com/gudaochangsheng/RefAlign) |
| [](https://gudaochangsheng.github.io/RefAlign-Page/) |
|
|
| <div align="center"> |
| <a href="https://gudaochangsheng.github.io/">Lei Wang</a><sup>1,2,*,‡</sup>, |
| <a href="https://scholar.google.com/citations?hl=zh-TW&user=1uL_9HAAAAAJ">Yuxin Song</a><sup>2,‡</sup>, |
| <a href="https://github.com/Martinser">Ge Wu</a><sup>1</sup>, |
| <a href="https://scholar.google.com.hk/citations?user=pnuQ5UsAAAAJ&hl=zh-CN&oi=ao">Haocheng Feng</a><sup>2</sup>, |
| <a href="https://hangz-nju-cuhk.github.io/">Hang Zhou</a><sup>2</sup>, |
| <a href="https://jingdongwang2017.github.io/">Jingdong Wang</a><sup>2</sup> |
| <a href="https://yaxingwang.github.io/">Yaxing Wang</a><sup>4†</sup> |
| <a href="https://scholar.google.com.hk/citations?user=6CIDtZQAAAAJ&hl=en">Jian Yang</a><sup>1,3†</sup> |
| </div> |
| |
| <div align="center"> |
| <sup>1</sup> PCA Lab, VCIP, College of Computer Science, Nankai University |
| <sup>2</sup> Baidu Inc. |
| <sup>3</sup> PCA Lab, School of Intelligence Science and Technology, Nanjing University |
| <sup>4</sup> College of Artificial Intelligence, Jilin University |
| </div> |
| |
| <div align="center"> |
| †Corresponding authors *Interns in Baidu Inc. ‡Equal Contribution |
| </div> |
|
|
| <div align="center"> |
| <img src="asserts/abstract-refalign.png" alt="demo" style="width: 100%;" /> |
| <br> |
| </div> |
|
|
| --- |
|
|
| ## π OpenS2V-Eval Leaderboard |
|
|
| > RefAlign achieves **state-of-the-art performance** on [OpenS2V-Eval](https://huggingface.co/spaces/BestWishYsh/OpenS2V-Eval) across multiple metrics. |
|
|
| | Model | Venue | TotalScore β | Aesthetic β | MotionSmoothness β | MotionAmplitude β | FaceSim β | GmeScore β | NexusScore β | NaturalScore β | |
| |---|---|---:|---:|---:|---:|---:|---:|---:|---:| |
| | π₯ **RefAlign-14B (Ours)** | Open-Source | **60.42%** | 46.84% | **97.61%** | 22.48% | **55.23%** | 68.32% | **48.52%** | 73.63% | |
| | π₯ **RefAlign-1.3B (Ours)** | Open-Source | **56.30%** | 42.96% | 94.74% | 20.74% | 53.06% | 66.85% | 43.97% | 66.25% | |
| | Saber | Closed-Source | 57.91% | 42.42% | 96.12% | 21.12% | 49.89% | 67.50% | 47.22% | 72.55% | |
| | VINO | Open-Source | 57.85% | 45.92% | 94.73% | 12.30% | 52.00% | 69.69% | 42.67% | 71.99% | |
| | BindWeave | Closed-Source | 57.61% | 45.55% | 95.90% | 13.91% | 53.71% | 67.79% | 46.84% | 66.85% | |
| | VACE-14B | Open-Source | 57.55% | **47.21%** | 94.97% | 15.02% | 55.09% | 67.27% | 44.08% | 67.04% | |
| | Phantom-14B | Open-Source | 56.77% | 46.39% | 96.31% | **33.42%** | 51.46% | **70.65%** | 37.43% | 69.35% | |
| | Kling1.6 | Closed-Source | 56.23% | 44.59% | 86.93% | **41.60%** | 40.10% | 66.20% | 45.89% | **74.59%** | |
| | Phantom-1.3B | Open-Source | 54.89% | 46.67% | 93.30% | 14.29% | 48.56% | 69.43% | 42.48% | 62.50% | |
| | MAGREF-480P | Open-Source | 52.51% | 45.02% | 93.17% | 21.81% | 30.83% | 70.47% | 43.04% | 66.90% | |
| | SkyReels-A2-P14B | Open-Source | 52.25% | 39.41% | 87.93% | 25.60% | 45.95% | 64.54% | 43.75% | 60.32% | |
| | Vidu2.0 | Closed-Source | 51.95% | 41.48% | 90.45% | 13.52% | 35.11% | 67.57% | 43.37% | 65.88% | |
|
|
| ## π¦ Model Weights |
|
|
| | Model | Params | Hugging Face | ModelScope | |
| |---|---:|---|---| |
| | RefAlign-1.3B | 1.3B | [](https://huggingface.co/gudaochangsheng/RefAlign-1.3B) | [](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-1.3B) | |
| | RefAlign-14B | 14B | [](https://huggingface.co/gudaochangsheng/RefAlign-14B) | [](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-14B) | |
|
|
| > β οΈ **Note** |
| > |
| > The provided weights are **DiT (Diffusion Transformer) checkpoints fine-tuned from Wan2.1**. |
| > To run RefAlign, please: |
| > |
| > 1. Download the original **[Wan2.1](https://huggingface.co/collections/Wan-AI/wan21)** model (including VAE, text encoder, etc.). |
| > 2. Replace the **DiT weights** in Wan2.1 with the RefAlign weights provided above. |
| > |
| > No modification is required for other components. |
|
|
| ## π¬ Inference |
|
|
|
|
| ```shell |
| # Inference RefAlign-1.3B |
| python examples/wanvideo/model_inference/Wan2.1-T2V-1.3B_subject.py |
| |
| # Inference RefAlign-14B |
| python examples/wanvideo/model_inference/Wan2.1-T2V-14B_subject.py |
| ``` |
| ## Citation |
|
|
| If you find RefAlign useful, please consider giving our repository a star (β) and citing our [paper](https://arxiv.org/abs/2603.25743). |
|
|
| ``` |
| @misc{wang2026refalign, |
| title={RefAlign: Representation Alignment for Reference-to-Video Generation}, |
| author={Lei Wang and Yuxin Song and Ge Wu and Haocheng Feng and Hang Zhou and Jingdong Wang and Yaxing Wang and Jian Yang}, |
| year={2026}, |
| eprint={2603.25743}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV} |
| } |
| ``` |
| ## Acknowledgement |
|
|
| This project is based on [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio). Thanks for their awesome works. |
| We sincerely acknowledge the excellent and inspiring prior work, [Phantom](https://github.com/Phantom-video/Phantom), [VINO](https://sotamak1r.github.io/VINO-web/), [OpenS2V](https://github.com/PKU-YuanGroup/OpenS2V-Nexus), [Phantom-Data](https://phantom-video.github.io/Phantom-Data/) and [Wan2.1](https://wan.video/). |
| ## Contact |
| If you have any questions, please feel free to reach out to me at `scitop1998@gmail.com`. |