Image-Text-to-Video

πŸš€ RefAlign: Representation Alignment for Reference-to-Video Generation

arXiv arXiv Visitors HF-1.3B HF-14B MS-1.3B MS-14B Code Project Page

Lei Wang1,2,*,‑, Yuxin Song2,‑, Ge Wu1, Haocheng Feng2, Hang Zhou2, Jingdong Wang2 Yaxing Wang4† Jian Yang1,3†
1 PCA Lab, VCIP, College of Computer Science, Nankai University    2 Baidu Inc.    3 PCA Lab, School of Intelligence Science and Technology, Nanjing University    4 College of Artificial Intelligence, Jilin University
†Corresponding authors *Interns in Baidu Inc. ‑Equal Contribution
demo

πŸ† OpenS2V-Eval Leaderboard

RefAlign achieves state-of-the-art performance on OpenS2V-Eval across multiple metrics.

Model Venue TotalScore ↑ Aesthetic ↑ MotionSmoothness ↑ MotionAmplitude ↑ FaceSim ↑ GmeScore ↑ NexusScore ↑ NaturalScore ↑
πŸ₯‡ RefAlign-14B (Ours) Open-Source 60.42% 46.84% 97.61% 22.48% 55.23% 68.32% 48.52% 73.63%
πŸ₯‡ RefAlign-1.3B (Ours) Open-Source 56.30% 42.96% 94.74% 20.74% 53.06% 66.85% 43.97% 66.25%
Saber Closed-Source 57.91% 42.42% 96.12% 21.12% 49.89% 67.50% 47.22% 72.55%
VINO Open-Source 57.85% 45.92% 94.73% 12.30% 52.00% 69.69% 42.67% 71.99%
BindWeave Closed-Source 57.61% 45.55% 95.90% 13.91% 53.71% 67.79% 46.84% 66.85%
VACE-14B Open-Source 57.55% 47.21% 94.97% 15.02% 55.09% 67.27% 44.08% 67.04%
Phantom-14B Open-Source 56.77% 46.39% 96.31% 33.42% 51.46% 70.65% 37.43% 69.35%
Kling1.6 Closed-Source 56.23% 44.59% 86.93% 41.60% 40.10% 66.20% 45.89% 74.59%
Phantom-1.3B Open-Source 54.89% 46.67% 93.30% 14.29% 48.56% 69.43% 42.48% 62.50%
MAGREF-480P Open-Source 52.51% 45.02% 93.17% 21.81% 30.83% 70.47% 43.04% 66.90%
SkyReels-A2-P14B Open-Source 52.25% 39.41% 87.93% 25.60% 45.95% 64.54% 43.75% 60.32%
Vidu2.0 Closed-Source 51.95% 41.48% 90.45% 13.52% 35.11% 67.57% 43.37% 65.88%

πŸ“¦ Model Weights

Model Params Hugging Face ModelScope
RefAlign-1.3B 1.3B HF Download MS Download
RefAlign-14B 14B HF Download MS Download

⚠️ Note

The provided weights are DiT (Diffusion Transformer) checkpoints fine-tuned from Wan2.1.
To run RefAlign, please:

  1. Download the original Wan2.1 model (including VAE, text encoder, etc.).
  2. Replace the DiT weights in Wan2.1 with the RefAlign weights provided above.

No modification is required for other components.

🎬 Inference

# Inference RefAlign-1.3B
python examples/wanvideo/model_inference/Wan2.1-T2V-1.3B_subject.py

# Inference RefAlign-14B
python examples/wanvideo/model_inference/Wan2.1-T2V-14B_subject.py

Citation

If you find RefAlign useful, please consider giving our repository a star (⭐) and citing our paper.

@misc{wang2026refalign,
  title={RefAlign: Representation Alignment for Reference-to-Video Generation},
  author={Lei Wang and Yuxin Song and Ge Wu and Haocheng Feng and Hang Zhou and Jingdong Wang and Yaxing Wang and Jian Yang},
  year={2026},
  eprint={2603.25743},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Acknowledgement

This project is based on DiffSynth-Studio. Thanks for their awesome works. We sincerely acknowledge the excellent and inspiring prior work, Phantom, VINO, OpenS2V, Phantom-Data and Wan2.1.

Contact

If you have any questions, please feel free to reach out to me at scitop1998@gmail.com.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gudaochangsheng/RefAlign-1.3B

Finetuned
(38)
this model

Datasets used to train gudaochangsheng/RefAlign-1.3B

Paper for gudaochangsheng/RefAlign-1.3B