LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. We propose LumosX, a framework that advances both data and model design to achieve state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.

arXiv OpenReview GitHub Project Page


πŸ’» Authors

Jiazheng Xing1,4,2,*, Fei Du2,3,*, Hangjie Yuan2,3,1,*, Pengwei Liu1,2, Hongbin Xu4, Hai Ci4, Ruigang Niu2,3, Weihua Chen2,3†, Fan Wang2, Yong Liu1†

1Zhejiang University, 2DAMO Academy, Alibaba Group, 3Hupan Lab, 4National University of Singapore

*Equal contributions  Β·  †Corresponding authors

Contact: jiazhengxing@zju.edu.cn, kugang.cwh@alibaba-inc.com, yongliu@iipc.zju.edu.cn

πŸ“˜ Click to view Abstract

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design.

On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark.

On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.

πŸ“œ News

[2026/1/26] Accepted by ICLR 2026 !

[2026/3/21] Code is available in Lumos-Custom / LumosX !

πŸ“Ž Citation

If you find this work useful, please cite:

@inproceedings{xinglumosx,
  title={LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation},
  author={Xing, Jiazheng and Du, Fei and Yuan, Hangjie and Liu, Pengwei and Xu, Hongbin and Ci, Hai and Niu, Ruigang and Chen, Weihua and Wang, Fan and Liu, Yong},
  booktitle={The Fourteenth International Conference on Learning Representations}
}

πŸ“£ Disclaimer

This is the official release channel for LumosX weights.

Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Alibaba-DAMO-Academy/LumosX