Update
#2
by Ockham98 - opened
README.md
CHANGED
|
@@ -1,6 +1,73 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
| 4 |
tags:
|
| 5 |
- video generation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
language: en
|
| 4 |
+
library_name: pytorch
|
| 5 |
+
pipeline_tag: text-to-video
|
| 6 |
tags:
|
| 7 |
- video generation
|
| 8 |
+
- personalized-generation
|
| 9 |
+
- text-to-video
|
| 10 |
+
- diffusion
|
| 11 |
+
downloads: true
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
|
| 15 |
+
|
| 16 |
+
> Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. We propose **LumosX**, a framework that advances both data and model design to achieve state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.
|
| 17 |
+
|
| 18 |
+
[](https://arxiv.org/abs/46f333f179)
|
| 19 |
+
[](https://openreview.net/forum?id=r5o6PWgzav)
|
| 20 |
+
[](https://github.com/alibaba-damo-academy/Lumos-Custom/tree/main/LumosX)
|
| 21 |
+
[](https://jiazheng-xing.github.io/lumosx-home/)
|
| 22 |
+
|
| 23 |
---
|
| 24 |
+
|
| 25 |
+
### ๐ป Authors
|
| 26 |
+
|
| 27 |
+
<div align="center">
|
| 28 |
+
|
| 29 |
+
[Jiazheng Xing](https://jiazheng-xing.github.io/)<sup>1,4,2,\*</sup>, Fei Du<sup>2,3,\*</sup>, [Hangjie Yuan](https://jacobyuan7.github.io/)<sup>2,3,1,\*</sup>, Pengwei Liu<sup>1,2</sup>, Hongbin Xu<sup>4</sup>, Hai Ci<sup>4</sup>, Ruigang Niu<sup>2,3</sup>, Weihua Chen<sup>2,3</sup><sup>โ </sup>, Fan Wang<sup>2</sup>, Yong Liu<sup>1</sup><sup>โ </sup>
|
| 30 |
+
|
| 31 |
+
<sup>1</sup>Zhejiang University, <sup>2</sup>DAMO Academy, Alibaba Group, <sup>3</sup>Hupan Lab, <sup>4</sup>National University of Singapore
|
| 32 |
+
|
| 33 |
+
<sup>\*</sup>Equal contributions ยท <sup>โ </sup>Corresponding authors
|
| 34 |
+
|
| 35 |
+
Contact: jiazhengxing@zju.edu.cn, kugang.cwh@alibaba-inc.com, yongliu@iipc.zju.edu.cn
|
| 36 |
+
|
| 37 |
+
</div>
|
| 38 |
+
|
| 39 |
+
<details>
|
| 40 |
+
<summary><strong>๐ Click to view Abstract</strong></summary>
|
| 41 |
+
|
| 42 |
+
> Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose **LumosX**, a framework that advances both data and model design.
|
| 43 |
+
|
| 44 |
+
> On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark.
|
| 45 |
+
|
| 46 |
+
> On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.
|
| 47 |
+
|
| 48 |
+
</details>
|
| 49 |
+
|
| 50 |
+
## ๐ News
|
| 51 |
+
|
| 52 |
+
**[2026/1/26]** Accepted by [ICLR 2026](https://iclr.cc/Conferences/2026) !
|
| 53 |
+
|
| 54 |
+
**[2026/3/21]** Code is available in [Lumos-Custom / LumosX](https://github.com/alibaba-damo-academy/Lumos-Custom/tree/main/LumosX) !
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
## ๐ Citation
|
| 59 |
+
|
| 60 |
+
If you find this work useful, please cite:
|
| 61 |
+
|
| 62 |
+
```bibtex
|
| 63 |
+
@inproceedings{xinglumosx,
|
| 64 |
+
title={LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation},
|
| 65 |
+
author={Xing, Jiazheng and Du, Fei and Yuan, Hangjie and Liu, Pengwei and Xu, Hongbin and Ci, Hai and Niu, Ruigang and Chen, Weihua and Wang, Fan and Liu, Yong},
|
| 66 |
+
booktitle={The Fourteenth International Conference on Learning Representations}
|
| 67 |
+
}
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## ๐ฃ Disclaimer
|
| 71 |
+
|
| 72 |
+
This is the official release channel for LumosX weights.
|
| 73 |
+
|