LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
Abstract
LumosX framework enhances text-to-video generation through relational attention mechanisms and structured data pipelines for improved face-attribute alignment and subject consistency.
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation (2026)
- SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens (2026)
- Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation (2026)
- 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model (2026)
- ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation (2026)
- DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation (2026)
- AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper