OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
Abstract
OmniShow is an end-to-end framework for human-object interaction video generation that effectively integrates multiple modalities through unified conditioning and attention mechanisms while addressing data scarcity via decoupled training strategies.
In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.
Community
🔥 Introducing OmniShow, an all-in-one model for Human-Object Interaction Video Generation.
- Project page: https://correr-zhou.github.io/OmniShow
- Paper: https://arxiv.org/pdf/2604.11804
- GitHub repo: https://github.com/Correr-Zhou/OmniShow
- HOIVG-Bench: https://huggingface.co/datasets/donghao-zhou/HOIVG-Bench
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation (2026)
- DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary (2026)
- SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model (2026)
- InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance (2026)
- DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning (2026)
- UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation (2026)
- Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.11804 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper