From Pixels to Words -- Towards Native One-Vision Models at Scale Paper • 2605.28820 • Published 6 days ago • 68
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture Paper • 2605.12500 • Published 21 days ago • 191
EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents Paper • 2602.23205 • Published Feb 26 • 11
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture Paper • 2605.12500 • Published 21 days ago • 191
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture Paper • 2605.12500 • Published 21 days ago • 191
Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer Paper • 2603.19227 • Published Mar 19 • 42
Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer Paper • 2603.19227 • Published Mar 19 • 42
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation Paper • 2510.26794 • Published Oct 30, 2025 • 27
ConsistCompose: Unified Multimodal Layout Control for Image Composition Paper • 2511.18333 • Published Nov 23, 2025 • 5
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation Paper • 2510.26794 • Published Oct 30, 2025 • 27
ConsistCompose: Unified Multimodal Layout Control for Image Composition Paper • 2511.18333 • Published Nov 23, 2025 • 5