PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception Paper • 2606.28322 • Published 8 days ago • 36
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? Paper • 2606.19531 • Published 17 days ago • 22
Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking Paper • 2606.03985 • Published Jun 2 • 41
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments Paper • 2605.30280 • Published May 28 • 146
Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation Paper • 2602.16705 • Published Feb 18 • 26
VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model Paper • 2602.10098 • Published Feb 10 • 22
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation Paper • 2602.03796 • Published Feb 3 • 65
OmniSpatial Collection Collections of ICLR 2026 paper: "OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models" • 4 items • Updated Jan 27 • 1
GS-Reasoner Collection Collections of paper "Reasoning in Space via Grounding in the World" • 6 items • Updated Oct 20, 2025 • 2
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency Paper • 2508.18265 • Published Aug 25, 2025 • 224
ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks Paper • 2508.08240 • Published Aug 11, 2025 • 45
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale Paper • 2508.10711 • Published Aug 14, 2025 • 146
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Paper • 2507.13344 • Published Jul 17, 2025 • 59
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning Paper • 2507.05255 • Published Jul 7, 2025 • 75
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge Paper • 2507.04447 • Published Jul 6, 2025 • 45
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models Paper • 2506.03135 • Published Jun 3, 2025 • 40
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding Paper • 2506.01853 • Published Jun 2, 2025 • 32