[CLS] Token is All You Need for Zero-Shot Semantic Segmentation Paper • 2304.06212 • Published Apr 13, 2023
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models Paper • 2506.03135 • Published Jun 3, 2025 • 40
Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation Paper • 2510.09320 • Published Oct 10, 2025 • 3
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion Paper • 2407.02077 • Published Jul 2, 2024
VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model Paper • 2602.10098 • Published Feb 10 • 22
ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models Paper • 2601.12428 • Published Jan 18
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps Paper • 2604.11135 • Published Apr 13
Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining Paper • 2604.16391 • Published Mar 27 • 4
Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking Paper • 2606.03985 • Published 27 days ago • 41
MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models Paper • 2606.13515 • Published 18 days ago • 2
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? Paper • 2606.19531 • Published 12 days ago • 21
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing Paper • 2606.09811 • Published 21 days ago • 15
VLA-JEPA Collection VLA-JEPA model checkpoints (LIBERO, Pretrain, SimplerEnv) • 3 items • Updated May 28 • 14
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training Paper • 2605.13757 • Published May 13 • 21
Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining Paper • 2604.16391 • Published Mar 27 • 4
Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining Paper • 2604.16391 • Published Mar 27 • 4