X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model Paper • 2510.10274 • Published Oct 11 • 15
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions Paper • 2509.06951 • Published Sep 8 • 32
LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry Paper • 2512.19629 • Published 3 days ago • 21
Towards Latency-Aware 3D Streaming Perception for Autonomous Driving Paper • 2504.19115 • Published Apr 27 • 1
MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence Paper • 2512.10863 • Published 14 days ago • 21
Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation Paper • 2512.08186 • Published 17 days ago • 21
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling Paper • 2507.05240 • Published Jul 7 • 47