Spaces:
Build error
Build error
Update README.md
Browse files
README.md
CHANGED
|
@@ -11,4 +11,20 @@ license: apache-2.0
|
|
| 11 |
short_description: 'Vision-Language-Action Models for Autonomous Driving: Past'
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
short_description: 'Vision-Language-Action Models for Autonomous Driving: Past'
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future
|
| 15 |
+
|
| 16 |
+
## Introduction
|
| 17 |
+
|
| 18 |
+
The pursuit of fully autonomous driving (AD) has long been a central goal in AI and robotics. Conventional AD systems typically adopt a modular "Perception-Decision-Action" pipeline, where mapping, object detection, motion prediction, and trajectory planning are developed and optimized as separate components.
|
| 19 |
+
|
| 20 |
+
While this design has achieved strong performance in structured environments, its reliance on hand-crafted interfaces and rules limits adaptability in complex, dynamic, and long-tailed scenarios.
|
| 21 |
+
|
| 22 |
+
This survey reviews **Vision-Language-Action (VLA)** models — an emerging paradigm that integrates visual perception, natural language reasoning, and executable actions for autonomous driving. We trace the evolution from traditional **Vision-Action (VA)** approaches to modern VLA frameworks. Charting the evolution from precursor VA models to modern VLA frameworks, we provide historical context and clarify the motivations behind this paradigm shift.
|
| 23 |
+
|
| 24 |
+
## Definition
|
| 25 |
+
|
| 26 |
+
**Vision-Action (VA)**:
|
| 27 |
+
A vision-centric driving system that directly maps raw sensory observations to driving actions, thereby avoiding explicit modular decomposition into perception, prediction, and planning. VA models learn end-to-end policies through imitation learning or reinforcement learning.
|
| 28 |
+
|
| 29 |
+
**Vision-Language-Action (VLA)**
|
| 30 |
+
A multimodal reasoning system that couples visual perception with large VLMs to produce executable driving actions. VLAs integrate visual understanding, linguistic reasoning, and actionable outputs within a unified framework, enabling more interpretable, generalizable, and human-aligned driving policies through natural language instructions and chain-of-thought reasoning.
|