vla4ad / README.md
worldbench's picture
Update README.md
778f32a verified

A newer version of the Gradio SDK is available: 6.8.0

Upgrade
metadata
title: Vla4ad
emoji: 🔥
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.1.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: 'Vision-Language-Action Models for Autonomous Driving: Past'

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Introduction

The pursuit of fully autonomous driving (AD) has long been a central goal in AI and robotics. Conventional AD systems typically adopt a modular "Perception-Decision-Action" pipeline, where mapping, object detection, motion prediction, and trajectory planning are developed and optimized as separate components.

While this design has achieved strong performance in structured environments, its reliance on hand-crafted interfaces and rules limits adaptability in complex, dynamic, and long-tailed scenarios.

This survey reviews Vision-Language-Action (VLA) models — an emerging paradigm that integrates visual perception, natural language reasoning, and executable actions for autonomous driving. We trace the evolution from traditional Vision-Action (VA) approaches to modern VLA frameworks. Charting the evolution from precursor VA models to modern VLA frameworks, we provide historical context and clarify the motivations behind this paradigm shift.

Definition

Vision-Action (VA): A vision-centric driving system that directly maps raw sensory observations to driving actions, thereby avoiding explicit modular decomposition into perception, prediction, and planning. VA models learn end-to-end policies through imitation learning or reinforcement learning.

Vision-Language-Action (VLA) A multimodal reasoning system that couples visual perception with large VLMs to produce executable driving actions. VLAs integrate visual understanding, linguistic reasoning, and actionable outputs within a unified framework, enabling more interpretable, generalizable, and human-aligned driving policies through natural language instructions and chain-of-thought reasoning.