Spaces:

worldbench
/

vla4ad

Running

App Files Files Community

vla4ad / README.md

worldbench

Update README.md

778f32a verified 2 months ago

preview code

raw

history blame contribute delete

2.02 kB

	---
	title: Vla4ad
	emoji: 🔥
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: 6.1.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: 'Vision-Language-Action Models for Autonomous Driving: Past'
	---

	# Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

	## Introduction

	The pursuit of fully autonomous driving (AD) has long been a central goal in AI and robotics. Conventional AD systems typically adopt a modular "Perception-Decision-Action" pipeline, where mapping, object detection, motion prediction, and trajectory planning are developed and optimized as separate components.

	While this design has achieved strong performance in structured environments, its reliance on hand-crafted interfaces and rules limits adaptability in complex, dynamic, and long-tailed scenarios.

	This survey reviews Vision-Language-Action (VLA) models — an emerging paradigm that integrates visual perception, natural language reasoning, and executable actions for autonomous driving. We trace the evolution from traditional Vision-Action (VA) approaches to modern VLA frameworks. Charting the evolution from precursor VA models to modern VLA frameworks, we provide historical context and clarify the motivations behind this paradigm shift.

	## Definition

	Vision-Action (VA):
	A vision-centric driving system that directly maps raw sensory observations to driving actions, thereby avoiding explicit modular decomposition into perception, prediction, and planning. VA models learn end-to-end policies through imitation learning or reinforcement learning.

	Vision-Language-Action (VLA)
	A multimodal reasoning system that couples visual perception with large VLMs to produce executable driving actions. VLAs integrate visual understanding, linguistic reasoning, and actionable outputs within a unified framework, enabling more interpretable, generalizable, and human-aligned driving policies through natural language instructions and chain-of-thought reasoning.