DriveWAM / README.md

Upload README.md

1022a96 verified 3 days ago

10.6 kB

	<div align="center">
	<h1>DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving</h1>

	<a href="https://arxiv.org/abs/2605.28544"><img src="https://img.shields.io/badge/Paper-b31b1b" alt="Paper"></a>
	<a href="https://chenshi3.github.io/drivewam.github.io/"><img src="https://img.shields.io/badge/Project_Page-green" alt="Project Page"></a>

	<br>

	Chen Shi\, Jinrui Xu\, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang†

	The Chinese University of Hong Kong, Shenzhen & Voyager Research, Didi Chuxing

	\*Equal Contribution, †Corresponding Author

	</div>

	DriveWAM is a joint video generation and action prediction model for autonomous driving. It adapts a pretrained video diffusion transformer into an autoregressive video-action policy, organizing video and action streams into a unified temporal token sequence trained under a joint flow-matching objective — preserving video generation priors while extending the model to ego-motion action prediction.

	<!-- Demo video: add a github user-attachments link here -->

	## Highlights

	### NavSim

	Comparison on NAVSIM v1. \*: results with imitation learning. †: trained with multiple trajectory anchors. MV: multi-view cameras; SV: single-view camera; L: LiDAR.

	<div align="center">

	\| Method \| Ref \| Sensors \| NC ↑ \| DAC ↑ \| TTC ↑ \| C. ↑ \| EP ↑ \| PDMS ↑ \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| Human \| – \| – \| 100.0 \| 100.0 \| 100.0 \| 99.9 \| 87.5 \| 94.8 \|
	\| UniAD \| CVPR'23 \| MV \| 97.8 \| 91.9 \| 92.9 \| 100.0 \| 78.8 \| 83.4 \|
	\| TransFuser \| TPAMI'23 \| MV & L \| 97.7 \| 92.8 \| 92.8 \| 100.0 \| 79.2 \| 84.0 \|
	\| PARA-Drive \| CVPR'24 \| MV \| 97.9 \| 92.4 \| 93.0 \| 99.8 \| 79.3 \| 84.0 \|
	\| LAW \| ICLR'25 \| SV \| 96.4 \| 95.4 \| 88.7 \| 99.9 \| 81.7 \| 84.6 \|
	\| DiffusionDrive \| CVPR'25 \| MV & L \| 98.2 \| 96.2 \| 94.7 \| 100.0 \| 82.2 \| 88.1 \|
	\| WoTE \| ICCV'25 \| MV & L \| 98.5 \| 96.8 \| 94.4 \| 99.9 \| 81.9 \| 88.3 \|
	\| VLA-based Methods \| \| \| \| \| \| \| \| \|
	\| ReCogDrive\* \| ICLR'26 \| MV \| 98.1 \| 94.7 \| 94.2 \| 100.0 \| 80.9 \| 86.5 \|
	\| DriveVLA-W0 \| ICLR'26 \| SV \| 98.7 \| 96.2 \| 95.5 \| 100.0 \| 82.2 \| 88.4 \|
	\| AutoVLA \| NeurIPS'25 \| MV \| 98.4 \| 95.6 \| 98.0 \| 99.9 \| 81.9 \| 89.1 \|
	\| DriveDreamer-Policy \| arXiv'26 \| MV \| 98.4 \| 97.1 \| 95.1 \| 100.0 \| 83.5 \| 89.2 \|
	\| DriveVLA-W0† \| ICLR'26 \| SV \| 98.7 \| 99.1 \| 95.3 \| 99.3 \| 83.3 \| 90.2 \|
	\| WA-based Methods \| \| \| \| \| \| \| \| \|
	\| Epona \| ICCV'25 \| SV \| 97.9 \| 95.1 \| 93.8 \| 99.9 \| 80.4 \| 86.2 \|
	\| WorldDrive \| arXiv'26 \| SV \| 98.4 \| 95.8 \| 95.2 \| 99.8 \| 83.3 \| 89.0 \|
	\| DriveWAM (Ours) \| – \| SV \| 98.3 \| 98.1 \| 95.2 \| 100.0 \| 84.3 \| 90.1 \|

	</div>

	### PhysicalAI-AV

	Comparison on PhysicalAI-Autonomous-Vehicles.

	<div align="center">

	\| Method \| Source \| ADE@3s ↓ \| FDE@3s ↓ \| ADE@4s ↓ \| FDE@4s ↓ \|
	\|---\|---\|---\|---\|---\|---\|
	\| VaVAM \| Valeo \| 2.31 \| 4.32 \| - \| - \|
	\| Alpamayo-1.5 \| NVIDIA \| 0.80 \| 2.31 \| 1.44 \| 4.18 \|
	\| DriveWAM (Ours) \| — \| 0.47 \| 1.35 \| 0.83 \| 2.47 \|

	</div>

	### Qualitative Results

	<div align="center">

	![Qualitative Results](assets/result_vis.jpg)

	</div>

	### Data Scaling

	DriveWAM's action prediction error improves consistently as training data scales from 4k to 100k clips. Scene-evolving (SE) guidance provides complementary benefit at every scale.

	<div align="center">

	<img src="assets/data_scaling.jpg" width="300">

	\| # Clips \| # Iters \| SE Guidance \| ADE@4s ↓ \| FDE@4s ↓ \|
	\|---\|---\|---\|---\|---\|
	\| 4k \| 50k \| ✗ \| 1.21 \| 3.65 \|
	\| 4k \| 50k \| ✓ \| 1.01 \| 2.95 \|
	\| 20k \| 50k \| ✗ \| 0.95 \| 2.94 \|
	\| 20k \| 50k \| ✓ \| 0.94 \| 2.65 \|
	\| 100k \| 50k \| ✗ \| 0.92 \| 2.75 \|
	\| 100k \| 50k \| ✓ \| 0.83 \| 2.47 \|

	</div>

	## News
	- [Jun 7, 2026] We open-source all code and model weights.
	- [May 27, 2026] We release the paper and project page.

	## Getting Started

	### Installation

	First, clone this repository and set up the environment.

	```bash
	git clone <repo-url>
	cd DriveWAM

	# 1. Create conda environment
	conda env create -f environment.yml
	conda activate drivewam

	# 2. Install PyTorch (CUDA 12.6)
	pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu126

	# 3. Install Flash Attention
	pip install flash-attn==2.8.3 --no-build-isolation
	```

	Two optional extras, installed when you need the corresponding feature:

	```bash
	# NavSim evaluation extras (for the NavSim benchmark)
	pip install -r requirements-navsim.txt

	# VLM preprocessing extras (to generate navigation guidance)
	pip install vllm qwen-vl-utils
	```

	## Data Preparation

	DriveWAM trains and evaluates on two benchmarks. Prepare whichever you need.

	### NavSim

	Follow the [NavSim installation guide](https://github.com/autonomousvision/navsim) to download the nuPlan-based dataset and cache metric files, and export the environment variables from that guide (`OPENSCENE_DATA_ROOT`, `NUPLAN_MAPS_ROOT`, `NUPLAN_MAP_VERSION`). Then extract per-scene samples:

	```bash
	# navtrain split (training)
	python -m src.navsim.process_data --output-path ./data/navsim/trainval

	# navtest split (evaluation)
	python -m src.navsim.process_data \
	--navsim-log-path $OPENSCENE_DATA_ROOT/navsim_logs/test \
	--sensor-blobs-path $OPENSCENE_DATA_ROOT/sensor_blobs/test \
	--scene-filter-config navsim/planning/script/config/common/train_test_split/scene_filter/navtest.yaml \
	--output-path ./data/navsim/test
	```

	Each scene becomes one pkl file, which is what the training and evaluation scripts read by default:

	```
	./data/navsim/trainval/
	sample_000000.pkl
	sample_000001.pkl
	...
	```

	### PhysicalAI-Autonomous-Vehicles

	The raw dataset is hosted on [Hugging Face](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles) and accessed through the [physical_ai_av](https://github.com/NVlabs/physical_ai_av) devkit. The devkit requires Python ≥ 3.11, so install it in a separate environment from `drivewam`:

	```bash
	pip install physical_ai_av
	```

	Accept the dataset license on the Hugging Face page, then download the dataset (or a subset of chunks) to `./data/physicalai`. DriveWAM only needs the `camera_front_wide_120fov` camera plus the egomotion and calibration features. Extract 10 Hz clips from the download:

	```bash
	python -m src.physicalai.process_data \
	--dataset_root ./data/physicalai \
	--output_dir ./data/physicalai/front \
	--num_workers 16
	```

	This writes one directory per clip:

	```
	./data/physicalai/
	├── clip_index.parquet # official train/test split; keep it even if you prune the raw chunks
	└── front/
	└── <clip_id>/
	├── camera_front_wide_120fov.mp4
	└── camera_front_wide_120fov_ego.pkl
	```

	VLM navigation prompts are used as conditioning during training and inference. We provide pregenerated prompts: training-split prompts are available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM); the 1k-sample test-split prompts used for evaluation are included in the repo at `src/physicalai/eval_data/prompts_test_sample_1k.json`. To regenerate them yourself:

	```bash
	# Step 1 – generate route / BEV / scene-evolving guidance
	bash scripts/drivewam_physicalai_vlm_preprocess.sh

	# Step 2 – VLM-based clip quality filtering and sub-sampling
	SPLIT=train \
	bash scripts/drivewam_physicalai_vlm_data_sample.sh
	```

	## Training

	DriveWAM model checkpoints are available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). DriveWAM is trained on top of [LingBot-VA Base](https://huggingface.co/robbyant/lingbot-va-base), a pretrained autoregressive diffusion transformer. Download the base model weights before training.

	Key training hyperparameters (see configs for full details):

	\| Hyperparameter \| NavSim / PhysicalAI \|
	\|---\|---\|
	\| Training steps \| 50 000 \|
	\| Learning rate \| 1e-5 \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.95, wd=0.1) \|
	\| Warmup steps \| 10 \|
	\| Batch size (per GPU) \| 1 \|
	\| Precision \| bfloat16 \|
	\| Input resolution \| 256×448 \|
	\| SNR shift (video / action) \| 5.0 / 1.0 \|

	All experiments are conducted on 48 × NVIDIA H20 GPUs.

	Edit the config (`src/configs/navsim_cfg.py` or `src/configs/physicalai_cfg.py`) to set your paths and hyperparameters, then launch with the matching script:

	\| Benchmark \| Config \| Launch script \|
	\|---\|---\|---\|
	\| NavSim \| `src/configs/navsim_cfg.py` \| `scripts/drivewam_navsim_train.sh` \|
	\| PhysicalAI \| `src/configs/physicalai_cfg.py` \| `scripts/drivewam_physicalai_train.sh` \|

	```bash
	# NavSim
	bash scripts/drivewam_navsim_train.sh

	# PhysicalAI
	bash scripts/drivewam_physicalai_train.sh
	```

	For PhysicalAI, we provide three CSV files listing clip IDs at different training data scales (4k / 20k / 100k clips), available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). Set the `clip_csv` field in `src/configs/physicalai_cfg.py` to the desired scale before training.

	## Evaluation

	### NavSim (PDM Score)

	PDM score evaluation requires a metric cache — a set of per-scenario `.pkl` files that store precomputed map and route information to score each predicted trajectory. The precomputed metric cache for the navtest split is available for download on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). To generate it yourself, run:

	```bash
	python navsim/planning/script/run_metric_caching.py \
	train_test_split=navtest \
	cache.cache_path=./data/navsim/metric_cache
	```

	This writes one `metric_cache.pkl` per scenario token under `./data/navsim/metric_cache/`. Pass the resulting directory to the evaluation script via `--metric-cache-path`.

	```bash
	python -m src.navsim.eval \
	--checkpoint-path /path/to/checkpoint \
	--config-name navsim_cfg \
	--dataset-path ./data/navsim/test \
	--metric-cache-path ./data/navsim/metric_cache
	```

	### PhysicalAI

	```bash
	python -m src.physicalai.eval \
	--checkpoint-path /path/to/checkpoint \
	--config-name physicalai_cfg
	```

	## Citation

	If you find DriveWAM useful, please cite:

	```bibtex
	@article{shi2026drivewam,
	title={DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving},
	author={Shi, Chen and Xu, Jinrui and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li},
	journal={arXiv preprint arXiv:2605.28544},
	year={2026}
	}
	```

	## Acknowledgements

	We gratefully acknowledge the following open-source projects that DriveWAM builds upon: [Wan2.2](https://github.com/Wan-Video/Wan2.2), [LingBot-VA](https://github.com/robbyant/lingbot-va), [NavSim](https://github.com/autonomousvision/navsim), [NVIDIA PhysicalAI-Autonomous-Vehicles](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles).