Diffusers
Safetensors
DriveWAM / README.md
chenchenshi's picture
Upload README.md
1022a96 verified
<div align="center">
<h1>DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving</h1>
<a href="https://arxiv.org/abs/2605.28544"><img src="https://img.shields.io/badge/Paper-b31b1b" alt="Paper"></a>
<a href="https://chenshi3.github.io/drivewam.github.io/"><img src="https://img.shields.io/badge/Project_Page-green" alt="Project Page"></a>
<br>
Chen Shi\*, Jinrui Xu\*, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang†
*The Chinese University of Hong Kong, Shenzhen &amp; Voyager Research, Didi Chuxing*
\*Equal Contribution, †Corresponding Author
</div>
**DriveWAM** is a joint **video generation and action prediction** model for autonomous driving. It adapts a pretrained video diffusion transformer into an **autoregressive video-action policy**, organizing video and action streams into a unified temporal token sequence trained under a joint flow-matching objective β€” preserving video generation priors while extending the model to ego-motion action prediction.
<!-- Demo video: add a github user-attachments link here -->
## Highlights
### NavSim
Comparison on NAVSIM v1. \*: results with imitation learning. †: trained with multiple trajectory anchors. MV: multi-view cameras; SV: single-view camera; L: LiDAR.
<div align="center">
| Method | Ref | Sensors | NC ↑ | DAC ↑ | TTC ↑ | C. ↑ | EP ↑ | PDMS ↑ |
|---|---|---|---|---|---|---|---|---|
| Human | – | – | 100.0 | 100.0 | 100.0 | 99.9 | 87.5 | 94.8 |
| UniAD | CVPR'23 | MV | 97.8 | 91.9 | 92.9 | 100.0 | 78.8 | 83.4 |
| TransFuser | TPAMI'23 | MV & L | 97.7 | 92.8 | 92.8 | 100.0 | 79.2 | 84.0 |
| PARA-Drive | CVPR'24 | MV | 97.9 | 92.4 | 93.0 | 99.8 | 79.3 | 84.0 |
| LAW | ICLR'25 | SV | 96.4 | 95.4 | 88.7 | 99.9 | 81.7 | 84.6 |
| DiffusionDrive | CVPR'25 | MV & L | 98.2 | 96.2 | 94.7 | 100.0 | 82.2 | 88.1 |
| WoTE | ICCV'25 | MV & L | 98.5 | 96.8 | 94.4 | 99.9 | 81.9 | 88.3 |
| *VLA-based Methods* | | | | | | | | |
| ReCogDrive\* | ICLR'26 | MV | 98.1 | 94.7 | 94.2 | 100.0 | 80.9 | 86.5 |
| DriveVLA-W0 | ICLR'26 | SV | 98.7 | 96.2 | 95.5 | 100.0 | 82.2 | 88.4 |
| AutoVLA | NeurIPS'25 | MV | 98.4 | 95.6 | 98.0 | 99.9 | 81.9 | 89.1 |
| DriveDreamer-Policy | arXiv'26 | MV | 98.4 | 97.1 | 95.1 | 100.0 | 83.5 | 89.2 |
| DriveVLA-W0† | ICLR'26 | SV | 98.7 | 99.1 | 95.3 | 99.3 | 83.3 | 90.2 |
| *WA-based Methods* | | | | | | | | |
| Epona | ICCV'25 | SV | 97.9 | 95.1 | 93.8 | 99.9 | 80.4 | 86.2 |
| WorldDrive | arXiv'26 | SV | 98.4 | 95.8 | 95.2 | 99.8 | 83.3 | 89.0 |
| **DriveWAM (Ours)** | – | SV | 98.3 | **98.1** | 95.2 | 100.0 | **84.3** | **90.1** |
</div>
### PhysicalAI-AV
Comparison on PhysicalAI-Autonomous-Vehicles.
<div align="center">
| Method | Source | ADE@3s ↓ | FDE@3s ↓ | ADE@4s ↓ | FDE@4s ↓ |
|---|---|---|---|---|---|
| VaVAM | Valeo | 2.31 | 4.32 | - | - |
| Alpamayo-1.5 | NVIDIA | 0.80 | 2.31 | 1.44 | 4.18 |
| **DriveWAM (Ours)** | β€” | **0.47** | **1.35** | **0.83** | **2.47** |
</div>
### Qualitative Results
<div align="center">
![Qualitative Results](assets/result_vis.jpg)
</div>
### Data Scaling
DriveWAM's action prediction error improves consistently as training data scales from 4k to 100k clips. Scene-evolving (SE) guidance provides complementary benefit at every scale.
<div align="center">
<img src="assets/data_scaling.jpg" width="300">
| # Clips | # Iters | SE Guidance | ADE@4s ↓ | FDE@4s ↓ |
|---|---|---|---|---|
| 4k | 50k | βœ— | 1.21 | 3.65 |
| 4k | 50k | βœ“ | 1.01 | 2.95 |
| 20k | 50k | βœ— | 0.95 | 2.94 |
| 20k | 50k | βœ“ | 0.94 | 2.65 |
| 100k | 50k | βœ— | 0.92 | 2.75 |
| 100k | 50k | βœ“ | **0.83** | **2.47** |
</div>
## News
- [Jun 7, 2026] We open-source all code and model weights.
- [May 27, 2026] We release the paper and project page.
## Getting Started
### Installation
First, clone this repository and set up the environment.
```bash
git clone <repo-url>
cd DriveWAM
# 1. Create conda environment
conda env create -f environment.yml
conda activate drivewam
# 2. Install PyTorch (CUDA 12.6)
pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu126
# 3. Install Flash Attention
pip install flash-attn==2.8.3 --no-build-isolation
```
Two optional extras, installed when you need the corresponding feature:
```bash
# NavSim evaluation extras (for the NavSim benchmark)
pip install -r requirements-navsim.txt
# VLM preprocessing extras (to generate navigation guidance)
pip install vllm qwen-vl-utils
```
## Data Preparation
DriveWAM trains and evaluates on two benchmarks. Prepare whichever you need.
### NavSim
Follow the [NavSim installation guide](https://github.com/autonomousvision/navsim) to download the nuPlan-based dataset and cache metric files, and export the environment variables from that guide (`OPENSCENE_DATA_ROOT`, `NUPLAN_MAPS_ROOT`, `NUPLAN_MAP_VERSION`). Then extract per-scene samples:
```bash
# navtrain split (training)
python -m src.navsim.process_data --output-path ./data/navsim/trainval
# navtest split (evaluation)
python -m src.navsim.process_data \
--navsim-log-path $OPENSCENE_DATA_ROOT/navsim_logs/test \
--sensor-blobs-path $OPENSCENE_DATA_ROOT/sensor_blobs/test \
--scene-filter-config navsim/planning/script/config/common/train_test_split/scene_filter/navtest.yaml \
--output-path ./data/navsim/test
```
Each scene becomes one pkl file, which is what the training and evaluation scripts read by default:
```
./data/navsim/trainval/
sample_000000.pkl
sample_000001.pkl
...
```
### PhysicalAI-Autonomous-Vehicles
The raw dataset is hosted on [Hugging Face](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles) and accessed through the [physical_ai_av](https://github.com/NVlabs/physical_ai_av) devkit. The devkit requires Python β‰₯ 3.11, so install it in a separate environment from `drivewam`:
```bash
pip install physical_ai_av
```
Accept the dataset license on the Hugging Face page, then download the dataset (or a subset of chunks) to `./data/physicalai`. DriveWAM only needs the **`camera_front_wide_120fov`** camera plus the egomotion and calibration features. Extract 10 Hz clips from the download:
```bash
python -m src.physicalai.process_data \
--dataset_root ./data/physicalai \
--output_dir ./data/physicalai/front \
--num_workers 16
```
This writes one directory per clip:
```
./data/physicalai/
β”œβ”€β”€ clip_index.parquet # official train/test split; keep it even if you prune the raw chunks
└── front/
└── <clip_id>/
β”œβ”€β”€ camera_front_wide_120fov.mp4
└── camera_front_wide_120fov_ego.pkl
```
VLM navigation prompts are used as conditioning during training and inference. We provide pregenerated prompts: training-split prompts are available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM); the 1k-sample test-split prompts used for evaluation are included in the repo at `src/physicalai/eval_data/prompts_test_sample_1k.json`. To regenerate them yourself:
```bash
# Step 1 – generate route / BEV / scene-evolving guidance
bash scripts/drivewam_physicalai_vlm_preprocess.sh
# Step 2 – VLM-based clip quality filtering and sub-sampling
SPLIT=train \
bash scripts/drivewam_physicalai_vlm_data_sample.sh
```
## Training
DriveWAM model checkpoints are available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). DriveWAM is trained on top of [LingBot-VA Base](https://huggingface.co/robbyant/lingbot-va-base), a pretrained autoregressive diffusion transformer. Download the base model weights before training.
Key training hyperparameters (see configs for full details):
| Hyperparameter | NavSim / PhysicalAI |
|---|---|
| Training steps | 50 000 |
| Learning rate | 1e-5 |
| Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.95, wd=0.1) |
| Warmup steps | 10 |
| Batch size (per GPU) | 1 |
| Precision | bfloat16 |
| Input resolution | 256Γ—448 |
| SNR shift (video / action) | 5.0 / 1.0 |
All experiments are conducted on 48 Γ— NVIDIA H20 GPUs.
Edit the config (`src/configs/navsim_cfg.py` or `src/configs/physicalai_cfg.py`) to set your paths and hyperparameters, then launch with the matching script:
| Benchmark | Config | Launch script |
|---|---|---|
| NavSim | `src/configs/navsim_cfg.py` | `scripts/drivewam_navsim_train.sh` |
| PhysicalAI | `src/configs/physicalai_cfg.py` | `scripts/drivewam_physicalai_train.sh` |
```bash
# NavSim
bash scripts/drivewam_navsim_train.sh
# PhysicalAI
bash scripts/drivewam_physicalai_train.sh
```
For PhysicalAI, we provide three CSV files listing clip IDs at different training data scales (4k / 20k / 100k clips), available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). Set the `clip_csv` field in `src/configs/physicalai_cfg.py` to the desired scale before training.
## Evaluation
### NavSim (PDM Score)
PDM score evaluation requires a metric cache β€” a set of per-scenario `.pkl` files that store precomputed map and route information to score each predicted trajectory. The precomputed metric cache for the navtest split is available for download on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). To generate it yourself, run:
```bash
python navsim/planning/script/run_metric_caching.py \
train_test_split=navtest \
cache.cache_path=./data/navsim/metric_cache
```
This writes one `metric_cache.pkl` per scenario token under `./data/navsim/metric_cache/`. Pass the resulting directory to the evaluation script via `--metric-cache-path`.
```bash
python -m src.navsim.eval \
--checkpoint-path /path/to/checkpoint \
--config-name navsim_cfg \
--dataset-path ./data/navsim/test \
--metric-cache-path ./data/navsim/metric_cache
```
### PhysicalAI
```bash
python -m src.physicalai.eval \
--checkpoint-path /path/to/checkpoint \
--config-name physicalai_cfg
```
## Citation
If you find DriveWAM useful, please cite:
```bibtex
@article{shi2026drivewam,
title={DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving},
author={Shi, Chen and Xu, Jinrui and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li},
journal={arXiv preprint arXiv:2605.28544},
year={2026}
}
```
## Acknowledgements
We gratefully acknowledge the following open-source projects that DriveWAM builds upon: [Wan2.2](https://github.com/Wan-Video/Wan2.2), [LingBot-VA](https://github.com/robbyant/lingbot-va), [NavSim](https://github.com/autonomousvision/navsim), [NVIDIA PhysicalAI-Autonomous-Vehicles](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles).