Instructions to use chenchenshi/DriveWAM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use chenchenshi/DriveWAM with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("chenchenshi/DriveWAM", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
File size: 10,620 Bytes
1022a96 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 | <div align="center">
<h1>DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving</h1>
<a href="https://arxiv.org/abs/2605.28544"><img src="https://img.shields.io/badge/Paper-b31b1b" alt="Paper"></a>
<a href="https://chenshi3.github.io/drivewam.github.io/"><img src="https://img.shields.io/badge/Project_Page-green" alt="Project Page"></a>
<br>
Chen Shi\*, Jinrui Xu\*, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiangβ
*The Chinese University of Hong Kong, Shenzhen & Voyager Research, Didi Chuxing*
\*Equal Contribution, β Corresponding Author
</div>
**DriveWAM** is a joint **video generation and action prediction** model for autonomous driving. It adapts a pretrained video diffusion transformer into an **autoregressive video-action policy**, organizing video and action streams into a unified temporal token sequence trained under a joint flow-matching objective β preserving video generation priors while extending the model to ego-motion action prediction.
<!-- Demo video: add a github user-attachments link here -->
## Highlights
### NavSim
Comparison on NAVSIM v1. \*: results with imitation learning. β : trained with multiple trajectory anchors. MV: multi-view cameras; SV: single-view camera; L: LiDAR.
<div align="center">
| Method | Ref | Sensors | NC β | DAC β | TTC β | C. β | EP β | PDMS β |
|---|---|---|---|---|---|---|---|---|
| Human | β | β | 100.0 | 100.0 | 100.0 | 99.9 | 87.5 | 94.8 |
| UniAD | CVPR'23 | MV | 97.8 | 91.9 | 92.9 | 100.0 | 78.8 | 83.4 |
| TransFuser | TPAMI'23 | MV & L | 97.7 | 92.8 | 92.8 | 100.0 | 79.2 | 84.0 |
| PARA-Drive | CVPR'24 | MV | 97.9 | 92.4 | 93.0 | 99.8 | 79.3 | 84.0 |
| LAW | ICLR'25 | SV | 96.4 | 95.4 | 88.7 | 99.9 | 81.7 | 84.6 |
| DiffusionDrive | CVPR'25 | MV & L | 98.2 | 96.2 | 94.7 | 100.0 | 82.2 | 88.1 |
| WoTE | ICCV'25 | MV & L | 98.5 | 96.8 | 94.4 | 99.9 | 81.9 | 88.3 |
| *VLA-based Methods* | | | | | | | | |
| ReCogDrive\* | ICLR'26 | MV | 98.1 | 94.7 | 94.2 | 100.0 | 80.9 | 86.5 |
| DriveVLA-W0 | ICLR'26 | SV | 98.7 | 96.2 | 95.5 | 100.0 | 82.2 | 88.4 |
| AutoVLA | NeurIPS'25 | MV | 98.4 | 95.6 | 98.0 | 99.9 | 81.9 | 89.1 |
| DriveDreamer-Policy | arXiv'26 | MV | 98.4 | 97.1 | 95.1 | 100.0 | 83.5 | 89.2 |
| DriveVLA-W0β | ICLR'26 | SV | 98.7 | 99.1 | 95.3 | 99.3 | 83.3 | 90.2 |
| *WA-based Methods* | | | | | | | | |
| Epona | ICCV'25 | SV | 97.9 | 95.1 | 93.8 | 99.9 | 80.4 | 86.2 |
| WorldDrive | arXiv'26 | SV | 98.4 | 95.8 | 95.2 | 99.8 | 83.3 | 89.0 |
| **DriveWAM (Ours)** | β | SV | 98.3 | **98.1** | 95.2 | 100.0 | **84.3** | **90.1** |
</div>
### PhysicalAI-AV
Comparison on PhysicalAI-Autonomous-Vehicles.
<div align="center">
| Method | Source | ADE@3s β | FDE@3s β | ADE@4s β | FDE@4s β |
|---|---|---|---|---|---|
| VaVAM | Valeo | 2.31 | 4.32 | - | - |
| Alpamayo-1.5 | NVIDIA | 0.80 | 2.31 | 1.44 | 4.18 |
| **DriveWAM (Ours)** | β | **0.47** | **1.35** | **0.83** | **2.47** |
</div>
### Qualitative Results
<div align="center">

</div>
### Data Scaling
DriveWAM's action prediction error improves consistently as training data scales from 4k to 100k clips. Scene-evolving (SE) guidance provides complementary benefit at every scale.
<div align="center">
<img src="assets/data_scaling.jpg" width="300">
| # Clips | # Iters | SE Guidance | ADE@4s β | FDE@4s β |
|---|---|---|---|---|
| 4k | 50k | β | 1.21 | 3.65 |
| 4k | 50k | β | 1.01 | 2.95 |
| 20k | 50k | β | 0.95 | 2.94 |
| 20k | 50k | β | 0.94 | 2.65 |
| 100k | 50k | β | 0.92 | 2.75 |
| 100k | 50k | β | **0.83** | **2.47** |
</div>
## News
- [Jun 7, 2026] We open-source all code and model weights.
- [May 27, 2026] We release the paper and project page.
## Getting Started
### Installation
First, clone this repository and set up the environment.
```bash
git clone <repo-url>
cd DriveWAM
# 1. Create conda environment
conda env create -f environment.yml
conda activate drivewam
# 2. Install PyTorch (CUDA 12.6)
pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu126
# 3. Install Flash Attention
pip install flash-attn==2.8.3 --no-build-isolation
```
Two optional extras, installed when you need the corresponding feature:
```bash
# NavSim evaluation extras (for the NavSim benchmark)
pip install -r requirements-navsim.txt
# VLM preprocessing extras (to generate navigation guidance)
pip install vllm qwen-vl-utils
```
## Data Preparation
DriveWAM trains and evaluates on two benchmarks. Prepare whichever you need.
### NavSim
Follow the [NavSim installation guide](https://github.com/autonomousvision/navsim) to download the nuPlan-based dataset and cache metric files, and export the environment variables from that guide (`OPENSCENE_DATA_ROOT`, `NUPLAN_MAPS_ROOT`, `NUPLAN_MAP_VERSION`). Then extract per-scene samples:
```bash
# navtrain split (training)
python -m src.navsim.process_data --output-path ./data/navsim/trainval
# navtest split (evaluation)
python -m src.navsim.process_data \
--navsim-log-path $OPENSCENE_DATA_ROOT/navsim_logs/test \
--sensor-blobs-path $OPENSCENE_DATA_ROOT/sensor_blobs/test \
--scene-filter-config navsim/planning/script/config/common/train_test_split/scene_filter/navtest.yaml \
--output-path ./data/navsim/test
```
Each scene becomes one pkl file, which is what the training and evaluation scripts read by default:
```
./data/navsim/trainval/
sample_000000.pkl
sample_000001.pkl
...
```
### PhysicalAI-Autonomous-Vehicles
The raw dataset is hosted on [Hugging Face](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles) and accessed through the [physical_ai_av](https://github.com/NVlabs/physical_ai_av) devkit. The devkit requires Python β₯ 3.11, so install it in a separate environment from `drivewam`:
```bash
pip install physical_ai_av
```
Accept the dataset license on the Hugging Face page, then download the dataset (or a subset of chunks) to `./data/physicalai`. DriveWAM only needs the **`camera_front_wide_120fov`** camera plus the egomotion and calibration features. Extract 10 Hz clips from the download:
```bash
python -m src.physicalai.process_data \
--dataset_root ./data/physicalai \
--output_dir ./data/physicalai/front \
--num_workers 16
```
This writes one directory per clip:
```
./data/physicalai/
βββ clip_index.parquet # official train/test split; keep it even if you prune the raw chunks
βββ front/
βββ <clip_id>/
βββ camera_front_wide_120fov.mp4
βββ camera_front_wide_120fov_ego.pkl
```
VLM navigation prompts are used as conditioning during training and inference. We provide pregenerated prompts: training-split prompts are available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM); the 1k-sample test-split prompts used for evaluation are included in the repo at `src/physicalai/eval_data/prompts_test_sample_1k.json`. To regenerate them yourself:
```bash
# Step 1 β generate route / BEV / scene-evolving guidance
bash scripts/drivewam_physicalai_vlm_preprocess.sh
# Step 2 β VLM-based clip quality filtering and sub-sampling
SPLIT=train \
bash scripts/drivewam_physicalai_vlm_data_sample.sh
```
## Training
DriveWAM model checkpoints are available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). DriveWAM is trained on top of [LingBot-VA Base](https://huggingface.co/robbyant/lingbot-va-base), a pretrained autoregressive diffusion transformer. Download the base model weights before training.
Key training hyperparameters (see configs for full details):
| Hyperparameter | NavSim / PhysicalAI |
|---|---|
| Training steps | 50 000 |
| Learning rate | 1e-5 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, wd=0.1) |
| Warmup steps | 10 |
| Batch size (per GPU) | 1 |
| Precision | bfloat16 |
| Input resolution | 256Γ448 |
| SNR shift (video / action) | 5.0 / 1.0 |
All experiments are conducted on 48 Γ NVIDIA H20 GPUs.
Edit the config (`src/configs/navsim_cfg.py` or `src/configs/physicalai_cfg.py`) to set your paths and hyperparameters, then launch with the matching script:
| Benchmark | Config | Launch script |
|---|---|---|
| NavSim | `src/configs/navsim_cfg.py` | `scripts/drivewam_navsim_train.sh` |
| PhysicalAI | `src/configs/physicalai_cfg.py` | `scripts/drivewam_physicalai_train.sh` |
```bash
# NavSim
bash scripts/drivewam_navsim_train.sh
# PhysicalAI
bash scripts/drivewam_physicalai_train.sh
```
For PhysicalAI, we provide three CSV files listing clip IDs at different training data scales (4k / 20k / 100k clips), available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). Set the `clip_csv` field in `src/configs/physicalai_cfg.py` to the desired scale before training.
## Evaluation
### NavSim (PDM Score)
PDM score evaluation requires a metric cache β a set of per-scenario `.pkl` files that store precomputed map and route information to score each predicted trajectory. The precomputed metric cache for the navtest split is available for download on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). To generate it yourself, run:
```bash
python navsim/planning/script/run_metric_caching.py \
train_test_split=navtest \
cache.cache_path=./data/navsim/metric_cache
```
This writes one `metric_cache.pkl` per scenario token under `./data/navsim/metric_cache/`. Pass the resulting directory to the evaluation script via `--metric-cache-path`.
```bash
python -m src.navsim.eval \
--checkpoint-path /path/to/checkpoint \
--config-name navsim_cfg \
--dataset-path ./data/navsim/test \
--metric-cache-path ./data/navsim/metric_cache
```
### PhysicalAI
```bash
python -m src.physicalai.eval \
--checkpoint-path /path/to/checkpoint \
--config-name physicalai_cfg
```
## Citation
If you find DriveWAM useful, please cite:
```bibtex
@article{shi2026drivewam,
title={DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving},
author={Shi, Chen and Xu, Jinrui and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li},
journal={arXiv preprint arXiv:2605.28544},
year={2026}
}
```
## Acknowledgements
We gratefully acknowledge the following open-source projects that DriveWAM builds upon: [Wan2.2](https://github.com/Wan-Video/Wan2.2), [LingBot-VA](https://github.com/robbyant/lingbot-va), [NavSim](https://github.com/autonomousvision/navsim), [NVIDIA PhysicalAI-Autonomous-Vehicles](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles).
|