Instructions to use chenchenshi/DriveWAM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use chenchenshi/DriveWAM with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("chenchenshi/DriveWAM", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
| <div align="center"> | |
| <h1>DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving</h1> | |
| <a href="https://arxiv.org/abs/2605.28544"><img src="https://img.shields.io/badge/Paper-b31b1b" alt="Paper"></a> | |
| <a href="https://chenshi3.github.io/drivewam.github.io/"><img src="https://img.shields.io/badge/Project_Page-green" alt="Project Page"></a> | |
| <br> | |
| Chen Shi\*, Jinrui Xu\*, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiangβ | |
| *The Chinese University of Hong Kong, Shenzhen & Voyager Research, Didi Chuxing* | |
| \*Equal Contribution, β Corresponding Author | |
| </div> | |
| **DriveWAM** is a joint **video generation and action prediction** model for autonomous driving. It adapts a pretrained video diffusion transformer into an **autoregressive video-action policy**, organizing video and action streams into a unified temporal token sequence trained under a joint flow-matching objective β preserving video generation priors while extending the model to ego-motion action prediction. | |
| <!-- Demo video: add a github user-attachments link here --> | |
| ## Highlights | |
| ### NavSim | |
| Comparison on NAVSIM v1. \*: results with imitation learning. β : trained with multiple trajectory anchors. MV: multi-view cameras; SV: single-view camera; L: LiDAR. | |
| <div align="center"> | |
| | Method | Ref | Sensors | NC β | DAC β | TTC β | C. β | EP β | PDMS β | | |
| |---|---|---|---|---|---|---|---|---| | |
| | Human | β | β | 100.0 | 100.0 | 100.0 | 99.9 | 87.5 | 94.8 | | |
| | UniAD | CVPR'23 | MV | 97.8 | 91.9 | 92.9 | 100.0 | 78.8 | 83.4 | | |
| | TransFuser | TPAMI'23 | MV & L | 97.7 | 92.8 | 92.8 | 100.0 | 79.2 | 84.0 | | |
| | PARA-Drive | CVPR'24 | MV | 97.9 | 92.4 | 93.0 | 99.8 | 79.3 | 84.0 | | |
| | LAW | ICLR'25 | SV | 96.4 | 95.4 | 88.7 | 99.9 | 81.7 | 84.6 | | |
| | DiffusionDrive | CVPR'25 | MV & L | 98.2 | 96.2 | 94.7 | 100.0 | 82.2 | 88.1 | | |
| | WoTE | ICCV'25 | MV & L | 98.5 | 96.8 | 94.4 | 99.9 | 81.9 | 88.3 | | |
| | *VLA-based Methods* | | | | | | | | | | |
| | ReCogDrive\* | ICLR'26 | MV | 98.1 | 94.7 | 94.2 | 100.0 | 80.9 | 86.5 | | |
| | DriveVLA-W0 | ICLR'26 | SV | 98.7 | 96.2 | 95.5 | 100.0 | 82.2 | 88.4 | | |
| | AutoVLA | NeurIPS'25 | MV | 98.4 | 95.6 | 98.0 | 99.9 | 81.9 | 89.1 | | |
| | DriveDreamer-Policy | arXiv'26 | MV | 98.4 | 97.1 | 95.1 | 100.0 | 83.5 | 89.2 | | |
| | DriveVLA-W0β | ICLR'26 | SV | 98.7 | 99.1 | 95.3 | 99.3 | 83.3 | 90.2 | | |
| | *WA-based Methods* | | | | | | | | | | |
| | Epona | ICCV'25 | SV | 97.9 | 95.1 | 93.8 | 99.9 | 80.4 | 86.2 | | |
| | WorldDrive | arXiv'26 | SV | 98.4 | 95.8 | 95.2 | 99.8 | 83.3 | 89.0 | | |
| | **DriveWAM (Ours)** | β | SV | 98.3 | **98.1** | 95.2 | 100.0 | **84.3** | **90.1** | | |
| </div> | |
| ### PhysicalAI-AV | |
| Comparison on PhysicalAI-Autonomous-Vehicles. | |
| <div align="center"> | |
| | Method | Source | ADE@3s β | FDE@3s β | ADE@4s β | FDE@4s β | | |
| |---|---|---|---|---|---| | |
| | VaVAM | Valeo | 2.31 | 4.32 | - | - | | |
| | Alpamayo-1.5 | NVIDIA | 0.80 | 2.31 | 1.44 | 4.18 | | |
| | **DriveWAM (Ours)** | β | **0.47** | **1.35** | **0.83** | **2.47** | | |
| </div> | |
| ### Qualitative Results | |
| <div align="center"> | |
|  | |
| </div> | |
| ### Data Scaling | |
| DriveWAM's action prediction error improves consistently as training data scales from 4k to 100k clips. Scene-evolving (SE) guidance provides complementary benefit at every scale. | |
| <div align="center"> | |
| <img src="assets/data_scaling.jpg" width="300"> | |
| | # Clips | # Iters | SE Guidance | ADE@4s β | FDE@4s β | | |
| |---|---|---|---|---| | |
| | 4k | 50k | β | 1.21 | 3.65 | | |
| | 4k | 50k | β | 1.01 | 2.95 | | |
| | 20k | 50k | β | 0.95 | 2.94 | | |
| | 20k | 50k | β | 0.94 | 2.65 | | |
| | 100k | 50k | β | 0.92 | 2.75 | | |
| | 100k | 50k | β | **0.83** | **2.47** | | |
| </div> | |
| ## News | |
| - [Jun 7, 2026] We open-source all code and model weights. | |
| - [May 27, 2026] We release the paper and project page. | |
| ## Getting Started | |
| ### Installation | |
| First, clone this repository and set up the environment. | |
| ```bash | |
| git clone <repo-url> | |
| cd DriveWAM | |
| # 1. Create conda environment | |
| conda env create -f environment.yml | |
| conda activate drivewam | |
| # 2. Install PyTorch (CUDA 12.6) | |
| pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu126 | |
| # 3. Install Flash Attention | |
| pip install flash-attn==2.8.3 --no-build-isolation | |
| ``` | |
| Two optional extras, installed when you need the corresponding feature: | |
| ```bash | |
| # NavSim evaluation extras (for the NavSim benchmark) | |
| pip install -r requirements-navsim.txt | |
| # VLM preprocessing extras (to generate navigation guidance) | |
| pip install vllm qwen-vl-utils | |
| ``` | |
| ## Data Preparation | |
| DriveWAM trains and evaluates on two benchmarks. Prepare whichever you need. | |
| ### NavSim | |
| Follow the [NavSim installation guide](https://github.com/autonomousvision/navsim) to download the nuPlan-based dataset and cache metric files, and export the environment variables from that guide (`OPENSCENE_DATA_ROOT`, `NUPLAN_MAPS_ROOT`, `NUPLAN_MAP_VERSION`). Then extract per-scene samples: | |
| ```bash | |
| # navtrain split (training) | |
| python -m src.navsim.process_data --output-path ./data/navsim/trainval | |
| # navtest split (evaluation) | |
| python -m src.navsim.process_data \ | |
| --navsim-log-path $OPENSCENE_DATA_ROOT/navsim_logs/test \ | |
| --sensor-blobs-path $OPENSCENE_DATA_ROOT/sensor_blobs/test \ | |
| --scene-filter-config navsim/planning/script/config/common/train_test_split/scene_filter/navtest.yaml \ | |
| --output-path ./data/navsim/test | |
| ``` | |
| Each scene becomes one pkl file, which is what the training and evaluation scripts read by default: | |
| ``` | |
| ./data/navsim/trainval/ | |
| sample_000000.pkl | |
| sample_000001.pkl | |
| ... | |
| ``` | |
| ### PhysicalAI-Autonomous-Vehicles | |
| The raw dataset is hosted on [Hugging Face](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles) and accessed through the [physical_ai_av](https://github.com/NVlabs/physical_ai_av) devkit. The devkit requires Python β₯ 3.11, so install it in a separate environment from `drivewam`: | |
| ```bash | |
| pip install physical_ai_av | |
| ``` | |
| Accept the dataset license on the Hugging Face page, then download the dataset (or a subset of chunks) to `./data/physicalai`. DriveWAM only needs the **`camera_front_wide_120fov`** camera plus the egomotion and calibration features. Extract 10 Hz clips from the download: | |
| ```bash | |
| python -m src.physicalai.process_data \ | |
| --dataset_root ./data/physicalai \ | |
| --output_dir ./data/physicalai/front \ | |
| --num_workers 16 | |
| ``` | |
| This writes one directory per clip: | |
| ``` | |
| ./data/physicalai/ | |
| βββ clip_index.parquet # official train/test split; keep it even if you prune the raw chunks | |
| βββ front/ | |
| βββ <clip_id>/ | |
| βββ camera_front_wide_120fov.mp4 | |
| βββ camera_front_wide_120fov_ego.pkl | |
| ``` | |
| VLM navigation prompts are used as conditioning during training and inference. We provide pregenerated prompts: training-split prompts are available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM); the 1k-sample test-split prompts used for evaluation are included in the repo at `src/physicalai/eval_data/prompts_test_sample_1k.json`. To regenerate them yourself: | |
| ```bash | |
| # Step 1 β generate route / BEV / scene-evolving guidance | |
| bash scripts/drivewam_physicalai_vlm_preprocess.sh | |
| # Step 2 β VLM-based clip quality filtering and sub-sampling | |
| SPLIT=train \ | |
| bash scripts/drivewam_physicalai_vlm_data_sample.sh | |
| ``` | |
| ## Training | |
| DriveWAM model checkpoints are available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). DriveWAM is trained on top of [LingBot-VA Base](https://huggingface.co/robbyant/lingbot-va-base), a pretrained autoregressive diffusion transformer. Download the base model weights before training. | |
| Key training hyperparameters (see configs for full details): | |
| | Hyperparameter | NavSim / PhysicalAI | | |
| |---|---| | |
| | Training steps | 50 000 | | |
| | Learning rate | 1e-5 | | |
| | Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, wd=0.1) | | |
| | Warmup steps | 10 | | |
| | Batch size (per GPU) | 1 | | |
| | Precision | bfloat16 | | |
| | Input resolution | 256Γ448 | | |
| | SNR shift (video / action) | 5.0 / 1.0 | | |
| All experiments are conducted on 48 Γ NVIDIA H20 GPUs. | |
| Edit the config (`src/configs/navsim_cfg.py` or `src/configs/physicalai_cfg.py`) to set your paths and hyperparameters, then launch with the matching script: | |
| | Benchmark | Config | Launch script | | |
| |---|---|---| | |
| | NavSim | `src/configs/navsim_cfg.py` | `scripts/drivewam_navsim_train.sh` | | |
| | PhysicalAI | `src/configs/physicalai_cfg.py` | `scripts/drivewam_physicalai_train.sh` | | |
| ```bash | |
| # NavSim | |
| bash scripts/drivewam_navsim_train.sh | |
| # PhysicalAI | |
| bash scripts/drivewam_physicalai_train.sh | |
| ``` | |
| For PhysicalAI, we provide three CSV files listing clip IDs at different training data scales (4k / 20k / 100k clips), available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). Set the `clip_csv` field in `src/configs/physicalai_cfg.py` to the desired scale before training. | |
| ## Evaluation | |
| ### NavSim (PDM Score) | |
| PDM score evaluation requires a metric cache β a set of per-scenario `.pkl` files that store precomputed map and route information to score each predicted trajectory. The precomputed metric cache for the navtest split is available for download on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). To generate it yourself, run: | |
| ```bash | |
| python navsim/planning/script/run_metric_caching.py \ | |
| train_test_split=navtest \ | |
| cache.cache_path=./data/navsim/metric_cache | |
| ``` | |
| This writes one `metric_cache.pkl` per scenario token under `./data/navsim/metric_cache/`. Pass the resulting directory to the evaluation script via `--metric-cache-path`. | |
| ```bash | |
| python -m src.navsim.eval \ | |
| --checkpoint-path /path/to/checkpoint \ | |
| --config-name navsim_cfg \ | |
| --dataset-path ./data/navsim/test \ | |
| --metric-cache-path ./data/navsim/metric_cache | |
| ``` | |
| ### PhysicalAI | |
| ```bash | |
| python -m src.physicalai.eval \ | |
| --checkpoint-path /path/to/checkpoint \ | |
| --config-name physicalai_cfg | |
| ``` | |
| ## Citation | |
| If you find DriveWAM useful, please cite: | |
| ```bibtex | |
| @article{shi2026drivewam, | |
| title={DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving}, | |
| author={Shi, Chen and Xu, Jinrui and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li}, | |
| journal={arXiv preprint arXiv:2605.28544}, | |
| year={2026} | |
| } | |
| ``` | |
| ## Acknowledgements | |
| We gratefully acknowledge the following open-source projects that DriveWAM builds upon: [Wan2.2](https://github.com/Wan-Video/Wan2.2), [LingBot-VA](https://github.com/robbyant/lingbot-va), [NavSim](https://github.com/autonomousvision/navsim), [NVIDIA PhysicalAI-Autonomous-Vehicles](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles). | |