DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

Paper Project Page
Chen Shi\*, Jinrui Xu\*, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang† *The Chinese University of Hong Kong, Shenzhen & Voyager Research, Didi Chuxing* \*Equal Contribution, †Corresponding Author
**DriveWAM** is a joint **video generation and action prediction** model for autonomous driving. It adapts a pretrained video diffusion transformer into an **autoregressive video-action policy**, organizing video and action streams into a unified temporal token sequence trained under a joint flow-matching objective — preserving video generation priors while extending the model to ego-motion action prediction. ## Highlights ### NavSim Comparison on NAVSIM v1. \*: results with imitation learning. †: trained with multiple trajectory anchors. MV: multi-view cameras; SV: single-view camera; L: LiDAR.
| Method | Ref | Sensors | NC ↑ | DAC ↑ | TTC ↑ | C. ↑ | EP ↑ | PDMS ↑ | |---|---|---|---|---|---|---|---|---| | Human | – | – | 100.0 | 100.0 | 100.0 | 99.9 | 87.5 | 94.8 | | UniAD | CVPR'23 | MV | 97.8 | 91.9 | 92.9 | 100.0 | 78.8 | 83.4 | | TransFuser | TPAMI'23 | MV & L | 97.7 | 92.8 | 92.8 | 100.0 | 79.2 | 84.0 | | PARA-Drive | CVPR'24 | MV | 97.9 | 92.4 | 93.0 | 99.8 | 79.3 | 84.0 | | LAW | ICLR'25 | SV | 96.4 | 95.4 | 88.7 | 99.9 | 81.7 | 84.6 | | DiffusionDrive | CVPR'25 | MV & L | 98.2 | 96.2 | 94.7 | 100.0 | 82.2 | 88.1 | | WoTE | ICCV'25 | MV & L | 98.5 | 96.8 | 94.4 | 99.9 | 81.9 | 88.3 | | *VLA-based Methods* | | | | | | | | | | ReCogDrive\* | ICLR'26 | MV | 98.1 | 94.7 | 94.2 | 100.0 | 80.9 | 86.5 | | DriveVLA-W0 | ICLR'26 | SV | 98.7 | 96.2 | 95.5 | 100.0 | 82.2 | 88.4 | | AutoVLA | NeurIPS'25 | MV | 98.4 | 95.6 | 98.0 | 99.9 | 81.9 | 89.1 | | DriveDreamer-Policy | arXiv'26 | MV | 98.4 | 97.1 | 95.1 | 100.0 | 83.5 | 89.2 | | DriveVLA-W0† | ICLR'26 | SV | 98.7 | 99.1 | 95.3 | 99.3 | 83.3 | 90.2 | | *WA-based Methods* | | | | | | | | | | Epona | ICCV'25 | SV | 97.9 | 95.1 | 93.8 | 99.9 | 80.4 | 86.2 | | WorldDrive | arXiv'26 | SV | 98.4 | 95.8 | 95.2 | 99.8 | 83.3 | 89.0 | | **DriveWAM (Ours)** | – | SV | 98.3 | **98.1** | 95.2 | 100.0 | **84.3** | **90.1** |
### PhysicalAI-AV Comparison on PhysicalAI-Autonomous-Vehicles.
| Method | Source | ADE@3s ↓ | FDE@3s ↓ | ADE@4s ↓ | FDE@4s ↓ | |---|---|---|---|---|---| | VaVAM | Valeo | 2.31 | 4.32 | - | - | | Alpamayo-1.5 | NVIDIA | 0.80 | 2.31 | 1.44 | 4.18 | | **DriveWAM (Ours)** | — | **0.47** | **1.35** | **0.83** | **2.47** |
### Qualitative Results
![Qualitative Results](assets/result_vis.jpg)
### Data Scaling DriveWAM's action prediction error improves consistently as training data scales from 4k to 100k clips. Scene-evolving (SE) guidance provides complementary benefit at every scale.
| # Clips | # Iters | SE Guidance | ADE@4s ↓ | FDE@4s ↓ | |---|---|---|---|---| | 4k | 50k | ✗ | 1.21 | 3.65 | | 4k | 50k | ✓ | 1.01 | 2.95 | | 20k | 50k | ✗ | 0.95 | 2.94 | | 20k | 50k | ✓ | 0.94 | 2.65 | | 100k | 50k | ✗ | 0.92 | 2.75 | | 100k | 50k | ✓ | **0.83** | **2.47** |
## News - [Jun 7, 2026] We open-source all code and model weights. - [May 27, 2026] We release the paper and project page. ## Getting Started ### Installation First, clone this repository and set up the environment. ```bash git clone cd DriveWAM # 1. Create conda environment conda env create -f environment.yml conda activate drivewam # 2. Install PyTorch (CUDA 12.6) pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu126 # 3. Install Flash Attention pip install flash-attn==2.8.3 --no-build-isolation ``` Two optional extras, installed when you need the corresponding feature: ```bash # NavSim evaluation extras (for the NavSim benchmark) pip install -r requirements-navsim.txt # VLM preprocessing extras (to generate navigation guidance) pip install vllm qwen-vl-utils ``` ## Data Preparation DriveWAM trains and evaluates on two benchmarks. Prepare whichever you need. ### NavSim Follow the [NavSim installation guide](https://github.com/autonomousvision/navsim) to download the nuPlan-based dataset and cache metric files, and export the environment variables from that guide (`OPENSCENE_DATA_ROOT`, `NUPLAN_MAPS_ROOT`, `NUPLAN_MAP_VERSION`). Then extract per-scene samples: ```bash # navtrain split (training) python -m src.navsim.process_data --output-path ./data/navsim/trainval # navtest split (evaluation) python -m src.navsim.process_data \ --navsim-log-path $OPENSCENE_DATA_ROOT/navsim_logs/test \ --sensor-blobs-path $OPENSCENE_DATA_ROOT/sensor_blobs/test \ --scene-filter-config navsim/planning/script/config/common/train_test_split/scene_filter/navtest.yaml \ --output-path ./data/navsim/test ``` Each scene becomes one pkl file, which is what the training and evaluation scripts read by default: ``` ./data/navsim/trainval/ sample_000000.pkl sample_000001.pkl ... ``` ### PhysicalAI-Autonomous-Vehicles The raw dataset is hosted on [Hugging Face](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles) and accessed through the [physical_ai_av](https://github.com/NVlabs/physical_ai_av) devkit. The devkit requires Python ≥ 3.11, so install it in a separate environment from `drivewam`: ```bash pip install physical_ai_av ``` Accept the dataset license on the Hugging Face page, then download the dataset (or a subset of chunks) to `./data/physicalai`. DriveWAM only needs the **`camera_front_wide_120fov`** camera plus the egomotion and calibration features. Extract 10 Hz clips from the download: ```bash python -m src.physicalai.process_data \ --dataset_root ./data/physicalai \ --output_dir ./data/physicalai/front \ --num_workers 16 ``` This writes one directory per clip: ``` ./data/physicalai/ ├── clip_index.parquet # official train/test split; keep it even if you prune the raw chunks └── front/ └── / ├── camera_front_wide_120fov.mp4 └── camera_front_wide_120fov_ego.pkl ``` VLM navigation prompts are used as conditioning during training and inference. We provide pregenerated prompts: training-split prompts are available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM); the 1k-sample test-split prompts used for evaluation are included in the repo at `src/physicalai/eval_data/prompts_test_sample_1k.json`. To regenerate them yourself: ```bash # Step 1 – generate route / BEV / scene-evolving guidance bash scripts/drivewam_physicalai_vlm_preprocess.sh # Step 2 – VLM-based clip quality filtering and sub-sampling SPLIT=train \ bash scripts/drivewam_physicalai_vlm_data_sample.sh ``` ## Training DriveWAM model checkpoints are available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). DriveWAM is trained on top of [LingBot-VA Base](https://huggingface.co/robbyant/lingbot-va-base), a pretrained autoregressive diffusion transformer. Download the base model weights before training. Key training hyperparameters (see configs for full details): | Hyperparameter | NavSim / PhysicalAI | |---|---| | Training steps | 50 000 | | Learning rate | 1e-5 | | Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.1) | | Warmup steps | 10 | | Batch size (per GPU) | 1 | | Precision | bfloat16 | | Input resolution | 256×448 | | SNR shift (video / action) | 5.0 / 1.0 | All experiments are conducted on 48 × NVIDIA H20 GPUs. Edit the config (`src/configs/navsim_cfg.py` or `src/configs/physicalai_cfg.py`) to set your paths and hyperparameters, then launch with the matching script: | Benchmark | Config | Launch script | |---|---|---| | NavSim | `src/configs/navsim_cfg.py` | `scripts/drivewam_navsim_train.sh` | | PhysicalAI | `src/configs/physicalai_cfg.py` | `scripts/drivewam_physicalai_train.sh` | ```bash # NavSim bash scripts/drivewam_navsim_train.sh # PhysicalAI bash scripts/drivewam_physicalai_train.sh ``` For PhysicalAI, we provide three CSV files listing clip IDs at different training data scales (4k / 20k / 100k clips), available on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). Set the `clip_csv` field in `src/configs/physicalai_cfg.py` to the desired scale before training. ## Evaluation ### NavSim (PDM Score) PDM score evaluation requires a metric cache — a set of per-scenario `.pkl` files that store precomputed map and route information to score each predicted trajectory. The precomputed metric cache for the navtest split is available for download on [Hugging Face](https://huggingface.co/chenchenshi/DriveWAM). To generate it yourself, run: ```bash python navsim/planning/script/run_metric_caching.py \ train_test_split=navtest \ cache.cache_path=./data/navsim/metric_cache ``` This writes one `metric_cache.pkl` per scenario token under `./data/navsim/metric_cache/`. Pass the resulting directory to the evaluation script via `--metric-cache-path`. ```bash python -m src.navsim.eval \ --checkpoint-path /path/to/checkpoint \ --config-name navsim_cfg \ --dataset-path ./data/navsim/test \ --metric-cache-path ./data/navsim/metric_cache ``` ### PhysicalAI ```bash python -m src.physicalai.eval \ --checkpoint-path /path/to/checkpoint \ --config-name physicalai_cfg ``` ## Citation If you find DriveWAM useful, please cite: ```bibtex @article{shi2026drivewam, title={DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving}, author={Shi, Chen and Xu, Jinrui and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li}, journal={arXiv preprint arXiv:2605.28544}, year={2026} } ``` ## Acknowledgements We gratefully acknowledge the following open-source projects that DriveWAM builds upon: [Wan2.2](https://github.com/Wan-Video/Wan2.2), [LingBot-VA](https://github.com/robbyant/lingbot-va), [NavSim](https://github.com/autonomousvision/navsim), [NVIDIA PhysicalAI-Autonomous-Vehicles](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles).