File size: 6,380 Bytes
9851bac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
<h1 align='center'>WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving</h1>
<div align='center'>
<a href='https://github.com/xumingw' target='_blank'>Mingwang Xu</a><sup>1*</sup> 
<a href='https://cuijh26.github.io/' target='_blank'>Jiahao Cui</a><sup>1*</sup> 
<a href='https://github.com/fudan-generative-vision/WAM-Diff' target='_blank'>Feipeng Cai</a><sup>2*</sup> 
<a href='https://github.com/NinoNeumann' target='_blank'>Hanlin Shang</a><sup>1*</sup> 
<a href='https://github.com/SSSSSSuger' target='_blank'>Zhihao Zhu</a><sup>1</sup> 
<a href='https://github.com/isan089' target='_blank'>Shan Luan</a><sup>1</sup> 
</div>
<div align='center'>
<a href='https://github.com/YoucanBaby' target='_blank'>Yifang Xu</a><sup>1</sup> 
<a href='https://github.com/fudan-generative-vision/WAM-Diff' target='_blank'>Neng Zhang</a><sup>2</sup> 
<a href='https://github.com/fudan-generative-vision/WAM-Diff' target='_blank'>Yaoyi Li</a><sup>2</sup> 
<a href='https://github.com/fudan-generative-vision/WAM-Diff' target='_blank'>Jia Cai</a><sup>2</sup> 
<a href='https://sites.google.com/site/zhusiyucs/home' target='_blank'>Siyu Zhu</a><sup>1</sup> 
</div>
<div align='center'>
<sup>1</sup>Fudan University  <sup>2</sup>Yinwang Intelligent Technology Co., Ltd 
</div>
<br>
<div align='center'>
<a href='https://github.com/fudan-generative-vision/WAM-Diff'><img src='https://img.shields.io/github/stars/fudan-generative-vision/WAM-Diff?style=social'></a>
<a href='https://arxiv.org/abs/2512.11872'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
<a href='https://huggingface.co/fudan-generative-ai/WAM-Diff'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow'></a>
</div>
<br>
## π° News
- **`2025/02/01`**: πππ Release the pretrained models on [Huggingface](https://huggingface.co/fudan-generative-ai/WAM-Diff).
- **`2025/12/06`**: πππ Paper submitted on [Arxiv](https://arxiv.org/pdf/2512.11872).
## π
οΈ Roadmap
| Status | Milestone | ETA |
| :----: | :----------------------------------------------------------------------------------------------------: | :--------: |
| β
| **[Release the inference source code](https://github.com/fudan-generative-vision/WAM-Diff)** | 2025.12.21 |
| β
| **[Release the SFT and inf code](https://github.com/fudan-generative-vision/WAM-Diff)** | 2025.12.21 |
| β
| **[Release pretrained models on Huggingface](https://huggingface.co/fudan-generative-ai/WAM-Diff)** | 2026.02.01 |
| π | **[Release NAVSIM evaluation code](https://huggingface.co/fudan-generative-ai/WAM-Diff)** | TBD |
| π | **[Release the RL code](https://github.com/fudan-generative-vision/WAM-Diff)** | TBD |
## π§οΈ Framework

## π Qualitative Results on NAVSIM
### NAVSIM-v1 benchmark results
<div style="text-align: center;">
<img src="assets/navsim-v1.png" alt="navsim-v1" width="70%" />
</div>
### NAVSIM-v2 benchmark results
<div style="text-align: center;">
<img src="assets/navsim-v2.png" alt="navsim-v2" width="90%" />
</div>
## Quick Inference Demo
The WAM-Diff will be available on Hugging Face Hub soon. To quickly test the model, follow these simple steps:
1. **Clone the repository**
```bash
git clone https://github.com/fudan-generative-vision/WAM-Diff
cd WAM-Diff
```
2. **Initialize the environment**
If you prefer conda, run the environment setup script to install necessary dependencies:
```bash
bash init_env.sh
```
Or you can use uv to create the environment:
```bash
uv venv && uv sync
```
3. **Prepare the Model**
Download the pretrained [WAM-Diff](https://huggingface.co/fudan-generative-ai/WAM-Diff) model from Hugging Face to the `./model/WAM-Diff` directory:
```
https://huggingface.co/fudan-generative-ai/WAM-Diff
```
Download the pretrained Siglip2 model from Hugging Face to the `./model/siglip2-so400m-patch14-384` directory:
```
https://huggingface.co/google/siglip2-so400m-patch14-384
```
3. **Run the demo script**
Execute the demo script to test WAM-Diff on an example image:
```bash
bash inf.sh
```
## Training
To fine-tune WAM-Diff, please follow these steps:
1. **Set Up the Environment**
Follow the same environment setup steps as in the Quick Inference Demo section.
2. **Prepare the Data**
Prepare your training dataset in JSON format like
```json
[
{
"image": ["path/to/image1.png"],
"conversations": [
{
"from": "human",
"value": "Here is front views of a driving vehicle:\n<image>\nThe navigation information is: straight\nThe current position is (0.00,0.00)\nCurrent velocity is: (13.48,-0.29) and current accelerate is: (0.19,0.05)\nPredict the optimal driving action for the next 4 seconds with 8 new waypoints."
},
{
"from": "gpt",
"value": "6.60,-0.01,13.12,-0.03,19.58,-0.04,25.95,-0.03,32.27,-0.03,38.56,-0.05,44.88,-0.06,51.16,-0.09"
}
]
},
...
]
```
3. **Run the Training Script**
Execute the training script with the following command:
```bash
cd train
bash ./scripts/llada_v_finetune.sh
```
## π Citation
If you find our work useful for your research, please consider citing the paper:
```
@article{xu2025wam,
title={WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving},
author={Xu, Mingwang and Cui, Jiahao and Cai, Feipeng and Shang, Hanlin and Zhu, Zhihao and Luan, Shan and Xu, Yifang and Zhang, Neng and Li, Yaoyi and Cai, Jia and others},
journal={arXiv preprint arXiv:2512.11872},
year={2025}
}
```
## π€ Acknowledgements
We gratefully acknowledge the contributors to the [LLaDA-V](https://github.com/ML-GSAI/LLaDA-V), repositories, whose commitment to open source has provided us with their excellent codebases and pretrained models.
|