File size: 6,380 Bytes
9851bac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
<h1 align='center'>WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving</h1>
<div align='center'>
    <a href='https://github.com/xumingw' target='_blank'>Mingwang Xu</a><sup>1*</sup>&emsp;
    <a href='https://cuijh26.github.io/' target='_blank'>Jiahao Cui</a><sup>1*</sup>&emsp;
    <a href='https://github.com/fudan-generative-vision/WAM-Diff' target='_blank'>Feipeng Cai</a><sup>2*</sup>&emsp;
    <a href='https://github.com/NinoNeumann' target='_blank'>Hanlin Shang</a><sup>1*</sup>&emsp;
    <a href='https://github.com/SSSSSSuger' target='_blank'>Zhihao Zhu</a><sup>1</sup>&emsp;
    <a href='https://github.com/isan089' target='_blank'>Shan Luan</a><sup>1</sup>&emsp;
</div>
<div align='center'>
    <a href='https://github.com/YoucanBaby' target='_blank'>Yifang Xu</a><sup>1</sup>&emsp;
    <a href='https://github.com/fudan-generative-vision/WAM-Diff' target='_blank'>Neng Zhang</a><sup>2</sup>&emsp;
    <a href='https://github.com/fudan-generative-vision/WAM-Diff' target='_blank'>Yaoyi Li</a><sup>2</sup>&emsp;
    <a href='https://github.com/fudan-generative-vision/WAM-Diff' target='_blank'>Jia Cai</a><sup>2</sup>&emsp;
    <a href='https://sites.google.com/site/zhusiyucs/home' target='_blank'>Siyu Zhu</a><sup>1</sup>&emsp;
</div>

<div align='center'>
    <sup>1</sup>Fudan University&emsp; <sup>2</sup>Yinwang Intelligent Technology Co., Ltd&emsp;
</div>

<br>
<div align='center'>
    <a href='https://github.com/fudan-generative-vision/WAM-Diff'><img src='https://img.shields.io/github/stars/fudan-generative-vision/WAM-Diff?style=social'></a>
    <a href='https://arxiv.org/abs/2512.11872'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
    <a href='https://huggingface.co/fudan-generative-ai/WAM-Diff'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow'></a>

</div>
<br>

## πŸ“° News

- **`2025/02/01`**: πŸŽ‰πŸŽ‰πŸŽ‰ Release the pretrained models on [Huggingface](https://huggingface.co/fudan-generative-ai/WAM-Diff).
- **`2025/12/06`**: πŸŽ‰πŸŽ‰πŸŽ‰ Paper submitted on [Arxiv](https://arxiv.org/pdf/2512.11872).

## πŸ“…οΈ Roadmap

| Status | Milestone                                                                                             |    ETA     |
| :----: | :----------------------------------------------------------------------------------------------------: | :--------: |
|   βœ…  | **[Release the inference source code](https://github.com/fudan-generative-vision/WAM-Diff)** | 2025.12.21      |
|   βœ…   | **[Release the SFT and inf code](https://github.com/fudan-generative-vision/WAM-Diff)**                                                       | 2025.12.21      |
|   βœ…   | **[Release pretrained models on Huggingface](https://huggingface.co/fudan-generative-ai/WAM-Diff)**              | 2026.02.01       |
|   πŸš€   | **[Release NAVSIM evaluation code](https://huggingface.co/fudan-generative-ai/WAM-Diff)**    | TBD |
|   πŸš€   | **[Release the RL code](https://github.com/fudan-generative-vision/WAM-Diff)**   | TBD |


## πŸ”§οΈ Framework
![framework](assets/main_arch.png)

## πŸ† Qualitative Results on NAVSIM
### NAVSIM-v1 benchmark results
<div style="text-align: center;">
  <img src="assets/navsim-v1.png" alt="navsim-v1" width="70%" />
</div>

### NAVSIM-v2 benchmark results
<div style="text-align: center;">
<img src="assets/navsim-v2.png" alt="navsim-v2" width="90%" />
</div>



## Quick Inference Demo
The WAM-Diff will be available on Hugging Face Hub soon. To quickly test the model, follow these simple steps:

1. **Clone the repository**  
   ```bash
   git clone https://github.com/fudan-generative-vision/WAM-Diff
   cd WAM-Diff
   ```
2. **Initialize the environment**  
   If you prefer conda, run the environment setup script to install necessary dependencies:
   ```bash
   bash init_env.sh
   ```
   Or you can use uv to create the environment:
   ```bash
   uv venv && uv sync
   ```
3. **Prepare the Model**
    Download the pretrained [WAM-Diff](https://huggingface.co/fudan-generative-ai/WAM-Diff) model from Hugging Face to the `./model/WAM-Diff` directory:
    ```
    https://huggingface.co/fudan-generative-ai/WAM-Diff
    ```
    Download the pretrained Siglip2 model from Hugging Face to the `./model/siglip2-so400m-patch14-384` directory:
   ```
   https://huggingface.co/google/siglip2-so400m-patch14-384
   ```


3. **Run the demo script**  
   Execute the demo script to test WAM-Diff on an example image:
   ```bash
   bash inf.sh
   ```

## Training
To fine-tune WAM-Diff, please follow these steps:
1. **Set Up the Environment**  
   Follow the same environment setup steps as in the Quick Inference Demo section.
2. **Prepare the Data**  
Prepare your training dataset in JSON format like
    ```json
    [
        {
        "image": ["path/to/image1.png"],
        "conversations": [
            {
                "from": "human",
                "value": "Here is front views of a driving vehicle:\n<image>\nThe navigation information is: straight\nThe current position is (0.00,0.00)\nCurrent velocity is: (13.48,-0.29)  and current accelerate is: (0.19,0.05)\nPredict the optimal driving action for the next 4 seconds with 8 new waypoints."
            },
            {
                "from": "gpt",
                "value": "6.60,-0.01,13.12,-0.03,19.58,-0.04,25.95,-0.03,32.27,-0.03,38.56,-0.05,44.88,-0.06,51.16,-0.09"
            }
            ]
        },
        ...
    ]
    ```
3. **Run the Training Script**  
   Execute the training script with the following command:
   ```bash
   cd train
   bash ./scripts/llada_v_finetune.sh
   ```

## πŸ“ Citation

If you find our work useful for your research, please consider citing the paper:

```
@article{xu2025wam,
  title={WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving},
  author={Xu, Mingwang and Cui, Jiahao and Cai, Feipeng and Shang, Hanlin and Zhu, Zhihao and Luan, Shan and Xu, Yifang and Zhang, Neng and Li, Yaoyi and Cai, Jia and others},
  journal={arXiv preprint arXiv:2512.11872},
  year={2025}
}
```

## πŸ€— Acknowledgements
We gratefully acknowledge the contributors to the [LLaDA-V](https://github.com/ML-GSAI/LLaDA-V), repositories, whose commitment to open source has provided us with their excellent codebases and pretrained models.