File size: 7,346 Bytes
ab0c645
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
license: apache-2.0
language:
- en
tags:
- autonomous-driving
- vision-language-action
- chain-of-thought
- trajectory-prediction
- VLA
base_model:
- Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: image-text-to-text
---

# OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

**[πŸ“„ Paper (arXiv)](https://arxiv.org/abs/2604.18486)** | **[πŸ’» GitHub](https://github.com/xiaomi-research/onevl)** | **[🌐 Project Page](https://Xiaomi-Embodied-Intelligence.github.io/OneVL/)**

*Xiaomi Embodied Intelligence Team*

---

## Overview

**OneVL** is a Vision-Language-Action (VLA) framework for autonomous driving that achieves **state-of-the-art trajectory prediction accuracy** while matching the inference latency of answer-only autoregressive models.

Prior latent Chain-of-Thought (CoT) methods compress reasoning into opaque hidden states β€” fast, but consistently underperform explicit CoT on driving tasks. OneVL identifies the root cause: purely linguistic latents encode abstract semantic labels rather than the spatiotemporal causal dynamics that govern real driving scenes. OneVL addresses this with **dual-modal auxiliary decoders** that force compact latent tokens to encode both human-readable reasoning *and* future scene dynamics simultaneously.

At inference, both decoders are discarded and all latents are **prefilled** into the prompt context in a single parallel pass β€” matching answer-only AR prediction speed while recovering the interpretability of explicit CoT in both vision and language.

OneVL is the **first latent CoT method to surpass explicit autoregressive CoT** across all four driving benchmarks.

---

## Architecture

OneVL augments **Qwen3-VL-4B-Instruct** with three components:

**Latent Token Interface** β€” 4 visual latent tokens + 2 language latent tokens are inserted in the assistant response before the answer, using existing vocabulary tokens (no new special tokens added).

**Visual Auxiliary Decoder** β€” Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (using the Emu3.5 IBQ 131k codebook). Acts as a **world model** supervision signal that forces the latent space to encode genuine physical scene dynamics β€” agent trajectories, road geometry, and environmental change β€” rather than abstract descriptions.

**Language Auxiliary Decoder** β€” Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features. Recovers 97% of explicit CoT text quality while running at answer-only speed.

**Prefill Inference** β€” Both decoders are discarded at inference time. All latent tokens are processed in a single parallel prefill pass; only the trajectory answer is generated autoregressively. This achieves **1.5Γ— speedup over explicit CoT on NAVSIM** and **2.3Γ— on ROADWork**.

### Three-Stage Training Pipeline

Training proceeds in three stages to ensure stable joint optimization:
- **Stage 0**: Main model warmup (trajectory prediction)
- **Stage 1**: Auxiliary decoder warmup (language + visual decoders independently)
- **Stage 2**: Joint end-to-end fine-tuning (all components together)

Staged training is essential β€” ablation shows that skipping it collapses PDM-score from 88.84 to 67.13.

---

## Results

### NAVSIM

| Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |
|---|:---:|:---:|:---:|:---:|
| AR Answer | 4B | 87.47 | 4.49 | β€” |
| AR CoT+Answer | 4B | 88.29 | 6.58 | Language |
| COCONUT | 4B | 84.84 | 5.93 | β€” |
| CODI | 4B | 83.92 | 8.62 | β€” |
| SIM-CoT | 4B | 84.21 | 10.86 | Language |
| **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |

### ROADWork

| Method | ADE (px) ↓ | FDE (px) ↓ | Latency (s) ↓ |
|---|:---:|:---:|:---:|
| AR CoT+Answer | 13.18 | 29.98 | 10.74 |
| **OneVL** | **12.49** | **28.80** | **4.71** |

### Impromptu

| Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
|---|:---:|:---:|:---:|
| AR CoT+Answer | 1.42 | 3.96 | 6.84 |
| **OneVL** | **1.34** | **3.70** | **4.02** |

### APR1 (Alpamayo-R1)

| Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
|---|:---:|:---:|:---:|
| AR CoT+Answer | 2.99 | 8.54 | 3.51 |
| **OneVL** | **2.62** | 7.53 | **3.26** |

### CoT Text Quality (NAVSIM)

| Method | Meta Action Acc. ↑ | STS Score ↑ | LLM Judge ↑ | Latency (s) ↓ |
|---|:---:|:---:|:---:|:---:|
| AR CoT+Answer | 73.20 | 79.75 | 81.86 | 6.58 |
| **OneVL** | 71.00 | 78.26 | 79.13 | **4.46** |

OneVL's language auxiliary decoder recovers 97% of explicit CoT quality at answer-only inference speed.

---

## Usage

### Requirements

- Python 3.10+, CUDA GPU (β‰₯16 GB VRAM recommended)
- `transformers >= 4.57.0` (required for `Qwen3VLForConditionalGeneration`)

```bash
uv venv venv/onevl --python 3.12
source venv/onevl/bin/activate
pip install -r requirements.txt
```

### Inference (Trajectory Prediction Only)

```bash
python infer_onevl.py \
    --model_path /path/to/OneVL-checkpoint \
    --test_set_path test_data/navsim_test.json \
    --image_base_path "" \
    --output_path output/navsim/results.json \
    --device cuda:0 \
    --num_latent 2 --num_latent_vis 4 \
    --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
```

### Inference with Language + Visual Explanation

```bash
python infer_onevl.py \
    --model_path /path/to/OneVL-checkpoint \
    --test_set_path test_data/navsim_test.json \
    --image_base_path "" \
    --output_path output/navsim/results_explain.json \
    --device cuda:0 \
    --num_latent 2 --num_latent_vis 4 \
    --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0 \
    --decoder_explain --aux_visual_condition \
    --c_thought 2 --max_explain_tokens 1024 \
    --visual_decoder_explain --visual_aux_visual_condition \
    --c_thought_visual 4 --max_visual_tokens 2560
```

### Multi-GPU Inference

```bash
export MODEL_PATH=/path/to/OneVL-checkpoint
export TEST_SET_PATH=test_data/navsim_test.json
export OUTPUT_PATH=output/navsim/navsim_results.json
bash run_infer.sh
```

Per-benchmark scripts are available in `scripts/`:

```bash
bash scripts/infer_navsim.sh
bash scripts/infer_ar1.sh
bash scripts/infer_roadwork.sh
bash scripts/infer_impromptu.sh
```

For full documentation, evaluation scripts, and data format details, see the [GitHub repository](https://github.com/xiaomi-research/onevl).

---

## Open-Source Status

| Component | Status |
|---|:---:|
| Technical Report | βœ… Released |
| Model Weights | βœ… Released |
| Inference Code | βœ… Released |
| Training Code | πŸ”œ Coming Soon |

---

## Citation

```bibtex
@article{lu2026onevl,
  title={OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation},
  author={Lu, Jinghui and Guan, Jiayi and Huang, Zhijian and Li, Jinlong and Li, Guang and Kong, Lingdong and Li, Yingyan and Wang, Han and Xu, Shaoqing and Luo, Yuechen and others},
  journal={arXiv preprint arXiv:2604.18486},
  year={2026},
  url={https://arxiv.org/abs/2604.18486}
}
```

---

## License

Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.