nielsr HF Staff commited on
Commit
a66028f
·
verified ·
1 Parent(s): c1e893e

Add library_name to metadata

Browse files

Hi! I'm Niels from the community science team at Hugging Face.

This PR adds `library_name: transformers` to the YAML metadata of the model card. Given that the model uses the `Qwen3VLForConditionalGeneration` architecture and requires `transformers >= 4.57.0`, adding this tag will enable the "Use in Transformers" feature on the Hub, making it easier for researchers to load and use your model.

I've also kept the existing detailed documentation, results, and usage examples.

Files changed (1) hide show
  1. README.md +18 -76
README.md CHANGED
@@ -1,16 +1,17 @@
1
  ---
2
- license: apache-2.0
 
3
  language:
4
  - en
 
 
 
5
  tags:
6
  - autonomous-driving
7
  - vision-language-action
8
  - chain-of-thought
9
  - trajectory-prediction
10
  - VLA
11
- base_model:
12
- - Qwen/Qwen3-VL-4B-Instruct
13
- pipeline_tag: image-text-to-text
14
  ---
15
 
16
  # OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
@@ -37,67 +38,24 @@ OneVL is the **first latent CoT method to surpass explicit autoregressive CoT**
37
 
38
  OneVL augments **Qwen3-VL-4B-Instruct** with three components:
39
 
40
- **Latent Token Interface** — 4 visual latent tokens + 2 language latent tokens are inserted in the assistant response before the answer, using existing vocabulary tokens (no new special tokens added).
41
-
42
- **Visual Auxiliary Decoder** — Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (using the Emu3.5 IBQ 131k codebook). Acts as a **world model** supervision signal that forces the latent space to encode genuine physical scene dynamics — agent trajectories, road geometry, and environmental change — rather than abstract descriptions.
43
-
44
- **Language Auxiliary Decoder** — Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features. Recovers 97% of explicit CoT text quality while running at answer-only speed.
45
-
46
- **Prefill Inference** — Both decoders are discarded at inference time. All latent tokens are processed in a single parallel prefill pass; only the trajectory answer is generated autoregressively. This achieves **1.5× speedup over explicit CoT on NAVSIM** and **2.3× on ROADWork**.
47
-
48
- ### Three-Stage Training Pipeline
49
-
50
- Training proceeds in three stages to ensure stable joint optimization:
51
- - **Stage 0**: Main model warmup (trajectory prediction)
52
- - **Stage 1**: Auxiliary decoder warmup (language + visual decoders independently)
53
- - **Stage 2**: Joint end-to-end fine-tuning (all components together)
54
-
55
- Staged training is essential — ablation shows that skipping it collapses PDM-score from 88.84 to 67.13.
56
 
57
  ---
58
 
59
  ## Results
60
 
61
- ### NAVSIM
62
 
63
  | Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |
64
  |---|:---:|:---:|:---:|:---:|
65
  | AR Answer | 4B | 87.47 | 4.49 | — |
66
  | AR CoT+Answer | 4B | 88.29 | 6.58 | Language |
67
- | COCONUT | 4B | 84.84 | 5.93 | — |
68
- | CODI | 4B | 83.92 | 8.62 | — |
69
- | SIM-CoT | 4B | 84.21 | 10.86 | Language |
70
  | **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |
71
 
72
- ### ROADWork
73
-
74
- | Method | ADE (px) ↓ | FDE (px) ↓ | Latency (s) ↓ |
75
- |---|:---:|:---:|:---:|
76
- | AR CoT+Answer | 13.18 | 29.98 | 10.74 |
77
- | **OneVL** | **12.49** | **28.80** | **4.71** |
78
-
79
- ### Impromptu
80
-
81
- | Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
82
- |---|:---:|:---:|:---:|
83
- | AR CoT+Answer | 1.42 | 3.96 | 6.84 |
84
- | **OneVL** | **1.34** | **3.70** | **4.02** |
85
-
86
- ### APR1 (Alpamayo-R1)
87
-
88
- | Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
89
- |---|:---:|:---:|:---:|
90
- | AR CoT+Answer | 2.99 | 8.54 | 3.51 |
91
- | **OneVL** | **2.62** | 7.53 | **3.26** |
92
-
93
- ### CoT Text Quality (NAVSIM)
94
-
95
- | Method | Meta Action Acc. ↑ | STS Score ↑ | LLM Judge ↑ | Latency (s) ↓ |
96
- |---|:---:|:---:|:---:|:---:|
97
- | AR CoT+Answer | 73.20 | 79.75 | 81.86 | 6.58 |
98
- | **OneVL** | 71.00 | 78.26 | 79.13 | **4.46** |
99
-
100
- OneVL's language auxiliary decoder recovers 97% of explicit CoT quality at answer-only inference speed.
101
 
102
  ---
103
 
@@ -114,7 +72,9 @@ source venv/onevl/bin/activate
114
  pip install -r requirements.txt
115
  ```
116
 
117
- ### Inference (Trajectory Prediction Only)
 
 
118
 
119
  ```bash
120
  python infer_onevl.py \
@@ -127,7 +87,9 @@ python infer_onevl.py \
127
  --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
128
  ```
129
 
130
- ### Inference with Language + Visual Explanation
 
 
131
 
132
  ```bash
133
  python infer_onevl.py \
@@ -144,26 +106,6 @@ python infer_onevl.py \
144
  --c_thought_visual 4 --max_visual_tokens 2560
145
  ```
146
 
147
- ### Multi-GPU Inference
148
-
149
- ```bash
150
- export MODEL_PATH=/path/to/OneVL-checkpoint
151
- export TEST_SET_PATH=test_data/navsim_test.json
152
- export OUTPUT_PATH=output/navsim/navsim_results.json
153
- bash run_infer.sh
154
- ```
155
-
156
- Per-benchmark scripts are available in `scripts/`:
157
-
158
- ```bash
159
- bash scripts/infer_navsim.sh
160
- bash scripts/infer_ar1.sh
161
- bash scripts/infer_roadwork.sh
162
- bash scripts/infer_impromptu.sh
163
- ```
164
-
165
- For full documentation, evaluation scripts, and data format details, see the [GitHub repository](https://github.com/xiaomi-research/onevl).
166
-
167
  ---
168
 
169
  ## Open-Source Status
@@ -195,4 +137,4 @@ For full documentation, evaluation scripts, and data format details, see the [Gi
195
 
196
  Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
197
 
198
- Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-VL-4B-Instruct
4
  language:
5
  - en
6
+ license: apache-2.0
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
  tags:
10
  - autonomous-driving
11
  - vision-language-action
12
  - chain-of-thought
13
  - trajectory-prediction
14
  - VLA
 
 
 
15
  ---
16
 
17
  # OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
 
38
 
39
  OneVL augments **Qwen3-VL-4B-Instruct** with three components:
40
 
41
+ - **Latent Token Interface** — 4 visual latent tokens + 2 language latent tokens are inserted in the assistant response before the answer, using existing vocabulary tokens (no new special tokens added).
42
+ - **Visual Auxiliary Decoder** — Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (using the Emu3.5 IBQ 131k codebook). Acts as a **world model** supervision signal that forces the latent space to encode genuine physical scene dynamics.
43
+ - **Language Auxiliary Decoder** — Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features.
44
+ - **Prefill Inference** — Both decoders are discarded at inference time. All latent tokens are processed in a single parallel prefill pass; only the trajectory answer is generated autoregressively. This achieves **1.5× speedup over explicit CoT on NAVSIM** and **2.3× on ROADWork**.
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ---
47
 
48
  ## Results
49
 
50
+ ### NAVSIM Performance
51
 
52
  | Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |
53
  |---|:---:|:---:|:---:|:---:|
54
  | AR Answer | 4B | 87.47 | 4.49 | — |
55
  | AR CoT+Answer | 4B | 88.29 | 6.58 | Language |
 
 
 
56
  | **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |
57
 
58
+ OneVL is the first latent CoT method to outperform explicit CoT on driving benchmarks while maintaining answer-only latency.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
  ---
61
 
 
72
  pip install -r requirements.txt
73
  ```
74
 
75
+ ### Inference (Trajectory Prediction)
76
+
77
+ To run inference for trajectory prediction:
78
 
79
  ```bash
80
  python infer_onevl.py \
 
87
  --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
88
  ```
89
 
90
+ ### Inference with Explanations
91
+
92
+ To generate both language and visual explanations:
93
 
94
  ```bash
95
  python infer_onevl.py \
 
106
  --c_thought_visual 4 --max_visual_tokens 2560
107
  ```
108
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
  ---
110
 
111
  ## Open-Source Status
 
137
 
138
  Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
139
 
140
+ Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.