Add library_name to metadata
Browse filesHi! I'm Niels from the community science team at Hugging Face.
This PR adds `library_name: transformers` to the YAML metadata of the model card. Given that the model uses the `Qwen3VLForConditionalGeneration` architecture and requires `transformers >= 4.57.0`, adding this tag will enable the "Use in Transformers" feature on the Hub, making it easier for researchers to load and use your model.
I've also kept the existing detailed documentation, results, and usage examples.
README.md
CHANGED
|
@@ -1,16 +1,17 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
- autonomous-driving
|
| 7 |
- vision-language-action
|
| 8 |
- chain-of-thought
|
| 9 |
- trajectory-prediction
|
| 10 |
- VLA
|
| 11 |
-
base_model:
|
| 12 |
-
- Qwen/Qwen3-VL-4B-Instruct
|
| 13 |
-
pipeline_tag: image-text-to-text
|
| 14 |
---
|
| 15 |
|
| 16 |
# OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
|
|
@@ -37,67 +38,24 @@ OneVL is the **first latent CoT method to surpass explicit autoregressive CoT**
|
|
| 37 |
|
| 38 |
OneVL augments **Qwen3-VL-4B-Instruct** with three components:
|
| 39 |
|
| 40 |
-
**Latent Token Interface** — 4 visual latent tokens + 2 language latent tokens are inserted in the assistant response before the answer, using existing vocabulary tokens (no new special tokens added).
|
| 41 |
-
|
| 42 |
-
**
|
| 43 |
-
|
| 44 |
-
**Language Auxiliary Decoder** — Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features. Recovers 97% of explicit CoT text quality while running at answer-only speed.
|
| 45 |
-
|
| 46 |
-
**Prefill Inference** — Both decoders are discarded at inference time. All latent tokens are processed in a single parallel prefill pass; only the trajectory answer is generated autoregressively. This achieves **1.5× speedup over explicit CoT on NAVSIM** and **2.3× on ROADWork**.
|
| 47 |
-
|
| 48 |
-
### Three-Stage Training Pipeline
|
| 49 |
-
|
| 50 |
-
Training proceeds in three stages to ensure stable joint optimization:
|
| 51 |
-
- **Stage 0**: Main model warmup (trajectory prediction)
|
| 52 |
-
- **Stage 1**: Auxiliary decoder warmup (language + visual decoders independently)
|
| 53 |
-
- **Stage 2**: Joint end-to-end fine-tuning (all components together)
|
| 54 |
-
|
| 55 |
-
Staged training is essential — ablation shows that skipping it collapses PDM-score from 88.84 to 67.13.
|
| 56 |
|
| 57 |
---
|
| 58 |
|
| 59 |
## Results
|
| 60 |
|
| 61 |
-
### NAVSIM
|
| 62 |
|
| 63 |
| Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |
|
| 64 |
|---|:---:|:---:|:---:|:---:|
|
| 65 |
| AR Answer | 4B | 87.47 | 4.49 | — |
|
| 66 |
| AR CoT+Answer | 4B | 88.29 | 6.58 | Language |
|
| 67 |
-
| COCONUT | 4B | 84.84 | 5.93 | — |
|
| 68 |
-
| CODI | 4B | 83.92 | 8.62 | — |
|
| 69 |
-
| SIM-CoT | 4B | 84.21 | 10.86 | Language |
|
| 70 |
| **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
| Method | ADE (px) ↓ | FDE (px) ↓ | Latency (s) ↓ |
|
| 75 |
-
|---|:---:|:---:|:---:|
|
| 76 |
-
| AR CoT+Answer | 13.18 | 29.98 | 10.74 |
|
| 77 |
-
| **OneVL** | **12.49** | **28.80** | **4.71** |
|
| 78 |
-
|
| 79 |
-
### Impromptu
|
| 80 |
-
|
| 81 |
-
| Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
|
| 82 |
-
|---|:---:|:---:|:---:|
|
| 83 |
-
| AR CoT+Answer | 1.42 | 3.96 | 6.84 |
|
| 84 |
-
| **OneVL** | **1.34** | **3.70** | **4.02** |
|
| 85 |
-
|
| 86 |
-
### APR1 (Alpamayo-R1)
|
| 87 |
-
|
| 88 |
-
| Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
|
| 89 |
-
|---|:---:|:---:|:---:|
|
| 90 |
-
| AR CoT+Answer | 2.99 | 8.54 | 3.51 |
|
| 91 |
-
| **OneVL** | **2.62** | 7.53 | **3.26** |
|
| 92 |
-
|
| 93 |
-
### CoT Text Quality (NAVSIM)
|
| 94 |
-
|
| 95 |
-
| Method | Meta Action Acc. ↑ | STS Score ↑ | LLM Judge ↑ | Latency (s) ↓ |
|
| 96 |
-
|---|:---:|:---:|:---:|:---:|
|
| 97 |
-
| AR CoT+Answer | 73.20 | 79.75 | 81.86 | 6.58 |
|
| 98 |
-
| **OneVL** | 71.00 | 78.26 | 79.13 | **4.46** |
|
| 99 |
-
|
| 100 |
-
OneVL's language auxiliary decoder recovers 97% of explicit CoT quality at answer-only inference speed.
|
| 101 |
|
| 102 |
---
|
| 103 |
|
|
@@ -114,7 +72,9 @@ source venv/onevl/bin/activate
|
|
| 114 |
pip install -r requirements.txt
|
| 115 |
```
|
| 116 |
|
| 117 |
-
### Inference (Trajectory Prediction
|
|
|
|
|
|
|
| 118 |
|
| 119 |
```bash
|
| 120 |
python infer_onevl.py \
|
|
@@ -127,7 +87,9 @@ python infer_onevl.py \
|
|
| 127 |
--max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
|
| 128 |
```
|
| 129 |
|
| 130 |
-
### Inference with
|
|
|
|
|
|
|
| 131 |
|
| 132 |
```bash
|
| 133 |
python infer_onevl.py \
|
|
@@ -144,26 +106,6 @@ python infer_onevl.py \
|
|
| 144 |
--c_thought_visual 4 --max_visual_tokens 2560
|
| 145 |
```
|
| 146 |
|
| 147 |
-
### Multi-GPU Inference
|
| 148 |
-
|
| 149 |
-
```bash
|
| 150 |
-
export MODEL_PATH=/path/to/OneVL-checkpoint
|
| 151 |
-
export TEST_SET_PATH=test_data/navsim_test.json
|
| 152 |
-
export OUTPUT_PATH=output/navsim/navsim_results.json
|
| 153 |
-
bash run_infer.sh
|
| 154 |
-
```
|
| 155 |
-
|
| 156 |
-
Per-benchmark scripts are available in `scripts/`:
|
| 157 |
-
|
| 158 |
-
```bash
|
| 159 |
-
bash scripts/infer_navsim.sh
|
| 160 |
-
bash scripts/infer_ar1.sh
|
| 161 |
-
bash scripts/infer_roadwork.sh
|
| 162 |
-
bash scripts/infer_impromptu.sh
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
For full documentation, evaluation scripts, and data format details, see the [GitHub repository](https://github.com/xiaomi-research/onevl).
|
| 166 |
-
|
| 167 |
---
|
| 168 |
|
| 169 |
## Open-Source Status
|
|
@@ -195,4 +137,4 @@ For full documentation, evaluation scripts, and data format details, see the [Gi
|
|
| 195 |
|
| 196 |
Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
|
| 197 |
|
| 198 |
-
Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen3-VL-4B-Instruct
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
pipeline_tag: image-text-to-text
|
| 8 |
+
library_name: transformers
|
| 9 |
tags:
|
| 10 |
- autonomous-driving
|
| 11 |
- vision-language-action
|
| 12 |
- chain-of-thought
|
| 13 |
- trajectory-prediction
|
| 14 |
- VLA
|
|
|
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
# OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
|
|
|
|
| 38 |
|
| 39 |
OneVL augments **Qwen3-VL-4B-Instruct** with three components:
|
| 40 |
|
| 41 |
+
- **Latent Token Interface** — 4 visual latent tokens + 2 language latent tokens are inserted in the assistant response before the answer, using existing vocabulary tokens (no new special tokens added).
|
| 42 |
+
- **Visual Auxiliary Decoder** — Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (using the Emu3.5 IBQ 131k codebook). Acts as a **world model** supervision signal that forces the latent space to encode genuine physical scene dynamics.
|
| 43 |
+
- **Language Auxiliary Decoder** — Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features.
|
| 44 |
+
- **Prefill Inference** — Both decoders are discarded at inference time. All latent tokens are processed in a single parallel prefill pass; only the trajectory answer is generated autoregressively. This achieves **1.5× speedup over explicit CoT on NAVSIM** and **2.3× on ROADWork**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
---
|
| 47 |
|
| 48 |
## Results
|
| 49 |
|
| 50 |
+
### NAVSIM Performance
|
| 51 |
|
| 52 |
| Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |
|
| 53 |
|---|:---:|:---:|:---:|:---:|
|
| 54 |
| AR Answer | 4B | 87.47 | 4.49 | — |
|
| 55 |
| AR CoT+Answer | 4B | 88.29 | 6.58 | Language |
|
|
|
|
|
|
|
|
|
|
| 56 |
| **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |
|
| 57 |
|
| 58 |
+
OneVL is the first latent CoT method to outperform explicit CoT on driving benchmarks while maintaining answer-only latency.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
---
|
| 61 |
|
|
|
|
| 72 |
pip install -r requirements.txt
|
| 73 |
```
|
| 74 |
|
| 75 |
+
### Inference (Trajectory Prediction)
|
| 76 |
+
|
| 77 |
+
To run inference for trajectory prediction:
|
| 78 |
|
| 79 |
```bash
|
| 80 |
python infer_onevl.py \
|
|
|
|
| 87 |
--max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
|
| 88 |
```
|
| 89 |
|
| 90 |
+
### Inference with Explanations
|
| 91 |
+
|
| 92 |
+
To generate both language and visual explanations:
|
| 93 |
|
| 94 |
```bash
|
| 95 |
python infer_onevl.py \
|
|
|
|
| 106 |
--c_thought_visual 4 --max_visual_tokens 2560
|
| 107 |
```
|
| 108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
---
|
| 110 |
|
| 111 |
## Open-Source Status
|
|
|
|
| 137 |
|
| 138 |
Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
|
| 139 |
|
| 140 |
+
Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.
|