Add library_name to metadata and improve model card links
Browse filesHi! I'm Niels from the community science team at Hugging Face.
This PR improves the model card by adding `library_name: transformers` to the YAML metadata. Based on the model configuration and the provided documentation, this model is compatible with the `transformers` library (version 4.57.0 or higher). Adding this metadata enables the "Use in Transformers" button and automated code snippets on the Hub.
The rest of the model card remains highly detailed, providing excellent documentation for the paper, architecture, and usage.
README.md
CHANGED
|
@@ -1,16 +1,17 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
- autonomous-driving
|
| 7 |
- vision-language-action
|
| 8 |
- chain-of-thought
|
| 9 |
- trajectory-prediction
|
| 10 |
- VLA
|
| 11 |
-
base_model:
|
| 12 |
-
- Qwen/Qwen3-VL-4B-Instruct
|
| 13 |
-
pipeline_tag: image-text-to-text
|
| 14 |
---
|
| 15 |
|
| 16 |
# OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
|
|
@@ -90,15 +91,6 @@ Staged training is essential — ablation shows that skipping it collapses PDM-s
|
|
| 90 |
| AR CoT+Answer | 2.99 | 8.54 | 3.51 |
|
| 91 |
| **OneVL** | **2.62** | 7.53 | **3.26** |
|
| 92 |
|
| 93 |
-
### CoT Text Quality (NAVSIM)
|
| 94 |
-
|
| 95 |
-
| Method | Meta Action Acc. ↑ | STS Score ↑ | LLM Judge ↑ | Latency (s) ↓ |
|
| 96 |
-
|---|:---:|:---:|:---:|:---:|
|
| 97 |
-
| AR CoT+Answer | 73.20 | 79.75 | 81.86 | 6.58 |
|
| 98 |
-
| **OneVL** | 71.00 | 78.26 | 79.13 | **4.46** |
|
| 99 |
-
|
| 100 |
-
OneVL's language auxiliary decoder recovers 97% of explicit CoT quality at answer-only inference speed.
|
| 101 |
-
|
| 102 |
---
|
| 103 |
|
| 104 |
## Usage
|
|
@@ -144,25 +136,7 @@ python infer_onevl.py \
|
|
| 144 |
--c_thought_visual 4 --max_visual_tokens 2560
|
| 145 |
```
|
| 146 |
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
```bash
|
| 150 |
-
export MODEL_PATH=/path/to/OneVL-checkpoint
|
| 151 |
-
export TEST_SET_PATH=test_data/navsim_test.json
|
| 152 |
-
export OUTPUT_PATH=output/navsim/navsim_results.json
|
| 153 |
-
bash run_infer.sh
|
| 154 |
-
```
|
| 155 |
-
|
| 156 |
-
Per-benchmark scripts are available in `scripts/`:
|
| 157 |
-
|
| 158 |
-
```bash
|
| 159 |
-
bash scripts/infer_navsim.sh
|
| 160 |
-
bash scripts/infer_ar1.sh
|
| 161 |
-
bash scripts/infer_roadwork.sh
|
| 162 |
-
bash scripts/infer_impromptu.sh
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
For full documentation, evaluation scripts, and data format details, see the [GitHub repository](https://github.com/xiaomi-research/onevl).
|
| 166 |
|
| 167 |
---
|
| 168 |
|
|
@@ -195,4 +169,4 @@ For full documentation, evaluation scripts, and data format details, see the [Gi
|
|
| 195 |
|
| 196 |
Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
|
| 197 |
|
| 198 |
-
Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen3-VL-4B-Instruct
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
pipeline_tag: image-text-to-text
|
| 8 |
+
library_name: transformers
|
| 9 |
tags:
|
| 10 |
- autonomous-driving
|
| 11 |
- vision-language-action
|
| 12 |
- chain-of-thought
|
| 13 |
- trajectory-prediction
|
| 14 |
- VLA
|
|
|
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
# OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
|
|
|
|
| 91 |
| AR CoT+Answer | 2.99 | 8.54 | 3.51 |
|
| 92 |
| **OneVL** | **2.62** | 7.53 | **3.26** |
|
| 93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
---
|
| 95 |
|
| 96 |
## Usage
|
|
|
|
| 136 |
--c_thought_visual 4 --max_visual_tokens 2560
|
| 137 |
```
|
| 138 |
|
| 139 |
+
For full documentation, evaluation scripts, and data format details, see the [official GitHub repository](https://github.com/xiaomi-research/onevl).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
---
|
| 142 |
|
|
|
|
| 169 |
|
| 170 |
Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
|
| 171 |
|
| 172 |
+
Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.
|