Add library_name to metadata and improve model card links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +7 -33
README.md CHANGED
@@ -1,16 +1,17 @@
1
  ---
2
- license: apache-2.0
 
3
  language:
4
  - en
 
 
 
5
  tags:
6
  - autonomous-driving
7
  - vision-language-action
8
  - chain-of-thought
9
  - trajectory-prediction
10
  - VLA
11
- base_model:
12
- - Qwen/Qwen3-VL-4B-Instruct
13
- pipeline_tag: image-text-to-text
14
  ---
15
 
16
  # OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
@@ -90,15 +91,6 @@ Staged training is essential — ablation shows that skipping it collapses PDM-s
90
  | AR CoT+Answer | 2.99 | 8.54 | 3.51 |
91
  | **OneVL** | **2.62** | 7.53 | **3.26** |
92
 
93
- ### CoT Text Quality (NAVSIM)
94
-
95
- | Method | Meta Action Acc. ↑ | STS Score ↑ | LLM Judge ↑ | Latency (s) ↓ |
96
- |---|:---:|:---:|:---:|:---:|
97
- | AR CoT+Answer | 73.20 | 79.75 | 81.86 | 6.58 |
98
- | **OneVL** | 71.00 | 78.26 | 79.13 | **4.46** |
99
-
100
- OneVL's language auxiliary decoder recovers 97% of explicit CoT quality at answer-only inference speed.
101
-
102
  ---
103
 
104
  ## Usage
@@ -144,25 +136,7 @@ python infer_onevl.py \
144
  --c_thought_visual 4 --max_visual_tokens 2560
145
  ```
146
 
147
- ### Multi-GPU Inference
148
-
149
- ```bash
150
- export MODEL_PATH=/path/to/OneVL-checkpoint
151
- export TEST_SET_PATH=test_data/navsim_test.json
152
- export OUTPUT_PATH=output/navsim/navsim_results.json
153
- bash run_infer.sh
154
- ```
155
-
156
- Per-benchmark scripts are available in `scripts/`:
157
-
158
- ```bash
159
- bash scripts/infer_navsim.sh
160
- bash scripts/infer_ar1.sh
161
- bash scripts/infer_roadwork.sh
162
- bash scripts/infer_impromptu.sh
163
- ```
164
-
165
- For full documentation, evaluation scripts, and data format details, see the [GitHub repository](https://github.com/xiaomi-research/onevl).
166
 
167
  ---
168
 
@@ -195,4 +169,4 @@ For full documentation, evaluation scripts, and data format details, see the [Gi
195
 
196
  Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
197
 
198
- Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-VL-4B-Instruct
4
  language:
5
  - en
6
+ license: apache-2.0
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
  tags:
10
  - autonomous-driving
11
  - vision-language-action
12
  - chain-of-thought
13
  - trajectory-prediction
14
  - VLA
 
 
 
15
  ---
16
 
17
  # OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
 
91
  | AR CoT+Answer | 2.99 | 8.54 | 3.51 |
92
  | **OneVL** | **2.62** | 7.53 | **3.26** |
93
 
 
 
 
 
 
 
 
 
 
94
  ---
95
 
96
  ## Usage
 
136
  --c_thought_visual 4 --max_visual_tokens 2560
137
  ```
138
 
139
+ For full documentation, evaluation scripts, and data format details, see the [official GitHub repository](https://github.com/xiaomi-research/onevl).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
  ---
142
 
 
169
 
170
  Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
171
 
172
+ Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.