luotingdan commited on
Commit
303e8de
Β·
1 Parent(s): 825bb1a

add generation config and update Readme

Browse files
Files changed (2) hide show
  1. README.md +65 -7
  2. generation_config.json +10 -0
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  license: apache-2.0
3
  base_model:
4
- - stepfun-ai/Step3-VL-10B-Base
5
  pipeline_tag: image-text-to-text
6
  ---
7
 
@@ -19,6 +19,13 @@ pipeline_tag: image-text-to-text
19
 
20
  </div>
21
 
 
 
 
 
 
 
 
22
  ## πŸš€ Introduction
23
 
24
  **STEP3-VL-10B** is a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact **10B parameter footprint**, STEP3-VL-10B excels in **visual perception**, **complex reasoning**, and **human-centric alignment**. It consistently outperforms models under the 10B scale and rivals or surpasses significantly larger open-weights models (**10×–20Γ— its size**), such as GLM-4.6V (106B-A12B), Qwen3-VL-Thinking (235B-A22B), and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL.
@@ -48,8 +55,8 @@ STEP3-VL-10B delivers best-in-class performance across major multimodal benchmar
48
 
49
  | Benchmark | STEP3-VL-10B (SeRe) | STEP3-VL-10B (PaCoRe) | GLM-4.6V (106B-A12B) | Qwen3-VL (235B-A22B) | Gemini-2.5-Pro | Seed-1.5-VL |
50
  | :---------------- | :-----------------: | :-------------------: | :------------------: | :------------------: | :------------: | :---------: |
51
- | **MMMU** | 78.11 | 80.11 | 75.20 | 78.70 | **83.89** | 79.11 |
52
- | **MathVista** | 83.97 | 85.50 | 83.51 | 85.10 | 83.88 | **85.60** |
53
  | **MathVision** | 70.81 | **75.95** | 63.50 | 72.10 | 73.30 | 68.70 |
54
  | **MMBench (EN)** | 92.05 | 92.38 | 92.75 | 92.70 | **93.19** | 92.11 |
55
  | **MMStar** | 77.48 | 77.64 | 75.30 | 76.80 | **79.18** | 77.91 |
@@ -121,7 +128,7 @@ STEP3-VL-10B delivers best-in-class performance across major multimodal benchmar
121
 
122
  ### Inference with Hugging Face Transformers
123
 
124
- We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.57.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang.
125
 
126
  **Note:** If you experience infinite generation issues, please check [Discussion #9](https://huggingface.co/stepfun-ai/Step3-VL-10B/discussions/9) for the fix.
127
 
@@ -169,6 +176,57 @@ decoded = processor.decode(generate_ids[0, inputs["input_ids"].shape[-1] :], ski
169
  print(decoded)
170
  ```
171
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
  ## πŸ“œ Citation
174
 
@@ -176,16 +234,16 @@ If you find this project useful in your research, please cite our technical repo
176
 
177
  ```tex
178
  @misc{huang2026step3vl10btechnicalreport,
179
- title={STEP3-VL-10B Technical Report},
180
  author={Ailin Huang and Chengyuan Yao and Chunrui Han and Fanqi Wan and Hangyu Guo and Haoran Lv and Hongyu Zhou and Jia Wang and Jian Zhou and Jianjian Sun and Jingcheng Hu and Kangheng Lin and Liang Zhao and Mitt Huang and Song Yuan and Wenwen Qu and Xiangfeng Wang and Yanlin Lai and Yingxiu Zhao and Yinmin Zhang and Yukang Shi and Yuyang Chen and Zejia Weng and Ziyang Meng and Ang Li and Aobo Kong and Bo Dong and Changyi Wan and David Wang and Di Qi and Dingming Li and En Yu and Guopeng Li and Haiquan Yin and Han Zhou and Hanshan Zhang and Haolong Yan and Hebin Zhou and Hongbo Peng and Jiaran Zhang and Jiashu Lv and Jiayi Fu and Jie Cheng and Jie Zhou and Jisheng Yin and Jingjing Xie and Jingwei Wu and Jun Zhang and Junfeng Liu and Kaijun Tan and Kaiwen Yan and Liangyu Chen and Lina Chen and Mingliang Li and Qian Zhao and Quan Sun and Shaoliang Pang and Shengjie Fan and Shijie Shang and Siyuan Zhang and Tianhao You and Wei Ji and Wuxun Xie and Xiaobo Yang and Xiaojie Hou and Xiaoran Jiao and Xiaoxiao Ren and Xiangwen Kong and Xin Huang and Xin Wu and Xing Chen and Xinran Wang and Xuelin Zhang and Yana Wei and Yang Li and Yanming Xu and Yeqing Shen and Yuang Peng and Yue Peng and Yu Zhou and Yusheng Li and Yuxiang Yang and Yuyang Zhang and Zhe Xie and Zhewei Huang and Zhenyi Lu and Zhimin Fan and Zihui Cheng and Daxin Jiang and Qi Han and Xiangyu Zhang and Yibo Zhu and Zheng Ge},
181
  year={2026},
182
  eprint={2601.09668},
183
  archivePrefix={arXiv},
184
  primaryClass={cs.CV},
185
- url={https://arxiv.org/abs/2601.09668},
186
  }
187
  ```
188
 
189
  ## πŸ“„ License
190
 
191
- This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).
 
1
  ---
2
  license: apache-2.0
3
  base_model:
4
+ - stepfun-ai/Step3-VL-10B-Base
5
  pipeline_tag: image-text-to-text
6
  ---
7
 
 
19
 
20
  </div>
21
 
22
+ ## πŸ“’ News & Updates
23
+
24
+ - πŸš€ **Online Demo**: Explore Step3-VL-10B on [Hugging Face Spaces](https://huggingface.co/spaces/stepfun-ai/Step3-VL-10B) !
25
+ - πŸ“’ **[Notice] vLLM Support:** vLLM integration is now officially supported! (PR [#32329](https://github.com/vllm-project/vllm/pull/32329))
26
+ - βœ… **[Fixed] HF Inference:** Resolved the `eos_token_id` misconfiguration in `config.json` that caused infinite generation loops. (PR [#abdf3](https://huggingface.co/stepfun-ai/Step3-VL-10B/commit/abdf3618e914a9e3de0ad74efacc8b7a10f06c10))
27
+ - βœ… **[Fixing] Metric Correction:** We sincerely apologize for inaccuracies in the Qwen3VL-8B benchmarks (e.g., AIME, HMMT, LCB). The errors were caused by an incorrect max_tokens setting (mistakenly set to 32k) during our large-scale evaluation process. We are re-running the tests and will provide corrected numbers in the next version of technical report.
28
+
29
  ## πŸš€ Introduction
30
 
31
  **STEP3-VL-10B** is a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact **10B parameter footprint**, STEP3-VL-10B excels in **visual perception**, **complex reasoning**, and **human-centric alignment**. It consistently outperforms models under the 10B scale and rivals or surpasses significantly larger open-weights models (**10×–20Γ— its size**), such as GLM-4.6V (106B-A12B), Qwen3-VL-Thinking (235B-A22B), and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL.
 
55
 
56
  | Benchmark | STEP3-VL-10B (SeRe) | STEP3-VL-10B (PaCoRe) | GLM-4.6V (106B-A12B) | Qwen3-VL (235B-A22B) | Gemini-2.5-Pro | Seed-1.5-VL |
57
  | :---------------- | :-----------------: | :-------------------: | :------------------: | :------------------: | :------------: | :---------: |
58
+ | **MMMU** | 78.11 | 80.11 | 75.20 | 78.70 | **83.89** | 79.11 |
59
+ | **MathVista** | 83.97 | 85.50 | 83.51 | 85.10 | 83.88 | **85.60** |
60
  | **MathVision** | 70.81 | **75.95** | 63.50 | 72.10 | 73.30 | 68.70 |
61
  | **MMBench (EN)** | 92.05 | 92.38 | 92.75 | 92.70 | **93.19** | 92.11 |
62
  | **MMStar** | 77.48 | 77.64 | 75.30 | 76.80 | **79.18** | 77.91 |
 
128
 
129
  ### Inference with Hugging Face Transformers
130
 
131
+ We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.57.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm.
132
 
133
  **Note:** If you experience infinite generation issues, please check [Discussion #9](https://huggingface.co/stepfun-ai/Step3-VL-10B/discussions/9) for the fix.
134
 
 
176
  print(decoded)
177
  ```
178
 
179
+ ## πŸš€ Deployment with vLLM (OpenAI-compatible API)
180
+
181
+ For deployment, you can use vllm to create an OpenAI-compatible API endpoint.
182
+
183
+ 1. Install vLLM nightly:
184
+
185
+ ```bash
186
+ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
187
+ ```
188
+
189
+ **Requirements:** Python β‰₯3.10 is required. Please ensure vLLM version >= 0.14.0rc2.dev143+gc0a350ca7.
190
+
191
+ > **Note:** The official vLLM nightly Docker image is pending release. For now, please install from the nightly wheel index as shown above.
192
+
193
+ 2. Launch the server:
194
+
195
+ ```bash
196
+ vllm serve --model stepfun-ai/Step3-VL-10B -tp 1 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --trust-remote-code
197
+ ```
198
+
199
+ **Crucial Step:**
200
+ You must append the --trust-remote-code flag to your deployment command. This is mandatory for models that utilize custom code for their architecture.
201
+
202
+ 3. Call the endpoint using any OpenAI-compatible SDK (example in Python):
203
+
204
+ ```python
205
+ from openai import OpenAI
206
+
207
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
208
+
209
+ resp = client.chat.completions.create(
210
+ model="stepfun-ai/Step3-VL-10B",
211
+ messages=[{
212
+ "role":
213
+ "user",
214
+ "content": [{
215
+ "type": "image_url",
216
+ "image_url": {
217
+ "url":
218
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
219
+ }
220
+ }, {
221
+ "type": "text",
222
+ "text": "what's in this picture?"
223
+ }]
224
+ }])
225
+
226
+ print(resp.choices[0].message.content)
227
+
228
+
229
+ ```
230
 
231
  ## πŸ“œ Citation
232
 
 
234
 
235
  ```tex
236
  @misc{huang2026step3vl10btechnicalreport,
237
+ title={STEP3-VL-10B Technical Report},
238
  author={Ailin Huang and Chengyuan Yao and Chunrui Han and Fanqi Wan and Hangyu Guo and Haoran Lv and Hongyu Zhou and Jia Wang and Jian Zhou and Jianjian Sun and Jingcheng Hu and Kangheng Lin and Liang Zhao and Mitt Huang and Song Yuan and Wenwen Qu and Xiangfeng Wang and Yanlin Lai and Yingxiu Zhao and Yinmin Zhang and Yukang Shi and Yuyang Chen and Zejia Weng and Ziyang Meng and Ang Li and Aobo Kong and Bo Dong and Changyi Wan and David Wang and Di Qi and Dingming Li and En Yu and Guopeng Li and Haiquan Yin and Han Zhou and Hanshan Zhang and Haolong Yan and Hebin Zhou and Hongbo Peng and Jiaran Zhang and Jiashu Lv and Jiayi Fu and Jie Cheng and Jie Zhou and Jisheng Yin and Jingjing Xie and Jingwei Wu and Jun Zhang and Junfeng Liu and Kaijun Tan and Kaiwen Yan and Liangyu Chen and Lina Chen and Mingliang Li and Qian Zhao and Quan Sun and Shaoliang Pang and Shengjie Fan and Shijie Shang and Siyuan Zhang and Tianhao You and Wei Ji and Wuxun Xie and Xiaobo Yang and Xiaojie Hou and Xiaoran Jiao and Xiaoxiao Ren and Xiangwen Kong and Xin Huang and Xin Wu and Xing Chen and Xinran Wang and Xuelin Zhang and Yana Wei and Yang Li and Yanming Xu and Yeqing Shen and Yuang Peng and Yue Peng and Yu Zhou and Yusheng Li and Yuxiang Yang and Yuyang Zhang and Zhe Xie and Zhewei Huang and Zhenyi Lu and Zhimin Fan and Zihui Cheng and Daxin Jiang and Qi Han and Xiangyu Zhang and Yibo Zhu and Zheng Ge},
239
  year={2026},
240
  eprint={2601.09668},
241
  archivePrefix={arXiv},
242
  primaryClass={cs.CV},
243
+ url={https://arxiv.org/abs/2601.09668},
244
  }
245
  ```
246
 
247
  ## πŸ“„ License
248
 
249
+ This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).
generation_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "temperature": 1.0,
3
+ "top_p": 1.0,
4
+ "top_k": 0,
5
+ "eos_token_id": [
6
+ 151643,
7
+ 151645,
8
+ 151679
9
+ ]
10
+ }