luotingdan
commited on
Commit
Β·
303e8de
1
Parent(s):
825bb1a
add generation config and update Readme
Browse files- README.md +65 -7
- generation_config.json +10 -0
README.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
-
- stepfun-ai/Step3-VL-10B-Base
|
| 5 |
pipeline_tag: image-text-to-text
|
| 6 |
---
|
| 7 |
|
|
@@ -19,6 +19,13 @@ pipeline_tag: image-text-to-text
|
|
| 19 |
|
| 20 |
</div>
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
## π Introduction
|
| 23 |
|
| 24 |
**STEP3-VL-10B** is a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact **10B parameter footprint**, STEP3-VL-10B excels in **visual perception**, **complex reasoning**, and **human-centric alignment**. It consistently outperforms models under the 10B scale and rivals or surpasses significantly larger open-weights models (**10Γβ20Γ its size**), such as GLM-4.6V (106B-A12B), Qwen3-VL-Thinking (235B-A22B), and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL.
|
|
@@ -48,8 +55,8 @@ STEP3-VL-10B delivers best-in-class performance across major multimodal benchmar
|
|
| 48 |
|
| 49 |
| Benchmark | STEP3-VL-10B (SeRe) | STEP3-VL-10B (PaCoRe) | GLM-4.6V (106B-A12B) | Qwen3-VL (235B-A22B) | Gemini-2.5-Pro | Seed-1.5-VL |
|
| 50 |
| :---------------- | :-----------------: | :-------------------: | :------------------: | :------------------: | :------------: | :---------: |
|
| 51 |
-
| **MMMU**
|
| 52 |
-
| **MathVista**
|
| 53 |
| **MathVision** | 70.81 | **75.95** | 63.50 | 72.10 | 73.30 | 68.70 |
|
| 54 |
| **MMBench (EN)** | 92.05 | 92.38 | 92.75 | 92.70 | **93.19** | 92.11 |
|
| 55 |
| **MMStar** | 77.48 | 77.64 | 75.30 | 76.80 | **79.18** | 77.91 |
|
|
@@ -121,7 +128,7 @@ STEP3-VL-10B delivers best-in-class performance across major multimodal benchmar
|
|
| 121 |
|
| 122 |
### Inference with Hugging Face Transformers
|
| 123 |
|
| 124 |
-
We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.57.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm
|
| 125 |
|
| 126 |
**Note:** If you experience infinite generation issues, please check [Discussion #9](https://huggingface.co/stepfun-ai/Step3-VL-10B/discussions/9) for the fix.
|
| 127 |
|
|
@@ -169,6 +176,57 @@ decoded = processor.decode(generate_ids[0, inputs["input_ids"].shape[-1] :], ski
|
|
| 169 |
print(decoded)
|
| 170 |
```
|
| 171 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
## π Citation
|
| 174 |
|
|
@@ -176,16 +234,16 @@ If you find this project useful in your research, please cite our technical repo
|
|
| 176 |
|
| 177 |
```tex
|
| 178 |
@misc{huang2026step3vl10btechnicalreport,
|
| 179 |
-
title={STEP3-VL-10B Technical Report},
|
| 180 |
author={Ailin Huang and Chengyuan Yao and Chunrui Han and Fanqi Wan and Hangyu Guo and Haoran Lv and Hongyu Zhou and Jia Wang and Jian Zhou and Jianjian Sun and Jingcheng Hu and Kangheng Lin and Liang Zhao and Mitt Huang and Song Yuan and Wenwen Qu and Xiangfeng Wang and Yanlin Lai and Yingxiu Zhao and Yinmin Zhang and Yukang Shi and Yuyang Chen and Zejia Weng and Ziyang Meng and Ang Li and Aobo Kong and Bo Dong and Changyi Wan and David Wang and Di Qi and Dingming Li and En Yu and Guopeng Li and Haiquan Yin and Han Zhou and Hanshan Zhang and Haolong Yan and Hebin Zhou and Hongbo Peng and Jiaran Zhang and Jiashu Lv and Jiayi Fu and Jie Cheng and Jie Zhou and Jisheng Yin and Jingjing Xie and Jingwei Wu and Jun Zhang and Junfeng Liu and Kaijun Tan and Kaiwen Yan and Liangyu Chen and Lina Chen and Mingliang Li and Qian Zhao and Quan Sun and Shaoliang Pang and Shengjie Fan and Shijie Shang and Siyuan Zhang and Tianhao You and Wei Ji and Wuxun Xie and Xiaobo Yang and Xiaojie Hou and Xiaoran Jiao and Xiaoxiao Ren and Xiangwen Kong and Xin Huang and Xin Wu and Xing Chen and Xinran Wang and Xuelin Zhang and Yana Wei and Yang Li and Yanming Xu and Yeqing Shen and Yuang Peng and Yue Peng and Yu Zhou and Yusheng Li and Yuxiang Yang and Yuyang Zhang and Zhe Xie and Zhewei Huang and Zhenyi Lu and Zhimin Fan and Zihui Cheng and Daxin Jiang and Qi Han and Xiangyu Zhang and Yibo Zhu and Zheng Ge},
|
| 181 |
year={2026},
|
| 182 |
eprint={2601.09668},
|
| 183 |
archivePrefix={arXiv},
|
| 184 |
primaryClass={cs.CV},
|
| 185 |
-
url={https://arxiv.org/abs/2601.09668},
|
| 186 |
}
|
| 187 |
```
|
| 188 |
|
| 189 |
## π License
|
| 190 |
|
| 191 |
-
This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
+
- stepfun-ai/Step3-VL-10B-Base
|
| 5 |
pipeline_tag: image-text-to-text
|
| 6 |
---
|
| 7 |
|
|
|
|
| 19 |
|
| 20 |
</div>
|
| 21 |
|
| 22 |
+
## π’ News & Updates
|
| 23 |
+
|
| 24 |
+
- π **Online Demo**: Explore Step3-VL-10B on [Hugging Face Spaces](https://huggingface.co/spaces/stepfun-ai/Step3-VL-10B) !
|
| 25 |
+
- π’ **[Notice] vLLM Support:** vLLM integration is now officially supported! (PR [#32329](https://github.com/vllm-project/vllm/pull/32329))
|
| 26 |
+
- β
**[Fixed] HF Inference:** Resolved the `eos_token_id` misconfiguration in `config.json` that caused infinite generation loops. (PR [#abdf3](https://huggingface.co/stepfun-ai/Step3-VL-10B/commit/abdf3618e914a9e3de0ad74efacc8b7a10f06c10))
|
| 27 |
+
- β
**[Fixing] Metric Correction:** We sincerely apologize for inaccuracies in the Qwen3VL-8B benchmarks (e.g., AIME, HMMT, LCB). The errors were caused by an incorrect max_tokens setting (mistakenly set to 32k) during our large-scale evaluation process. We are re-running the tests and will provide corrected numbers in the next version of technical report.
|
| 28 |
+
|
| 29 |
## π Introduction
|
| 30 |
|
| 31 |
**STEP3-VL-10B** is a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact **10B parameter footprint**, STEP3-VL-10B excels in **visual perception**, **complex reasoning**, and **human-centric alignment**. It consistently outperforms models under the 10B scale and rivals or surpasses significantly larger open-weights models (**10Γβ20Γ its size**), such as GLM-4.6V (106B-A12B), Qwen3-VL-Thinking (235B-A22B), and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL.
|
|
|
|
| 55 |
|
| 56 |
| Benchmark | STEP3-VL-10B (SeRe) | STEP3-VL-10B (PaCoRe) | GLM-4.6V (106B-A12B) | Qwen3-VL (235B-A22B) | Gemini-2.5-Pro | Seed-1.5-VL |
|
| 57 |
| :---------------- | :-----------------: | :-------------------: | :------------------: | :------------------: | :------------: | :---------: |
|
| 58 |
+
| **MMMU** | 78.11 | 80.11 | 75.20 | 78.70 | **83.89** | 79.11 |
|
| 59 |
+
| **MathVista** | 83.97 | 85.50 | 83.51 | 85.10 | 83.88 | **85.60** |
|
| 60 |
| **MathVision** | 70.81 | **75.95** | 63.50 | 72.10 | 73.30 | 68.70 |
|
| 61 |
| **MMBench (EN)** | 92.05 | 92.38 | 92.75 | 92.70 | **93.19** | 92.11 |
|
| 62 |
| **MMStar** | 77.48 | 77.64 | 75.30 | 76.80 | **79.18** | 77.91 |
|
|
|
|
| 128 |
|
| 129 |
### Inference with Hugging Face Transformers
|
| 130 |
|
| 131 |
+
We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.57.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm.
|
| 132 |
|
| 133 |
**Note:** If you experience infinite generation issues, please check [Discussion #9](https://huggingface.co/stepfun-ai/Step3-VL-10B/discussions/9) for the fix.
|
| 134 |
|
|
|
|
| 176 |
print(decoded)
|
| 177 |
```
|
| 178 |
|
| 179 |
+
## π Deployment with vLLM (OpenAI-compatible API)
|
| 180 |
+
|
| 181 |
+
For deployment, you can use vllm to create an OpenAI-compatible API endpoint.
|
| 182 |
+
|
| 183 |
+
1. Install vLLM nightly:
|
| 184 |
+
|
| 185 |
+
```bash
|
| 186 |
+
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
**Requirements:** Python β₯3.10 is required. Please ensure vLLM version >= 0.14.0rc2.dev143+gc0a350ca7.
|
| 190 |
+
|
| 191 |
+
> **Note:** The official vLLM nightly Docker image is pending release. For now, please install from the nightly wheel index as shown above.
|
| 192 |
+
|
| 193 |
+
2. Launch the server:
|
| 194 |
+
|
| 195 |
+
```bash
|
| 196 |
+
vllm serve --model stepfun-ai/Step3-VL-10B -tp 1 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --trust-remote-code
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
**Crucial Step:**
|
| 200 |
+
You must append the --trust-remote-code flag to your deployment command. This is mandatory for models that utilize custom code for their architecture.
|
| 201 |
+
|
| 202 |
+
3. Call the endpoint using any OpenAI-compatible SDK (example in Python):
|
| 203 |
+
|
| 204 |
+
```python
|
| 205 |
+
from openai import OpenAI
|
| 206 |
+
|
| 207 |
+
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
|
| 208 |
+
|
| 209 |
+
resp = client.chat.completions.create(
|
| 210 |
+
model="stepfun-ai/Step3-VL-10B",
|
| 211 |
+
messages=[{
|
| 212 |
+
"role":
|
| 213 |
+
"user",
|
| 214 |
+
"content": [{
|
| 215 |
+
"type": "image_url",
|
| 216 |
+
"image_url": {
|
| 217 |
+
"url":
|
| 218 |
+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
|
| 219 |
+
}
|
| 220 |
+
}, {
|
| 221 |
+
"type": "text",
|
| 222 |
+
"text": "what's in this picture?"
|
| 223 |
+
}]
|
| 224 |
+
}])
|
| 225 |
+
|
| 226 |
+
print(resp.choices[0].message.content)
|
| 227 |
+
|
| 228 |
+
|
| 229 |
+
```
|
| 230 |
|
| 231 |
## π Citation
|
| 232 |
|
|
|
|
| 234 |
|
| 235 |
```tex
|
| 236 |
@misc{huang2026step3vl10btechnicalreport,
|
| 237 |
+
title={STEP3-VL-10B Technical Report},
|
| 238 |
author={Ailin Huang and Chengyuan Yao and Chunrui Han and Fanqi Wan and Hangyu Guo and Haoran Lv and Hongyu Zhou and Jia Wang and Jian Zhou and Jianjian Sun and Jingcheng Hu and Kangheng Lin and Liang Zhao and Mitt Huang and Song Yuan and Wenwen Qu and Xiangfeng Wang and Yanlin Lai and Yingxiu Zhao and Yinmin Zhang and Yukang Shi and Yuyang Chen and Zejia Weng and Ziyang Meng and Ang Li and Aobo Kong and Bo Dong and Changyi Wan and David Wang and Di Qi and Dingming Li and En Yu and Guopeng Li and Haiquan Yin and Han Zhou and Hanshan Zhang and Haolong Yan and Hebin Zhou and Hongbo Peng and Jiaran Zhang and Jiashu Lv and Jiayi Fu and Jie Cheng and Jie Zhou and Jisheng Yin and Jingjing Xie and Jingwei Wu and Jun Zhang and Junfeng Liu and Kaijun Tan and Kaiwen Yan and Liangyu Chen and Lina Chen and Mingliang Li and Qian Zhao and Quan Sun and Shaoliang Pang and Shengjie Fan and Shijie Shang and Siyuan Zhang and Tianhao You and Wei Ji and Wuxun Xie and Xiaobo Yang and Xiaojie Hou and Xiaoran Jiao and Xiaoxiao Ren and Xiangwen Kong and Xin Huang and Xin Wu and Xing Chen and Xinran Wang and Xuelin Zhang and Yana Wei and Yang Li and Yanming Xu and Yeqing Shen and Yuang Peng and Yue Peng and Yu Zhou and Yusheng Li and Yuxiang Yang and Yuyang Zhang and Zhe Xie and Zhewei Huang and Zhenyi Lu and Zhimin Fan and Zihui Cheng and Daxin Jiang and Qi Han and Xiangyu Zhang and Yibo Zhu and Zheng Ge},
|
| 239 |
year={2026},
|
| 240 |
eprint={2601.09668},
|
| 241 |
archivePrefix={arXiv},
|
| 242 |
primaryClass={cs.CV},
|
| 243 |
+
url={https://arxiv.org/abs/2601.09668},
|
| 244 |
}
|
| 245 |
```
|
| 246 |
|
| 247 |
## π License
|
| 248 |
|
| 249 |
+
This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).
|
generation_config.json
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"temperature": 1.0,
|
| 3 |
+
"top_p": 1.0,
|
| 4 |
+
"top_k": 0,
|
| 5 |
+
"eos_token_id": [
|
| 6 |
+
151643,
|
| 7 |
+
151645,
|
| 8 |
+
151679
|
| 9 |
+
]
|
| 10 |
+
}
|