stepfun-ai
/

Step3-VL-10B

@@ -1,7 +1,9 @@
 ---
-license: apache-2.0
 base_model:
-  - stepfun-ai/Step3-VL-10B-Base
 ---
 <div align="center">
@@ -57,8 +59,6 @@ STEP3-VL-10B delivers best-in-class performance across major multimodal benchmar
 | **HMMT 2025**     |        78.18        |       **92.14**       |        57.29         |        67.71         |     65.68      |    51.30    |
 | **LiveCodeBench** |        75.77        |       **76.43**       |        48.71         |        69.45         |     72.01      |    57.10    |
-<!-- > **Note:** **SeRe** (Sequential Reasoning) uses a max length of 64K tokens; **PaCoRe** (Parallel Coordinated Reasoning) synthesizes 16 SeRe rollouts with a max length of 128K tokens. -->
 > **Note on Inference Modes:**
 >
 > **SeRe (Sequential Reasoning):** The standard inference mode using sequential generation (Chain-of-Thought) with a max length of 64K tokens.
@@ -120,7 +120,7 @@ STEP3-VL-10B delivers best-in-class performance across major multimodal benchmar
 ### Inference with Hugging Face Transformers
-We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.57.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang.
 ```python
 from transformers import AutoProcessor, AutoModelForCausalLM
@@ -185,4 +185,4 @@ If you find this project useful in your research, please cite our technical repo
 ## 📄 License
-This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).

 ---
 base_model:
+- stepfun-ai/Step3-VL-10B-Base
+license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
 ---
 <div align="center">
 | **HMMT 2025**     |        78.18        |       **92.14**       |        57.29         |        67.71         |     65.68      |    51.30    |
 | **LiveCodeBench** |        75.77        |       **76.43**       |        48.71         |        69.45         |     72.01      |    57.10    |
 > **Note on Inference Modes:**
 >
 > **SeRe (Sequential Reasoning):** The standard inference mode using sequential generation (Chain-of-Thought) with a max length of 64K tokens.
 ### Inference with Hugging Face Transformers
+We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.57.0 as the development environment. We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang.
 ```python
 from transformers import AutoProcessor, AutoModelForCausalLM
 ## 📄 License
+This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).