stepfun-ai
/

Step3-VL-10B-Base

@@ -1,5 +1,7 @@
 ---
 license: apache-2.0
 ---
 <div align="center">
@@ -20,6 +22,8 @@ license: apache-2.0
 **STEP3-VL-10B** is a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact **10B parameter footprint**, STEP3-VL-10B excels in **visual perception**, **complex reasoning**, and **human-centric alignment**. It consistently outperforms models under the 10B scale and rivals or surpasses significantly larger open-weights models (**10×–20× its size**), such as GLM-4.6V (106B-A12B), Qwen3-VL-Thinking (235B-A22B), and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL.
 <div align="center">
 <img src="figures/performance.png" alt="Performance Comparison" width="800"/>
 <p><i>Figure 1: Performance comparison of STEP3-VL-10B against SOTA multimodal foundation models. SeRe: Sequential Reasoning; PaCoRe: Parallel Coordinated Reasoning.</i></p>
@@ -27,8 +31,8 @@ license: apache-2.0
 The success of STEP3-VL-10B is driven by two key strategic designs:
-1.  **Unified Pre-training on High-Quality Multimodal Corpus:** A single-stage, fully unfrozen training strategy on a 1.2T token multimodal corpus, focusing on two foundational capabilities: **reasoning** (e.g., general knowledge and education-centric tasks) and **perception** (e.g., grounding, counting, OCR, and GUI interactions). By jointly optimizing the Perception Encoder and the Qwen3-8B decoder, STEP3-VL-10B establishes intrinsic vision-language synergy.
-2.  **Scaled Multimodal Reinforcement Learning and Parallel Reasoning:** Frontier capabilities are unlocked through a rigorous post-training pipeline comprising two-stage supervised finetuning (SFT) and **over 1,400 iterations of RL** with both verifiable rewards (RLVR) and human feedback (RLHF). Beyond sequential reasoning, we adopt **Parallel Coordinated Reasoning (PaCoRe)**, which allocates test-time compute to aggregate evidence from parallel visual exploration.
 ## 📥 Model Zoo
@@ -55,42 +59,11 @@ STEP3-VL-10B delivers best-in-class performance across major multimodal benchmar
 | **HMMT 2025**     |        78.18        |       **92.14**       |        57.29         |        67.71         |     65.68      |    51.30    |
 | **LiveCodeBench** |        75.77        |       **76.43**       |        48.71         |        69.45         |     72.01      |    57.10    |
-<!-- > **Note:** **SeRe** (Sequential Reasoning) uses a max length of 64K tokens; **PaCoRe** (Parallel Coordinated Reasoning) synthesizes 16 SeRe rollouts with a max length of 128K tokens. -->
 > **Note on Inference Modes:**
 >
 > **SeRe (Sequential Reasoning):** The standard inference mode using sequential generation (Chain-of-Thought) with a max length of 64K tokens.
 >
-> **PaCoRe (Parallel Coordinated Reasoning):** An advanced mode that scales test-time compute. It aggregates evidence from **16 parallel rollouts** to synthesize a final answer, utilizing a max context length of 128K tokens.
->
-> _Unless otherwise stated, scores below refer to the standard SeRe mode. Higher scores achieved via PaCoRe are explicitly marked._
-### Comparison with Open-Source Models (7B–10B)
-| Category           | Benchmark        | STEP3-VL-10B | GLM-4.6V-Flash (9B) | Qwen3-VL-Thinking (8B) | InternVL-3.5 (8B) | MiMo-VL-RL-2508 (7B) |
-| :----------------- | :--------------- | :----------: | :-----------------: | :--------------------: | :---------------: | :------------------: |
-| **STEM Reasoning** | MMMU             |  **78.11**   |        71.17        |         73.53          |       71.69       |        71.14         |
-|                    | MathVision       |  **70.81**   |        54.05        |         59.60          |       52.05       |        59.65         |
-|                    | MathVista        |  **83.97**   |        82.85        |         78.50          |       76.78       |        79.86         |
-|                    | PhyX             |  **59.45**   |        52.28        |         57.67          |       50.51       |        56.00         |
-| **Recognition**    | MMBench (EN)     |  **92.05**   |        91.04        |         90.55          |       88.20       |        89.91         |
-|                    | MMStar           |  **77.48**   |        74.26        |         73.58          |       69.83       |        72.93         |
-|                    | ReMI             |  **67.29**   |        60.75        |         57.17          |       52.65       |        63.13         |
-| **OCR & Document** | OCRBench         |  **86.75**   |        85.97        |         82.85          |       83.70       |        85.40         |
-|                    | AI2D             |  **89.35**   |        88.93        |         83.32          |       82.34       |        84.96         |
-| **GUI Grounding**  | ScreenSpot-V2    |    92.61     |        92.14        |       **93.60**        |       84.02       |        90.82         |
-|                    | ScreenSpot-Pro   |  **51.55**   |        45.68        |         46.60          |       15.39       |        34.84         |
-|                    | OSWorld-G        |  **59.02**   |        54.71        |         56.70          |       31.91       |        50.54         |
-| **Spatial**        | BLINK            |  **66.79**   |        64.90        |         62.78          |       55.40       |        62.57         |
-|                    | All-Angles-Bench |  **57.21**   |        53.24        |         45.88          |       45.29       |        51.62         |
-| **Code**           | HumanEval-V      |  **66.05**   |        29.26        |         26.94          |       24.31       |        31.96         |
-### Key Capabilities
-- **STEM Reasoning:** Achieves **94.43%** on AIME 2025 and **75.95%** on MathVision (with PaCoRe), demonstrating exceptional complex reasoning capabilities that outperform models 10×–20× larger.
-- **Visual Perception:** Records **92.05%** on MMBench and **80.11%** on MMMU, establishing strong general visual understanding and multimodal reasoning.
-- **GUI & OCR:** Delivers state-of-the-art performance on ScreenSpot-V2 (**92.61%**), ScreenSpot-Pro (**51.55%**), and OCRBench (**86.75%**), optimized for agentic and document understanding tasks.
-- **Spatial Understanding:** Demonstrates emergent spatial awareness with **66.79%** on BLINK and **57.21%** on All-Angles-Bench, establishing strong potential for embodied intelligence applications.
 ## 🏗️ Architecture & Training
@@ -101,24 +74,11 @@ STEP3-VL-10B delivers best-in-class performance across major multimodal benchmar
 - **Projector:** Two consecutive stride-2 layers (resulting in 16× spatial downsampling).
 - **Resolution:** Multi-crop strategy consisting of a 728×728 global view and multiple 504×504 local crops.
-### Training Pipeline
-- **Pre-training:** Single-stage, fully unfrozen strategy using AdamW optimizer (Total: 1.2T tokens, 370K iterations).
-  - Phase 1: 900B tokens.
-  - Phase 2: 300B tokens.
-- **Supervised Finetuning (SFT):** Two-stage approach (Total: ~226B tokens).
-  - Stage 1: 9:1 text-to-multimodal ratio (~190B tokens).
-  - Stage 2: 1:1 text-to-multimodal ratio (~36B tokens).
-- **Reinforcement Learning:** Total >1,400 iterations.
-  - **RLVR:** 600 iterations (Tasks: mathematics, geometry, physics, perception, grounding).
-  - **RLHF:** 300 iterations (Task: open-ended generation).
-  - **PaCoRe Training:** 500 iterations (Context length: 64K max sequence).
 ## 🛠️ Quick Start
 ### Inference with Hugging Face Transformers
-We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.57.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang.
 ```python
 from transformers import AutoProcessor, AutoModelForCausalLM
@@ -130,7 +90,7 @@ key_mapping = {
     "vit_large_projector": "model.vit_large_projector",
 }
-model_path = "stepfun-ai/Step3-VL-10B-Base"
 processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
@@ -182,4 +142,4 @@ If you find this project useful in your research, please cite our technical repo
 ## 📄 License
-This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).

 ---
 license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
 ---
 <div align="center">
 **STEP3-VL-10B** is a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact **10B parameter footprint**, STEP3-VL-10B excels in **visual perception**, **complex reasoning**, and **human-centric alignment**. It consistently outperforms models under the 10B scale and rivals or surpasses significantly larger open-weights models (**10×–20× its size**), such as GLM-4.6V (106B-A12B), Qwen3-VL-Thinking (235B-A22B), and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL.
+The model was presented in the paper [STEP3-VL-10B Technical Report](https://huggingface.co/papers/2601.09668).
 <div align="center">
 <img src="figures/performance.png" alt="Performance Comparison" width="800"/>
 <p><i>Figure 1: Performance comparison of STEP3-VL-10B against SOTA multimodal foundation models. SeRe: Sequential Reasoning; PaCoRe: Parallel Coordinated Reasoning.</i></p>
 The success of STEP3-VL-10B is driven by two key strategic designs:
+1.  **Unified Pre-training on High-Quality Multimodal Corpus:** A single-stage, fully unfrozen training strategy on a 1.2T token multimodal corpus, focusing on two foundational capabilities: **reasoning** and **perception**. By jointly optimizing the Perception Encoder and the Qwen3-8B decoder, STEP3-VL-10B establishes intrinsic vision-language synergy.
+2.  **Scaled Multimodal Reinforcement Learning and Parallel Reasoning:** Frontier capabilities are unlocked through a rigorous post-training pipeline comprising two-stage supervised finetuning (SFT) and **over 1,400 iterations of RL**. Beyond sequential reasoning, we adopt **Parallel Coordinated Reasoning (PaCoRe)**, which allocates test-time compute to aggregate evidence from parallel visual exploration.
 ## 📥 Model Zoo
 | **HMMT 2025**     |        78.18        |       **92.14**       |        57.29         |        67.71         |     65.68      |    51.30    |
 | **LiveCodeBench** |        75.77        |       **76.43**       |        48.71         |        69.45         |     72.01      |    57.10    |
 > **Note on Inference Modes:**
 >
 > **SeRe (Sequential Reasoning):** The standard inference mode using sequential generation (Chain-of-Thought) with a max length of 64K tokens.
 >
+> **PaCoRe (Parallel Coordinated Reasoning):** An advanced mode that scales test-time compute. It aggregates evidence from **16 parallel rollouts** to synthesize a final answer.
 ## 🏗️ Architecture & Training
 - **Projector:** Two consecutive stride-2 layers (resulting in 16× spatial downsampling).
 - **Resolution:** Multi-crop strategy consisting of a 728×728 global view and multiple 504×504 local crops.
 ## 🛠️ Quick Start
 ### Inference with Hugging Face Transformers
+We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.57.0 as the development environment.
 ```python
 from transformers import AutoProcessor, AutoModelForCausalLM
     "vit_large_projector": "model.vit_large_projector",
 }
+model_path = "stepfun-ai/Step3-VL-10B"
 processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
 ## 📄 License
+This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).