Papers
arxiv:2601.09668

STEP3-VL-10B Technical Report

Published on Jan 14
· Submitted by
Jingcheng Hu
on Jan 16
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

STEP3-VL-10B achieves superior multimodal performance through unified pre-training with a language-aligned Perception Encoder and Qwen3-8B decoder, combined with scaled post-training and Parallel Coordinated Reasoning for efficient large-scale visual reasoning.

AI-generated summary

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10times-20times larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

Community

Paper author Paper submitter

🎉 Introducing Step3-VL-10B, Compact Yet Frontier Multimodal Intelligence, with best performance at 10B model scale, even matching 10x-20x size of open-source frontier models!

congrats!

Paper author Paper submitter

PaCoRe is also used in Step3-VL-10B, and this demonstrates that PaCoRe also works well for multimodal intelligence!
With massive test time scaling, small model can also achieve strong performance:-)

More details about pacore are in pacore's original link:-)
hf paper link: https://huggingface.co/papers/2601.05593
gh link: https://github.com/stepfun-ai/PaCoRe

arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/step3-vl-10b-technical-report

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.09668 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.09668 in a Space README.md to link it from this page.

Collections including this paper 1