File size: 8,808 Bytes
eb90293 0fd8942 eb90293 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | ---
license: apache-2.0
---
# S1-VL-32B: Scientific Multimodal Reasoning Model
[δΈζη](./README_zh.md) | [English](./README.md)
## π¬ Introduction
**S1-VL-32B** is a multimodal large language model for scientific domains, developed by the ScienceOne team at the Chinese Academy of Sciences. It natively supports two reasoning paradigms β **Multimodal Reasoning** and **Thinking with Images** β and achieves state-of-the-art performance across multiple mainstream scientific multimodal evaluation benchmarks.
- **Multimodal Reasoning Mode**: Chain-of-thought-based multimodal scientific reasoning, designed for the analysis and solving of complex, multi-step problems.
- **Thinking with Images Mode**: Enables the model to actively invoke code tools during the reasoning process to perform image operations β including cropping, zooming, image enhancement, bounding box annotation, and keypoint marking β before generating responses.
We have established a **cross-disciplinary data processing pipeline** that conducts multi-dimensional utility evaluation and filtering of visual reasoning trajectories to ensure the quality of training data. A **multi-stage post-training procedure** is employed to progressively unlock the scientific reasoning capabilities of S1-VL-32B:
- **Stage 1**: Large-scale multimodal instruction data spanning multiple disciplines β including **mathematics, physics, chemistry, astronomy, earth sciences, and biology** β is used for mixed training to enhance the model's scientific visual understanding and logical reasoning abilities, laying a solid foundation for academic figure Q&A, medical image analysis, chemical structure recognition, and related tasks.
- **Stage 2**: The **Thinking with Images** reasoning paradigm is introduced. Through high-quality **scientific reasoning data annealing**, the model acquires the ability to perform **image operations via code** during inference. This approach yields particularly outstanding performance in scenarios requiring fine-grained image analysis, with notable strengths in interpreting dense scientific charts, high-resolution remote sensing imagery, microscopic images, and complex visual scenes such as astronomical observation data.
## π Model Weights
| Model | Parameters | HuggingFace | ModelScope |
|-------|-----------|-------------|------------|
| S1-VL-32B | 32B | π€ [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | π€ [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
## π Evaluation Results
The evaluation covers **2 dimensions** and **13 benchmarks**. The **Scientific Multimodal Reasoning** dimension includes MMMU, SFE, MathVision, Physics, ScienceOlympiad, VRSBench-MINI, GMAI-MMBench, and Galaxy-10-DECaLS, spanning mathematics, physics, medicine, remote sensing, astronomy, and other professional fields. The **Image Manipulation Reasoning** dimension includes HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, focusing on high-resolution image understanding and real-world visual reasoning.
<div align="center">
<img src="./image/s1-vl-32b-benchmark.png"/>
</div>
S1-VL-32B demonstrates outstanding overall competitiveness across the aforementioned evaluations. In **scientific multimodal reasoning** tasks, the model achieves significant advantages on multiple authoritative benchmarks β including MMMU, MathVision, and VRSBench-MINI β surpassing its base model Qwen3-VL-32B in overall performance, while remaining highly competitive against open-source models with substantially larger parameter scales (e.g., Qwen3-VL-235B, Intern-S1) as well as closed-source flagship models (e.g., Gemini 2.5 Pro, GPT-5). In **image operation reasoning** tasks, S1-VL-32B ranks **first across all five benchmark evaluations**, comprehensively outperforming models of comparable and larger scales, while also surpassing dedicated "Thinking with Images" models such as Thyme-VL and Skywork-R1V4. These results fully validate its ability to achieve efficient, high-quality multimodal reasoning at the 32B parameter scale.
## π§ Case Study
The following presents reasoning examples of S1-VL-32B operating in **Thinking with Images** mode. When processing a low-resolution cervical CT image, S1-VL-32B proactively invokes code tools during its reasoning process to perform **cropping and magnification** on the region of interest. By obtaining a clearer local image, the model then combines the enhanced visual information with its internal knowledge to complete the reasoning.
<div align="center">
<img src="./image/s1-vl-32b-twi.png"/>
</div>
π More cases are available in [CASES.md](./CASES.md).
## π Quick Start
### 1. Install Dependencies
```bash
# Requires vLLM >= 0.11.0
pip install -U vllm
pip install qwen-vl-utils==0.0.14
```
### 2. Start the vLLM Service
```bash
vllm serve ScienceOne-AI/S1-VL-32B \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--limit-mm-per-prompt image=15 \
--reasoning-parser deepseek_r1 \
--enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--port 9200
```
### 3. Multimodal Reasoning Mode
```python
from openai import OpenAI
import base64
client = OpenAI(api_key="EMPTY", base_url="http://localhost:9200/v1")
with open("path/to/your/image.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="ScienceOne-AI/S1-VL-32B",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
{"type": "text", "text": "Please describe the physical phenomenon shown in the image and derive the relevant equations."},
],
}
],
temperature=0.6,
top_p=0.95,
max_tokens=16384,
)
# The reasoning process is in the reasoning_content field
print("Thinking process:\n", response.choices[0].message.reasoning_content)
print("\nFinal answer:\n", response.choices[0].message.content)
```
### 4. Thinking with Images Mode
Thinking with Images mode requires deploying a **code sandbox** to support the model invoking code tools during reasoning for image operations (cropping, zooming, enhancement, annotation, etc.).
#### Step 1: Deploy the Code Sandbox
We recommend deploying the AIO Sandbox with Docker:
```bash
git clone https://github.com/agent-infra/sandbox
cd sandbox
# Mount the host image directory into the container
docker run -d \
--name twi-sandbox \
-p 18081:18081 \
-v /data/images:/mnt/data/images \ # host path β sandbox path
sandbox:latest
```
The mount path must match the path configuration in the FastAPI service.
#### Step 2: Start the Thinking with Images FastAPI Service
Download [twi_server.py](twi_server.py) and update the path configuration at the top of the file:
```python
CHAT_API = "http://localhost:9200/v1/chat/completions" # vLLM address
JUPYTER_API = "http://localhost:18081/v1/jupyter" # Sandbox address
HOST_IMG_DIR = "/data/images" # β Host image directory (must match docker -v mount)
```
Start the service:
```bash
pip install fastapi uvicorn httpx pillow
python twi_server.py # Listens on port 10044
```
#### Step 3: Call the Thinking with Images Endpoint
```python
import httpx
import base64
with open("path/to/your/image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode("utf-8")
messages = [
{"type": "text", "text": "Please carefully analyze this scientific image."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
]
response = httpx.post(
"http://localhost:10044/process",
json={
"messages": messages,
"image_path_list": ["/data/images/your_image.png"], # Absolute host path
},
timeout=300,
)
result = response.json()
# The final answer is the last message with role="assistant"
final = [m for m in result["messages"] if m["role"] == "assistant"][-1]
print(final["content"])
```
## π Citation
If you use S1-VL-32B in your research, please cite (the corresponding paper is coming soon):
```latex
@misc{s1vl2026,
title = {S1-VL-32B: Scientific Multimodal Reasoning Model},
author = {ScienceOne Team},
year = {2026},
howpublished = {\url{https://huggingface.co/ScienceOne-AI/S1-VL-32B}}
}
```
## π License
This project is released under the Apache 2.0 License.
## π Acknowledgements
We thank the open-source communities and pioneering works of [Qwen3-VL](https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b) and [AIO Sandbox](https://github.com/agent-infra/sandbox) for laying the foundation for the scientific multimodal reasoning research behind S1-VL-32B. |