File size: 8,808 Bytes
eb90293
 
 
0fd8942
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb90293
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
license: apache-2.0
---
# S1-VL-32B: Scientific Multimodal Reasoning Model

[δΈ­ζ–‡η‰ˆ](./README_zh.md) | [English](./README.md)

## πŸ”¬ Introduction

**S1-VL-32B** is a multimodal large language model for scientific domains, developed by the ScienceOne team at the Chinese Academy of Sciences. It natively supports two reasoning paradigms β€” **Multimodal Reasoning** and **Thinking with Images** β€” and achieves state-of-the-art performance across multiple mainstream scientific multimodal evaluation benchmarks.

- **Multimodal Reasoning Mode**: Chain-of-thought-based multimodal scientific reasoning, designed for the analysis and solving of complex, multi-step problems.
- **Thinking with Images Mode**: Enables the model to actively invoke code tools during the reasoning process to perform image operations β€” including cropping, zooming, image enhancement, bounding box annotation, and keypoint marking β€” before generating responses.

We have established a **cross-disciplinary data processing pipeline** that conducts multi-dimensional utility evaluation and filtering of visual reasoning trajectories to ensure the quality of training data. A **multi-stage post-training procedure** is employed to progressively unlock the scientific reasoning capabilities of S1-VL-32B:

- **Stage 1**: Large-scale multimodal instruction data spanning multiple disciplines β€” including **mathematics, physics, chemistry, astronomy, earth sciences, and biology** β€” is used for mixed training to enhance the model's scientific visual understanding and logical reasoning abilities, laying a solid foundation for academic figure Q&A, medical image analysis, chemical structure recognition, and related tasks.
- **Stage 2**: The **Thinking with Images** reasoning paradigm is introduced. Through high-quality **scientific reasoning data annealing**, the model acquires the ability to perform **image operations via code** during inference. This approach yields particularly outstanding performance in scenarios requiring fine-grained image analysis, with notable strengths in interpreting dense scientific charts, high-resolution remote sensing imagery, microscopic images, and complex visual scenes such as astronomical observation data.

## πŸ“‚ Model Weights

| Model | Parameters | HuggingFace | ModelScope |
|-------|-----------|-------------|------------|
| S1-VL-32B | 32B | πŸ€— [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | πŸ€– [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |

## πŸ† Evaluation Results

The evaluation covers **2 dimensions** and **13 benchmarks**. The **Scientific Multimodal Reasoning** dimension includes MMMU, SFE, MathVision, Physics, ScienceOlympiad, VRSBench-MINI, GMAI-MMBench, and Galaxy-10-DECaLS, spanning mathematics, physics, medicine, remote sensing, astronomy, and other professional fields. The **Image Manipulation Reasoning** dimension includes HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, focusing on high-resolution image understanding and real-world visual reasoning.

<div align="center">
<img src="./image/s1-vl-32b-benchmark.png"/>
</div>

S1-VL-32B demonstrates outstanding overall competitiveness across the aforementioned evaluations. In **scientific multimodal reasoning** tasks, the model achieves significant advantages on multiple authoritative benchmarks β€” including MMMU, MathVision, and VRSBench-MINI β€” surpassing its base model Qwen3-VL-32B in overall performance, while remaining highly competitive against open-source models with substantially larger parameter scales (e.g., Qwen3-VL-235B, Intern-S1) as well as closed-source flagship models (e.g., Gemini 2.5 Pro, GPT-5). In **image operation reasoning** tasks, S1-VL-32B ranks **first across all five benchmark evaluations**, comprehensively outperforming models of comparable and larger scales, while also surpassing dedicated "Thinking with Images" models such as Thyme-VL and Skywork-R1V4. These results fully validate its ability to achieve efficient, high-quality multimodal reasoning at the 32B parameter scale.

## 🧠 Case Study

The following presents reasoning examples of S1-VL-32B operating in **Thinking with Images** mode. When processing a low-resolution cervical CT image, S1-VL-32B proactively invokes code tools during its reasoning process to perform **cropping and magnification** on the region of interest. By obtaining a clearer local image, the model then combines the enhanced visual information with its internal knowledge to complete the reasoning.

<div align="center">
<img src="./image/s1-vl-32b-twi.png"/>
</div>

πŸ“ More cases are available in [CASES.md](./CASES.md).

## πŸš€ Quick Start

### 1. Install Dependencies

```bash
# Requires vLLM >= 0.11.0
pip install -U vllm
pip install qwen-vl-utils==0.0.14
```

### 2. Start the vLLM Service

```bash
vllm serve ScienceOne-AI/S1-VL-32B \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --limit-mm-per-prompt image=15 \
    --reasoning-parser deepseek_r1 \
    --enable-prefix-caching \
    --gpu-memory-utilization 0.95 \
    --port 9200
```

### 3. Multimodal Reasoning Mode

```python
from openai import OpenAI
import base64

client = OpenAI(api_key="EMPTY", base_url="http://localhost:9200/v1")

with open("path/to/your/image.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="ScienceOne-AI/S1-VL-32B",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
                {"type": "text", "text": "Please describe the physical phenomenon shown in the image and derive the relevant equations."},
            ],
        }
    ],
    temperature=0.6,
    top_p=0.95,
    max_tokens=16384,
)

# The reasoning process is in the reasoning_content field
print("Thinking process:\n", response.choices[0].message.reasoning_content)
print("\nFinal answer:\n", response.choices[0].message.content)
```

### 4. Thinking with Images Mode

Thinking with Images mode requires deploying a **code sandbox** to support the model invoking code tools during reasoning for image operations (cropping, zooming, enhancement, annotation, etc.).

#### Step 1: Deploy the Code Sandbox

We recommend deploying the AIO Sandbox with Docker:

```bash
git clone https://github.com/agent-infra/sandbox
cd sandbox
# Mount the host image directory into the container
docker run -d \
    --name twi-sandbox \
    -p 18081:18081 \
    -v /data/images:/mnt/data/images \   # host path β†’ sandbox path
    sandbox:latest
```
The mount path must match the path configuration in the FastAPI service.

#### Step 2: Start the Thinking with Images FastAPI Service

Download [twi_server.py](twi_server.py) and update the path configuration at the top of the file:

```python
CHAT_API        = "http://localhost:9200/v1/chat/completions"  # vLLM address
JUPYTER_API     = "http://localhost:18081/v1/jupyter"          # Sandbox address
HOST_IMG_DIR    = "/data/images"     # ← Host image directory (must match docker -v mount)
```

Start the service:

```bash
pip install fastapi uvicorn httpx pillow
python twi_server.py   # Listens on port 10044
```

#### Step 3: Call the Thinking with Images Endpoint

```python
import httpx
import base64

with open("path/to/your/image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode("utf-8")

messages = [
    {"type": "text", "text": "Please carefully analyze this scientific image."},
    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
]

response = httpx.post(
    "http://localhost:10044/process",
    json={
        "messages": messages,
        "image_path_list": ["/data/images/your_image.png"],  # Absolute host path
    },
    timeout=300,
)

result = response.json()

# The final answer is the last message with role="assistant"
final = [m for m in result["messages"] if m["role"] == "assistant"][-1]
print(final["content"])
```

## πŸ“„ Citation

If you use S1-VL-32B in your research, please cite (the corresponding paper is coming soon):

```latex
@misc{s1vl2026,
  title        = {S1-VL-32B: Scientific Multimodal Reasoning Model},
  author       = {ScienceOne Team},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/ScienceOne-AI/S1-VL-32B}}
}
```

## πŸ“œ License

This project is released under the Apache 2.0 License.

## πŸ™ Acknowledgements

We thank the open-source communities and pioneering works of [Qwen3-VL](https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b) and [AIO Sandbox](https://github.com/agent-infra/sandbox) for laying the foundation for the scientific multimodal reasoning research behind S1-VL-32B.