Update README.md

eb90293 verified 3 days ago

8.81 kB

	---
	license: apache-2.0
	---
	# S1-VL-32B: Scientific Multimodal Reasoning Model

	[中文版](./README_zh.md) \| [English](./README.md)

	## 🔬 Introduction

	S1-VL-32B is a multimodal large language model for scientific domains, developed by the ScienceOne team at the Chinese Academy of Sciences. It natively supports two reasoning paradigms — Multimodal Reasoning and Thinking with Images — and achieves state-of-the-art performance across multiple mainstream scientific multimodal evaluation benchmarks.

	- Multimodal Reasoning Mode: Chain-of-thought-based multimodal scientific reasoning, designed for the analysis and solving of complex, multi-step problems.
	- Thinking with Images Mode: Enables the model to actively invoke code tools during the reasoning process to perform image operations — including cropping, zooming, image enhancement, bounding box annotation, and keypoint marking — before generating responses.

	We have established a cross-disciplinary data processing pipeline that conducts multi-dimensional utility evaluation and filtering of visual reasoning trajectories to ensure the quality of training data. A multi-stage post-training procedure is employed to progressively unlock the scientific reasoning capabilities of S1-VL-32B:

	- Stage 1: Large-scale multimodal instruction data spanning multiple disciplines — including mathematics, physics, chemistry, astronomy, earth sciences, and biology — is used for mixed training to enhance the model's scientific visual understanding and logical reasoning abilities, laying a solid foundation for academic figure Q&A, medical image analysis, chemical structure recognition, and related tasks.
	- Stage 2: The Thinking with Images reasoning paradigm is introduced. Through high-quality scientific reasoning data annealing, the model acquires the ability to perform image operations via code during inference. This approach yields particularly outstanding performance in scenarios requiring fine-grained image analysis, with notable strengths in interpreting dense scientific charts, high-resolution remote sensing imagery, microscopic images, and complex visual scenes such as astronomical observation data.

	## 📂 Model Weights

	\| Model \| Parameters \| HuggingFace \| ModelScope \|
	\|-------\|-----------\|-------------\|------------\|
	\| S1-VL-32B \| 32B \| 🤗 [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B) \| 🤖 [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) \|

	## 🏆 Evaluation Results

	The evaluation covers 2 dimensions and 13 benchmarks. The Scientific Multimodal Reasoning dimension includes MMMU, SFE, MathVision, Physics, ScienceOlympiad, VRSBench-MINI, GMAI-MMBench, and Galaxy-10-DECaLS, spanning mathematics, physics, medicine, remote sensing, astronomy, and other professional fields. The Image Manipulation Reasoning dimension includes HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, focusing on high-resolution image understanding and real-world visual reasoning.

	<div align="center">
	<img src="./image/s1-vl-32b-benchmark.png"/>
	</div>

	S1-VL-32B demonstrates outstanding overall competitiveness across the aforementioned evaluations. In scientific multimodal reasoning tasks, the model achieves significant advantages on multiple authoritative benchmarks — including MMMU, MathVision, and VRSBench-MINI — surpassing its base model Qwen3-VL-32B in overall performance, while remaining highly competitive against open-source models with substantially larger parameter scales (e.g., Qwen3-VL-235B, Intern-S1) as well as closed-source flagship models (e.g., Gemini 2.5 Pro, GPT-5). In image operation reasoning tasks, S1-VL-32B ranks first across all five benchmark evaluations, comprehensively outperforming models of comparable and larger scales, while also surpassing dedicated "Thinking with Images" models such as Thyme-VL and Skywork-R1V4. These results fully validate its ability to achieve efficient, high-quality multimodal reasoning at the 32B parameter scale.

	## 🧠 Case Study

	The following presents reasoning examples of S1-VL-32B operating in Thinking with Images mode. When processing a low-resolution cervical CT image, S1-VL-32B proactively invokes code tools during its reasoning process to perform cropping and magnification on the region of interest. By obtaining a clearer local image, the model then combines the enhanced visual information with its internal knowledge to complete the reasoning.

	<div align="center">
	<img src="./image/s1-vl-32b-twi.png"/>
	</div>

	📁 More cases are available in [CASES.md](./CASES.md).

	## 🚀 Quick Start

	### 1. Install Dependencies

	```bash
	# Requires vLLM >= 0.11.0
	pip install -U vllm
	pip install qwen-vl-utils==0.0.14
	```

	### 2. Start the vLLM Service

	```bash
	vllm serve ScienceOne-AI/S1-VL-32B \
	--tensor-parallel-size 4 \
	--max-model-len 32768 \
	--limit-mm-per-prompt image=15 \
	--reasoning-parser deepseek_r1 \
	--enable-prefix-caching \
	--gpu-memory-utilization 0.95 \
	--port 9200
	```

	### 3. Multimodal Reasoning Mode

	```python
	from openai import OpenAI
	import base64

	client = OpenAI(api_key="EMPTY", base_url="http://localhost:9200/v1")

	with open("path/to/your/image.png", "rb") as f:
	image_data = base64.b64encode(f.read()).decode("utf-8")

	response = client.chat.completions.create(
	model="ScienceOne-AI/S1-VL-32B",
	messages=[
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
	{"type": "text", "text": "Please describe the physical phenomenon shown in the image and derive the relevant equations."},
	],
	}
	],
	temperature=0.6,
	top_p=0.95,
	max_tokens=16384,
	)

	# The reasoning process is in the reasoning_content field
	print("Thinking process:\n", response.choices[0].message.reasoning_content)
	print("\nFinal answer:\n", response.choices[0].message.content)
	```

	### 4. Thinking with Images Mode

	Thinking with Images mode requires deploying a code sandbox to support the model invoking code tools during reasoning for image operations (cropping, zooming, enhancement, annotation, etc.).

	#### Step 1: Deploy the Code Sandbox

	We recommend deploying the AIO Sandbox with Docker:

	```bash
	git clone https://github.com/agent-infra/sandbox
	cd sandbox
	# Mount the host image directory into the container
	docker run -d \
	--name twi-sandbox \
	-p 18081:18081 \
	-v /data/images:/mnt/data/images \ # host path → sandbox path
	sandbox:latest
	```
	The mount path must match the path configuration in the FastAPI service.

	#### Step 2: Start the Thinking with Images FastAPI Service

	Download [twi_server.py](twi_server.py) and update the path configuration at the top of the file:

	```python
	CHAT_API = "http://localhost:9200/v1/chat/completions" # vLLM address
	JUPYTER_API = "http://localhost:18081/v1/jupyter" # Sandbox address
	HOST_IMG_DIR = "/data/images" # ← Host image directory (must match docker -v mount)
	```

	Start the service:

	```bash
	pip install fastapi uvicorn httpx pillow
	python twi_server.py # Listens on port 10044
	```

	#### Step 3: Call the Thinking with Images Endpoint

	```python
	import httpx
	import base64

	with open("path/to/your/image.png", "rb") as f:
	image_b64 = base64.b64encode(f.read()).decode("utf-8")

	messages = [
	{"type": "text", "text": "Please carefully analyze this scientific image."},
	{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
	]

	response = httpx.post(
	"http://localhost:10044/process",
	json={
	"messages": messages,
	"image_path_list": ["/data/images/your_image.png"], # Absolute host path
	},
	timeout=300,
	)

	result = response.json()

	# The final answer is the last message with role="assistant"
	final = [m for m in result["messages"] if m["role"] == "assistant"][-1]
	print(final["content"])
	```

	## 📄 Citation

	If you use S1-VL-32B in your research, please cite (the corresponding paper is coming soon):

	```latex
	@misc{s1vl2026,
	title = {S1-VL-32B: Scientific Multimodal Reasoning Model},
	author = {ScienceOne Team},
	year = {2026},
	howpublished = {\url{https://huggingface.co/ScienceOne-AI/S1-VL-32B}}
	}
	```

	## 📜 License

	This project is released under the Apache 2.0 License.

	## 🙏 Acknowledgements

	We thank the open-source communities and pioneering works of [Qwen3-VL](https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b) and [AIO Sandbox](https://github.com/agent-infra/sandbox) for laying the foundation for the scientific multimodal reasoning research behind S1-VL-32B.