| --- |
| license: apache-2.0 |
| --- |
| # S1-VL-32B: Scientific Multimodal Reasoning Model |
|
|
| [δΈζη](./README_zh.md) | [English](./README.md) |
|
|
| ## π¬ Introduction |
|
|
| **S1-VL-32B** is a multimodal large language model for scientific domains, developed by the ScienceOne team at the Chinese Academy of Sciences. It natively supports two reasoning paradigms β **Multimodal Reasoning** and **Thinking with Images** β and achieves state-of-the-art performance across multiple mainstream scientific multimodal evaluation benchmarks. |
|
|
| - **Multimodal Reasoning Mode**: Chain-of-thought-based multimodal scientific reasoning, designed for the analysis and solving of complex, multi-step problems. |
| - **Thinking with Images Mode**: Enables the model to actively invoke code tools during the reasoning process to perform image operations β including cropping, zooming, image enhancement, bounding box annotation, and keypoint marking β before generating responses. |
|
|
| We have established a **cross-disciplinary data processing pipeline** that conducts multi-dimensional utility evaluation and filtering of visual reasoning trajectories to ensure the quality of training data. A **multi-stage post-training procedure** is employed to progressively unlock the scientific reasoning capabilities of S1-VL-32B: |
|
|
| - **Stage 1**: Large-scale multimodal instruction data spanning multiple disciplines β including **mathematics, physics, chemistry, astronomy, earth sciences, and biology** β is used for mixed training to enhance the model's scientific visual understanding and logical reasoning abilities, laying a solid foundation for academic figure Q&A, medical image analysis, chemical structure recognition, and related tasks. |
| - **Stage 2**: The **Thinking with Images** reasoning paradigm is introduced. Through high-quality **scientific reasoning data annealing**, the model acquires the ability to perform **image operations via code** during inference. This approach yields particularly outstanding performance in scenarios requiring fine-grained image analysis, with notable strengths in interpreting dense scientific charts, high-resolution remote sensing imagery, microscopic images, and complex visual scenes such as astronomical observation data. |
|
|
| ## π Model Weights |
|
|
| | Model | Parameters | HuggingFace | ModelScope | |
| |-------|-----------|-------------|------------| |
| | S1-VL-32B | 32B | π€ [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | π€ [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) | |
|
|
| ## π Evaluation Results |
|
|
| The evaluation covers **2 dimensions** and **13 benchmarks**. The **Scientific Multimodal Reasoning** dimension includes MMMU, SFE, MathVision, Physics, ScienceOlympiad, VRSBench-MINI, GMAI-MMBench, and Galaxy-10-DECaLS, spanning mathematics, physics, medicine, remote sensing, astronomy, and other professional fields. The **Image Manipulation Reasoning** dimension includes HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, focusing on high-resolution image understanding and real-world visual reasoning. |
| |
| <div align="center"> |
| <img src="./image/s1-vl-32b-benchmark.png"/> |
| </div> |
| |
| S1-VL-32B demonstrates outstanding overall competitiveness across the aforementioned evaluations. In **scientific multimodal reasoning** tasks, the model achieves significant advantages on multiple authoritative benchmarks β including MMMU, MathVision, and VRSBench-MINI β surpassing its base model Qwen3-VL-32B in overall performance, while remaining highly competitive against open-source models with substantially larger parameter scales (e.g., Qwen3-VL-235B, Intern-S1) as well as closed-source flagship models (e.g., Gemini 2.5 Pro, GPT-5). In **image operation reasoning** tasks, S1-VL-32B ranks **first across all five benchmark evaluations**, comprehensively outperforming models of comparable and larger scales, while also surpassing dedicated "Thinking with Images" models such as Thyme-VL and Skywork-R1V4. These results fully validate its ability to achieve efficient, high-quality multimodal reasoning at the 32B parameter scale. |
| |
| ## π§ Case Study |
| |
| The following presents reasoning examples of S1-VL-32B operating in **Thinking with Images** mode. When processing a low-resolution cervical CT image, S1-VL-32B proactively invokes code tools during its reasoning process to perform **cropping and magnification** on the region of interest. By obtaining a clearer local image, the model then combines the enhanced visual information with its internal knowledge to complete the reasoning. |
| |
| <div align="center"> |
| <img src="./image/s1-vl-32b-twi.png"/> |
| </div> |
| |
| π More cases are available in [CASES.md](./CASES.md). |
| |
| ## π Quick Start |
| |
| ### 1. Install Dependencies |
| |
| ```bash |
| # Requires vLLM >= 0.11.0 |
| pip install -U vllm |
| pip install qwen-vl-utils==0.0.14 |
| ``` |
| |
| ### 2. Start the vLLM Service |
| |
| ```bash |
| vllm serve ScienceOne-AI/S1-VL-32B \ |
| --tensor-parallel-size 4 \ |
| --max-model-len 32768 \ |
| --limit-mm-per-prompt image=15 \ |
| --reasoning-parser deepseek_r1 \ |
| --enable-prefix-caching \ |
| --gpu-memory-utilization 0.95 \ |
| --port 9200 |
| ``` |
| |
| ### 3. Multimodal Reasoning Mode |
| |
| ```python |
| from openai import OpenAI |
| import base64 |
| |
| client = OpenAI(api_key="EMPTY", base_url="http://localhost:9200/v1") |
| |
| with open("path/to/your/image.png", "rb") as f: |
| image_data = base64.b64encode(f.read()).decode("utf-8") |
| |
| response = client.chat.completions.create( |
| model="ScienceOne-AI/S1-VL-32B", |
| messages=[ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}, |
| {"type": "text", "text": "Please describe the physical phenomenon shown in the image and derive the relevant equations."}, |
| ], |
| } |
| ], |
| temperature=0.6, |
| top_p=0.95, |
| max_tokens=16384, |
| ) |
| |
| # The reasoning process is in the reasoning_content field |
| print("Thinking process:\n", response.choices[0].message.reasoning_content) |
| print("\nFinal answer:\n", response.choices[0].message.content) |
| ``` |
| |
| ### 4. Thinking with Images Mode |
| |
| Thinking with Images mode requires deploying a **code sandbox** to support the model invoking code tools during reasoning for image operations (cropping, zooming, enhancement, annotation, etc.). |
| |
| #### Step 1: Deploy the Code Sandbox |
| |
| We recommend deploying the AIO Sandbox with Docker: |
| |
| ```bash |
| git clone https://github.com/agent-infra/sandbox |
| cd sandbox |
| # Mount the host image directory into the container |
| docker run -d \ |
| --name twi-sandbox \ |
| -p 18081:18081 \ |
| -v /data/images:/mnt/data/images \ # host path β sandbox path |
| sandbox:latest |
| ``` |
| The mount path must match the path configuration in the FastAPI service. |
| |
| #### Step 2: Start the Thinking with Images FastAPI Service |
| |
| Download [twi_server.py](twi_server.py) and update the path configuration at the top of the file: |
| |
| ```python |
| CHAT_API = "http://localhost:9200/v1/chat/completions" # vLLM address |
| JUPYTER_API = "http://localhost:18081/v1/jupyter" # Sandbox address |
| HOST_IMG_DIR = "/data/images" # β Host image directory (must match docker -v mount) |
| ``` |
| |
| Start the service: |
| |
| ```bash |
| pip install fastapi uvicorn httpx pillow |
| python twi_server.py # Listens on port 10044 |
| ``` |
| |
| #### Step 3: Call the Thinking with Images Endpoint |
| |
| ```python |
| import httpx |
| import base64 |
| |
| with open("path/to/your/image.png", "rb") as f: |
| image_b64 = base64.b64encode(f.read()).decode("utf-8") |
| |
| messages = [ |
| {"type": "text", "text": "Please carefully analyze this scientific image."}, |
| {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}, |
| ] |
| |
| response = httpx.post( |
| "http://localhost:10044/process", |
| json={ |
| "messages": messages, |
| "image_path_list": ["/data/images/your_image.png"], # Absolute host path |
| }, |
| timeout=300, |
| ) |
| |
| result = response.json() |
| |
| # The final answer is the last message with role="assistant" |
| final = [m for m in result["messages"] if m["role"] == "assistant"][-1] |
| print(final["content"]) |
| ``` |
| |
| ## π Citation |
| |
| If you use S1-VL-32B in your research, please cite (the corresponding paper is coming soon): |
| |
| ```latex |
| @misc{s1vl2026, |
| title = {S1-VL-32B: Scientific Multimodal Reasoning Model}, |
| author = {ScienceOne Team}, |
| year = {2026}, |
| howpublished = {\url{https://huggingface.co/ScienceOne-AI/S1-VL-32B}} |
| } |
| ``` |
| |
| ## π License |
| |
| This project is released under the Apache 2.0 License. |
| |
| ## π Acknowledgements |
| |
| We thank the open-source communities and pioneering works of [Qwen3-VL](https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b) and [AIO Sandbox](https://github.com/agent-infra/sandbox) for laying the foundation for the scientific multimodal reasoning research behind S1-VL-32B. |