ScienceOne-AI
/

S1-Omni-Image-Preview

Safetensors

s1-omni-image

Model card Files Files and versions

xet

Community

ScienceOne-AI commited on Feb 14

Commit

cd7b3a1

verified ·

1 Parent(s): c01d9c1

README

Browse files

Files changed (2) hide show

README.md +262 -3
README_zh.md +262 -0

README.md CHANGED Viewed

@@ -1,3 +1,262 @@
----
-license: apache-2.0
----

+<div align="center">
+![S1-Omni-Image-Preview Logo](assets/logo.jpg)
+**S1-Omni-Image: A Unified Multimodal Model for Scientific Image Understanding and Generation**
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
+[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue.svg?logo=github)](https://github.com/ScienceOne-AI/S1-Omni-Image)
+[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow.svg)](https://huggingface.co/ScienceOne-AI/S1-Omni-Image-Preview)
+[![ModelScope](https://img.shields.io/badge/ModelScope-Model-blue.svg)](https://modelscope.cn/models/ScienceOne-AI/S1-Omni-Image-Preview)
+English | [简体中文](./README_zh.md)
+</div>
+## 📖 Introduction
+**S1-Omni-Image-Preview** is a unified end-to-end reasoning model for multimodal understanding and generation developed by the  ScienceOne team at the Chinese Academy of Sciences. Through a unified **"Think before  generate"** paradigm, the model is capable of completing the following four types of tasks:
+- **Text Generation (T2T)**: Generates text responses based on text input
+- **Image-Text Understanding (TI2T)**: Understands images and generates answers based on instructions
+- **Image Generation (T2I)**: Generates images based on text instruction requirements
+- **Image Editing (TI2I)**: Edits images based on text instruction requirements
+The open-sourced **S1-Omni-Image-Preview** model has a total parameter count of ~30B and natively supports input and output of both text and image modalities. It represents the team's technical exploration in unified model architecture for scientific multimodal content understanding and scientific image generation tasks.
+In terms of **image generation**, especially in the scenario of scientific figure generation tasks, the model has been optimized based on large-scale **academic paper figure** data from **six major disciplines** including mathematics, physics, chemistry, astronomy, geography, and biology. It covers common scientific figure types such as flowcharts, architecture diagrams, and principle diagrams, making it the **first open-source** unified multimodal understanding and generation model with enhanced scientific figure generation capabilities.
+## 📥 Model Weights Download
+Model weights are now open-sourced on Hugging Face and ModelScope platforms. You are welcome to download and use them!
+| Platform | Model URL |
+|------|---------|
+| **Hugging Face** | [S1-Omni-Image-Preview](https://huggingface.co/ScienceOne-AI/S1-Omni-Image-Preview) |
+| **ModelScope** | [S1-Omni-Image-Preview](https://modelscope.cn/models/ScienceOne-AI/S1-Omni-Image-Preview) |
+## 🧠 Model Architecture
+The overall architecture of the S1-Omni-Image-Preview model is shown in the figure, natively supporting input and output of text and image modalities.
+<div align="center">
+<img src="assets/figure_arch.png">
+</div>
+Specifically, Text Embedding and Image Embedding encode text and visual tokens into vector representations respectively, and the Image Encoder (VAE) encodes input images into Image Latents. The model first generates a thinking process autoregressively in the form of `<think> {Chain of Thought} </think>`, and then generates answers for different tasks following user instructions.
+For text generation and image understanding tasks, the answer content is generated in text form; for image generation tasks, the answer content is `<image_gen> {Detailed prompt for image generation} </image_gen>`; for image editing tasks, the answer content is `<image_edit> {Detailed prompt for image edit} </image_edit>`. The Hidden States associated with its text semantics are simultaneously used as conditional inputs to the Diffusion Transformer for image generation. This strategy leverages the reasoning capabilities of Multimodal Large Language Models to enrich fine-grained text semantic information during image generation and editing, thereby improving image generation quality.
+## ⚙️ Training Strategy
+The model adopts a three-stage training strategy, as shown in the figure.
+<div align="center">
+<img src="assets/figure_stage.png">
+</div>
+- **Stage 1: Reasoning Paradigm Training.** Initialize weights based on the multimodal large language model [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct), construct training data for four types of tasks with reasoning processes for fine-tuning, enabling the model to think before generating and produce task-aware Special Tokens.
+- **Stage 2: Diffusion Module Training.** Freeze the multimodal large language model, initialize the Diffusion Module based on the MMDiT module of the [Qwen-Image-Edit](https://huggingface.co/Qwen/Qwen-Image-Edit-2511) model, and jointly train on multi-task data including self-constructed academic figure generation to enhance the model's comprehensive capabilities in image generation and editing tasks.
+- **Stage 3: Alignment Module Training.** Freeze both the multimodal large language model and the MMDiT module, add and train a projection layer between the two modules to align the Hidden States of text generation content with the dimensionality of the Diffusion module.
+## 🎨 Case Showcase
+As shown in the figure, these are two generation samples from the S1-Omni-Image-Preview model, including a Chinese example and an English example. The figure demonstrates the complete workflow from user input of natural language descriptions, through the model's reasoning process, to the final generation of professional scientific research figures.
+<div align="center">
+<img src="assets/figure_case_en.png">
+<img src="assets/figure_case_zh.png">
+</div>
+As shown in the figure, these are samples of the S1-Omni-Image-Preview model generating different types of academic-style figures across multiple disciplines including mathematics, physics, chemistry, astronomy, geography, and biology.
+<div align="center">
+<img src="assets/figure_case_subjects.png">
+</div>
+## 🚀 Quick Start
+### 1. Environment Configuration
+System Requirements:
+- **Python**: 3.10+
+- **CUDA**: 12.6+ (Recommended)
+- **GPU**: NVIDIA GPU 80GB+ VRAM (Recommended A100/H100)
+### 2. Installation
+Download the model weights to a local directory, run the following commands to clone the project code, create a virtual environment, and install dependencies.
+```bash
+# Clone project repository
+git clone https://github.com/ScienceOne-AI/S1-Omni-Image.git
+cd S1-Omni-Image
+# Create virtual environment (Recommended)
+conda create -n s1-omni-image-env python=3.10
+conda activate s1-omni-image-env
+# Install dependencies
+pip install -r requirements.txt
+```
+For the complete list of dependencies, please refer to [requirements.txt](requirements.txt).
+### 3. Launch Service
+This project provides request and response formats compatible with the **OpenAI Compatible API**, as well as a simple interactive web page.
+#### Launch Command
+```bash
+# Single GPU
+CUDA_VISIBLE_DEVICES=0 python server.py --model /path/to/S1-Omni-Image-Preview --port 8000
+# Multi GPUs
+CUDA_VISIBLE_DEVICES=0,1 python server.py --model /path/to/S1-Omni-Image-Preview --port 8000
+```
+**Service Configuration Parameters:**
+- `--config`: Model configuration file path
+- `--port`: Service port (default 8000)
+- `--host`: Service address (default 0.0.0.0)
+#### Access Page
+After the service starts, access in the browser:
+```
+http://localhost:8000
+```
+<div align="center">
+<img src="assets/figure_web.png">
+</div>
+#### API Call Example
+**Image Generation Task:**
+```python
+import requests
+import base64
+url = "http://localhost:8000/v1/chat/completions"
+headers = {"Content-Type": "application/json"}
+data = {
+    "model": "s1-omni-image-preview",
+    "messages": [
+        {
+            "role": "user",
+            "content": "Generate a scientific illustration showing the DNA double helix structure"
+        }
+    ],
+    "height": 1024,
+    "width": 1024,
+    "num_inference_steps": 50
+}
+response = requests.post(url, headers=headers, json=data)
+result = response.json()
+content = result["choices"][0]["message"]["content"]
+if isinstance(content, str):
+    print("Text response:", content)
+elif isinstance(content, list):
+    for part in content:
+        if part["type"] == "text":
+            print("Text response:", part["text"])
+        elif part["type"] == "image_url":
+            # extract base64 data "data:image/png;base64,<actual_base64>"
+            image_url = part["image_url"]["url"]
+            base64_str = image_url.split(",", 1)[1]
+            image_data = base64.b64decode(base64_str)
+            output_path = "output.png"
+            with open(output_path, "wb") as f:
+                f.write(image_data)
+            print(f"Image saved: {output_path}")
+```
+**Image Understanding Task:**
+```python
+import requests
+import base64
+url = "http://localhost:8000/v1/chat/completions"
+headers = {"Content-Type": "application/json"}
+# image -> base64
+with open("input.png", "rb") as f:
+    image_base64 = base64.b64encode(f.read()).decode("utf-8")
+data = {
+    "model": "s1-omni-image-preview",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "Please describe the content in this image"},
+                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
+            ]
+        }
+    ]
+}
+response = requests.post(url, headers=headers, json=data)
+print(response.json()["choices"][0]["message"]["content"])
+```
+For more API usage examples, please refer to [docs/API_GUIDE.md](docs/API_GUIDE.md).
+## ⚠️ Limitations
+Although S1-Omni-Image-Preview has made meaningful progress in unified multimodal understanding and generation, the current version still has the following limitations, which we will continue to improve in subsequent versions:
+- **Safety Alignment**: The model may generate inaccurate, biased, or inappropriate text and image content. The current version has not undergone comprehensive safety alignment training. It is recommended to add additional content moderation mechanisms when using it in open scenarios.
+- **General Capabilities**: The model has been specifically optimized for scientific figure generation (such as flowcharts, architecture diagrams, principle diagrams, etc.), but its performance in general natural image generation and support for multi-turn dialogue may not be as good as general models; the model's image editing capability has limited support for refined local editing.
+- **Image Details**: The text rendering clarity in high-resolution images generated by the current version, and the alignment of fine-grained elements in complex charts still have room for improvement. Text in generated images may occasionally appear blurred or incorrect.
+We welcome community users to provide feedback and suggestions during use to help us continuously improve the model.
+## 📄 License
+This project is released under the Apache License 2.0 open source license.
+## 🙏 Acknowledgements
+The development of S1-Omni-Image-Preview would not be possible without the support of the following excellent open-source projects. We express our heartfelt thanks:
+- [Transformers](https://github.com/huggingface/transformers): Develop by the Hugging Face team, it is a state-of-the-art Natural Language Processing and Multimodal model library that provides the core model loading, inference, and service deployment infrastructure for this project.
+- [Diffusers](https://github.com/huggingface/diffusers): Developed by the Hugging Face team, it is a diffusion model library that provides key support for the implementation and inference of the Diffusion Transformer module in this project.
+- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL): Developed by the Alibaba Qwen team, it is a multimodal visual language model. This project uses Qwen3-VL-8B-Instruct as the initialization weights for the multimodal language model. Its excellent capabilities laid a solid foundation for S1-Omni-Image-Preview to train task-aware thinking processes.
+- [Qwen-Image](https://github.com/QwenLM/Qwen-Image): Developed by the Alibaba Qwen team, it is an image generation model. This project referenced and benefited from its technical solutions during the design and training process of the image generation and editing modules.
+## 📚 Citation
+If you use S1-Omni-Image-Preview in your research, please cite our work:
+```bibtex
+@software{s1-omni-image-2026,
+  title        = {S1-Omni-Image: A Unified Multimodal Model for Scientific Image Understanding and Generation},
+  author       = {ScienceOne Team},
+  year         = {2026},
+  organization = {Institute of Automation, Chinese Academy of Sciences},
+  url          = {https://github.com/ScienceOne-AI/S1-Omni-Image}
+}
+```

README_zh.md ADDED Viewed

	@@ -0,0 +1,262 @@

+<div align="center">
+![S1-Omni-Image-Preview Logo](assets/logo.jpg)
+**S1-Omni-Image: A Unified Multimodal Model for Scientific Image Understanding and Generation**
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
+[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue.svg?logo=github)](https://github.com/ScienceOne-AI/S1-Omni-Image)
+[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow.svg)](https://huggingface.co/ScienceOne-AI/S1-Omni-Image-Preview)
+[![ModelScope](https://img.shields.io/badge/ModelScope-Model-blue.svg)](https://modelscope.cn/models/ScienceOne-AI/S1-Omni-Image-Preview)
+[English](./README.md) | 简体中文
+</div>
+## 📖 简介
+**S1-Omni-Image-Preview** 由中国科学院磐石·科学基础大模型（ScienceOne）团队研发，是统一多模态理解和生成的端到端推理模型。通过 **“先思考后生成”** 的统一范式，模型能够完成以下四类任务：
+- **文本生成（T2T）**: 基于文本输入生成文本响应
+- **图文理解（TI2T）**: 理解图像并根据指令生成回答
+- **图像生成（T2I）**: 根据文本指令要求生成图像
+- **图像编辑（TI2I）**: 基于文本指令要求编辑图像
+本次开源的 **S1-Omni-Image-Preview** 模型总参数量 ~30B，原生支持文本和图像两种模态的输入和输出，是团队在科学多模态内容理解和科学图像生成任务统一模型架构方面的技术探索。
+在**图像生成**方面，尤其是科研配图生成任务场景，模型基于数学、物理、化学、天文、地理、生物等**六大学科**的大规模**学术论文配图**数据进行训练优化，涵盖流程图、架构图、原理图等常见科研配图类型，是**首个开源**的科研配图能力增强的统一多模态理解和生成模型。
+## 📥 模型权重下载
+模型权重已在 Hugging face 和 ModelScope 平台开源，欢迎下载使用！
+| 平台 | 模型地址 |
+|------|---------|
+| **Hugging Face** | [S1-Omni-Image-Preview](https://huggingface.co/ScienceOne-AI/S1-Omni-Image-Preview) |
+| **ModelScope** | [S1-Omni-Image-Preview](https://modelscope.cn/models/ScienceOne-AI/S1-Omni-Image-Preview) |
+## 🧠 模型架构
+S1-Omni-Image-Preview 模型整体架构如图所示，原生支持文本、图像模态内容的输入和输出。
+<div align="center">
+<img src="assets/figure_arch.png">
+</div>
+具体地，Text Embedding、Image Embedding 分别将文本和视觉 Tokens 编码为向量表示，Image Encoder (VAE) 将输入图像编码为 Image Latents。模型首先通过自回归生成思考过程，形式为 `<think> {Chain of Thought} </think>`，然后遵循用户指令针对不同任务生成回答。
+对于文本生成和图像理解任务，回答内容以文本形式生成；对于图像生成任务，回答内容为 `<image_gen> {Detailed prompt for image generation} </image_gen>`；对于图像编辑任务，回答内容为 `<image_edit> {Detailed prompt for image edit} </image_edit>`。其文本语义关联的 Hidden States 会同时作为条件输入到 Diffusion Transformer 用于图像生成。该策略能够通过多模态大语言模型的思考和逻辑推理能力，丰富图像生成和图像编辑时的细粒度文本语义信息，从而提高图像生成质量。
+## ⚙️ 训练策略
+模型采用三阶段训练策略，如图所示。
+<div align="center">
+<img src="assets/figure_stage.png">
+</div>
+- **第一阶段：思考范式训练**。基于多模态大语言模型 [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) 初始化权重，构建带思考过程的四类任务训练数据进行微调，使模型具备先思考后生成，并生成任务感知的 Special Token 能力。
+- **第二阶段：扩散模块训练**。冻结多模态大语言模型，基于 [Qwen-Image-Edit](https://huggingface.co/Qwen/Qwen-Image-Edit-2511) 模型的 MMDiT 模块初始化 Diffusion Module，通过自建的学术配图生成等多任务数据联合训练，增强模型在图像生成和编辑任务的综合能力。
+- **第三阶段：对齐模块训练**。冻结多模态大语言模型和 MMDiT 模块，在两个模块之间添加投影层并训练，将文本生成内容的 Hidden States 与 Diffusion 模块维度对齐。
+## 🎨 案例展示
+如图为 S1-Omni-Image-Preview 模型的生成样例，包括中文和英文样例，该图展示了从用户输入自然语言描述到模型思考过程，到最终生成专业科研配图的完整流程。
+<div align="center">
+<img src="assets/figure_case_en.png">
+<img src="assets/figure_case_zh.png">
+</div>
+如图为 S1-Omni-Image-Preview 模型在数学、物理、化学、天文、地理、生物等多个学科生成不同类型学术风格配图的样例。
+<div align="center">
+<img src="assets/figure_case_subjects.png">
+</div>
+## 🚀 快速开始
+### 1. 环境配置
+系统要求：
+- **Python**: 3.10+
+- **CUDA**: 12.6+ (推荐)
+- **GPU**: NVIDIA GPU 80GB+ VRAM (推荐 A100/H100)
+### 2. 安装依赖
+下载模型权重至本地目录，执行以下命令克隆项目代码、创建虚拟环境并安装依赖。
+```bash
+# 克隆项目仓库
+git clone https://github.com/ScienceOne-AI/S1-Omni-Image.git
+cd S1-Omni-Image
+# 创建虚拟环境（推荐）
+conda create -n s1-omni-image-env python=3.10
+conda activate s1-omni-image-env
+# 安装依赖
+pip install -r requirements.txt
+```
+完整依赖列表请参考 [requirements.txt](requirements.txt)。
+### 3. 启动服务
+本项目提供了兼容 **OpenAI 接口规范**的请求和响应格式，以及一个简易的交互式 web 页面。
+#### 启动命令
+```bash
+# Single GPU
+CUDA_VISIBLE_DEVICES=0 python server.py --model /path/to/S1-Omni-Image-Preview --port 8000
+# Multi GPUs
+CUDA_VISIBLE_DEVICES=0,1 python server.py --model /path/to/S1-Omni-Image-Preview --port 8000
+```
+**服务配置参数:**
+- `--config`: 模型配置文件路径
+- `--port`: 服务端口（默认 8000）
+- `--host`: 服务地址（默认 0.0.0.0）
+#### 访问页面
+服务启动后，在浏览器中访问：
+```
+http://localhost:8000
+```
+<div align="center">
+<img src="assets/figure_web.png">
+</div>
+#### API 调用示例
+**图像生成任务:**
+```python
+import requests
+import base64
+url = "http://localhost:8000/v1/chat/completions"
+headers = {"Content-Type": "application/json"}
+data = {
+    "model": "s1-omni-image-preview",
+    "messages": [
+        {
+            "role": "user",
+            "content": "生成一张展示DNA双螺旋结构的科学插图"
+        }
+    ],
+    "height": 1024,
+    "width": 1024,
+    "num_inference_steps": 50
+}
+response = requests.post(url, headers=headers, json=data)
+result = response.json()
+content = result["choices"][0]["message"]["content"]
+if isinstance(content, str):
+    print("Text response:", content)
+elif isinstance(content, list):
+    for part in content:
+        if part["type"] == "text":
+            print("Text response:", part["text"])
+        elif part["type"] == "image_url":
+            # extract base64 data "data:image/png;base64,<actual_base64>"
+            image_url = part["image_url"]["url"]
+            base64_str = image_url.split(",", 1)[1]
+            image_data = base64.b64decode(base64_str)
+            output_path = "output.png"
+            with open(output_path, "wb") as f:
+                f.write(image_data)
+            print(f"Image saved: {output_path}")
+```
+**图像理解任务:**
+```python
+import requests
+import base64
+url = "http://localhost:8000/v1/chat/completions"
+headers = {"Content-Type": "application/json"}
+# image -> base64
+with open("input.png", "rb") as f:
+    image_base64 = base64.b64encode(f.read()).decode("utf-8")
+data = {
+    "model": "s1-omni-image-preview",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "请描述这张图片中的内容"},
+                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
+            ]
+        }
+    ]
+}
+response = requests.post(url, headers=headers, json=data)
+print(response.json()["choices"][0]["message"]["content"])
+```
+更多 API 使用示例请参考 [docs/API_GUIDE.md](docs/API_GUIDE.md)。
+## ⚠️ 局限性
+尽管 S1-Omni-Image-Preview 在统一多模态理解与生成方面取得了有意义的进展，但当前版本仍存在以下局限性，我们将在后续版本中持续改进：
+- **安全对齐**：模型可能生成不准确、带有偏见或不当的文本与图像内容。当前版本尚未经过全面的安全对齐训练，在开放场景下使用时建议增加额外的内容审核机制。
+- **通用能力**：模型在科研配图生成（如流程图、架构图、原理图等）方面经过针对性优化，但在通用自然图像生成的表现，以及多轮对话的支持可能不及通用模型；模型的图像编辑能力对精细化局部编辑的支持有限。
+- **图像细节**：当前版本生成的图像在高分辨率下的文字渲染清晰度、复杂图表中的细粒度元素对齐等方面仍有提升空间，生成图像中的文字偶尔可能出现模糊或错误。
+我们欢迎社区用户在使用过程中反馈问题和建议，帮助我们不断完善模型。
+## 📄 许可协议
+本项目基于 Apache License 2.0 开源协议发布。
+## 🙏 致谢
+S1-Omni-Image-Preview 的研发离不开以下优秀开源项目的支持，在此表示衷心感谢：
+- [Transformers](https://github.com/huggingface/transformers)：由 Hugging Face 团队开发的先进自然语言处理与多模态模型库，为本项目提供了核心的模型加载、推理和服务部署基础设施。
+- [Diffusers](https://github.com/huggingface/diffusers)：由 Hugging Face 团队开发的扩散模型库，为本项目中 Diffusion Transformer 模块的实现与推理提供了关键支持。
+- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)：由阿里巴巴通义千问团队开发的多模态视觉语言模型，本项目以 Qwen3-VL-8B-Instruct 作为多模态语言模型的初始化权重，其优秀能力为 S1-Omni-Image-Preview 训练任务感知的思考过程奠定了坚实基础。
+- [Qwen-Image](https://github.com/QwenLM/Qwen-Image)：由阿里巴巴通义千问团队开发的图像生成模型，本项目在图像生成与编辑模块的设计与训练过程中参考并受益于该项目的技术方案。
+## 📚 引用
+如果您在研究中使用了 S1-Omni-Image-Preview，欢迎引用我们的工作：
+```bibtex
+@software{s1-omni-image-2026,
+  title        = {S1-Omni-Image: A Unified Multimodal Model for Scientific Image Understanding and Generation},
+  author       = {ScienceOne Team},
+  year         = {2026},
+  organization = {Institute of Automation, Chinese Academy of Sciences},
+  url          = {https://github.com/ScienceOne-AI/S1-Omni-Image}
+}
+```