File size: 6,855 Bytes
ddb654f b3c375a ddb654f 48a7abd b3c375a ddb654f b3c375a ddb654f b3c375a ddb654f b3c375a ddb654f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
---
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
datasets:
- Alex11556666/Reason_Tuning
license: apache-2.0
pipeline_tag: text-to-image
arxiv: 2602.12205
tags:
- image-generation
- image-editing
- multimodal
---
# π‘ DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
<p align="left">
<a href="http://arxiv.org/abs/2602.12205">
<img
src="https://img.shields.io/badge/DeepGen 1.0-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
alt="DeepGen 1.0 Paper on arXiv"
/>
</a>
<a href="https://github.com/deepgenteam/deepgen" target="_blank" style="margin: 2px;">
<img
alt="Github" src="https://img.shields.io/badge/DeepGen 1.0-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
alt="DeepGen 1.0 Codebase"
/>
</a>
<a href="https://deepgenteam.github.io/" target="_blank" style="margin: 2px;">
<img
alt="Github" src="https://img.shields.io/badge/Website-project page-orange" style="display: inline-block; vertical-align: middle;"
alt="DeepGen 1.0 page"
/>
</a>
</p>
DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβgeneral image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβwithin a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ to 16Γ larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.
**Authors**: Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tianhang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, Jiaqi Wang.
<p align="left"><img src="bubble_chart.png" width="80%"></p>
## π§ Method
Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts.
To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance.
We further design a data-centric training strategy spanning three progressive stages: (1) **Alignment Pre-training** on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) **Joint Supervised Fine-tuning** on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) **Reinforcement Learning with MR-GRPO**, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.
<p align="left"><img src="arch.png" width="80%"></p>
## π Benchmarks
### 1. General Image Generation
| Model | Params | Geneval β | DPGBench β | UniGenBench β |
| --------------------- | ----------- | ----------- | ------------ | ------------- |
| OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 |
| BAGEL | 14B | 0.82 | 85.10 | 61.53 |
| X-Omni | 7B + 12B | 0.83 | 87.65π₯ | 53.77 |
| Lumina-DiMOO | 8B | 0.88π₯ | 86.04 | 71.12 |
| Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β |
| Qwen-Image | 7B + 20B | 0.87 π₯ | 88.32 π₯ | 78.81 π₯ |
| LongCat-Image | 7B + 6B | 0.87 π₯ | 86.80 | β |
| Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 |
| GLM-Image | 9B + 7B | β | 84.78 | β |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.86 π₯ | 87.05 | 74.18 π₯ |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 0.87 π₯ | 87.90 π₯ | 75.74 π₯ |
### 2. General Image Editing
| Model | Params | GEdit-EN β | ImgEdit β |
| :--- | :--- | :--- | :--- |
| BAGEL | 14B | 6.52 | 3.20 |
| Qwen-Image-Edit [2509] | 7B + 20B | 7.54 π₯ | 4.35 π₯ |
| LongCat-Image-Edit | 7B + 6B | 7.60 π₯ | 4.50 π₯ |
| Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 7.12 | 4.09 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 7.17 π₯ | 4.14 π₯ |
### 3. Reasoning Image Generation
| Model | Params | WISE β | T2I-CoREBench β |
| :--- | :--- | :--- | :--- |
| OmniGen2 | 3B + 4B | 0.47 | 36.1 |
| BAGEL | 14B | 0.70 π₯ | 41.1 |
| Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 |
| Qwen-Image | 7B + 20B | 0.62 | 46.3 π₯ |
| LongCat-Image | 7B + 6B | 0.65 | 52.2 π₯ |
| Z-Image-Turbo | 4B + 6B | - | 43.7 |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.72 π₯ | 45.7 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 0.73 π₯ | 46.5 π₯ |
### 4. Reasoning Image Editing
| Model | Params | RISE β | UniREditBench β |
| :--- | :--- | :--- | :--- |
| OmniGen2 | 3B + 4B | - | 43.4 |
| BAGEL | 14B | 11.9 π₯ | 51.0 |
| Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 π₯ |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 13.3 π₯ | 77.5 π₯ |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 π₯ | 75.7 π₯ |
## π¨ Quantitative results
<p align="left"><img src="teaser.png" width="80%"></p>
## π οΈ Usage
### Merge ZIP Files
To use the DeepGen checkpoints, please merge the sharded model files first. We release Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints.
```bash
# Merge zip
cat DeepGen_CKPT.zip.part-* > DeepGen_CKPT.zip
# Unzip DeepGen checkpoints
unzip DeepGen_CKPT.zip
```
## π Citation
```bibtex
@article{wang2026deepgen10alightweightunified,
title = {DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
author = {Dianyi Wang and Ruihang Li and Feng Han and Chaofan Ma and Wei Song and Siyuan Wang and Yibin Wang and Yi Xin and Hongjian Liu and Zhixiong Zhang and Shengyuan Ding and Tianhang Wang and Zhenglin Cheng and Tao Lin and Cheng Jin and Kaicheng Yu and Jingjing Chen and Wenjie Wang and Zhongyu Wei and Jiaqi Wang},
year = {2026},
journal = {arXiv preprint arXiv: 2602.12205}
}
``` |