Commit
Β·
2e012f2
0
Parent(s):
Duplicate from deepgenteam/DeepGen-1.0
Browse filesCo-authored-by: Alex Wang(SII) <Alex11556666@users.noreply.huggingface.co>
- .gitattributes +52 -0
- README.md +116 -0
- docs/arch.png +3 -0
- docs/bubble_chart.png +3 -0
- docs/teaser.png +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
arch.png filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
bubble_chart.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
teaser.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
DeepGen_CKPT.zip.part-00000 filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
DeepGen_CKPT.zip.part-00001 filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
DeepGen_CKPT.zip.part-00002 filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
DeepGen_CKPT.zip.part-00003 filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
DeepGen_CKPT.zip.part-00004 filter=lfs diff=lfs merge=lfs -text
|
| 44 |
+
DeepGen_CKPT.zip.part-00005 filter=lfs diff=lfs merge=lfs -text
|
| 45 |
+
DeepGen_CKPT.zip.part-00006 filter=lfs diff=lfs merge=lfs -text
|
| 46 |
+
DeepGen_CKPT.zip.part-00007 filter=lfs diff=lfs merge=lfs -text
|
| 47 |
+
DeepGen_CKPT.zip.part-00008 filter=lfs diff=lfs merge=lfs -text
|
| 48 |
+
DeepGen_CKPT.zip.part-00009 filter=lfs diff=lfs merge=lfs -text
|
| 49 |
+
DeepGen_CKPT.zip.part-00010 filter=lfs diff=lfs merge=lfs -text
|
| 50 |
+
docs/bubble_chart.png filter=lfs diff=lfs merge=lfs -text
|
| 51 |
+
docs/arch.png filter=lfs diff=lfs merge=lfs -text
|
| 52 |
+
docs/teaser.png filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- Alex11556666/Reason_Tuning
|
| 5 |
+
base_model:
|
| 6 |
+
- Qwen/Qwen2.5-VL-3B-Instruct
|
| 7 |
+
pipeline_tag: image-to-image
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
> **DeepGen 1.0 Checkpoints**
|
| 11 |
+
>
|
| 12 |
+
> | Stage | Repository | Description |
|
| 13 |
+
> | :--- | :--- | :--- |
|
| 14 |
+
> | Pretrain | [deepgenteam/DeepGen-1.0-Pretrain](https://huggingface.co/deepgenteam/DeepGen-1.0-Pretrain) | Alignment pre-training checkpoint |
|
| 15 |
+
> | SFT | [deepgenteam/DeepGen-1.0-SFT](https://huggingface.co/deepgenteam/DeepGen-1.0-SFT) | Supervised fine-tuning checkpoint |
|
| 16 |
+
> | **RL** | **[deepgenteam/DeepGen-1.0](https://huggingface.co/deepgenteam/DeepGen-1.0)** | Reinforcement learning checkpoint (MR-GDPO) *(this repo)* |
|
| 17 |
+
|
| 18 |
+
# π‘ DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
|
| 19 |
+
<p align="left">
|
| 20 |
+
<a href="http://arxiv.org/abs/2602.12205">
|
| 21 |
+
<img
|
| 22 |
+
src="https://img.shields.io/badge/DeepGen 1.0-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
|
| 23 |
+
alt="DeepGen 1.0 Paper on arXiv"
|
| 24 |
+
/>
|
| 25 |
+
</a>
|
| 26 |
+
<a href="https://github.com/deepgenteam/deepgen" target="_blank" style="margin: 2px;">
|
| 27 |
+
<img
|
| 28 |
+
alt="Github" src="https://img.shields.io/badge/DeepGen 1.0-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
|
| 29 |
+
alt="DeepGen 1.0 Codebase"
|
| 30 |
+
/>
|
| 31 |
+
</a>
|
| 32 |
+
<a href="https://deepgenteam.github.io/" target="_blank" style="margin: 2px;">
|
| 33 |
+
<img
|
| 34 |
+
alt="Github" src="https://img.shields.io/badge/Website-project page-orange" style="display: inline-block; vertical-align: middle;"
|
| 35 |
+
alt="DeepGen 1.0 page"
|
| 36 |
+
/>
|
| 37 |
+
</a>
|
| 38 |
+
</p>
|
| 39 |
+
DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβgeneral image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβwithin a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ to 16Γ larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.
|
| 40 |
+
<p align="left"><img src="docs/bubble_chart.png" width="80%"></p>
|
| 41 |
+
|
| 42 |
+
## π§ Method
|
| 43 |
+
Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts.
|
| 44 |
+
To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ``think tokens'' to provide the generative backbone with structured, reasoning-rich guidance.
|
| 45 |
+
We further design a data-centric training strategy spanning three progressive stages: (1) **Alignment Pre-training** on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) **Joint Supervised Fine-tuning** on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) **Reinforcement Learning with MR-GRPO**, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.
|
| 46 |
+
|
| 47 |
+
<p align="left"><img src="docs/arch.png" width="80%"></p>
|
| 48 |
+
|
| 49 |
+
## π Benchmarks
|
| 50 |
+
|
| 51 |
+
### 1. General Image Generation
|
| 52 |
+
| Model | Params | Geneval β | DPGBench β | UniGenBench β |
|
| 53 |
+
| --------------------- | ----------- | ----------- | ------------ | ------------- |
|
| 54 |
+
| OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 |
|
| 55 |
+
| BAGEL | 14B | 0.82 | 85.10 | 61.53 |
|
| 56 |
+
| X-Omni | 7B + 12B | 0.83 | 87.65π₯ | 53.77 |
|
| 57 |
+
| Lumina-DiMOO | 8B | 0.88π₯ | 86.04 | 71.12 |
|
| 58 |
+
| Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β |
|
| 59 |
+
| Qwen-Image | 7B + 20B | 0.87 π₯ | 88.32 π₯ | 78.81 π₯ |
|
| 60 |
+
| LongCat-Image | 7B + 6B | 0.87 π₯ | 86.80 | β |
|
| 61 |
+
| Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 |
|
| 62 |
+
| GLM-Image | 9B + 7B | β | 84.78 | β |
|
| 63 |
+
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.86 π₯ | 87.05 | 74.18 π₯ |
|
| 64 |
+
| **DeepGen 1.0 (RL)** | **3B + 2B** | 0.87 π₯ | 87.90 π₯ | 75.74 π₯ |
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
### 2. General Image Editing
|
| 69 |
+
|
| 70 |
+
| Model | Params | GEdit-EN β | ImgEdit β |
|
| 71 |
+
| :--- | :--- | :--- | :--- |
|
| 72 |
+
| BAGEL | 14B | 6.52 | 3.20 |
|
| 73 |
+
| Qwen-Image-Edit [2509] | 7B + 20B | 7.54 π₯ | 4.35 π₯ |
|
| 74 |
+
| LongCat-Image-Edit | 7B + 6B | 7.60 π₯ | 4.50 π₯ |
|
| 75 |
+
| Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 |
|
| 76 |
+
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 7.12 | 4.09 |
|
| 77 |
+
| **DeepGen 1.0 (RL)** | **3B + 2B** | 7.17 π₯ | 4.14 π₯ |
|
| 78 |
+
|
| 79 |
+
### 3. Reasoning Image Generation
|
| 80 |
+
| Model | Params | WISE β | T2I-CoREBench β |
|
| 81 |
+
| :--- | :--- | :--- | :--- |
|
| 82 |
+
| OmniGen2 | 3B + 4B | 0.47 | 36.1 |
|
| 83 |
+
| BAGEL | 14B | 0.70 π₯ | 41.1 |
|
| 84 |
+
| Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 |
|
| 85 |
+
| Qwen-Image | 7B + 20B | 0.62 | 46.3 π₯ |
|
| 86 |
+
| LongCat-Image | 7B + 6B | 0.65 | 52.2 π₯ |
|
| 87 |
+
| Z-Image-Turbo | 4B + 6B | - | 43.7 |
|
| 88 |
+
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.72 π₯ | 45.7 |
|
| 89 |
+
| **DeepGen 1.0 (RL)** | **3B + 2B** | 0.73 π₯ | 46.5 π₯ |
|
| 90 |
+
|
| 91 |
+
### 4. Reasoning Image Editing
|
| 92 |
+
|
| 93 |
+
| Model | Params | RISE β | UniREditBench β |
|
| 94 |
+
| :--- | :--- | :--- | :--- |
|
| 95 |
+
| OmniGen2 | 3B + 4B | - | 43.4 |
|
| 96 |
+
| BAGEL | 14B | 11.9 π₯ | 51.0 |
|
| 97 |
+
| Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 π₯ |
|
| 98 |
+
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 13.3 π₯ | 77.5 π₯ |
|
| 99 |
+
| **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 π₯ | 75.7 π₯ |
|
| 100 |
+
|
| 101 |
+
## π¨ Quantitative results
|
| 102 |
+
<p align="left"><img src="docs/teaser.png" width="80%"></p>
|
| 103 |
+
|
| 104 |
+
## π οΈ Usage
|
| 105 |
+
|
| 106 |
+
### Download Checkpoint
|
| 107 |
+
This repository contains the **Reinforcement Learning (MR-GDPO)** checkpoint β the final release model.
|
| 108 |
+
|
| 109 |
+
```bash
|
| 110 |
+
# Using hf CLI
|
| 111 |
+
hf download deepgenteam/DeepGen-1.0 model.pt --local-dir .
|
| 112 |
+
|
| 113 |
+
# Or using Python
|
| 114 |
+
from huggingface_hub import hf_hub_download
|
| 115 |
+
hf_hub_download("deepgenteam/DeepGen-1.0", "model.pt", local_dir=".")
|
| 116 |
+
```
|
docs/arch.png
ADDED
|
Git LFS Details
|
docs/bubble_chart.png
ADDED
|
Git LFS Details
|
docs/teaser.png
ADDED
|
Git LFS Details
|