Image-to-Image
Alex11556666 commited on
Commit
ddb654f
Β·
verified Β·
1 Parent(s): 558f240

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -3
README.md CHANGED
@@ -1,3 +1,106 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Alex11556666/Reason_Tuning
5
+ base_model:
6
+ - Qwen/Qwen2.5-VL-3B-Instruct
7
+ pipeline_tag: image-to-image
8
+ ---
9
+
10
+ # πŸ’‘ DeepGen 1.0 β€’ DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
11
+ <p align="left">
12
+ <a href="http://arxiv.org/abs/2602.12205">
13
+ <img
14
+ src="https://img.shields.io/badge/DeepGen 1.0-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
15
+ alt="DeepGen 1.0 Paper on arXiv"
16
+ />
17
+ </a>
18
+ <a href="https://github.com/deepgenteam/deepgen" target="_blank" style="margin: 2px;">
19
+ <img
20
+ alt="Github" src="https://img.shields.io/badge/DeepGen 1.0-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
21
+ alt="DeepGen 1.0 Codebase"
22
+ />
23
+ </a>
24
+ <a href="https://deepgenteam.github.io/" target="_blank" style="margin: 2px;">
25
+ <img
26
+ alt="Github" src="https://img.shields.io/badge/Website-project page-orange" style="display: inline-block; vertical-align: middle;"
27
+ alt="DeepGen 1.0 page"
28
+ />
29
+ </a>
30
+ </p>
31
+ DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβ€”general image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβ€”within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ— to 16Γ— larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.
32
+ <p align="left"><img src="bubble_chart.png" width="80%"></p>
33
+
34
+ ## 🧠 Method
35
+ Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts.
36
+ To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ``think tokens'' to provide the generative backbone with structured, reasoning-rich guidance.
37
+ We further design a data-centric training strategy spanning three progressive stages: (1) **Alignment Pre-training** on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) **Joint Supervised Fine-tuning** on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) **Reinforcement Learning with MR-GRPO**, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.
38
+
39
+ <p align="left"><img src="arch.png" width="80%"></p>
40
+
41
+ ## πŸ“Š Benchmarks
42
+
43
+ ### 1. General Image Generation
44
+ | Model | Params | Geneval ↑ | DPGBench ↑ | UniGenBench ↑ |
45
+ | --------------------- | ----------- | ----------- | ------------ | ------------- |
46
+ | OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 |
47
+ | BAGEL | 14B | 0.82 | 85.10 | 61.53 |
48
+ | X-Omni | 7B + 12B | 0.83 | 87.65πŸ₯‰ | 53.77 |
49
+ | Lumina-DiMOO | 8B | 0.88πŸ₯‡ | 86.04 | 71.12 |
50
+ | Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β€” |
51
+ | Qwen-Image | 7B + 20B | 0.87 πŸ₯ˆ | 88.32 πŸ₯‡ | 78.81 πŸ₯‡ |
52
+ | LongCat-Image | 7B + 6B | 0.87 πŸ₯ˆ | 86.80 | β€” |
53
+ | Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 |
54
+ | GLM-Image | 9B + 7B | β€” | 84.78 | β€” |
55
+ | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.86 πŸ₯‰ | 87.05 | 74.18 πŸ₯‰ |
56
+ | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.87 πŸ₯ˆ | 87.90 πŸ₯ˆ | 75.74 πŸ₯ˆ |
57
+
58
+
59
+
60
+ ### 2. General Image Editing
61
+
62
+ | Model | Params | GEdit-EN ↑ | ImgEdit ↑ |
63
+ | :--- | :--- | :--- | :--- |
64
+ | BAGEL | 14B | 6.52 | 3.20 |
65
+ | Qwen-Image-Edit [2509] | 7B + 20B | 7.54 πŸ₯ˆ | 4.35 πŸ₯ˆ |
66
+ | LongCat-Image-Edit | 7B + 6B | 7.60 πŸ₯‡ | 4.50 πŸ₯‡ |
67
+ | Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 |
68
+ | **DeepGen 1.0 (SFT)** | **3B + 2B** | 7.12 | 4.09 |
69
+ | **DeepGen 1.0 (RL)** | **3B + 2B** | 7.17 πŸ₯‰ | 4.14 πŸ₯‰ |
70
+
71
+ ### 3. Reasoning Image Generation
72
+ | Model | Params | WISE ↑ | T2I-CoREBench ↑ |
73
+ | :--- | :--- | :--- | :--- |
74
+ | OmniGen2 | 3B + 4B | 0.47 | 36.1 |
75
+ | BAGEL | 14B | 0.70 πŸ₯‰ | 41.1 |
76
+ | Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 |
77
+ | Qwen-Image | 7B + 20B | 0.62 | 46.3 πŸ₯‰ |
78
+ | LongCat-Image | 7B + 6B | 0.65 | 52.2 πŸ₯‡ |
79
+ | Z-Image-Turbo | 4B + 6B | - | 43.7 |
80
+ | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.72 πŸ₯ˆ | 45.7 |
81
+ | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.73 πŸ₯‡ | 46.5 πŸ₯ˆ |
82
+
83
+ ### 4. Reasoning Image Editing
84
+
85
+ | Model | Params | RISE ↑ | UniREditBench ↑ |
86
+ | :--- | :--- | :--- | :--- |
87
+ | OmniGen2 | 3B + 4B | - | 43.4 |
88
+ | BAGEL | 14B | 11.9 πŸ₯ˆ | 51.0 |
89
+ | Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 πŸ₯‰ |
90
+ | **DeepGen 1.0 (SFT)** | **3B + 2B** | 13.3 πŸ₯‡ | 77.5 πŸ₯‡ |
91
+ | **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 πŸ₯‰ | 75.7 πŸ₯ˆ |
92
+
93
+ ## 🎨 Quantitative results
94
+ <p align="left"><img src="teaser.png" width="80%"></p>
95
+
96
+ ## πŸ› οΈ Usage
97
+
98
+ ### Merge ZIP Files
99
+ To use the DeepGen checkpoints, please merge the sharded model files first. We release Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints.
100
+
101
+ ```bash
102
+ # Merge zip
103
+ cat DeepGen_CKPT.zip.part-* > DeepGen_CKPT.zip
104
+ # Unzip DeepGen checkpoints
105
+ unzip DeepGen_CKPT.zip
106
+ ```