Revert to 713c1a6a33a2
Browse files- .gitattributes +0 -3
- DeepGen_CKPT.zip.part-00000 +3 -0
- DeepGen_CKPT.zip.part-00001 +3 -0
- DeepGen_CKPT.zip.part-00002 +3 -0
- DeepGen_CKPT.zip.part-00003 +3 -0
- DeepGen_CKPT.zip.part-00004 +3 -0
- DeepGen_CKPT.zip.part-00005 +3 -0
- DeepGen_CKPT.zip.part-00006 +3 -0
- DeepGen_CKPT.zip.part-00007 +3 -0
- DeepGen_CKPT.zip.part-00008 +3 -0
- DeepGen_CKPT.zip.part-00009 +3 -0
- DeepGen_CKPT.zip.part-00010 +3 -0
- README.md +10 -20
- docs/arch.png → arch.png +0 -0
- docs/bubble_chart.png → bubble_chart.png +0 -0
- docs/teaser.png → teaser.png +0 -0
.gitattributes
CHANGED
|
@@ -47,6 +47,3 @@ DeepGen_CKPT.zip.part-00007 filter=lfs diff=lfs merge=lfs -text
|
|
| 47 |
DeepGen_CKPT.zip.part-00008 filter=lfs diff=lfs merge=lfs -text
|
| 48 |
DeepGen_CKPT.zip.part-00009 filter=lfs diff=lfs merge=lfs -text
|
| 49 |
DeepGen_CKPT.zip.part-00010 filter=lfs diff=lfs merge=lfs -text
|
| 50 |
-
docs/bubble_chart.png filter=lfs diff=lfs merge=lfs -text
|
| 51 |
-
docs/arch.png filter=lfs diff=lfs merge=lfs -text
|
| 52 |
-
docs/teaser.png filter=lfs diff=lfs merge=lfs -text
|
|
|
|
| 47 |
DeepGen_CKPT.zip.part-00008 filter=lfs diff=lfs merge=lfs -text
|
| 48 |
DeepGen_CKPT.zip.part-00009 filter=lfs diff=lfs merge=lfs -text
|
| 49 |
DeepGen_CKPT.zip.part-00010 filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
DeepGen_CKPT.zip.part-00000
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:49f94464b6b16f559dc6b09e31c2abe020a7cd7f1de4c4d5b046e307c1814776
|
| 3 |
+
size 5368709120
|
DeepGen_CKPT.zip.part-00001
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9ca0fd771ae4061dc2d0d36f62b50bca675beab9b3dca424119d828a1b23b28e
|
| 3 |
+
size 5368709120
|
DeepGen_CKPT.zip.part-00002
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:16671fcd9f92a879b831d2297a281757b473546ab758679a170de792abd787b6
|
| 3 |
+
size 5368709120
|
DeepGen_CKPT.zip.part-00003
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b8022e2693aa690ad14e184616d7ef0d49dd519fc423a71582aa9c8d10b2136f
|
| 3 |
+
size 5368709120
|
DeepGen_CKPT.zip.part-00004
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c8fd4a4c59212d45a35006f4b4dcd03f8dec684c46d852f3b720dff0bdddd9a4
|
| 3 |
+
size 5368709120
|
DeepGen_CKPT.zip.part-00005
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bd0fba2bd69be06c705779a6c1a9a1d41366e0310ef696eae944872083109a2a
|
| 3 |
+
size 5368709120
|
DeepGen_CKPT.zip.part-00006
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:db9573c3dc0d980450fc8e51e19b9c36971bf16d91ea560a9a9edf54824b672b
|
| 3 |
+
size 5368709120
|
DeepGen_CKPT.zip.part-00007
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5eb739f5b16b92057487b7e9055119fb532872caf9857c13f76ea49fa45832ea
|
| 3 |
+
size 5368709120
|
DeepGen_CKPT.zip.part-00008
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ce4eac2c5010cc7067f556344eaa4e93e1b6df1986eeaa9ced1aa4f20e3411f0
|
| 3 |
+
size 5368709120
|
DeepGen_CKPT.zip.part-00009
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c9cd1e26bcb9ca5dba536c1e86bde3619c3f8f9d0bb80fa705994bbdcb7c4f00
|
| 3 |
+
size 5368709120
|
DeepGen_CKPT.zip.part-00010
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6ce439a0e4c6dea344e5736e2b09dd17e835047eda9757eb58ebca485f27f506
|
| 3 |
+
size 1993873384
|
README.md
CHANGED
|
@@ -7,14 +7,6 @@ base_model:
|
|
| 7 |
pipeline_tag: image-to-image
|
| 8 |
---
|
| 9 |
|
| 10 |
-
> **DeepGen 1.0 Checkpoints**
|
| 11 |
-
>
|
| 12 |
-
> | Stage | Repository | Description |
|
| 13 |
-
> | :--- | :--- | :--- |
|
| 14 |
-
> | Pretrain | [deepgenteam/DeepGen-1.0-Pretrain](https://huggingface.co/deepgenteam/DeepGen-1.0-Pretrain) | Alignment pre-training checkpoint |
|
| 15 |
-
> | SFT | [deepgenteam/DeepGen-1.0-SFT](https://huggingface.co/deepgenteam/DeepGen-1.0-SFT) | Supervised fine-tuning checkpoint |
|
| 16 |
-
> | **RL** | **[deepgenteam/DeepGen-1.0](https://huggingface.co/deepgenteam/DeepGen-1.0)** | Reinforcement learning checkpoint (MR-GDPO) *(this repo)* |
|
| 17 |
-
|
| 18 |
# 💡 DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
|
| 19 |
<p align="left">
|
| 20 |
<a href="http://arxiv.org/abs/2602.12205">
|
|
@@ -37,14 +29,14 @@ pipeline_tag: image-to-image
|
|
| 37 |
</a>
|
| 38 |
</p>
|
| 39 |
DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities—general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering—within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with competitive with or surpassing the state-of-the-art unified multimodal models that are 3× to 16× larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.
|
| 40 |
-
<p align="left"><img src="
|
| 41 |
|
| 42 |
## 🧠 Method
|
| 43 |
Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts.
|
| 44 |
To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ``think tokens'' to provide the generative backbone with structured, reasoning-rich guidance.
|
| 45 |
We further design a data-centric training strategy spanning three progressive stages: (1) **Alignment Pre-training** on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) **Joint Supervised Fine-tuning** on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) **Reinforcement Learning with MR-GRPO**, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.
|
| 46 |
|
| 47 |
-
<p align="left"><img src="
|
| 48 |
|
| 49 |
## 📊 Benchmarks
|
| 50 |
|
|
@@ -99,18 +91,16 @@ We further design a data-centric training strategy spanning three progressive st
|
|
| 99 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 🥉 | 75.7 🥈 |
|
| 100 |
|
| 101 |
## 🎨 Quantitative results
|
| 102 |
-
<p align="left"><img src="
|
| 103 |
|
| 104 |
## 🛠️ Usage
|
| 105 |
|
| 106 |
-
###
|
| 107 |
-
|
| 108 |
|
| 109 |
```bash
|
| 110 |
-
#
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
hf_hub_download("deepgenteam/DeepGen-1.0", "model.pt", local_dir=".")
|
| 116 |
-
```
|
|
|
|
| 7 |
pipeline_tag: image-to-image
|
| 8 |
---
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
# 💡 DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
|
| 11 |
<p align="left">
|
| 12 |
<a href="http://arxiv.org/abs/2602.12205">
|
|
|
|
| 29 |
</a>
|
| 30 |
</p>
|
| 31 |
DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities—general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering—within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with competitive with or surpassing the state-of-the-art unified multimodal models that are 3× to 16× larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.
|
| 32 |
+
<p align="left"><img src="bubble_chart.png" width="80%"></p>
|
| 33 |
|
| 34 |
## 🧠 Method
|
| 35 |
Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts.
|
| 36 |
To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ``think tokens'' to provide the generative backbone with structured, reasoning-rich guidance.
|
| 37 |
We further design a data-centric training strategy spanning three progressive stages: (1) **Alignment Pre-training** on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) **Joint Supervised Fine-tuning** on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) **Reinforcement Learning with MR-GRPO**, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.
|
| 38 |
|
| 39 |
+
<p align="left"><img src="arch.png" width="80%"></p>
|
| 40 |
|
| 41 |
## 📊 Benchmarks
|
| 42 |
|
|
|
|
| 91 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 🥉 | 75.7 🥈 |
|
| 92 |
|
| 93 |
## 🎨 Quantitative results
|
| 94 |
+
<p align="left"><img src="teaser.png" width="80%"></p>
|
| 95 |
|
| 96 |
## 🛠️ Usage
|
| 97 |
|
| 98 |
+
### Merge ZIP Files
|
| 99 |
+
To use the DeepGen checkpoints, please merge the sharded model files first. We release Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints.
|
| 100 |
|
| 101 |
```bash
|
| 102 |
+
# Merge zip
|
| 103 |
+
cat DeepGen_CKPT.zip.part-* > DeepGen_CKPT.zip
|
| 104 |
+
# Unzip DeepGen checkpoints
|
| 105 |
+
unzip DeepGen_CKPT.zip
|
| 106 |
+
```
|
|
|
|
|
|
docs/arch.png → arch.png
RENAMED
|
File without changes
|
docs/bubble_chart.png → bubble_chart.png
RENAMED
|
File without changes
|
docs/teaser.png → teaser.png
RENAMED
|
File without changes
|