Image-to-Image
rhli commited on
Commit
816d11e
·
verified ·
1 Parent(s): 00f3c04

Update README

Browse files
Files changed (1) hide show
  1. README.md +20 -10
README.md CHANGED
@@ -7,6 +7,14 @@ base_model:
7
  pipeline_tag: image-to-image
8
  ---
9
 
 
 
 
 
 
 
 
 
10
  # 💡 DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
11
  <p align="left">
12
  <a href="http://arxiv.org/abs/2602.12205">
@@ -29,14 +37,14 @@ pipeline_tag: image-to-image
29
  </a>
30
  </p>
31
  DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities—general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering—within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with competitive with or surpassing the state-of-the-art unified multimodal models that are 3× to 16× larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.
32
- <p align="left"><img src="bubble_chart.png" width="80%"></p>
33
 
34
  ## 🧠 Method
35
  Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts.
36
  To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ``think tokens'' to provide the generative backbone with structured, reasoning-rich guidance.
37
  We further design a data-centric training strategy spanning three progressive stages: (1) **Alignment Pre-training** on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) **Joint Supervised Fine-tuning** on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) **Reinforcement Learning with MR-GRPO**, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.
38
 
39
- <p align="left"><img src="arch.png" width="80%"></p>
40
 
41
  ## 📊 Benchmarks
42
 
@@ -91,16 +99,18 @@ We further design a data-centric training strategy spanning three progressive st
91
  | **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 🥉 | 75.7 🥈 |
92
 
93
  ## 🎨 Quantitative results
94
- <p align="left"><img src="teaser.png" width="80%"></p>
95
 
96
  ## 🛠️ Usage
97
 
98
- ### Merge ZIP Files
99
- To use the DeepGen checkpoints, please merge the sharded model files first. We release Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints.
100
 
101
  ```bash
102
- # Merge zip
103
- cat DeepGen_CKPT.zip.part-* > DeepGen_CKPT.zip
104
- # Unzip DeepGen checkpoints
105
- unzip DeepGen_CKPT.zip
106
- ```
 
 
 
7
  pipeline_tag: image-to-image
8
  ---
9
 
10
+ > **DeepGen 1.0 Checkpoints**
11
+ >
12
+ > | Stage | Repository | Description |
13
+ > | :--- | :--- | :--- |
14
+ > | Pretrain | [deepgenteam/DeepGen-1.0-Pretrain](https://huggingface.co/deepgenteam/DeepGen-1.0-Pretrain) | Alignment pre-training checkpoint |
15
+ > | SFT | [deepgenteam/DeepGen-1.0-SFT](https://huggingface.co/deepgenteam/DeepGen-1.0-SFT) | Supervised fine-tuning checkpoint |
16
+ > | **RL** | **[deepgenteam/DeepGen-1.0](https://huggingface.co/deepgenteam/DeepGen-1.0)** | Reinforcement learning checkpoint (MR-GDPO) *(this repo)* |
17
+
18
  # 💡 DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
19
  <p align="left">
20
  <a href="http://arxiv.org/abs/2602.12205">
 
37
  </a>
38
  </p>
39
  DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities—general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering—within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with competitive with or surpassing the state-of-the-art unified multimodal models that are 3× to 16× larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.
40
+ <p align="left"><img src="docs/bubble_chart.png" width="80%"></p>
41
 
42
  ## 🧠 Method
43
  Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts.
44
  To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ``think tokens'' to provide the generative backbone with structured, reasoning-rich guidance.
45
  We further design a data-centric training strategy spanning three progressive stages: (1) **Alignment Pre-training** on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) **Joint Supervised Fine-tuning** on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) **Reinforcement Learning with MR-GRPO**, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.
46
 
47
+ <p align="left"><img src="docs/arch.png" width="80%"></p>
48
 
49
  ## 📊 Benchmarks
50
 
 
99
  | **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 🥉 | 75.7 🥈 |
100
 
101
  ## 🎨 Quantitative results
102
+ <p align="left"><img src="docs/teaser.png" width="80%"></p>
103
 
104
  ## 🛠️ Usage
105
 
106
+ ### Download Checkpoint
107
+ This repository contains the **Reinforcement Learning (MR-GDPO)** checkpoint the final release model.
108
 
109
  ```bash
110
+ # Using hf CLI
111
+ hf download deepgenteam/DeepGen-1.0 model.pt --local-dir .
112
+
113
+ # Or using Python
114
+ from huggingface_hub import hf_hub_download
115
+ hf_hub_download("deepgenteam/DeepGen-1.0", "model.pt", local_dir=".")
116
+ ```