Add pipeline tag and library name to model card, and include citation
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -3,6 +3,8 @@ license: cc-by-4.0
|
|
| 3 |
tags:
|
| 4 |
- image-editing
|
| 5 |
- diffusion
|
|
|
|
|
|
|
| 6 |
---
|
| 7 |
|
| 8 |
# Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination
|
|
@@ -18,18 +20,7 @@ tags:
|
|
| 18 |
|
| 19 |
Unified models achieve strong results in text-to-image generation but remain weak in precise editing. This limitation
|
| 20 |
arises from an *imbalanced division of responsibilities*. The understanding module is usually treated as a translator
|
| 21 |
-
that encodes instructions into conditions, while the generation module must act as
|
| 22 |
-
is that the generation module carries too much responsibility, even though it is not optimized for complex reasoning.
|
| 23 |
-
|
| 24 |
-
To address this, we introduce **Draw-In-Mind (DIM)**, a dataset with two complementary parts:
|
| 25 |
-
|
| 26 |
-
- **DIM-T2I**: 14M long-context image–text pairs that strengthen instruction comprehension.
|
| 27 |
-
- **DIM-Edit**: 233K chain-of-thought imaginations from GPT-4o that provide explicit design blueprints.
|
| 28 |
-
|
| 29 |
-
We connect a frozen **Qwen2.5-VL-3B** with a trainable **SANA1.5-1.6B** via a lightweight MLP, forming
|
| 30 |
-
**DIM-4.6B-T2I/Edit**. With this setup, the understanding module takes on the *designer responsibility*, while the
|
| 31 |
-
generation module focuses on rendering. Despite its modest size, DIM-4.6B-Edit achieves SOTA or competitive results on
|
| 32 |
-
ImgEdit and GEdit-Bench, outperforming much larger models.
|
| 33 |
|
| 34 |
## Performance
|
| 35 |
|
|
@@ -350,4 +341,17 @@ bash scripts/eval_gedit_bench.sh
|
|
| 350 |
The generated images will be saved to `cache/inference/DIM-4.6B-Edit/GEdit-Bench`. Please follow the guide
|
| 351 |
in [GEdit-Bench](https://github.com/stepfun-ai/Step1X-Edit) official repo for metrics calculation.
|
| 352 |
|
| 353 |
-
</details>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
tags:
|
| 4 |
- image-editing
|
| 5 |
- diffusion
|
| 6 |
+
pipeline_tag: image-to-image
|
| 7 |
+
library_name: transformers
|
| 8 |
---
|
| 9 |
|
| 10 |
# Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination
|
|
|
|
| 20 |
|
| 21 |
Unified models achieve strong results in text-to-image generation but remain weak in precise editing. This limitation
|
| 22 |
arises from an *imbalanced division of responsibilities*. The understanding module is usually treated as a translator
|
| 23 |
+
that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models will be available at this https URL .
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
## Performance
|
| 26 |
|
|
|
|
| 341 |
The generated images will be saved to `cache/inference/DIM-4.6B-Edit/GEdit-Bench`. Please follow the guide
|
| 342 |
in [GEdit-Bench](https://github.com/stepfun-ai/Step1X-Edit) official repo for metrics calculation.
|
| 343 |
|
| 344 |
+
</details>
|
| 345 |
+
|
| 346 |
+
## Citation
|
| 347 |
+
If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.
|
| 348 |
+
```bibtex
|
| 349 |
+
@article{zhou2025drawinmind,
|
| 350 |
+
title={Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination},
|
| 351 |
+
author={Yifei Zhou and Haozhe Liu and Songhua Liu and Peng Gao and Hongsheng Li and Yu Qiao},
|
| 352 |
+
year={2025},
|
| 353 |
+
journal={arXiv preprint arXiv:2509.01986},
|
| 354 |
+
archivePrefix={arXiv},
|
| 355 |
+
eprint={2509.01986},
|
| 356 |
+
}
|
| 357 |
+
```
|