Add pipeline tag and improve model card metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +15 -380
README.md CHANGED
@@ -1,5 +1,6 @@
1
  ---
2
  license: cc-by-nc-4.0
 
3
  tags:
4
  - diffusion
5
  - image-editing
@@ -14,10 +15,12 @@ tags:
14
  [![Hugging Face Models](https://img.shields.io/badge/🤗%20%20Model-DIM--4.6B--T2I-orange.svg)](https://huggingface.co/stdKonjac/DIM-4.6B-T2I)
15
  [![Hugging Face Models](https://img.shields.io/badge/🤗%20%20Model-DIM--4.6B--Edit-orange.svg)](https://huggingface.co/stdKonjac/DIM-4.6B-Edit)
16
 
17
- ![DIM-Edit](assets/dim_edit.png)
18
 
19
  ## 📰 News
20
 
 
 
21
  **[2025-10-08]** We release the **DIM-Edit** dataset and the **DIM-4.6B-T2I** / **DIM-4.6B-Edit** models.
22
 
23
  **[2025-09-26]** We upload a new version of the paper, including more results across various designers.
@@ -26,146 +29,34 @@ tags:
26
 
27
  ## Introduction
28
 
29
- Unified models achieve strong results in text-to-image generation but remain weak in precise editing. This limitation
30
- arises from an *imbalanced division of responsibilities*. The understanding module is usually treated as a translator
31
- that encodes instructions into conditions, while the generation module must act as both designer and painter. The result
32
- is that the generation module carries too much responsibility, even though it is not optimized for complex reasoning.
33
 
34
- To address this, we introduce **Draw-In-Mind (DIM)**, a dataset with two complementary parts:
35
 
36
  - **DIM-T2I**: 14M long-context image–text pairs that strengthen instruction comprehension.
37
  - **DIM-Edit**: 233K chain-of-thought imaginations from GPT-4o that provide explicit design blueprints.
38
 
39
- We connect a frozen **Qwen2.5-VL-3B** with a trainable **SANA1.5-1.6B** via a lightweight MLP, forming
40
- **DIM-4.6B-T2I/Edit**. With this setup, the understanding module takes on the *designer responsibility*, while the
41
- generation module focuses on rendering. Despite its modest size, DIM-4.6B-Edit achieves SOTA or competitive results on
42
- ImgEdit and GEdit-Bench, outperforming much larger models.
43
 
44
  ## Performance
45
 
46
  <details>
47
-
48
- <summary><b>GenEval and MJHQ-30K</b></summary>
49
-
50
- *: <sup>†</sup> denotes using an LLM rewriter. For MJHQ(-30K), we report FID.
51
-
52
- | Model | Params | Sin. | Two | CT. | Colors | Pos. | Attr. | Overall | MJHQ |
53
- |----------------------------------------------------------------|:----------------:|:----:|:----:|:----:|:------:|:----:|:-----:|:-------:|:-----:|
54
- | <tr><td colspan="10" align="center"><b>Gen. Only</b></td></tr> |
55
- | PixArt-α | 0.6B🔥 | 0.98 | 0.50 | 0.44 | 0.80 | 0.08 | 0.07 | 0.48 | 6.14 |
56
- | SDXL | 2.6B🔥 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 | 8.76 |
57
- | DALL-E·3 | - | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 | 0.67 | - |
58
- | SD3-Medium | 2.0B🔥 | 0.99 | 0.94 | 0.72 | 0.89 | 0.33 | 0.60 | 0.74 | 11.92 |
59
- | <tr><td colspan="10" align="center"><b>Unified</b></td></tr> |
60
- | Janus | 1.3B🔥 | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | 0.61 | 10.10 |
61
- | Emu3-Gen<sup>†</sup> | 8.0B🔥 | 0.99 | 0.81 | 0.42 | 0.80 | 0.49 | 0.45 | 0.66 | - |
62
- | Show-o | 1.3B🔥 | 0.98 | 0.80 | 0.66 | 0.84 | 0.31 | 0.50 | 0.68 | 15.18 |
63
- | Show-o2-7B | 7.0B🔥 | 1.00 | 0.87 | 0.58 | 0.92 | 0.52 | 0.62 | 0.76 | - |
64
- | Janus-Pro-7B | 7.0B🔥 | 0.99 | 0.89 | 0.59 | 0.90 | 0.79 | 0.66 | 0.80 | 13.48 |
65
- | BAGEL | 14.0B🔥 | 0.99 | 0.94 | 0.81 | 0.88 | 0.64 | 0.63 | 0.82 | - |
66
- | MetaQuery-L<sup>†</sup> | 3.0B❄️ \| 3.2B🔥 | - | - | - | - | - | - | 0.78 | 6.35 |
67
- | **DIM-4.6B-T2I<sup>†</sup>** | 3.0B❄️ \| 1.6B🔥 | 0.99 | 0.89 | 0.63 | 0.86 | 0.62 | 0.61 | 0.77 | 5.50 |
68
-
69
- </details>
70
-
71
- <details>
72
-
73
  <summary><b>ImgEdit Overall</b></summary>
74
 
75
- *: Q3/7B indicates using Qwen2.5-VL-3/7B as the external designer during inference. By default, GPT-4o is employed
76
- as the external designer to ensure the best performance. All models are evaluated using GPT-4.1.
77
-
78
  | Model | Add | Adj. | Ext. | Rep. | Rem. | Back. | Sty. | Hyb. | Act. | Overall |
79
  |-------------------|:----:|:----:|:----:|:----:|:----:|:-----:|:----:|:----:|:----:|:-------:|
80
  | MagicBrush | 2.84 | 1.58 | 1.51 | 1.97 | 1.58 | 1.75 | 2.38 | 1.62 | 1.22 | 1.83 |
81
- | Instruct-P2P | 2.45 | 1.83 | 1.44 | 2.01 | 1.50 | 1.44 | 3.55 | 1.20 | 1.46 | 1.88 |
82
- | AnyEdit | 3.18 | 2.95 | 1.88 | 2.47 | 2.23 | 2.24 | 2.85 | 1.56 | 2.65 | 2.45 |
83
- | UltraEdit | 3.44 | 2.81 | 2.13 | 2.96 | 1.45 | 2.83 | 3.76 | 1.91 | 2.98 | 2.70 |
84
  | Step1X-Edit | 3.88 | 3.14 | 1.76 | 3.40 | 2.41 | 3.16 | 4.63 | 2.64 | 2.52 | 3.06 |
85
- | BAGEL | 3.56 | 3.31 | 1.70 | 3.30 | 2.62 | 3.24 | 4.49 | 2.38 | 4.17 | 3.20 |
86
  | UniWorld-V1 | 3.82 | 3.64 | 2.27 | 3.47 | 3.24 | 2.99 | 4.21 | 2.96 | 2.74 | 3.26 |
87
- | Janus-4o | 3.35 | 3.35 | 2.25 | 3.01 | 2.18 | 3.32 | 4.71 | 2.49 | 4.04 | 3.19 |
88
- | GPT-4o-Image | 4.61 | 4.33 | 2.90 | 4.35 | 3.66 | 4.57 | 4.93 | 3.96 | 4.89 | 4.20 |
89
  | **DIM-4.6B-Edit** | 4.09 | 3.47 | 2.30 | 4.00 | 3.43 | 3.87 | 4.92 | 2.85 | 4.08 | 3.67 |
90
 
91
  </details>
92
 
93
- <details>
94
-
95
- <summary><b>ImgEdit Designer Ablation</b></summary>
96
-
97
- <sup>†</sup>: The default setting.
98
-
99
- | Designer | Add | Adj. | Ext. | Rep. | Rem. | Back. | Sty. | Hyb. | Act. | Overall |
100
- |:-------------------|:----:|:----:|:----:|:----:|:----:|:-----:|:----:|:----:|:----:|:-------:|
101
- | – | 3.53 | 3.23 | 2.01 | 3.49 | 1.47 | 3.42 | 4.79 | 2.35 | 3.64 | 3.10 |
102
- | Qwen2.5-VL-3B | 3.80 | 3.24 | 2.03 | 3.89 | 3.21 | 3.52 | 4.92 | 2.71 | 4.05 | 3.49 |
103
- | Qwen2.5-VL-7B | 3.95 | 3.35 | 2.25 | 3.85 | 3.31 | 3.57 | 4.88 | 2.81 | 4.02 | 3.55 |
104
- | MiMo-VL-7B | 3.95 | 3.32 | 2.20 | 3.75 | 2.46 | 3.82 | 4.88 | 2.52 | 3.93 | 3.43 |
105
- | InternVL3.5-8B | 3.98 | 3.40 | 2.05 | 4.14 | 3.30 | 3.84 | 4.94 | 2.77 | 3.89 | 3.59 |
106
- | GLM-4.1V-9B | 3.95 | 3.27 | 2.23 | 3.90 | 2.64 | 3.81 | 4.92 | 2.23 | 4.02 | 3.44 |
107
- | GPT-4o<sup>†</sup> | 4.09 | 3.47 | 2.30 | 4.00 | 3.43 | 3.87 | 4.92 | 2.85 | 4.08 | 3.67 |
108
-
109
- </details>
110
-
111
- <details>
112
-
113
- <summary><b>Visualization</b></summary>
114
-
115
- *:**Green** and **Blue** denote the edits of *Janus-4o* and *Step1X-Edit* respectively; **Red** denotes the edits of our
116
- models trained on different data corpora.
117
-
118
- ![Overall](assets/vis_overall.png)
119
- ![Add](assets/vis_add.png)
120
- ![Change](assets/vis_change.png)
121
- ![Remove](assets/vis_remove.png)
122
- ![Replace](assets/vis_replace.png)
123
- ![Transfer](assets/vis_transfer.png)
124
-
125
- </details>
126
-
127
  ## Dataset Usage
128
 
129
- ### DIM-T2I
130
-
131
- Not available yet.
132
-
133
  ### DIM-Edit
134
 
135
- Please first download [**DIM-Edit**](https://huggingface.co/datasets/stdKonjac/DIM-Edit) from our 🤗HF repo. You can use
136
- `huggingface-cli` to download it quickly:
137
-
138
- ```
139
- # 1. Install the huggingface hub tools (if not yet installed)
140
- pip install -U huggingface_hub
141
-
142
- # 2. Log in with your Hugging Face account token
143
- huggingface-cli login
144
-
145
- # 3. Download the dataset
146
- huggingface-cli download stdKonjac/DIM-Edit --repo-type dataset --local-dir ./DIM-Edit
147
- ```
148
-
149
- After downloading, navigate into the dataset folder, merge and extract the split archives using the following bash
150
- commands:
151
-
152
- ```
153
- cd DIM-Edit
154
- cat images.tar.gz.part* > images.tar.gz
155
- tar -xvzf images.tar.gz
156
- ```
157
-
158
- In the meantime, you will find a JSONL file named `tos_dataset_edit.jsonl` in the root directory, which records all
159
- image editing samples. Each line in this file corresponds to a single sample containing four fields:
160
-
161
- | Field | Description |
162
- |:----------------------|:----------------------------------------------------------------------------------|
163
- | **id** | Unique identifier for each sample. |
164
- | **image_path** | Path to the **source** image, beginning with `image/`. |
165
- | **image_path_target** | Path to the **target** image, beginning with `image/`. |
166
- | **prompt** | The CoT-style instruction describing how to transform the source into the target. |
167
-
168
- We recommend using the huggingface `datasets` library to load the dataset efficiently:
169
 
170
  ```python
171
  from datasets import load_dataset, Features, Value
@@ -183,301 +74,45 @@ ds = load_dataset(
183
  features=features,
184
  split="train",
185
  )
186
-
187
- print(ds[0])
188
  ```
189
 
190
  ## Model Usage
191
 
192
  ### Environment Setup
193
 
194
- Run the following script to set up the Python environment.
195
-
196
- ```
197
  pip install -r requirements.txt
198
  ```
199
 
200
- ### 🦙 Model Zoo
201
-
202
- Please first create a `checkpoints` folder in the root directory:
203
-
204
- ```
205
- mkdir checkpoints
206
- ```
207
-
208
- Then download the models from our 🤗HF repo below, and move them to the `checkpoints` folder.
209
-
210
- *: To facilitate reproducibility, we release [**DIM-4.6B-Edit-Stage1**](https://huggingface.co/stdKonjac/DIM-4.6B-Edit-Stage1), which is trained solely on the **UltraEdit** dataset.
211
- By fine-tuning this checkpoint on our proposed [**DIM-Edit**](https://huggingface.co/datasets/stdKonjac/DIM-Edit) dataset, you should obtain [**DIM-4.6B-Edit**](https://huggingface.co/stdKonjac/DIM-4.6B-Edit).
212
-
213
- | Model | Task | Training Data | ImgEdit | Parameters |
214
- |:----------------------------------------------------------------------------------|:-------------:|:--------------------------:|:-------:|:---------------:|
215
- | [**DIM-4.6B-T2I**](https://huggingface.co/stdKonjac/DIM-4.6B-T2I) | Text-to-Image | DIM-T2I + 6.9M Public Data | – | 3.0B❄️ + 1.6B🔥 |
216
- | [**DIM-4.6B-Edit-Stage1**](https://huggingface.co/stdKonjac/DIM-4.6B-Edit-Stage1) | Image Editing | UltraEdit | 2.76 | 3.0B❄️ + 1.6B🔥 |
217
- | [**DIM-4.6B-Edit**](https://huggingface.co/stdKonjac/DIM-4.6B-Edit) | Image Editing | UltraEdit → DIM-Edit | 3.67 | 3.0B❄️ + 1.6B🔥 |
218
-
219
- The checkpoints should be organized like:
220
-
221
- ```
222
- DIM/
223
- └── checkpoints/
224
- ├── DIM-4.6B-T2I/
225
- │ ├── model.safetensors
226
- │ └── ...
227
- ├── DIM-4.6B-Edit-Stage1/
228
- │ ├── model.safetensors
229
- │ └── ...
230
- └── DIM-4.6B-Edit/
231
- ├── model.safetensors
232
- └── ...
233
- ```
234
-
235
  ### Inference
236
 
237
- <details>
238
-
239
- <summary><b>T2I Generation</b></summary>
240
-
241
- The demo T2I instructions are provided in `cache/demo/tos_dataset_demo.jsonl`, where each line is an instruction in json
242
- format like:
243
-
244
- ```
245
- {"id": "0000", "image_path": "./cache/demo/edit_demo_0000.png", "prompt": "A yummy cupcake floating in the air dark background"}
246
- ```
247
-
248
- The `image_path` is just a placeholder, and you can modify `prompt` to create your own image.
249
-
250
- To generate images from the jsonl file, run the following script:
251
-
252
- ```
253
- bash scripts/demo_t2i.sh
254
- ```
255
-
256
- For each instruction, the generated image will be saved at `cache/inference/demo/DIM-4.6B-T2I/{id}_gen.jpg`.
257
-
258
- </details>
259
-
260
- <details>
261
-
262
- <summary><b>Image Editing</b></summary>
263
-
264
- The demo edit instructions are provided in `cache/demo/tos_dataset_edit_demo.jsonl`, where each line is an instruction
265
- in json
266
- format like:
267
-
268
- ```
269
- {"id": "0", "image_path": "./cache/demo/edit_demo_0000.png", "prompt": "Remove the lemons on the table.", "image_path_target": "./cache/demo/edit_demo_0000.png"}
270
- ```
271
-
272
- The `image_path` corresponds to the source image, and the `prompt` is the edit instruction. The `image_path_target` is
273
- just a placeholder.
274
-
275
- In `infer/demo_edit.py`, use the `set_designer_gpt` API with your own key to set GPT-4o as the external designer for
276
- optimal performance.
277
-
278
- ```python
279
- # GPT-4o as external designer
280
- model.set_designer_gpt(api_key='')
281
- ```
282
-
283
- You can also use the `set_designer_X` API to set various open-source VLMs as the external designer. The VLMs will be
284
- automatically downloaded to local disk.
285
 
286
  ```python
287
  # Qwen2.5-VL as external designer
288
  model.set_designer_qwen(version='Qwen/Qwen2.5-VL-3B-Instruct')
289
- model.set_designer_qwen(version='Qwen/Qwen2.5-VL-7B-Instruct')
290
 
291
- # InternVL3.5 as external designer (recommend using transformers==4.53.0)
292
  model.set_designer_internvl(version='OpenGVLab/InternVL3_5-8B-HF')
293
-
294
- # MiMo-VL as external designer
295
- model.set_designer_mimo(version='XiaomiMimo/MiMo-VL-7B-RL-2508')
296
-
297
- # GLM-4.1V as external designer (recommend using transformers==4.53.1)
298
- model.set_designer_glm(version='THUDM/GLM-4.1V-9B-Thinking')
299
  ```
300
 
301
- To generate edited images from the jsonl file, run the following script:
302
 
303
- ```
304
  bash scripts/demo_edit.sh
305
  ```
306
 
307
- The model will first generate a CoT-guided edit instruction for each prompt and save it to
308
- `cache/inference/demo/DIM-4.6B-Edit/tos_dataset_edit_cot_demo_gen.jsonl`. Then the generated images will be saved at
309
- `cache/inference/demo/DIM-4.6B-Edit/{id}_edited.jpg`.
310
-
311
- We also provide a sample GPT-4o generated CoT jsonl file at `cache/demo/tos_dataset_edit_cot_demo.jsonl` for reference.
312
-
313
- </details>
314
-
315
- ### Evaluation
316
-
317
- <details>
318
-
319
- <summary><b>GenEval</b></summary>
320
-
321
- We provide two evaluation jsonl files according to prompt types in `cache/GenEval`:
322
-
323
- 1. `tos_dataset.jsonl`: Origin prompts.
324
- 2. `tos_dataset_rewritten.jsonl`: LLM-rewritten prompts.
325
-
326
- The `image_path` field in each line of the jsonl is just a
327
- placeholder, please replace it with a pseudo image on your local disk first.
328
-
329
- Run the following script to generate images:
330
-
331
- ```
332
- bash scripts/eval_geneval.sh
333
- ```
334
-
335
- The generated images will be saved to `cache/inference/DIM-4.6B-T2I/GenEval(_rewritten)`.
336
- Please follow the guide in [GenEval](https://github.com/djghosh13/geneval) official repo for metrics calculation.
337
-
338
- </details>
339
-
340
- <details>
341
-
342
- <summary><b>MJHQ-30K</b></summary>
343
-
344
- First download [MJHQ-30K](https://huggingface.co/datasets/playgroundai/MJHQ-30K) from the HF repo. You only need to
345
- download `mjhq30k_imgs.zip`. Then extract all images in
346
- the `cache` folder and organize them as follows:
347
-
348
- ```
349
- cache
350
- └── MJHQ-30K
351
- ├── animals
352
- │ ├── {id}.jpg
353
- │ ├── {id}.jpg
354
- │ └── ...
355
- ├── art
356
- ├── fashion
357
- ├── food
358
- ├── indoor
359
- ├── landscape
360
- ├── logo
361
- ├── people
362
- ├── plants
363
- └── vehicles
364
- ```
365
-
366
- We have provided all prompts of MJHQ-30K in `cache/MJHQ-30K/tos_dataset.jsonl`. Run the following script to
367
- generate images:
368
-
369
- ```
370
- bash scripts/eval_mjhq30k.sh
371
- ```
372
-
373
- The generated images will be saved to `cache/inference/DIM-4.6B-T2I/MJHQ-30K`. We
374
- use [pytorch-fid](https://github.com/mseitzer/pytorch-fid) to calculate the FID on MJHQ-30K.
375
-
376
- </details>
377
-
378
- <details>
379
-
380
- <summary><b>ImgEdit</b></summary>
381
-
382
- First download [ImgEdit](https://huggingface.co/datasets/sysuyy/ImgEdit/tree/main) from the HF repo. Put the dataset in
383
- the `cache` folder, and organize it as follows:
384
-
385
- ```
386
- cache
387
- └── ImgEdit
388
- └── Benchmark
389
- ├── hard
390
- ├── multiturn
391
- └── singleturn
392
- ├── animal
393
- │ ├── {id}.jpg
394
- │ └── ...
395
- ├── architecture
396
- ├── clothes
397
- ├── compose
398
- ├── daily object
399
- ├── for_add
400
- ├── human
401
- ├── style
402
- ├── transport
403
- ├── judge_prompt.json
404
- └── singleturn.json
405
- ```
406
-
407
- We provide four evaluation jsonl files according to prompt types in `cache/ImgEdit`:
408
-
409
- 1. `tos_dataset_edit.jsonl`: Origin prompts.
410
- 2. `tos_dataset_edit_cot.jsonl`: CoT-style prompts generated by GPT-4o.
411
- 3. `tos_dataset_edit_cot_Qwen2.5-VL-3B-Instruct.jsonl`: CoT-style prompts generated by Qwen2.5-VL-3B.
412
- 4. `tos_dataset_edit_cot_Qwen2.5-VL-7B-Instruct.jsonl`: CoT-style prompts generated by Qwen2.5-VL-7B.
413
-
414
- Run the following script to generate images:
415
-
416
- ```
417
- bash scripts/eval_imgedit.sh
418
- ```
419
-
420
- The generated images will be saved to `cache/inference/DIM-4.6B-Edit/ImgEdit`. Please follow the guide
421
- in [ImgEdit](https://github.com/PKU-YuanGroup/ImgEdit) official repo for metrics calculation.
422
-
423
- </details>
424
-
425
- <details>
426
-
427
- <summary><b>GEdit-Bench-EN</b></summary>
428
-
429
- First download [GEdit-Bench](https://huggingface.co/datasets/stepfun-ai/GEdit-Bench) from the HF repo. Extract all raw
430
- images from the dataset and put them in the `cache` folder. Organize them as follows:
431
-
432
- ```
433
- cache
434
- └── GEdit-Bench
435
- └── input_image_raw
436
- ├── {id}.png
437
- ├── {id}.png
438
- ├── {id}.png
439
- ├── {id}.png
440
- └── ...
441
- ```
442
-
443
- We provide four evaluation jsonl files according to prompt types in `cache/GEdit-Bench`:
444
-
445
- 1. `tos_dataset_edit_en.jsonl`: Origin prompts.
446
- 2. `tos_dataset_edit_en_cot.jsonl`: CoT-style prompts generated by GPT-4o.
447
- 3. `tos_dataset_edit_en_ot_Qwen2.5-VL-3B-Instruct.jsonl`: CoT-style prompts generated by Qwen2.5-VL-3B.
448
- 4. `tos_dataset_edit_en_cot_Qwen2.5-VL-7B-Instruct.jsonl`: CoT-style prompts generated by Qwen2.5-VL-7B.
449
-
450
- Run the following script to generate images:
451
-
452
- ```
453
- bash scripts/eval_gedit_bench.sh
454
- ```
455
-
456
- The generated images will be saved to `cache/inference/DIM-4.6B-Edit/GEdit-Bench`. Please follow the guide
457
- in [GEdit-Bench](https://github.com/stepfun-ai/Step1X-Edit) official repo for metrics calculation.
458
-
459
- </details>
460
-
461
  ## License
462
 
463
  ### Dataset
464
-
465
  The dataset is licensed under the [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license.
466
 
467
  ### Model
468
-
469
- The models are developed based on [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) (subject
470
- to [Qwen RESEARCH LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE)) and
471
- [SANA1.5_1.6B_1024px](https://huggingface.co/Efficient-Large-Model/SANA1.5_1.6B_1024px) (subject
472
- to [NVIDIA License](https://huggingface.co/Efficient-Large-Model/SANA1.5_1.6B_1024px/blob/main/LICENSE.txt)). We retain
473
- ownership of all intellectual property rights in and to any
474
- derivative works and modifications that we made.
475
 
476
  ## Citation
477
 
478
- If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.
479
-
480
- ```
481
  @misc{zeng2025drawinmindrebalancingdesignerpainterroles,
482
  title={Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing},
483
  author={Ziyun Zeng and Junhao Zhang and Wei Li and Mike Zheng Shou},
 
1
  ---
2
  license: cc-by-nc-4.0
3
+ pipeline_tag: image-to-image
4
  tags:
5
  - diffusion
6
  - image-editing
 
15
  [![Hugging Face Models](https://img.shields.io/badge/🤗%20%20Model-DIM--4.6B--T2I-orange.svg)](https://huggingface.co/stdKonjac/DIM-4.6B-T2I)
16
  [![Hugging Face Models](https://img.shields.io/badge/🤗%20%20Model-DIM--4.6B--Edit-orange.svg)](https://huggingface.co/stdKonjac/DIM-4.6B-Edit)
17
 
18
+ ![DIM-Edit](https://huggingface.co/stdKonjac/DIM-4.6B-Edit/resolve/main/assets/dim_edit.png)
19
 
20
  ## 📰 News
21
 
22
+ **[2026-01-26]** **DIM** is accepted to ICLR 2026 🎉🎉
23
+
24
  **[2025-10-08]** We release the **DIM-Edit** dataset and the **DIM-4.6B-T2I** / **DIM-4.6B-Edit** models.
25
 
26
  **[2025-09-26]** We upload a new version of the paper, including more results across various designers.
 
29
 
30
  ## Introduction
31
 
32
+ Unified models achieve strong results in text-to-image generation but remain weak in precise editing. This limitation arises from an *imbalanced division of responsibilities*. The understanding module is usually treated as a translator that encodes instructions into conditions, while the generation module must act as both designer and painter.
 
 
 
33
 
34
+ To address this, the paper introduces **Draw-In-Mind (DIM)**, a dataset with two complementary parts:
35
 
36
  - **DIM-T2I**: 14M long-context image–text pairs that strengthen instruction comprehension.
37
  - **DIM-Edit**: 233K chain-of-thought imaginations from GPT-4o that provide explicit design blueprints.
38
 
39
+ The authors connect a frozen **Qwen2.5-VL-3B** with a trainable **SANA1.5-1.6B** via a lightweight MLP, forming **DIM-4.6B-T2I/Edit**. With this setup, the understanding module takes on the *designer responsibility*, while the generation module focuses on rendering. Despite its modest size, DIM-4.6B-Edit achieves SOTA or competitive results on ImgEdit and GEdit-Bench.
 
 
 
40
 
41
  ## Performance
42
 
43
  <details>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  <summary><b>ImgEdit Overall</b></summary>
45
 
 
 
 
46
  | Model | Add | Adj. | Ext. | Rep. | Rem. | Back. | Sty. | Hyb. | Act. | Overall |
47
  |-------------------|:----:|:----:|:----:|:----:|:----:|:-----:|:----:|:----:|:----:|:-------:|
48
  | MagicBrush | 2.84 | 1.58 | 1.51 | 1.97 | 1.58 | 1.75 | 2.38 | 1.62 | 1.22 | 1.83 |
 
 
 
49
  | Step1X-Edit | 3.88 | 3.14 | 1.76 | 3.40 | 2.41 | 3.16 | 4.63 | 2.64 | 2.52 | 3.06 |
 
50
  | UniWorld-V1 | 3.82 | 3.64 | 2.27 | 3.47 | 3.24 | 2.99 | 4.21 | 2.96 | 2.74 | 3.26 |
 
 
51
  | **DIM-4.6B-Edit** | 4.09 | 3.47 | 2.30 | 4.00 | 3.43 | 3.87 | 4.92 | 2.85 | 4.08 | 3.67 |
52
 
53
  </details>
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  ## Dataset Usage
56
 
 
 
 
 
57
  ### DIM-Edit
58
 
59
+ You can load the [DIM-Edit dataset](https://huggingface.co/datasets/stdKonjac/DIM-Edit) using the `datasets` library:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
  ```python
62
  from datasets import load_dataset, Features, Value
 
74
  features=features,
75
  split="train",
76
  )
 
 
77
  ```
78
 
79
  ## Model Usage
80
 
81
  ### Environment Setup
82
 
83
+ ```bash
 
 
84
  pip install -r requirements.txt
85
  ```
86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  ### Inference
88
 
89
+ The model uses a Chain-of-Thought (CoT) approach where an external "designer" generates a blueprint. In `infer/demo_edit.py`, you can set various open-source VLMs as the external designer:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ```python
92
  # Qwen2.5-VL as external designer
93
  model.set_designer_qwen(version='Qwen/Qwen2.5-VL-3B-Instruct')
 
94
 
95
+ # InternVL3.5 as external designer
96
  model.set_designer_internvl(version='OpenGVLab/InternVL3_5-8B-HF')
 
 
 
 
 
 
97
  ```
98
 
99
+ To generate edited images from a jsonl file, run:
100
 
101
+ ```bash
102
  bash scripts/demo_edit.sh
103
  ```
104
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  ## License
106
 
107
  ### Dataset
 
108
  The dataset is licensed under the [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license.
109
 
110
  ### Model
111
+ The models are developed based on [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and [SANA1.5_1.6B_1024px](https://huggingface.co/Efficient-Large-Model/SANA1.5_1.6B_1024px). Please refer to their respective licenses for usage constraints.
 
 
 
 
 
 
112
 
113
  ## Citation
114
 
115
+ ```bibtex
 
 
116
  @misc{zeng2025drawinmindrebalancingdesignerpainterroles,
117
  title={Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing},
118
  author={Ziyun Zeng and Junhao Zhang and Wei Li and Mike Zheng Shou},