Improve model card: Add pipeline tag, library name, paper link, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +92 -6
README.md CHANGED
@@ -1,14 +1,100 @@
1
  ---
2
  license: mit
 
 
3
  ---
4
- This repository contains pretrained checkpoints for [IMG](https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment).
5
 
6
- > **IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance, ICCV 2025**
7
- >
8
- > [Jiayi Guo](https://www.jiayiguo.net)\*,
9
- > [Chuanhao Yan](https://openreview.net/profile?id=~Chuanhao_Yan1)\*,
 
 
 
 
 
 
 
10
  > [Xingqian Xu](https://scholar.google.com/citations?user=s1X82zMAAAAJ&hl=zh-CN&oi=ao),
11
  > [Yulin Wang](https://openreview.net/profile?id=~Yulin_Wang1),
12
  > [Kai Wang](https://kaiwang.com),
13
  > [Gao Huang](https://www.gaohuang.net),
14
- > [Humphrey Shi](https://www.humphreyshi.com)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ pipeline_tag: text-to-image
4
+ library_name: transformers
5
  ---
 
6
 
7
+ # IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
8
+
9
+ This repository contains pretrained checkpoints for [IMG](https://huggingface.co/papers/2509.26231) (ICCV 2025).
10
+
11
+ **Paper:** [IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance](https://huggingface.co/papers/2509.26231)
12
+ **Code:** The official PyTorch implementation can be found on GitHub: [https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment](https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment)
13
+
14
+ > **IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance.**
15
+ >
16
+ > [Jiayi Guo](https://www.jiayiguo.net)*,
17
+ > [Chuanhao Yan](https://openreview.net/profile?id=~Chuanhao_Yan1)*,
18
  > [Xingqian Xu](https://scholar.google.com/citations?user=s1X82zMAAAAJ&hl=zh-CN&oi=ao),
19
  > [Yulin Wang](https://openreview.net/profile?id=~Yulin_Wang1),
20
  > [Kai Wang](https://kaiwang.com),
21
  > [Gao Huang](https://www.gaohuang.net),
22
+ > [Humphrey Shi](https://www.humphreyshi.com)
23
+
24
+ ## Abstract
25
+ Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods.
26
+
27
+ <p align="center">
28
+ <img src="https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment/raw/main/assets/teaser.png" width="1080px"/>
29
+ Our proposed Implicit Multimodal Guidance (IMG) framework mitigates the prompt-image misalignment issues in various aspects such as concept comprehension, aesthetic quality, object addition, and correction. In each case, both images are generated with the same random seed for fair comparison.
30
+ </p>
31
+
32
+ ## News
33
+ - [2025.09.30] Paper and code released!
34
+ - [2025.06.26] IMG is accepted by ICCV 2025!
35
+
36
+ ## Overview
37
+
38
+ <p align="center">
39
+ <img src="https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment/raw/main/assets/overview.png" width="1080px"/>
40
+ Given an initial image with its prompt, IMG begins by conducting an MLLM-driven misalignment analysis. Following this, IMG utilizes an Implicit Aligner to translate the initial image features into better-aligned features according to the MLLM's guidance. Finally, these aligned image features are incorporated as new conditions to re-generate images with improved prompt-image alignment.
41
+ </p>
42
+
43
+ ## Quick Start
44
+
45
+ - Checkpoints
46
+ ```python
47
+ from huggingface_hub import snapshot_download
48
+
49
+ save_dir = "ckpts"
50
+ repo_id = "shi-labs/IMG"
51
+ cache_dir = save_dir + "/cache"
52
+
53
+ snapshot_download(cache_dir=cache_dir,
54
+ local_dir=save_dir,
55
+ repo_id=repo_id,
56
+ local_dir_use_symlinks=False,
57
+ resume_download=True,
58
+ )
59
+ ```
60
+
61
+ - For SDXL
62
+ ```bash
63
+ # packages
64
+ conda create -n imgsdxl python=3.10 -y
65
+ conda activate imgsdxl
66
+ pip install --no-deps -r requirements_sdxl.txt
67
+ # gradio demo
68
+ python demo_sdxl.py
69
+ python demo_sdxl_dpo.py
70
+ ```
71
+
72
+ - For FLUX
73
+ ```bash
74
+ # packages
75
+ conda create -n imgflux python=3.10 -y
76
+ conda activate imgflux
77
+ pip install --no-deps -r requirements_flux.txt
78
+ # gradio demo (require A100 80G or multi-gpus)
79
+ python demo_flux.py
80
+ ```
81
+
82
+ ## Training
83
+ - coming soon
84
+
85
+ ## Acknowledgements
86
+
87
+ Our code is developed on the top of [Diffusers](https://github.com/huggingface/diffusers), [LLaVA](https://github.com/haotian-liu/LLaVA), [IP-Adapter](https://github.com/tencent-ailab/IP-Adapter) and [x-FLUX](https://github.com/XLabs-AI/x-flux).
88
+
89
+ ## Citation
90
+
91
+ If you find this repo helpful, please consider citing us.
92
+
93
+ ```latex
94
+ @inproceedings{guo2025img,
95
+ title={IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance.},
96
+ author={Jiayi Guo, Chuanhao Yan, Xingqian Xu, Yulin Wang, Kai Wang, Gao Huang, Humphrey Shi},
97
+ booktitle={International Conference on Computer Vision},
98
+ year={2025},
99
+ }
100
+ ```