Improve model card: Add pipeline tag, library name, paper link, and usage example
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,14 +1,100 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
-
This repository contains pretrained checkpoints for [IMG](https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment).
|
| 5 |
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
> [Xingqian Xu](https://scholar.google.com/citations?user=s1X82zMAAAAJ&hl=zh-CN&oi=ao),
|
| 11 |
> [Yulin Wang](https://openreview.net/profile?id=~Yulin_Wang1),
|
| 12 |
> [Kai Wang](https://kaiwang.com),
|
| 13 |
> [Gao Huang](https://www.gaohuang.net),
|
| 14 |
-
> [Humphrey Shi](https://www.humphreyshi.com)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
pipeline_tag: text-to-image
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
|
|
|
| 6 |
|
| 7 |
+
# IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
|
| 8 |
+
|
| 9 |
+
This repository contains pretrained checkpoints for [IMG](https://huggingface.co/papers/2509.26231) (ICCV 2025).
|
| 10 |
+
|
| 11 |
+
**Paper:** [IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance](https://huggingface.co/papers/2509.26231)
|
| 12 |
+
**Code:** The official PyTorch implementation can be found on GitHub: [https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment](https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment)
|
| 13 |
+
|
| 14 |
+
> **IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance.**
|
| 15 |
+
>
|
| 16 |
+
> [Jiayi Guo](https://www.jiayiguo.net)*,
|
| 17 |
+
> [Chuanhao Yan](https://openreview.net/profile?id=~Chuanhao_Yan1)*,
|
| 18 |
> [Xingqian Xu](https://scholar.google.com/citations?user=s1X82zMAAAAJ&hl=zh-CN&oi=ao),
|
| 19 |
> [Yulin Wang](https://openreview.net/profile?id=~Yulin_Wang1),
|
| 20 |
> [Kai Wang](https://kaiwang.com),
|
| 21 |
> [Gao Huang](https://www.gaohuang.net),
|
| 22 |
+
> [Humphrey Shi](https://www.humphreyshi.com)
|
| 23 |
+
|
| 24 |
+
## Abstract
|
| 25 |
+
Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods.
|
| 26 |
+
|
| 27 |
+
<p align="center">
|
| 28 |
+
<img src="https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment/raw/main/assets/teaser.png" width="1080px"/>
|
| 29 |
+
Our proposed Implicit Multimodal Guidance (IMG) framework mitigates the prompt-image misalignment issues in various aspects such as concept comprehension, aesthetic quality, object addition, and correction. In each case, both images are generated with the same random seed for fair comparison.
|
| 30 |
+
</p>
|
| 31 |
+
|
| 32 |
+
## News
|
| 33 |
+
- [2025.09.30] Paper and code released!
|
| 34 |
+
- [2025.06.26] IMG is accepted by ICCV 2025!
|
| 35 |
+
|
| 36 |
+
## Overview
|
| 37 |
+
|
| 38 |
+
<p align="center">
|
| 39 |
+
<img src="https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment/raw/main/assets/overview.png" width="1080px"/>
|
| 40 |
+
Given an initial image with its prompt, IMG begins by conducting an MLLM-driven misalignment analysis. Following this, IMG utilizes an Implicit Aligner to translate the initial image features into better-aligned features according to the MLLM's guidance. Finally, these aligned image features are incorporated as new conditions to re-generate images with improved prompt-image alignment.
|
| 41 |
+
</p>
|
| 42 |
+
|
| 43 |
+
## Quick Start
|
| 44 |
+
|
| 45 |
+
- Checkpoints
|
| 46 |
+
```python
|
| 47 |
+
from huggingface_hub import snapshot_download
|
| 48 |
+
|
| 49 |
+
save_dir = "ckpts"
|
| 50 |
+
repo_id = "shi-labs/IMG"
|
| 51 |
+
cache_dir = save_dir + "/cache"
|
| 52 |
+
|
| 53 |
+
snapshot_download(cache_dir=cache_dir,
|
| 54 |
+
local_dir=save_dir,
|
| 55 |
+
repo_id=repo_id,
|
| 56 |
+
local_dir_use_symlinks=False,
|
| 57 |
+
resume_download=True,
|
| 58 |
+
)
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
- For SDXL
|
| 62 |
+
```bash
|
| 63 |
+
# packages
|
| 64 |
+
conda create -n imgsdxl python=3.10 -y
|
| 65 |
+
conda activate imgsdxl
|
| 66 |
+
pip install --no-deps -r requirements_sdxl.txt
|
| 67 |
+
# gradio demo
|
| 68 |
+
python demo_sdxl.py
|
| 69 |
+
python demo_sdxl_dpo.py
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
- For FLUX
|
| 73 |
+
```bash
|
| 74 |
+
# packages
|
| 75 |
+
conda create -n imgflux python=3.10 -y
|
| 76 |
+
conda activate imgflux
|
| 77 |
+
pip install --no-deps -r requirements_flux.txt
|
| 78 |
+
# gradio demo (require A100 80G or multi-gpus)
|
| 79 |
+
python demo_flux.py
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
## Training
|
| 83 |
+
- coming soon
|
| 84 |
+
|
| 85 |
+
## Acknowledgements
|
| 86 |
+
|
| 87 |
+
Our code is developed on the top of [Diffusers](https://github.com/huggingface/diffusers), [LLaVA](https://github.com/haotian-liu/LLaVA), [IP-Adapter](https://github.com/tencent-ailab/IP-Adapter) and [x-FLUX](https://github.com/XLabs-AI/x-flux).
|
| 88 |
+
|
| 89 |
+
## Citation
|
| 90 |
+
|
| 91 |
+
If you find this repo helpful, please consider citing us.
|
| 92 |
+
|
| 93 |
+
```latex
|
| 94 |
+
@inproceedings{guo2025img,
|
| 95 |
+
title={IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance.},
|
| 96 |
+
author={Jiayi Guo, Chuanhao Yan, Xingqian Xu, Yulin Wang, Kai Wang, Gao Huang, Humphrey Shi},
|
| 97 |
+
booktitle={International Conference on Computer Vision},
|
| 98 |
+
year={2025},
|
| 99 |
+
}
|
| 100 |
+
```
|