File size: 8,996 Bytes
053c3cd 0ccf796 053c3cd 7faf911 053c3cd b3fd872 053c3cd 1d93e5f af5a5db 85db9a0 7faf911 af5a5db a36a3b6 053c3cd af5a5db fc9d2e0 053c3cd af5a5db 053c3cd 7faf911 fc9d2e0 7faf911 fc9d2e0 697b8fe b065ec1 697b8fe 053c3cd 4700041 053c3cd f50f1b5 053c3cd f50f1b5 053c3cd f50f1b5 053c3cd fc9d2e0 29f969c 053c3cd 7faf911 053c3cd f50f1b5 053c3cd f50f1b5 053c3cd 697b8fe 053c3cd fc9d2e0 b065ec1 053c3cd 4700041 053c3cd eba9f4d 053c3cd 4700041 053c3cd 4700041 053c3cd fc9d2e0 29f969c 053c3cd 7faf911 053c3cd 4700041 053c3cd 4700041 053c3cd 697b8fe 053c3cd 752601a 053c3cd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
---
license: apache-2.0
library_name: diffusers
datasets:
- VisualCloze/Graph200K
pipeline_tag: image-to-image
tags:
- text-to-image
- image-to-image
- flux
- lora
- in-context-learning
- universal-image-generation
- ai-tools
---
# VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning (Implementation with <strong><span style="color:red">Diffusers</span></strong>)
<div align="center">
[[Paper](https://arxiv.org/abs/2504.07960)]   [[Project Page](https://visualcloze.github.io/)]   [[Github](https://github.com/lzyhha/VisualCloze)]
</div>
<div align="center">
[[π€ <strong><span style="color:hotpink">Diffusers</span></strong> Implementation](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze)]   [[π€ LoRA Model Card for Diffusers]](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384)
</div>
<div align="center">
[[π€ Online Demo](https://huggingface.co/spaces/VisualCloze/VisualCloze)]   [[π€ Dataset Card](https://huggingface.co/datasets/VisualCloze/Graph200K)]
</div>

If you find VisualCloze is helpful, please consider to star β the [<strong><span style="color:hotpink">Github Repo</span></strong>](https://github.com/lzyhha/VisualCloze). Thanks!
## π° News
- [2025-5-15] π€π€π€ VisualCloze has been merged into the [<strong><span style="color:hotpink">official pipelines of diffusers</span></strong>](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze).
- [2025-5-18] π₯³π₯³π₯³ We have released the LoRA weights supporting diffusers at [LoRA Model Card 384](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384) and [LoRA Model Card 512](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512).
## π Key Features
An in-context learning based universal image generation framework.
1. Support various in-domain tasks.
2. Generalize to <strong><span style="color:hotpink"> unseen tasks</span></strong> through in-context learning.
3. Unify multiple tasks into one step and generate both target image and intermediate results.
4. Support reverse-engineering a set of conditions from a target image.
π₯ Examples are shown in the [project page](https://visualcloze.github.io/).
## π§ Installation
<strong><span style="color:hotpink">You can install the official </span></strong> [diffusers](https://github.com/huggingface/diffusers.git).
```bash
pip install git+https://github.com/huggingface/diffusers.git
```
### π» Diffusers Usage
[](https://huggingface.co/spaces/VisualCloze/VisualCloze)
This model provides the full parameters of our VisualCloze.
If you find the download size too large, you can use the [LoRA version](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384)
with the FLUX.1-Fill-dev as the base model.
A model trained with the `resolution` of 512 is released at [Full Model Card 512](https://huggingface.co/VisualCloze/VisualClozePipeline-512) and [LoRA Model Card 512](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512),
while this model uses the `resolution` of 384. The `resolution` means that each image will be resized to it before being
concatenated to avoid the out-of-memory error. To generate high-resolution images, we use the SDEdit technology for upsampling the generated results.
#### Example with Depth-to-Image:
<img src="./visualcloze_diffusers_example_depthtoimage.jpg" width="60%" height="50%" alt="Example with Depth-to-Image"/>
```python
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image
# Load in-context images (make sure the paths are correct and accessible)
image_paths = [
# in-context examples
[
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2_depth-anything-v2_Large.jpg'),
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2.jpg'),
],
# query with the target image
[
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/79f2ee632f1be3ad64210a641c4e201b/79f2ee632f1be3ad64210a641c4e201b_depth-anything-v2_Large.jpg'),
None, # No image needed for the query in this case
],
]
# Task and content prompt
task_prompt = "Each row outlines a logical process, starting from [IMAGE1] gray-based depth map with detailed object contours, to achieve [IMAGE2] an image with flawless clarity."
content_prompt = """A serene portrait of a young woman with long dark hair, wearing a beige dress with intricate
gold embroidery, standing in a softly lit room. She holds a large bouquet of pale pink roses in a black box,
positioned in the center of the frame. The background features a tall green plant to the left and a framed artwork
on the wall to the right. A window on the left allows natural light to gently illuminate the scene.
The woman gazes down at the bouquet with a calm expression. Soft natural lighting, warm color palette,
high contrast, photorealistic, intimate, elegant, visually balanced, serene atmosphere."""
# Load the VisualClozePipeline
pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16)
pipe.to("cuda")
# Loading the VisualClozePipeline via LoRA
# pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=384, torch_dtype=torch.bfloat16)
# pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-384', weight_name='visualcloze-lora-384.safetensors')
# pipe.to("cuda")
# Run the pipeline
image_result = pipe(
task_prompt=task_prompt,
content_prompt=content_prompt,
image=image_paths,
upsampling_width=1024,
upsampling_height=1024,
upsampling_strength=0.4,
guidance_scale=30,
num_inference_steps=30,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]
# Save the resulting image
image_result.save("visualcloze.png")
```
#### Example with Virtual Try-On:
<img src="./visualcloze_diffusers_example_tryon.jpg" width="60%" height="50%" alt="Example with Virtual Try-On"/>
```python
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image
# Load in-context images (make sure the paths are correct and accessible)
# The images are from the VITON-HD dataset at https://github.com/shadow2496/VITON-HD
image_paths = [
# in-context examples
[
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00.jpg'),
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/03673_00.jpg'),
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00_tryon_catvton_0.jpg'),
],
# query with the target image
[
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00555_00.jpg'),
load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/12265_00.jpg'),
None
],
]
# Task and content prompt
task_prompt = "Each row shows a virtual try-on process that aims to put [IMAGE2] the clothing onto [IMAGE1] the person, producing [IMAGE3] the person wearing the new clothing."
content_prompt = None
# Load the VisualClozePipeline
pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16)
pipe.to("cuda")
# Loading the VisualClozePipeline via LoRA
# pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=384, torch_dtype=torch.bfloat16)
# pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-384', weight_name='visualcloze-lora-384.safetensors')
# pipe.to("cuda")
# Run the pipeline
image_result = pipe(
task_prompt=task_prompt,
content_prompt=content_prompt,
image=image_paths,
upsampling_height=1632,
upsampling_width=1232,
upsampling_strength=0.3,
guidance_scale=30,
num_inference_steps=30,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]
# Save the resulting image
image_result.save("visualcloze.png")
```
### Citation
If you find VisualCloze useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{li2025visualcloze,
title={VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning},
author={Li, Zhong-Yu and Du, Ruoyi and Yan, Juncheng and Zhuo, Le and Wu, Qilong and Li, Zhen and Gao, Peng and Ma, Zhanyu and Cheng, Ming-Ming},
journal={arXiv preprint arXiv:2504.07960},
year={2025}
}
``` |