File size: 8,996 Bytes
4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2021ecd
 
4c8cb1c
 
 
 
 
 
 
 
 
0bc9a31
4c8cb1c
5ddabdc
 
54ef1f3
 
2021ecd
54ef1f3
4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
54ef1f3
f0ab9f6
4c8cb1c
54ef1f3
4c8cb1c
 
 
 
 
 
2021ecd
 
 
 
 
f0ab9f6
a36068c
f0ab9f6
 
4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ec434e
4c8cb1c
2021ecd
 
 
 
 
4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0ab9f6
4c8cb1c
 
 
 
 
 
 
 
 
 
da9fc6a
4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ec434e
4c8cb1c
2021ecd
 
 
 
 
4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f4c49e4
4c8cb1c
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: apache-2.0
library_name: diffusers
datasets:
- VisualCloze/Graph200K
pipeline_tag: image-to-image
tags:
- text-to-image
- image-to-image
- flux
- lora
- in-context-learning
- universal-image-generation
- ai-tools
---


# VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning (Implementation with <strong><span style="color:red">Diffusers</span></strong>)

<div align="center">

[[Paper](https://arxiv.org/abs/2504.07960)] &emsp; [[Project Page](https://visualcloze.github.io/)] &emsp; [[Github](https://github.com/lzyhha/VisualCloze)]

</div>

<div align="center">

[[πŸ€— <strong><span style="color:hotpink">Diffusers</span></strong> Implementation](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze)] &emsp; [[πŸ€— LoRA Model Card for Diffusers]](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512)


</div>

<div align="center">

[[πŸ€— Online Demo](https://huggingface.co/spaces/VisualCloze/VisualCloze)] &emsp; [[πŸ€— Dataset Card](https://huggingface.co/datasets/VisualCloze/Graph200K)]

</div>

![Examples](https://github.com/lzyhha/VisualCloze/raw/main/figures/seen.jpg)

If you find VisualCloze is helpful, please consider to star ⭐ the [<strong><span style="color:hotpink">Github Repo</span></strong>](https://github.com/lzyhha/VisualCloze). Thanks!

## πŸ“° News
- [2025-5-15] πŸ€—πŸ€—πŸ€— VisualCloze has been merged into the [<strong><span style="color:hotpink">official pipelines of diffusers</span></strong>](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze).
- [2025-5-18] πŸ₯³πŸ₯³πŸ₯³ We have released the LoRA weights supporting diffusers at [LoRA Model Card 384](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384) and [LoRA Model Card 512](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512).

## 🌠 Key Features

An in-context learning based universal image generation framework. 

1. Support various in-domain tasks.
2. Generalize to <strong><span style="color:hotpink"> unseen tasks</span></strong> through in-context learning. 
3. Unify multiple tasks into one step and generate both target image and intermediate results. 
4. Support reverse-engineering a set of conditions from a target image.

πŸ”₯ Examples are shown in the [project page](https://visualcloze.github.io/).

## πŸ”§ Installation

<strong><span style="color:hotpink">You can install the official </span></strong> [diffusers](https://github.com/huggingface/diffusers.git).

```bash
pip install git+https://github.com/huggingface/diffusers.git
```

### πŸ’» Diffusers Usage

[![Huggingface VisualCloze](https://img.shields.io/static/v1?label=Demo&message=Huggingface%20Gradio&color=orange)](https://huggingface.co/spaces/VisualCloze/VisualCloze)

This model provides the full parameters of our VisualCloze. 
If you find the download size too large, you can use the [LoRA version](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512) 
with the FLUX.1-Fill-dev as the base model.

A model trained with the `resolution` of 384 is released at [Full Model Card 384](https://huggingface.co/VisualCloze/VisualClozePipeline-384) and [LoRA Model Card 384](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384), 
while this model uses the `resolution` of 512. The `resolution` means that each image will be resized to it before being
concatenated to avoid the out-of-memory error. To generate high-resolution images, we use the SDEdit technology for upsampling the generated results.

#### Example with Depth-to-Image:

<img src="./visualcloze_diffusers_example_depthtoimage.jpg" width="60%" height="50%" alt="Example with Depth-to-Image"/>

```python
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image


# Load in-context images (make sure the paths are correct and accessible)
image_paths = [
    # in-context examples
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2_depth-anything-v2_Large.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2.jpg'),
    ],
    # query with the target image
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/79f2ee632f1be3ad64210a641c4e201b/79f2ee632f1be3ad64210a641c4e201b_depth-anything-v2_Large.jpg'),
        None,  # No image needed for the query in this case
    ],
]

# Task and content prompt
task_prompt = "Each row outlines a logical process, starting from [IMAGE1] gray-based depth map with detailed object contours, to achieve [IMAGE2] an image with flawless clarity."
content_prompt = """A serene portrait of a young woman with long dark hair, wearing a beige dress with intricate 
gold embroidery, standing in a softly lit room. She holds a large bouquet of pale pink roses in a black box, 
positioned in the center of the frame. The background features a tall green plant to the left and a framed artwork 
on the wall to the right. A window on the left allows natural light to gently illuminate the scene. 
The woman gazes down at the bouquet with a calm expression. Soft natural lighting, warm color palette, 
high contrast, photorealistic, intimate, elegant, visually balanced, serene atmosphere."""

# Load the VisualClozePipeline
pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-512", resolution=512, torch_dtype=torch.bfloat16)
pipe.to("cuda")

# Loading the VisualClozePipeline via LoRA
# pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=512, torch_dtype=torch.bfloat16)
# pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-512', weight_name='visualcloze-lora-512.safetensors')
# pipe.to("cuda")

# Run the pipeline
image_result = pipe(
    task_prompt=task_prompt,
    content_prompt=content_prompt,
    image=image_paths,
    upsampling_width=1024,
    upsampling_height=1024,
    upsampling_strength=0.4,
    guidance_scale=30,
    num_inference_steps=30,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]

# Save the resulting image
image_result.save("visualcloze.png")
```

#### Example with Virtual Try-On:

<img src="./visualcloze_diffusers_example_tryon.jpg" width="60%" height="50%" alt="Example with Virtual Try-On"/>

```python
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image


# Load in-context images (make sure the paths are correct and accessible)
# The images are from the VITON-HD dataset at https://github.com/shadow2496/VITON-HD
image_paths = [
    # in-context examples
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/03673_00.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00_tryon_catvton_0.jpg'),
    ],
    # query with the target image
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00555_00.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/12265_00.jpg'),
        None
    ],
]

# Task and content prompt
task_prompt = "Each row shows a virtual try-on process that aims to put [IMAGE2] the clothing onto [IMAGE1] the person, producing [IMAGE3] the person wearing the new clothing."
content_prompt = None

# Load the VisualClozePipeline
pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-512", resolution=512, torch_dtype=torch.bfloat16)
pipe.to("cuda")

# Loading the VisualClozePipeline via LoRA
# pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=512, torch_dtype=torch.bfloat16)
# pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-512', weight_name='visualcloze-lora-512.safetensors')
# pipe.to("cuda")

# Run the pipeline
image_result = pipe(
    task_prompt=task_prompt,
    content_prompt=content_prompt,
    image=image_paths,
    upsampling_height=1632,
    upsampling_width=1232,
    upsampling_strength=0.3,
    guidance_scale=30,
    num_inference_steps=30,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]

# Save the resulting image
image_result.save("visualcloze.png")
```

### Citation

If you find VisualCloze useful for your research and applications, please cite using this BibTeX:

```bibtex
@article{li2025visualcloze,
  title={VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning},
  author={Li, Zhong-Yu and Du, Ruoyi and Yan, Juncheng and Zhuo, Le and Wu, Qilong and Li, Zhen and Gao, Peng and Ma, Zhanyu and Cheng, Ming-Ming},
  journal={arXiv preprint arXiv:2504.07960},
  year={2025}
}
```