File size: 9,117 Bytes

053c3cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ccf796
053c3cd
 
 
 
 
7faf911
053c3cd
 
 
 
 
 
 
 
 
b3fd872
053c3cd
1d93e5f
 
af5a5db
85db9a0
7faf911
af5a5db
a36a3b6
053c3cd
 
 
 
 
 
 
 
 
 
 
 
af5a5db
fc9d2e0
053c3cd
af5a5db
053c3cd
 
 
 
 
 
7faf911
 
 
 
 
fc9d2e0
7faf911
fc9d2e0
 
697b8fe
b065ec1
697b8fe
053c3cd
 
 
 
 
4700041
053c3cd
 
 
 
f50f1b5
 
053c3cd
 
 
f50f1b5
053c3cd
 
 
 
 
 
f50f1b5
 
 
 
 
 
053c3cd
 
fc9d2e0
29f969c
053c3cd
7faf911
 
 
 
 
053c3cd
 
 
 
 
f50f1b5
 
053c3cd
 
f50f1b5
053c3cd
 
697b8fe
053c3cd
 
 
 
 
 
fc9d2e0
b065ec1
 
 
053c3cd
 
 
 
4700041
053c3cd
 
eba9f4d
053c3cd
 
 
4700041
 
 
053c3cd
 
 
4700041
 
053c3cd
 
 
 
 
 
 
 
 
fc9d2e0
29f969c
053c3cd
7faf911
 
 
 
 
053c3cd
 
 
 
 
4700041
 
 
053c3cd
4700041
053c3cd
 
697b8fe
053c3cd
 
 
 
 
 
 
 
 
 
59c469d
 
 
 
 
 
 
053c3cd

---
license: apache-2.0
library_name: diffusers
datasets:
- VisualCloze/Graph200K
pipeline_tag: image-to-image
tags:
- text-to-image
- image-to-image
- flux
- lora
- in-context-learning
- universal-image-generation
- ai-tools
---


# VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning (Implementation with <strong><span style="color:red">Diffusers</span></strong>)

<div align="center">

[[Paper](https://arxiv.org/abs/2504.07960)] &emsp; [[Project Page](https://visualcloze.github.io/)] &emsp; [[Github](https://github.com/lzyhha/VisualCloze)]

</div>

<div align="center">

[[🤗 <strong><span style="color:hotpink">Diffusers</span></strong> Implementation](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze)] &emsp; [[🤗 LoRA Model Card for Diffusers]](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384)

</div>

<div align="center">

[[🤗 Online Demo](https://huggingface.co/spaces/VisualCloze/VisualCloze)] &emsp; [[🤗 Dataset Card](https://huggingface.co/datasets/VisualCloze/Graph200K)]

</div>

![Examples](https://github.com/lzyhha/VisualCloze/raw/main/figures/seen.jpg)

If you find VisualCloze is helpful, please consider to star ⭐ the [<strong><span style="color:hotpink">Github Repo</span></strong>](https://github.com/lzyhha/VisualCloze). Thanks!

## 📰 News
- [2025-5-15] 🤗🤗🤗 VisualCloze has been merged into the [<strong><span style="color:hotpink">official pipelines of diffusers</span></strong>](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze).
- [2025-5-18] 🥳🥳🥳 We have released the LoRA weights supporting diffusers at [LoRA Model Card 384](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384) and [LoRA Model Card 512](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512).

## 🌠 Key Features

An in-context learning based universal image generation framework. 

1. Support various in-domain tasks.
2. Generalize to <strong><span style="color:hotpink"> unseen tasks</span></strong> through in-context learning. 
3. Unify multiple tasks into one step and generate both target image and intermediate results. 
4. Support reverse-engineering a set of conditions from a target image.

🔥 Examples are shown in the [project page](https://visualcloze.github.io/).

## 🔧 Installation

<strong><span style="color:hotpink">You can install the official </span></strong> [diffusers](https://github.com/huggingface/diffusers.git).

```bash
pip install git+https://github.com/huggingface/diffusers.git
```

### 💻 Diffusers Usage

[![Huggingface VisualCloze](https://img.shields.io/static/v1?label=Demo&message=Huggingface%20Gradio&color=orange)](https://huggingface.co/spaces/VisualCloze/VisualCloze)

This model provides the full parameters of our VisualCloze. 
If you find the download size too large, you can use the [LoRA version](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384) 
with the FLUX.1-Fill-dev as the base model.

A model trained with the `resolution` of 512 is released at [Full Model Card 512](https://huggingface.co/VisualCloze/VisualClozePipeline-512) and [LoRA Model Card 512](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512), 
while this model uses the `resolution` of 384. The `resolution` means that each image will be resized to it before being
concatenated to avoid the out-of-memory error. To generate high-resolution images, we use the SDEdit technology for upsampling the generated results.

#### Example with Depth-to-Image:

<img src="./visualcloze_diffusers_example_depthtoimage.jpg" width="60%" height="50%" alt="Example with Depth-to-Image"/>

```python
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image


# Load in-context images (make sure the paths are correct and accessible)
image_paths = [
    # in-context examples
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2_depth-anything-v2_Large.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2.jpg'),
    ],
    # query with the target image
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/79f2ee632f1be3ad64210a641c4e201b/79f2ee632f1be3ad64210a641c4e201b_depth-anything-v2_Large.jpg'),
        None,  # No image needed for the query in this case
    ],
]

# Task and content prompt
task_prompt = "Each row outlines a logical process, starting from [IMAGE1] gray-based depth map with detailed object contours, to achieve [IMAGE2] an image with flawless clarity."
content_prompt = """A serene portrait of a young woman with long dark hair, wearing a beige dress with intricate 
gold embroidery, standing in a softly lit room. She holds a large bouquet of pale pink roses in a black box, 
positioned in the center of the frame. The background features a tall green plant to the left and a framed artwork 
on the wall to the right. A window on the left allows natural light to gently illuminate the scene. 
The woman gazes down at the bouquet with a calm expression. Soft natural lighting, warm color palette, 
high contrast, photorealistic, intimate, elegant, visually balanced, serene atmosphere."""

# Load the VisualClozePipeline
pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16)
pipe.to("cuda")

# Loading the VisualClozePipeline via LoRA
# pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=384, torch_dtype=torch.bfloat16)
# pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-384', weight_name='visualcloze-lora-384.safetensors')
# pipe.to("cuda")

# Run the pipeline
image_result = pipe(
    task_prompt=task_prompt,
    content_prompt=content_prompt,
    image=image_paths,
    upsampling_width=1024,
    upsampling_height=1024,
    upsampling_strength=0.4,
    guidance_scale=30,
    num_inference_steps=30,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]

# Save the resulting image
image_result.save("visualcloze.png")
```


#### Example with Virtual Try-On:

<img src="./visualcloze_diffusers_example_tryon.jpg" width="60%" height="50%" alt="Example with Virtual Try-On"/>

```python
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image


# Load in-context images (make sure the paths are correct and accessible)
# The images are from the VITON-HD dataset at https://github.com/shadow2496/VITON-HD
image_paths = [
    # in-context examples
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/03673_00.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00_tryon_catvton_0.jpg'),
    ],
    # query with the target image
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00555_00.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/12265_00.jpg'),
        None
    ],
]

# Task and content prompt
task_prompt = "Each row shows a virtual try-on process that aims to put [IMAGE2] the clothing onto [IMAGE1] the person, producing [IMAGE3] the person wearing the new clothing."
content_prompt = None

# Load the VisualClozePipeline
pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16)
pipe.to("cuda")

# Loading the VisualClozePipeline via LoRA
# pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=384, torch_dtype=torch.bfloat16)
# pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-384', weight_name='visualcloze-lora-384.safetensors')
# pipe.to("cuda")

# Run the pipeline
image_result = pipe(
    task_prompt=task_prompt,
    content_prompt=content_prompt,
    image=image_paths,
    upsampling_height=1632,
    upsampling_width=1232,
    upsampling_strength=0.3,
    guidance_scale=30,
    num_inference_steps=30,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]

# Save the resulting image
image_result.save("visualcloze.png")
```

### Citation

If you find VisualCloze useful for your research and applications, please cite using this BibTeX:

```bibtex
@InProceedings{Li_2025_ICCV,
    author    = {Li, Zhong-Yu and Du, Ruoyi and Yan, Juncheng and Zhuo, Le and Li, Zhen and Gao, Peng and Ma, Zhanyu and Cheng, Ming-Ming},
    title     = {VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {18969-18979}
}
```