File size: 9,117 Bytes

4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2021ecd
 
4c8cb1c
 
 
 
 
 
 
 
 
0bc9a31
4c8cb1c
5ddabdc
 
54ef1f3
 
2021ecd
54ef1f3
4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
54ef1f3
f0ab9f6
4c8cb1c
54ef1f3
4c8cb1c
 
 
 
 
 
2021ecd
 
 
 
 
f0ab9f6
a36068c
f0ab9f6
 
4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ec434e
4c8cb1c
2021ecd
 
 
 
 
4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0ab9f6
4c8cb1c
 
 
 
 
 
 
 
 
 
da9fc6a
4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ec434e
4c8cb1c
2021ecd
 
 
 
 
4c8cb1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
feaad2d
 
 
 
 
 
 
4c8cb1c

---
license: apache-2.0
library_name: diffusers
datasets:
- VisualCloze/Graph200K
pipeline_tag: image-to-image
tags:
- text-to-image
- image-to-image
- flux
- lora
- in-context-learning
- universal-image-generation
- ai-tools
---


# VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning (Implementation with <strong><span style="color:red">Diffusers</span></strong>)

<div align="center">

[[Paper](https://arxiv.org/abs/2504.07960)] &emsp; [[Project Page](https://visualcloze.github.io/)] &emsp; [[Github](https://github.com/lzyhha/VisualCloze)]

</div>

<div align="center">

[[🤗 <strong><span style="color:hotpink">Diffusers</span></strong> Implementation](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze)] &emsp; [[🤗 LoRA Model Card for Diffusers]](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512)


</div>

<div align="center">

[[🤗 Online Demo](https://huggingface.co/spaces/VisualCloze/VisualCloze)] &emsp; [[🤗 Dataset Card](https://huggingface.co/datasets/VisualCloze/Graph200K)]

</div>

![Examples](https://github.com/lzyhha/VisualCloze/raw/main/figures/seen.jpg)

If you find VisualCloze is helpful, please consider to star ⭐ the [<strong><span style="color:hotpink">Github Repo</span></strong>](https://github.com/lzyhha/VisualCloze). Thanks!

## 📰 News
- [2025-5-15] 🤗🤗🤗 VisualCloze has been merged into the [<strong><span style="color:hotpink">official pipelines of diffusers</span></strong>](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze).
- [2025-5-18] 🥳🥳🥳 We have released the LoRA weights supporting diffusers at [LoRA Model Card 384](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384) and [LoRA Model Card 512](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512).

## 🌠 Key Features

An in-context learning based universal image generation framework. 

1. Support various in-domain tasks.
2. Generalize to <strong><span style="color:hotpink"> unseen tasks</span></strong> through in-context learning. 
3. Unify multiple tasks into one step and generate both target image and intermediate results. 
4. Support reverse-engineering a set of conditions from a target image.

🔥 Examples are shown in the [project page](https://visualcloze.github.io/).

## 🔧 Installation

<strong><span style="color:hotpink">You can install the official </span></strong> [diffusers](https://github.com/huggingface/diffusers.git).

```bash
pip install git+https://github.com/huggingface/diffusers.git
```

### 💻 Diffusers Usage

[![Huggingface VisualCloze](https://img.shields.io/static/v1?label=Demo&message=Huggingface%20Gradio&color=orange)](https://huggingface.co/spaces/VisualCloze/VisualCloze)

This model provides the full parameters of our VisualCloze. 
If you find the download size too large, you can use the [LoRA version](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512) 
with the FLUX.1-Fill-dev as the base model.

A model trained with the `resolution` of 384 is released at [Full Model Card 384](https://huggingface.co/VisualCloze/VisualClozePipeline-384) and [LoRA Model Card 384](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384), 
while this model uses the `resolution` of 512. The `resolution` means that each image will be resized to it before being
concatenated to avoid the out-of-memory error. To generate high-resolution images, we use the SDEdit technology for upsampling the generated results.

#### Example with Depth-to-Image:

<img src="./visualcloze_diffusers_example_depthtoimage.jpg" width="60%" height="50%" alt="Example with Depth-to-Image"/>

```python
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image


# Load in-context images (make sure the paths are correct and accessible)
image_paths = [
    # in-context examples
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2_depth-anything-v2_Large.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2.jpg'),
    ],
    # query with the target image
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/79f2ee632f1be3ad64210a641c4e201b/79f2ee632f1be3ad64210a641c4e201b_depth-anything-v2_Large.jpg'),
        None,  # No image needed for the query in this case
    ],
]

# Task and content prompt
task_prompt = "Each row outlines a logical process, starting from [IMAGE1] gray-based depth map with detailed object contours, to achieve [IMAGE2] an image with flawless clarity."
content_prompt = """A serene portrait of a young woman with long dark hair, wearing a beige dress with intricate 
gold embroidery, standing in a softly lit room. She holds a large bouquet of pale pink roses in a black box, 
positioned in the center of the frame. The background features a tall green plant to the left and a framed artwork 
on the wall to the right. A window on the left allows natural light to gently illuminate the scene. 
The woman gazes down at the bouquet with a calm expression. Soft natural lighting, warm color palette, 
high contrast, photorealistic, intimate, elegant, visually balanced, serene atmosphere."""

# Load the VisualClozePipeline
pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-512", resolution=512, torch_dtype=torch.bfloat16)
pipe.to("cuda")

# Loading the VisualClozePipeline via LoRA
# pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=512, torch_dtype=torch.bfloat16)
# pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-512', weight_name='visualcloze-lora-512.safetensors')
# pipe.to("cuda")

# Run the pipeline
image_result = pipe(
    task_prompt=task_prompt,
    content_prompt=content_prompt,
    image=image_paths,
    upsampling_width=1024,
    upsampling_height=1024,
    upsampling_strength=0.4,
    guidance_scale=30,
    num_inference_steps=30,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]

# Save the resulting image
image_result.save("visualcloze.png")
```

#### Example with Virtual Try-On:

<img src="./visualcloze_diffusers_example_tryon.jpg" width="60%" height="50%" alt="Example with Virtual Try-On"/>

```python
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image


# Load in-context images (make sure the paths are correct and accessible)
# The images are from the VITON-HD dataset at https://github.com/shadow2496/VITON-HD
image_paths = [
    # in-context examples
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/03673_00.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00_tryon_catvton_0.jpg'),
    ],
    # query with the target image
    [
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00555_00.jpg'),
        load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/12265_00.jpg'),
        None
    ],
]

# Task and content prompt
task_prompt = "Each row shows a virtual try-on process that aims to put [IMAGE2] the clothing onto [IMAGE1] the person, producing [IMAGE3] the person wearing the new clothing."
content_prompt = None

# Load the VisualClozePipeline
pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-512", resolution=512, torch_dtype=torch.bfloat16)
pipe.to("cuda")

# Loading the VisualClozePipeline via LoRA
# pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=512, torch_dtype=torch.bfloat16)
# pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-512', weight_name='visualcloze-lora-512.safetensors')
# pipe.to("cuda")

# Run the pipeline
image_result = pipe(
    task_prompt=task_prompt,
    content_prompt=content_prompt,
    image=image_paths,
    upsampling_height=1632,
    upsampling_width=1232,
    upsampling_strength=0.3,
    guidance_scale=30,
    num_inference_steps=30,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]

# Save the resulting image
image_result.save("visualcloze.png")
```

### Citation

If you find VisualCloze useful for your research and applications, please cite using this BibTeX:

```bibtex
@InProceedings{Li_2025_ICCV,
    author    = {Li, Zhong-Yu and Du, Ruoyi and Yan, Juncheng and Zhuo, Le and Li, Zhen and Gao, Peng and Ma, Zhanyu and Cheng, Ming-Ming},
    title     = {VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {18969-18979}
}
```