File size: 4,096 Bytes
0808787
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82f853c
 
 
 
5232ac4
 
 
 
6306faa
2493fb8
5232ac4
5db8ac2
5232ac4
 
8bb35fc
0808787
 
 
 
 
 
 
 
 
887bc70
 
0808787
 
 
 
 
 
 
 
2fe57cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5ff574d
 
 
 
 
0808787
 
 
 
8bb35fc
 
0808787
 
2493fb8
 
 
 
 
0808787
5ff574d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
language:
- en
pipeline_tag: image-to-image
tags:
- image-editing
- text-guided-editing
- diffusion
- sana
- qwen-vl
- multimodal
base_model:
- Efficient-Large-Model/SANA1.5_1.6B_1024px
- Qwen/Qwen3-VL-2B-Instruct
library_name: diffusers
---

# VIBE: Visual Instruction Based Editor

<div align="center">
  <img src="VIBE.png" width="800" alt="VIBE"/>
</div>

<p style="text-align: center;">
  <div align="center">
  </div>
  <p align="center">
  <a href="https://riko0.github.io/VIBE"> 🌐 Project Page </a> | 
  <a href="https://arxiv.org/abs/2601.02242"> 📜 Paper on arXiv </a> | 
  <a href="https://github.com/ai-forever/vibe"> Github </a> | 
  <a href="https://huggingface.co/spaces/iitolstykh/VIBE-Image-Edit-DEMO">🤗 Space | </a> 
</p>

**VIBE** is a powerful open-source framework for text-guided image editing. It leverages the efficiency of the [Sana1.5-1.6B](https://github.com/NVlabs/Sana) diffusion model and the visual understanding capabilities of [Qwen3-VL-2B-Instruct](https://github.com/QwenLM/Qwen3-VL) to provide **exceptionally fast** and high-quality, instruction-based image manipulation.

## Model Details

- **Name:** VIBE
- **Task:** Text-Guided Image Editing
- **Architecture:**
  - **Diffusion Backbone:** Sana1.5 (1.6B parameters) with Linear Attention.
  - **Condition Encoder:** Qwen3-VL (2B parameters) for multimodal understanding.
- **Framework:** Built on `diffusers` and `transformers`.
- **Model precision**: torch.bfloat16 (BF16)
- **Model resolution**: This model is developed to edit up to 2048px images with multi-scale heigh and width.

## Features

- **Text-Guided Editing:** Edit images using natural language instructions (e.g., "Add a cat on the sofa").
- **Compact & Efficient:** Combines a 1.6B parameter diffusion model with a 2B parameter encoder for a lightweight footprint.
- **High-Speed Inference:** Utilizes Sana1.5's linear attention mechanism for rapid generation.
- **Multimodal Understanding:** Qwen3-VL ensures strong alignment between visual content and text instructions.


# Inference Requirements

- `vibe` library
```bash
pip install git+https://github.com/ai-forever/VIBE
```
- requirements for `vibe` library:
```bash
pip install transformers==4.57.1 torchvision==0.21.0 torch==2.6.0 diffusers==0.33.1 loguru==0.7.3
```

# Quick start

```python
from PIL import Image
import requests
from io import BytesIO
from huggingface_hub import snapshot_download

from vibe.editor import ImageEditor

# Download model
model_path = snapshot_download(
    repo_id="iitolstykh/VIBE-Image-Edit",
    repo_type="model",
)

# Load model
editor = ImageEditor(
    checkpoint_path=model_path,
    image_guidance_scale=1.2,
    guidance_scale=4.5,
    num_inference_steps=20,
    device="cuda:0",
)

# Download test image
resp = requests.get('https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3f58a82a-b4b4-40c3-a318-43f9350fcd02/original=true,quality=90/115610275.jpeg')
image = Image.open(BytesIO(resp.content))

# Generate edited image
edited_image = editor.generate_edited_image(
    instruction="let this case swim in the river",
    conditioning_image=image,
    num_images_per_prompt=1,
)[0]

edited_image.save(f"edited_image.jpg", quality=100)
```

## License

This project is built upon the SANA. Please refer to the original SANA license for usage terms:
[SANA License](https://huggingface.co/Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers/blob/main/LICENSE.txt)

## Citation

If you use this model in your research or applications, please acknowledge the original projects:

- [SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer](https://github.com/NVlabs/Sana)
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)

```bibtex
@misc{vibe2026,
  Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
  Title = {VIBE: Visual Instruction Based Editor},
  Year = {2026},
  Eprint = {arXiv:2601.02242},
}
```