File size: 4,693 Bytes
2fb191f e2816f8 2fb191f e2816f8 2fb191f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
language:
- en
pipeline_tag: image-to-image
tags:
- image-editing
- text-guided-editing
- diffusion
- sana
- qwen-vl
- multimodal
- distilled
- cfg-distillation
base_model:
- iitolstykh/VIBE-Image-Edit
library_name: diffusers
---
# VIBE: Visual Instruction Based Editor
<div align="center">
<img src="VIBE.png" width="800" alt="VIBE"/>
</div>
<p style="text-align: center;">
<div align="center">
</div>
<p align="center">
<a href="https://riko0.github.io/VIBE"> 🌐 Project Page </a> |
<a href="https://arxiv.org/abs/2601.02242"> 📜 Paper on arXiv </a> |
<a href="https://github.com/ai-forever/vibe"> Github </a> |
<a href="https://huggingface.co/spaces/iitolstykh/VIBE-Image-Edit-DEMO">🤗 Space | </a>
<a href="https://huggingface.co/iitolstykh/VIBE-Image-Edit">🤗 VIBE-Image-Edit | </a>
</p>
**VIBE-DistilledCFG** is a specialized version of the original [VIBE-Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit) model.
This model can be run without classifier-free guidance, substantially reducing image generation time while maintaining high quality outputs.
## Performance Comparison
Below is a comparison of total inference time between the original VIBE model (using CFG) and this DistilledCFG model (without CFG). The distillation process yields an approx **1.8x - 2x speedup**.
| Resolution | Original Model (with CFG) | DistilledCFG Model (No CFG) |
| :--- | :--- | :--- |
| **1024x1024** | 1.1453s | **0.6389s** |
| **2048x2048** | 4.0837s | **1.9687s** |
## Model Details
- **Name:** VIBE-DistilledCFG
- **Parent Model:** [iitolstykh/VIBE-Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit)
- **Task:** Text-Guided Image Editing
- **Architecture:**
- **Diffusion Backbone:** Sana1.5 (1.6B parameters) with Linear Attention.
- **Condition Encoder:** Qwen3-VL (2B parameters).
- **Technique:** Classifier-Free Guidance (CFG) Distillation.
- **Model precision**: torch.bfloat16 (BF16)
- **Model resolution**: Optimized for up to 2048px images.
## Features
- **Blazing Fast Inference:** Runs approximately 2x faster than the original model by skipping the guidance pass.
- **Text-Guided Editing:** Edit images using natural language instructions.
- **Compact & Efficient:** Retains the lightweight footprint of the original 1.6B/2B architecture.
- **Multimodal Understanding:** Powered by Qwen3-VL for precise instruction following.
- **Text-to-Image** support.
# Inference Requirements
- `vibe` library
```bash
pip install git+https://github.com/ai-forever/VIBE
```
- requirements for `vibe` library:
```bash
pip install transformers==4.57.1 torchvision==0.21.0 torch==2.6.0 diffusers==0.33.1 loguru==0.7.3
```
# Quick start
**Note:** When using this distilled model, please set `image_guidance_scale` and `guidance_scale` to 0.0 to disable CFG.
```python
from PIL import Image
import requests
from io import BytesIO
from huggingface_hub import snapshot_download
from vibe.editor import ImageEditor
# Download model
model_path = snapshot_download(
repo_id="iitolstykh/VIBE-Image-Edit-DistilledCFG",
repo_type="model",
)
# Load model
# Note: Guidance scales are removed for the distilled version
editor = ImageEditor(
checkpoint_path=model_path,
num_inference_steps=20,
image_guidance_scale=0.0,
guidance_scale=0.0,
device="cuda:0",
)
# Download test image
resp = requests.get('https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3f58a82a-b4b4-40c3-a318-43f9350fcd02/original=true,quality=90/115610275.jpeg')
image = Image.open(BytesIO(resp.content))
# Generate edited image
edited_image = editor.generate_edited_image(
instruction="let this case swim in the river",
conditioning_image=image,
num_images_per_prompt=1,
)[0]
edited_image.save(f"edited_image.jpg", quality=100)
```
## License
This project is built upon the SANA. Please refer to the original SANA license for usage terms:
[SANA License](https://huggingface.co/Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers/blob/main/LICENSE.txt)
## Citation
If you use this model in your research or applications, please acknowledge the original projects:
- [SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer](https://github.com/NVlabs/Sana)
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
```bibtex
@misc{vibe2026,
Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
Title = {VIBE: Visual Instruction Based Editor},
Year = {2026},
Eprint = {arXiv:2601.02242},
}
```
|