|
|
--- |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: image-to-image |
|
|
tags: |
|
|
- image-editing |
|
|
- text-guided-editing |
|
|
- diffusion |
|
|
- sana |
|
|
- qwen-vl |
|
|
- multimodal |
|
|
- distilled |
|
|
- cfg-distillation |
|
|
base_model: |
|
|
- iitolstykh/VIBE-Image-Edit |
|
|
library_name: diffusers |
|
|
--- |
|
|
|
|
|
# VIBE: Visual Instruction Based Editor |
|
|
|
|
|
<div align="center"> |
|
|
<img src="VIBE.png" width="800" alt="VIBE"/> |
|
|
</div> |
|
|
|
|
|
<p style="text-align: center;"> |
|
|
<div align="center"> |
|
|
</div> |
|
|
<p align="center"> |
|
|
<a href="https://riko0.github.io/VIBE"> ๐ Project Page </a> | |
|
|
<a href="https://arxiv.org/abs/2601.02242"> ๐ Paper on arXiv </a> | |
|
|
<a href="https://github.com/ai-forever/vibe"> Github </a> | |
|
|
<a href="https://huggingface.co/spaces/iitolstykh/VIBE-Image-Edit-DEMO">๐ค Space | </a> |
|
|
<a href="https://huggingface.co/iitolstykh/VIBE-Image-Edit">๐ค VIBE-Image-Edit | </a> |
|
|
</p> |
|
|
|
|
|
**VIBE-DistilledCFG** is a specialized version of the original [VIBE-Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit) model. |
|
|
|
|
|
This model can be run without classifier-free guidance, substantially reducing image generation time while maintaining high quality outputs. |
|
|
|
|
|
## Performance Comparison |
|
|
|
|
|
Below is a comparison of total inference time between the original VIBE model (using CFG) and this DistilledCFG model (without CFG). The distillation process yields an approx **1.8x - 2x speedup**. |
|
|
|
|
|
| Resolution | Original Model (with CFG) | DistilledCFG Model (No CFG) | |
|
|
| :--- | :--- | :--- | |
|
|
| **1024x1024** | 1.1453s | **0.6389s** | |
|
|
| **2048x2048** | 4.0837s | **1.9687s** | |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Name:** VIBE-DistilledCFG |
|
|
- **Parent Model:** [iitolstykh/VIBE-Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit) |
|
|
- **Task:** Text-Guided Image Editing |
|
|
- **Architecture:** |
|
|
- **Diffusion Backbone:** Sana1.5 (1.6B parameters) with Linear Attention. |
|
|
- **Condition Encoder:** Qwen3-VL (2B parameters). |
|
|
- **Technique:** Classifier-Free Guidance (CFG) Distillation. |
|
|
- **Model precision**: torch.bfloat16 (BF16) |
|
|
- **Model resolution**: Optimized for up to 2048px images. |
|
|
|
|
|
## Features |
|
|
|
|
|
- **Blazing Fast Inference:** Runs approximately 2x faster than the original model by skipping the guidance pass. |
|
|
- **Text-Guided Editing:** Edit images using natural language instructions. |
|
|
- **Compact & Efficient:** Retains the lightweight footprint of the original 1.6B/2B architecture. |
|
|
- **Multimodal Understanding:** Powered by Qwen3-VL for precise instruction following. |
|
|
- **Text-to-Image** support. |
|
|
|
|
|
# Inference Requirements |
|
|
|
|
|
- `vibe` library |
|
|
```bash |
|
|
pip install git+https://github.com/ai-forever/VIBE |
|
|
``` |
|
|
- requirements for `vibe` library: |
|
|
```bash |
|
|
pip install transformers==4.57.1 torchvision==0.21.0 torch==2.6.0 diffusers==0.33.1 loguru==0.7.3 |
|
|
``` |
|
|
|
|
|
# Quick start |
|
|
|
|
|
**Note:** When using this distilled model, please set `image_guidance_scale` and `guidance_scale` to 0.0 to disable CFG. |
|
|
|
|
|
```python |
|
|
from PIL import Image |
|
|
import requests |
|
|
from io import BytesIO |
|
|
from huggingface_hub import snapshot_download |
|
|
|
|
|
from vibe.editor import ImageEditor |
|
|
|
|
|
# Download model |
|
|
model_path = snapshot_download( |
|
|
repo_id="iitolstykh/VIBE-Image-Edit-DistilledCFG", |
|
|
repo_type="model", |
|
|
) |
|
|
|
|
|
# Load model |
|
|
# Note: Guidance scales are removed for the distilled version |
|
|
editor = ImageEditor( |
|
|
checkpoint_path=model_path, |
|
|
num_inference_steps=20, |
|
|
image_guidance_scale=0.0, |
|
|
guidance_scale=0.0, |
|
|
device="cuda:0", |
|
|
) |
|
|
|
|
|
# Download test image |
|
|
resp = requests.get('https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3f58a82a-b4b4-40c3-a318-43f9350fcd02/original=true,quality=90/115610275.jpeg') |
|
|
image = Image.open(BytesIO(resp.content)) |
|
|
|
|
|
# Generate edited image |
|
|
edited_image = editor.generate_edited_image( |
|
|
instruction="let this case swim in the river", |
|
|
conditioning_image=image, |
|
|
num_images_per_prompt=1, |
|
|
)[0] |
|
|
|
|
|
edited_image.save(f"edited_image.jpg", quality=100) |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This project is built upon the SANA. Please refer to the original SANA license for usage terms: |
|
|
[SANA License](https://huggingface.co/Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers/blob/main/LICENSE.txt) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research or applications, please acknowledge the original projects: |
|
|
|
|
|
- [SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer](https://github.com/NVlabs/Sana) |
|
|
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) |
|
|
|
|
|
```bibtex |
|
|
@misc{vibe2026, |
|
|
Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich}, |
|
|
Title = {VIBE: Visual Instruction Based Editor}, |
|
|
Year = {2026}, |
|
|
Eprint = {arXiv:2601.02242}, |
|
|
} |
|
|
``` |
|
|
|