iitolstykh
/

VIBE-Image-Edit-DistilledCFG

+---
+language:
+- en
+pipeline_tag: image-to-image
+tags:
+- image-editing
+- text-guided-editing
+- diffusion
+- sana
+- qwen-vl
+- multimodal
+- distilled
+- cfg-distillation
+base_model:
+- iitolstykh/VIBE-Image-Edit
+- Efficient-Large-Model/SANA1.5_1.6B_1024px
+- Qwen/Qwen3-VL-2B-Instruct
+library_name: diffusers
+---
+# VIBE: Visual Instruction Based Editor
+<div align="center">
+  <img src="VIBE.png" width="800" alt="VIBE"/>
+</div>
+<p style="text-align: center;">
+  <div align="center">
+  </div>
+  <p align="center">
+  <a href="https://riko0.github.io/VIBE"> 🌐 Project Page </a> |
+  <a href="https://arxiv.org/abs/2601.02242"> 📜 Paper on arXiv </a> |
+  <a href="https://github.com/ai-forever/vibe"> Github </a> |
+  <a href="https://huggingface.co/spaces/iitolstykh/VIBE-Image-Edit-DEMO">🤗 Space | </a>
+  <a href="https://huggingface.co/iitolstykh/VIBE-Image-Edit">🤗 VIBE-Image-Edit | </a>
+</p>
+**VIBE-DistilledCFG** is a specialized version of the original [VIBE-Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit) model.
+This model can be run without classifier-free guidance, substantially reducing image generation time while maintaining high quality outputs.
+## Performance Comparison
+Below is a comparison of total inference time between the original VIBE model (using CFG) and this DistilledCFG model (without CFG). The distillation process yields an approx **1.8x - 2x speedup**.
+| Resolution | Original Model (with CFG) | DistilledCFG Model (No CFG) |
+| :--- | :--- | :--- |
+| **1024x1024** | 1.1453s | **0.6389s** |
+| **2048x2048** | 4.0837s | **1.9687s** |
+## Model Details
+- **Name:** VIBE-DistilledCFG
+- **Parent Model:** [iitolstykh/VIBE-Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit)
+- **Task:** Text-Guided Image Editing
+- **Architecture:**
+  - **Diffusion Backbone:** Sana1.5 (1.6B parameters) with Linear Attention.
+  - **Condition Encoder:** Qwen3-VL (2B parameters).
+- **Technique:** Classifier-Free Guidance (CFG) Distillation.
+- **Model precision**: torch.bfloat16 (BF16)
+- **Model resolution**: Optimized for up to 2048px images.
+## Features
+- **Blazing Fast Inference:** Runs approximately 2x faster than the original model by skipping the guidance pass.
+- **Text-Guided Editing:** Edit images using natural language instructions.
+- **Compact & Efficient:** Retains the lightweight footprint of the original 1.6B/2B architecture.
+- **Multimodal Understanding:** Powered by Qwen3-VL for precise instruction following.
+- **Text-to-Image** support.
+# Inference Requirements
+- `vibe` library
+```bash
+pip install git+https://github.com/ai-forever/VIBE
+```
+- requirements for `vibe` library:
+```bash
+pip install transformers==4.57.1 torchvision==0.21.0 torch==2.6.0 diffusers==0.33.1 loguru==0.7.3
+```
+# Quick start
+**Note:** When using this distilled model, you do not need to provide `guidance_scale` or `image_guidance_scale`.
+```python
+from PIL import Image
+import requests
+from io import BytesIO
+from huggingface_hub import snapshot_download
+from vibe.editor import ImageEditor
+# Download model
+model_path = snapshot_download(
+    repo_id="iitolstykh/VIBE-Image-Edit-DistilledCFG",
+    repo_type="model",
+)
+# Load model
+# Note: Guidance scales are removed for the distilled version
+editor = ImageEditor(
+    checkpoint_path=model_path,
+    num_inference_steps=20,
+    device="cuda:0",
+)
+# Download test image
+resp = requests.get('https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3f58a82a-b4b4-40c3-a318-43f9350fcd02/original=true,quality=90/115610275.jpeg')
+image = Image.open(BytesIO(resp.content))
+# Generate edited image
+edited_image = editor.generate_edited_image(
+    instruction="let this case swim in the river",
+    conditioning_image=image,
+    num_images_per_prompt=1,
+)[0]
+edited_image.save(f"edited_image.jpg", quality=100)
+```
+## License
+This project is built upon the SANA. Please refer to the original SANA license for usage terms:
+[SANA License](https://huggingface.co/Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers/blob/main/LICENSE.txt)
+## Citation
+If you use this model in your research or applications, please acknowledge the original projects:
+- [SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer](https://github.com/NVlabs/Sana)
+- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
+```bibtex
+@misc{vibe2026,
+  Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
+  Title = {VIBE: Visual Instruction Based Editor},
+  Year = {2026},
+  Eprint = {arXiv:2601.02242},
+}
+```