VIBE-Image-Edit-DistilledCFG / README.md

Update README.md

a10edb9 verified 1 day ago

4.69 kB

	---
	language:
	- en
	pipeline_tag: image-to-image
	tags:
	- image-editing
	- text-guided-editing
	- diffusion
	- sana
	- qwen-vl
	- multimodal
	- distilled
	- cfg-distillation
	base_model:
	- iitolstykh/VIBE-Image-Edit
	library_name: diffusers
	---

	# VIBE: Visual Instruction Based Editor

	<div align="center">
	<img src="VIBE.png" width="800" alt="VIBE"/>
	</div>

	<p style="text-align: center;">
	<div align="center">
	</div>
	<p align="center">
	<a href="https://riko0.github.io/VIBE"> 🌐 Project Page </a> \|
	<a href="https://arxiv.org/abs/2601.02242"> 📜 Paper on arXiv </a> \|
	<a href="https://github.com/ai-forever/vibe"> Github </a> \|
	<a href="https://huggingface.co/spaces/iitolstykh/VIBE-Image-Edit-DEMO">🤗 Space \| </a>
	<a href="https://huggingface.co/iitolstykh/VIBE-Image-Edit">🤗 VIBE-Image-Edit \| </a>
	</p>

	VIBE-DistilledCFG is a specialized version of the original [VIBE-Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit) model.

	This model can be run without classifier-free guidance, substantially reducing image generation time while maintaining high quality outputs.

	## Performance Comparison

	Below is a comparison of total inference time between the original VIBE model (using CFG) and this DistilledCFG model (without CFG). The distillation process yields an approx 1.8x - 2x speedup.

	\| Resolution \| Original Model (with CFG) \| DistilledCFG Model (No CFG) \|
	\| :--- \| :--- \| :--- \|
	\| 1024x1024 \| 1.1453s \| 0.6389s \|
	\| 2048x2048 \| 4.0837s \| 1.9687s \|

	## Model Details

	- Name: VIBE-DistilledCFG
	- Parent Model: [iitolstykh/VIBE-Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit)
	- Task: Text-Guided Image Editing
	- Architecture:
	- Diffusion Backbone: Sana1.5 (1.6B parameters) with Linear Attention.
	- Condition Encoder: Qwen3-VL (2B parameters).
	- Technique: Classifier-Free Guidance (CFG) Distillation.
	- Model precision: torch.bfloat16 (BF16)
	- Model resolution: Optimized for up to 2048px images.

	## Features

	- Blazing Fast Inference: Runs approximately 2x faster than the original model by skipping the guidance pass.
	- Text-Guided Editing: Edit images using natural language instructions.
	- Compact & Efficient: Retains the lightweight footprint of the original 1.6B/2B architecture.
	- Multimodal Understanding: Powered by Qwen3-VL for precise instruction following.
	- Text-to-Image support.

	# Inference Requirements

	- `vibe` library
	```bash
	pip install git+https://github.com/ai-forever/VIBE
	```
	- requirements for `vibe` library:
	```bash
	pip install transformers==4.57.1 torchvision==0.21.0 torch==2.6.0 diffusers==0.33.1 loguru==0.7.3
	```

	# Quick start

	Note: When using this distilled model, please set `image_guidance_scale` and `guidance_scale` to 0.0 to disable CFG.

	```python
	from PIL import Image
	import requests
	from io import BytesIO
	from huggingface_hub import snapshot_download

	from vibe.editor import ImageEditor

	# Download model
	model_path = snapshot_download(
	repo_id="iitolstykh/VIBE-Image-Edit-DistilledCFG",
	repo_type="model",
	)

	# Load model
	# Note: Guidance scales are removed for the distilled version
	editor = ImageEditor(
	checkpoint_path=model_path,
	num_inference_steps=20,
	image_guidance_scale=0.0,
	guidance_scale=0.0,
	device="cuda:0",
	)

	# Download test image
	resp = requests.get('https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3f58a82a-b4b4-40c3-a318-43f9350fcd02/original=true,quality=90/115610275.jpeg')
	image = Image.open(BytesIO(resp.content))

	# Generate edited image
	edited_image = editor.generate_edited_image(
	instruction="let this case swim in the river",
	conditioning_image=image,
	num_images_per_prompt=1,
	)[0]

	edited_image.save(f"edited_image.jpg", quality=100)
	```

	## License

	This project is built upon the SANA. Please refer to the original SANA license for usage terms:
	[SANA License](https://huggingface.co/Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers/blob/main/LICENSE.txt)

	## Citation

	If you use this model in your research or applications, please acknowledge the original projects:

	- [SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer](https://github.com/NVlabs/Sana)
	- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)

	```bibtex
	@misc{vibe2026,
	Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
	Title = {VIBE: Visual Instruction Based Editor},
	Year = {2026},
	Eprint = {arXiv:2601.02242},
	}
	```