VIBE-Image-Edit / README.md

Update README.md

5ff574d verified 8 days ago

4.02 kB

	---
	language:
	- en
	pipeline_tag: image-to-image
	tags:
	- image-editing
	- text-guided-editing
	- diffusion
	- sana
	- qwen-vl
	- multimodal
	base_model:
	- Efficient-Large-Model/SANA1.5_1.6B_1024px
	- Qwen/Qwen3-VL-2B-Instruct
	library_name: diffusers
	---

	# VIBE: Visual Instruction Based Editor

	<p style="text-align: center;">
	<div align="center">
	</div>
	<p align="center">
	<a href="https://riko0.github.io/VIBE"> 🌐 Project Page </a> \|
	<a href="https://arxiv.org/abs/2601.02242"> 📜 Paper on arXiv </a> \|
	<a href="https://github.com/ai-forever/vibe"> Github </a> \|
	<a href="https://huggingface.co/spaces/iitolstykh/VIBE-Image-Edit-DEMO">🤗 Space \| </a>
	</p>

	VIBE is a powerful open-source framework for text-guided image editing. It leverages the efficiency of the [Sana1.5-1.6B](https://github.com/NVlabs/Sana) diffusion model and the visual understanding capabilities of [Qwen3-VL-2B-Instruct](https://github.com/QwenLM/Qwen3-VL) to provide exceptionally fast and high-quality, instruction-based image manipulation.

	## Model Details

	- Name: VIBE
	- Task: Text-Guided Image Editing
	- Architecture:
	- Diffusion Backbone: Sana1.5 (1.6B parameters) with Linear Attention.
	- Condition Encoder: Qwen3-VL (2B parameters) for multimodal understanding.
	- Framework: Built on `diffusers` and `transformers`.
	- Model precision: torch.bfloat16 (BF16)
	- Model resolution: This model is developed to edit up to 2048px images with multi-scale heigh and width.

	## Features

	- Text-Guided Editing: Edit images using natural language instructions (e.g., "Add a cat on the sofa").
	- Compact & Efficient: Combines a 1.6B parameter diffusion model with a 2B parameter encoder for a lightweight footprint.
	- High-Speed Inference: Utilizes Sana1.5's linear attention mechanism for rapid generation.
	- Multimodal Understanding: Qwen3-VL ensures strong alignment between visual content and text instructions.


	# Inference Requirements

	- `vibe` library
	```bash
	pip install git+https://github.com/ai-forever/VIBE
	```
	- requirements for `vibe` library:
	```bash
	pip install transformers==4.57.1 torchvision==0.21.0 torch==2.6.0 diffusers==0.33.1 loguru==0.7.3
	```

	# Quick start

	```python
	from PIL import Image
	import requests
	from io import BytesIO
	from huggingface_hub import snapshot_download

	from vibe.editor import ImageEditor

	# Download model
	model_path = snapshot_download(
	repo_id="iitolstykh/VIBE-Image-Edit",
	repo_type="model",
	)

	# Load model
	editor = ImageEditor(
	checkpoint_path=model_path,
	image_guidance_scale=1.2,
	guidance_scale=4.5,
	num_inference_steps=20,
	device="cuda:0",
	)

	# Download test image
	resp = requests.get('https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3f58a82a-b4b4-40c3-a318-43f9350fcd02/original=true,quality=90/115610275.jpeg')
	image = Image.open(BytesIO(resp.content))

	# Generate edited image
	edited_image = editor.generate_edited_image(
	instruction="let this case swim in the river",
	conditioning_image=image,
	num_images_per_prompt=1,
	)[0]

	edited_image.save(f"edited_image.jpg", quality=100)
	```

	## License

	This project is built upon the SANA. Please refer to the original SANA license for usage terms:
	[SANA License](https://huggingface.co/Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers/blob/main/LICENSE.txt)

	## Citation

	If you use this model in your research or applications, please acknowledge the original projects:

	- [SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer](https://github.com/NVlabs/Sana)
	- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)

	```bibtex
	@misc{vibe2026,
	Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
	Title = {VIBE: Visual Instruction Based Editor},
	Year = {2026},
	Eprint = {arXiv:2601.02242},
	}
	```