VIBE-Image-Edit / README.md

iitolstykh

Update README.md

5ff574d verified 8 days ago

preview code

raw

history blame contribute delete

4.02 kB

metadata

language:
  - en
pipeline_tag: image-to-image
tags:
  - image-editing
  - text-guided-editing
  - diffusion
  - sana
  - qwen-vl
  - multimodal
base_model:
  - Efficient-Large-Model/SANA1.5_1.6B_1024px
  - Qwen/Qwen3-VL-2B-Instruct
library_name: diffusers

VIBE: Visual Instruction Based Editor

🌐 Project Page | 📜 Paper on arXiv | Github | 🤗 Space |

VIBE is a powerful open-source framework for text-guided image editing. It leverages the efficiency of the Sana1.5-1.6B diffusion model and the visual understanding capabilities of Qwen3-VL-2B-Instruct to provide exceptionally fast and high-quality, instruction-based image manipulation.

Model Details

Name: VIBE
Task: Text-Guided Image Editing
Architecture:
- Diffusion Backbone: Sana1.5 (1.6B parameters) with Linear Attention.
- Condition Encoder: Qwen3-VL (2B parameters) for multimodal understanding.
Framework: Built on diffusers and transformers.
Model precision: torch.bfloat16 (BF16)
Model resolution: This model is developed to edit up to 2048px images with multi-scale heigh and width.

Features

Text-Guided Editing: Edit images using natural language instructions (e.g., "Add a cat on the sofa").
Compact & Efficient: Combines a 1.6B parameter diffusion model with a 2B parameter encoder for a lightweight footprint.
High-Speed Inference: Utilizes Sana1.5's linear attention mechanism for rapid generation.
Multimodal Understanding: Qwen3-VL ensures strong alignment between visual content and text instructions.

Inference Requirements

vibe library

pip install git+https://github.com/ai-forever/VIBE

requirements for vibe library:

pip install transformers==4.57.1 torchvision==0.21.0 torch==2.6.0 diffusers==0.33.1 loguru==0.7.3

Quick start

from PIL import Image
import requests
from io import BytesIO
from huggingface_hub import snapshot_download

from vibe.editor import ImageEditor

# Download model
model_path = snapshot_download(
    repo_id="iitolstykh/VIBE-Image-Edit",
    repo_type="model",
)

# Load model
editor = ImageEditor(
    checkpoint_path=model_path,
    image_guidance_scale=1.2,
    guidance_scale=4.5,
    num_inference_steps=20,
    device="cuda:0",
)

# Download test image
resp = requests.get('https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3f58a82a-b4b4-40c3-a318-43f9350fcd02/original=true,quality=90/115610275.jpeg')
image = Image.open(BytesIO(resp.content))

# Generate edited image
edited_image = editor.generate_edited_image(
    instruction="let this case swim in the river",
    conditioning_image=image,
    num_images_per_prompt=1,
)[0]

edited_image.save(f"edited_image.jpg", quality=100)

License

This project is built upon the SANA. Please refer to the original SANA license for usage terms: SANA License

Citation

If you use this model in your research or applications, please acknowledge the original projects:

@misc{vibe2026,
  Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
  Title = {VIBE: Visual Instruction Based Editor},
  Year = {2026},
  Eprint = {arXiv:2601.02242},
}