iitolstykh commited on
Commit
2fb191f
·
verified ·
1 Parent(s): b94306e

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -0
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ pipeline_tag: image-to-image
5
+ tags:
6
+ - image-editing
7
+ - text-guided-editing
8
+ - diffusion
9
+ - sana
10
+ - qwen-vl
11
+ - multimodal
12
+ - distilled
13
+ - cfg-distillation
14
+ base_model:
15
+ - iitolstykh/VIBE-Image-Edit
16
+ - Efficient-Large-Model/SANA1.5_1.6B_1024px
17
+ - Qwen/Qwen3-VL-2B-Instruct
18
+ library_name: diffusers
19
+ ---
20
+
21
+ # VIBE: Visual Instruction Based Editor
22
+
23
+ <div align="center">
24
+ <img src="VIBE.png" width="800" alt="VIBE"/>
25
+ </div>
26
+
27
+ <p style="text-align: center;">
28
+ <div align="center">
29
+ </div>
30
+ <p align="center">
31
+ <a href="https://riko0.github.io/VIBE"> 🌐 Project Page </a> |
32
+ <a href="https://arxiv.org/abs/2601.02242"> 📜 Paper on arXiv </a> |
33
+ <a href="https://github.com/ai-forever/vibe"> Github </a> |
34
+ <a href="https://huggingface.co/spaces/iitolstykh/VIBE-Image-Edit-DEMO">🤗 Space | </a>
35
+ <a href="https://huggingface.co/iitolstykh/VIBE-Image-Edit">🤗 VIBE-Image-Edit | </a>
36
+ </p>
37
+
38
+ **VIBE-DistilledCFG** is a specialized version of the original [VIBE-Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit) model.
39
+
40
+ This model can be run without classifier-free guidance, substantially reducing image generation time while maintaining high quality outputs.
41
+
42
+ ## Performance Comparison
43
+
44
+ Below is a comparison of total inference time between the original VIBE model (using CFG) and this DistilledCFG model (without CFG). The distillation process yields an approx **1.8x - 2x speedup**.
45
+
46
+ | Resolution | Original Model (with CFG) | DistilledCFG Model (No CFG) |
47
+ | :--- | :--- | :--- |
48
+ | **1024x1024** | 1.1453s | **0.6389s** |
49
+ | **2048x2048** | 4.0837s | **1.9687s** |
50
+
51
+ ## Model Details
52
+
53
+ - **Name:** VIBE-DistilledCFG
54
+ - **Parent Model:** [iitolstykh/VIBE-Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit)
55
+ - **Task:** Text-Guided Image Editing
56
+ - **Architecture:**
57
+ - **Diffusion Backbone:** Sana1.5 (1.6B parameters) with Linear Attention.
58
+ - **Condition Encoder:** Qwen3-VL (2B parameters).
59
+ - **Technique:** Classifier-Free Guidance (CFG) Distillation.
60
+ - **Model precision**: torch.bfloat16 (BF16)
61
+ - **Model resolution**: Optimized for up to 2048px images.
62
+
63
+ ## Features
64
+
65
+ - **Blazing Fast Inference:** Runs approximately 2x faster than the original model by skipping the guidance pass.
66
+ - **Text-Guided Editing:** Edit images using natural language instructions.
67
+ - **Compact & Efficient:** Retains the lightweight footprint of the original 1.6B/2B architecture.
68
+ - **Multimodal Understanding:** Powered by Qwen3-VL for precise instruction following.
69
+ - **Text-to-Image** support.
70
+
71
+ # Inference Requirements
72
+
73
+ - `vibe` library
74
+ ```bash
75
+ pip install git+https://github.com/ai-forever/VIBE
76
+ ```
77
+ - requirements for `vibe` library:
78
+ ```bash
79
+ pip install transformers==4.57.1 torchvision==0.21.0 torch==2.6.0 diffusers==0.33.1 loguru==0.7.3
80
+ ```
81
+
82
+ # Quick start
83
+
84
+ **Note:** When using this distilled model, you do not need to provide `guidance_scale` or `image_guidance_scale`.
85
+
86
+ ```python
87
+ from PIL import Image
88
+ import requests
89
+ from io import BytesIO
90
+ from huggingface_hub import snapshot_download
91
+
92
+ from vibe.editor import ImageEditor
93
+
94
+ # Download model
95
+ model_path = snapshot_download(
96
+ repo_id="iitolstykh/VIBE-Image-Edit-DistilledCFG",
97
+ repo_type="model",
98
+ )
99
+
100
+ # Load model
101
+ # Note: Guidance scales are removed for the distilled version
102
+ editor = ImageEditor(
103
+ checkpoint_path=model_path,
104
+ num_inference_steps=20,
105
+ device="cuda:0",
106
+ )
107
+
108
+ # Download test image
109
+ resp = requests.get('https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3f58a82a-b4b4-40c3-a318-43f9350fcd02/original=true,quality=90/115610275.jpeg')
110
+ image = Image.open(BytesIO(resp.content))
111
+
112
+ # Generate edited image
113
+ edited_image = editor.generate_edited_image(
114
+ instruction="let this case swim in the river",
115
+ conditioning_image=image,
116
+ num_images_per_prompt=1,
117
+ )[0]
118
+
119
+ edited_image.save(f"edited_image.jpg", quality=100)
120
+ ```
121
+
122
+ ## License
123
+
124
+ This project is built upon the SANA. Please refer to the original SANA license for usage terms:
125
+ [SANA License](https://huggingface.co/Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers/blob/main/LICENSE.txt)
126
+
127
+ ## Citation
128
+
129
+ If you use this model in your research or applications, please acknowledge the original projects:
130
+
131
+ - [SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer](https://github.com/NVlabs/Sana)
132
+ - [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
133
+
134
+ ```bibtex
135
+ @misc{vibe2026,
136
+ Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
137
+ Title = {VIBE: Visual Instruction Based Editor},
138
+ Year = {2026},
139
+ Eprint = {arXiv:2601.02242},
140
+ }
141
+ ```