Safetensors
Emu3p5VisionVQ
custom_code
aikx commited on
Commit
a5c45bd
Β·
verified Β·
1 Parent(s): aceec64

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -13
README.md CHANGED
@@ -6,10 +6,13 @@ license: apache-2.0
6
 
7
  Emu3.5 Team, BAAI
8
 
9
- [Project Page](https://emu.world/) | [πŸ€—HF Models](https://huggingface.co/collections/BAAI/emu35) | [Paper](https://arxiv.org/pdf/2510.26583)
10
  </div>
11
 
12
 
 
 
 
13
  <div align='center'>
14
  <img src="https://github.com/baaivision/Emu3.5/blob/main/assets/arch.png?raw=True" class="interpolation-image" alt="arch." height="100%" width="100%" />
15
  </div>
@@ -33,13 +36,21 @@ Emu3.5 Team, BAAI
33
  | πŸ† | **Performance Benchmark** | Matches **Gemini 2.5 Flash Image (Nano Banana)** on **image generation/editing**, and **outperforms** on **interleaved generation tasks**. |
34
 
35
 
 
 
 
 
 
 
 
36
 
37
  ## Table of Contents
38
 
39
  1. [Model & Weights](#1-model--weights)
40
  2. [Quick Start](#2-quick-start)
41
- 3. [Schedule](#3-schedule)
42
- 4. [Citation](#4-citation)
 
43
 
44
  ## 1. Model & Weights
45
 
@@ -49,14 +60,28 @@ Emu3.5 Team, BAAI
49
  | Emu3.5-Image | [πŸ€— HF link](https://huggingface.co/BAAI/Emu3.5-Image/tree/main) |
50
  | Emu3.5-VisionTokenizer | [πŸ€— HF link](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer/tree/main) |
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ## 2. Quick Start
53
 
54
  ### Environment Setup
55
 
56
  ```bash
 
57
  git clone https://github.com/baaivision/Emu3.5
58
  cd Emu3.5
59
- pip install -r requirements.txt
60
  pip install flash_attn==2.8.3 --no-build-isolation
61
  ```
62
  ### Configuration
@@ -64,8 +89,10 @@ pip install flash_attn==2.8.3 --no-build-isolation
64
  Edit `configs/config.py` to set:
65
 
66
  - Paths: `model_path`, `vq_path`
67
- - Task template: `task_type in {t2i, x2i, howto, story, explore, vla}`, `use_image` controls `<|IMAGE|>` usage (set to true when reference images are provided)
 
68
  - Sampling: `sampling_params` (classifier_free_guidance, temperature, top_k/top_p, etc.)
 
69
 
70
  ### Run Inference
71
 
@@ -73,24 +100,158 @@ Edit `configs/config.py` to set:
73
  python inference.py --cfg configs/config.py
74
  ```
75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  Protobuf outputs are written to `outputs/<exp_name>/proto/`. For better throughput, we recommend β‰₯2 GPUs.
77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  ### Visualize Protobuf Outputs
79
 
80
- To visualize generated protobuf files:
81
 
82
  ```bash
83
- python src/utils/vis_proto.py --input <input_proto_file> --output <output_dir>
84
  ```
85
 
86
- ## 3. Schedule
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
- - [x] Inference Code
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  - [ ] Advanced Image Decoder
90
- - [ ] Discrete Diffusion Adaptation(DiDA)
91
 
92
 
93
- ## 4. Citation
94
 
95
  ```bibtex
96
  @misc{cui2025emu35nativemultimodalmodels,
@@ -102,5 +263,4 @@ python src/utils/vis_proto.py --input <input_proto_file> --output <output_dir>
102
  primaryClass={cs.CV},
103
  url={https://arxiv.org/abs/2510.26583},
104
  }
105
- ```
106
-
 
6
 
7
  Emu3.5 Team, BAAI
8
 
9
+ [Project Page](https://emu.world/pages/web/landingPage) | [πŸ€—HF Models](https://huggingface.co/collections/BAAI/emu35) | [Paper](https://arxiv.org/pdf/2510.26583) | [App](https://emu.world/pages/web/home?route=index)
10
  </div>
11
 
12
 
13
+ > πŸ”” **Latest**: Emu3.5 Web & Mobile Apps and vLLM offline inference are live β€” see [πŸ”₯ News](#news) for details.
14
+
15
+
16
  <div align='center'>
17
  <img src="https://github.com/baaivision/Emu3.5/blob/main/assets/arch.png?raw=True" class="interpolation-image" alt="arch." height="100%" width="100%" />
18
  </div>
 
36
  | πŸ† | **Performance Benchmark** | Matches **Gemini 2.5 Flash Image (Nano Banana)** on **image generation/editing**, and **outperforms** on **interleaved generation tasks**. |
37
 
38
 
39
+ <a id="news"></a>
40
+
41
+ ## πŸ”₯ News
42
+
43
+ - **2025-11-28 Β· 🌐 Emu3.5 Web & Mobile Apps Live** β€” Official product experience is **now available** on the web at [zh.emu.world](https://zh.emu.world) (Mainland China) and [emu.world](https://emu.world) (global) πŸŽ‰ The new homepage highlights featured cases and a β€œGet Started” entry, while the workspace and mobile apps bring together creation, inspiration feed, history, profile, and language switch across web, Android APK, and H5. *([See more details](#official-web--mobile-apps) below.)*
44
+ - **2025-11-19 Β· πŸš€ vLLM Offline Inference Released** β€” Meet `inference_vllm.py` with a new cond/uncond batch scheduler, delivering **4–5Γ— faster end-to-end generation** on vLLM 0.11.0 across Emu3.5 tasks. Jump to [#Run Inference with vLLM](#run-inference-with-vllm) for setup guidance and see PR [#47](https://github.com/baaivision/Emu3.5/pull/47) for full details.
45
+ - **2025-11-17 Β· πŸŽ›οΈ Gradio Demo (Transformers Backend)** β€” Introduced `gradio_demo_image.py` and `gradio_demo_interleave.py` presets for the standard Transformers runtime, providing turnkey T2I/X2I and interleaved generation experiences with streaming output. Try the commands in [#Gradio Demo](#3-gradio-demo) to launch both UIs locally.
46
 
47
  ## Table of Contents
48
 
49
  1. [Model & Weights](#1-model--weights)
50
  2. [Quick Start](#2-quick-start)
51
+ 3. [Gradio Demo](#3-gradio-demo)
52
+ 4. [Schedule](#4-schedule)
53
+ 5. [Citation](#5-citation)
54
 
55
  ## 1. Model & Weights
56
 
 
60
  | Emu3.5-Image | [πŸ€— HF link](https://huggingface.co/BAAI/Emu3.5-Image/tree/main) |
61
  | Emu3.5-VisionTokenizer | [πŸ€— HF link](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer/tree/main) |
62
 
63
+
64
+ *Note:*
65
+ - **Emu3.5** supports general-purpose multimodal predictions, including interleaved image-text generation and single-image generation (T2I/X2I) tasks.
66
+ - **Emu3.5-Image** is a model focused on T2I/X2I tasks for best performance on these scenarios.
67
+ - Both models are pure next-token predictors without DiDA acceleration (each image may take several minutes to generate).
68
+ - ⚑ **Stay tuned for DiDA-accelerated weights.**
69
+
70
+ > πŸ’‘ **Usage tip:**
71
+ > For **interleaved image-text generation**, use **Emu3.5**.
72
+ > For **single-image generation** (T2I and X2I), use **Emu3.5-Image** for the best quality.
73
+
74
+
75
+
76
  ## 2. Quick Start
77
 
78
  ### Environment Setup
79
 
80
  ```bash
81
+ # Requires Python 3.12 or higher.
82
  git clone https://github.com/baaivision/Emu3.5
83
  cd Emu3.5
84
+ pip install -r requirements/transformers.txt
85
  pip install flash_attn==2.8.3 --no-build-isolation
86
  ```
87
  ### Configuration
 
89
  Edit `configs/config.py` to set:
90
 
91
  - Paths: `model_path`, `vq_path`
92
+ - Task template: `task_type in {t2i, x2i, howto, story, explore, vla}`
93
+ - Input image: `use_image` (True to provide reference images, controls <|IMAGE|> token); set `reference_image` in each prompt to specify the image path. For x2i task, we recommand using `reference_image` as a list containing single/multiple image paths to be compatible with multi-image input.
94
  - Sampling: `sampling_params` (classifier_free_guidance, temperature, top_k/top_p, etc.)
95
+ - Aspect Ratio (for t2i task): `aspect_ratio` ("4:3", "21:9", "1:1", "auto" etc..)
96
 
97
  ### Run Inference
98
 
 
100
  python inference.py --cfg configs/config.py
101
  ```
102
 
103
+
104
+ #### Example Configurations by Task
105
+ Below are example commands for different tasks.
106
+ Make sure to set CUDA_VISIBLE_DEVICES according to your available GPUs.
107
+
108
+
109
+ ```bash
110
+ # πŸ–ΌοΈ Text-to-Image (T2I) task
111
+ CUDA_VISIBLE_DEVICES=0 python inference.py --cfg configs/example_config_t2i.py
112
+
113
+ # πŸ”„ Any-to-Image (X2I) task
114
+ CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_x2i.py
115
+
116
+ # 🎯 Visual Guidance task
117
+ CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_guidance.py
118
+
119
+ # πŸ“– Visual Narrative task
120
+ CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_narrative.py
121
+
122
+
123
+ # After running inference, the model will generate results in protobuf format (.pb files) for each input prompt.
124
+ ```
125
+
126
+
127
  Protobuf outputs are written to `outputs/<exp_name>/proto/`. For better throughput, we recommend β‰₯2 GPUs.
128
 
129
+
130
+ ### Run Inference with vLLM
131
+
132
+ #### vLLM Enviroment Setup
133
+
134
+ 1. [Optional Recommendation] Use a new virtual environment for vLLM backend.
135
+ ```bash
136
+ conda create -n Emu3p5 python=3.12
137
+ ```
138
+
139
+ 2. Install vLLM and apply the patch files.
140
+ ```bash
141
+ # Requires Python 3.12 or higher.
142
+ # Recommended: CUDA 12.8.
143
+ pip install -r requirements/vllm.txt
144
+ pip install flash_attn==2.8.3 --no-build-isolation
145
+
146
+ cd Emu3.5
147
+ python src/patch/apply.py
148
+ ```
149
+
150
+ #### Example Configurations by Task
151
+
152
+ ```bash
153
+ # πŸ–ΌοΈ Text-to-Image (T2I) task
154
+ CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_t2i.py
155
+
156
+ # πŸ”„ Any-to-Image (X2I) task
157
+ CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_x2i.py
158
+
159
+ # 🎯 Visual Guidance task
160
+ CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_guidance.py
161
+
162
+ # πŸ“– Visual Narrative task
163
+ CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_narrative.py
164
+ ```
165
+
166
+
167
  ### Visualize Protobuf Outputs
168
 
169
+ To visualize generated protobuf files (--video: Generate video visualizations for interleaved output):
170
 
171
  ```bash
172
+ python src/utils/vis_proto.py --input <input_proto_path> [--output <output_dir>] [--video]
173
  ```
174
 
175
+ - `--input`: supports a single `.pb` file or a directory; directories are scanned recursively.
176
+ - `--output`: optional; defaults to `<input_dir>/results/<file_stem>` for files, or `<parent_dir_of_input>/results` for directories.
177
+
178
+ Expected output directory layout (example):
179
+
180
+ ```text
181
+ results/<pb_name>/
182
+ β”œβ”€β”€ 000_question.txt
183
+ β”œβ”€β”€ 000_global_cot.txt
184
+ β”œβ”€β”€ 001_text.txt
185
+ β”œβ”€β”€ 001_00_image.png
186
+ β”œβ”€β”€ 001_00_image_cot.txt
187
+ β”œβ”€β”€ 002_text.txt
188
+ β”œβ”€β”€ 002_00_image.png
189
+ β”œβ”€β”€ ...
190
+ └── video.mp4 # only when --video is enabled
191
+ ```
192
+
193
+ Each `*_text.txt` stores decoded segments, `*_image.png` stores generated frames, and matching `*_image_cot.txt` keeps image-level chain-of-thought notes when available.
194
+
195
+ ## 3. Gradio Demo
196
+
197
+ We provide two Gradio Demos for different application scenarios:
198
+
199
+ Emu3.5-Image Demo β€”β€” Interactive interface optimized for Text-to-Image (T2I) and Any-to-Image (X2I) tasks:
200
+
201
+ ```bash
202
+ CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_image.py --host 0.0.0.0 --port 7860
203
+ ```
204
 
205
+ Emu3.5-Interleave Demo β€”β€” Launch Emu3.5 Interleave Tasks (Visual Guidance and Visual Narrate) Gradio Demo
206
+ ```bash
207
+ CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_interleave.py --host 0.0.0.0 --port 7860
208
+ ```
209
+
210
+ ### Features
211
+
212
+ - Image Generation: Support Text-to-Image Generation and Multimodal Image Generation
213
+ - Interleaved Generation: Support long-sequence creation with alternating image and text generation
214
+ - Multiple Aspect Ratios for T2I: 9 preset aspect ratios (4:3, 16:9, 1:1, etc.) plus auto mode
215
+ - Chain-of-Thought Display: Automatically parse and format model's internal thinking process
216
+ - Real-time Streaming: Stream text and image generation with live updates
217
+
218
+ ### Official Web & Mobile Apps
219
+
220
+ - **Web**: Production-ready Emu3.5 experience is available at [zh.emu.world](https://zh.emu.world) (Mainland China) and [emu.world](https://emu.world) (global), featuring a curated homepage, β€œCreate” workspace, inspiration feed, history, personal profile, and language switching.
221
+ - **Mobile (Android APK & H5)**: Mobile clients provide the same core flows β€” prompt-based creation, β€œinspiration” gallery, personal center, and feedback & privacy entrypoints β€” with automatic UI language selection based on system settings.
222
+ - **Docs**: For product usage details, see the **Emu3.5 AI δ½Ώη”¨ζŒ‡ε— (Chinese)** and **Emu3.5 AI User Guide (English)**:
223
+ - CN: [Emu3.5 AI δ½Ώη”¨ζŒ‡ε—](https://jwolpxeehx.feishu.cn/wiki/BKuKwkzZOi4pdRkVV13csI0FnIg?from=from_copylink)
224
+ - EN: [Emu3.5 AI User Guide](https://jwolpxeehx.feishu.cn/wiki/Gcxtw9XHhisUu8kBEaac6s6xnhc?from=from_copylink)
225
+
226
+ #### Mobile App Download (QR Codes)
227
+
228
+ <div align='center'>
229
+ <table>
230
+ <tr>
231
+ <td align="center">
232
+ <img src="https://github.com/baaivision/Emu3.5/blob/main/assets/qr_zh.png?raw=True" alt="Emu3.5 Mobile App (Mainland China)" width="220" />
233
+ <br />
234
+ <sub><b>Emu3.5 Mobile Β· Mainland China</b></sub>
235
+ </td>
236
+ <td align="center">
237
+ <img src="https://github.com/baaivision/Emu3.5/blob/main/assets/qr.png?raw=True" alt="Emu3.5 Mobile App (Global)" width="220" />
238
+ <br />
239
+ <sub><b>Emu3.5 Mobile Β· Global</b></sub>
240
+ </td>
241
+ </tr>
242
+ </table>
243
+ </div>
244
+
245
+
246
+
247
+ ## 4. Schedule
248
+
249
+ - [x] Inference Code (NTP Version)
250
  - [ ] Advanced Image Decoder
251
+ - [ ] Discrete Diffusion Adaptation (DiDA) Inference & Weights
252
 
253
 
254
+ ## 5. Citation
255
 
256
  ```bibtex
257
  @misc{cui2025emu35nativemultimodalmodels,
 
263
  primaryClass={cs.CV},
264
  url={https://arxiv.org/abs/2510.26583},
265
  }
266
+ ```