aikx commited on
Commit
b94be7d
Β·
verified Β·
1 Parent(s): e0489a1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +171 -15
README.md CHANGED
@@ -1,18 +1,20 @@
1
  ---
2
  license: apache-2.0
3
- pipeline_tag: text-to-image
4
  library_name: transformers
5
  ---
6
-
7
  <div align='center'>
8
  <h1>Emu3.5: Native Multimodal Models are World Learners</h1>
9
 
10
  Emu3.5 Team, BAAI
11
 
12
- [Project Page](https://emu.world/) | [πŸ€—HF Models](https://huggingface.co/collections/BAAI/emu35) | [Paper](https://arxiv.org/pdf/2510.26583) | [Code](https://github.com/baaivision/Emu3.5)
13
  </div>
14
 
15
 
 
 
 
16
  <div align='center'>
17
  <img src="https://github.com/baaivision/Emu3.5/blob/main/assets/arch.png?raw=True" class="interpolation-image" alt="arch." height="100%" width="100%" />
18
  </div>
@@ -32,17 +34,25 @@ Emu3.5 Team, BAAI
32
  | 🎯 | **RL Post-Training** | Large-scale **reinforcement learning** enhances **reasoning**, **compositionality**, and **generation quality**. |
33
  | ⚑ | **Discrete Diffusion Adaptation (DiDA)** | Converts **sequential decoding β†’ bidirectional parallel prediction**, achieving **β‰ˆ20Γ— faster inference without performance loss**. |
34
  | πŸ–ΌοΈ | **Versatile Generation** | Excels in **long-horizon vision–language generation**, **any-to-image (X2I)** synthesis, and **text-rich image creation**. |
35
- | 🌐 | **Generalizable World Modeling** | Enables **spatiotemporally consistent world exploration**, and **open-world embodied manipulation** across diverse scenarios. |\
36
  | πŸ† | **Performance Benchmark** | Matches **Gemini 2.5 Flash Image (Nano Banana)** on **image generation/editing**, and **outperforms** on **interleaved generation tasks**. |
37
 
38
 
 
 
 
 
 
 
 
39
 
40
  ## Table of Contents
41
 
42
  1. [Model & Weights](#1-model--weights)
43
  2. [Quick Start](#2-quick-start)
44
- 3. [Schedule](#3-schedule)
45
- 4. [Citation](#4-citation)
 
46
 
47
  ## 1. Model & Weights
48
 
@@ -52,7 +62,17 @@ Emu3.5 Team, BAAI
52
  | Emu3.5-Image | [πŸ€— HF link](https://huggingface.co/BAAI/Emu3.5-Image/tree/main) |
53
  | Emu3.5-VisionTokenizer | [πŸ€— HF link](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer/tree/main) |
54
 
55
- **Emu3.5** handles general tasks(including interleaved generation and image generation/editing), while **Emu3.5-Image** focuses on high-quality image generation/editing.
 
 
 
 
 
 
 
 
 
 
56
 
57
 
58
  ## 2. Quick Start
@@ -60,9 +80,10 @@ Emu3.5 Team, BAAI
60
  ### Environment Setup
61
 
62
  ```bash
 
63
  git clone https://github.com/baaivision/Emu3.5
64
  cd Emu3.5
65
- pip install -r requirements.txt
66
  pip install flash_attn==2.8.3 --no-build-isolation
67
  ```
68
  ### Configuration
@@ -71,8 +92,9 @@ Edit `configs/config.py` to set:
71
 
72
  - Paths: `model_path`, `vq_path`
73
  - Task template: `task_type in {t2i, x2i, howto, story, explore, vla}`
74
- - Input image: `use_image` (True to provide reference images, controls <|IMAGE|> token); set `reference_image` in each prompt to specify the image path.
75
  - Sampling: `sampling_params` (classifier_free_guidance, temperature, top_k/top_p, etc.)
 
76
 
77
  ### Run Inference
78
 
@@ -80,24 +102,158 @@ Edit `configs/config.py` to set:
80
  python inference.py --cfg configs/config.py
81
  ```
82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  Protobuf outputs are written to `outputs/<exp_name>/proto/`. For better throughput, we recommend β‰₯2 GPUs.
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  ### Visualize Protobuf Outputs
86
 
87
- To visualize generated protobuf files:
88
 
89
  ```bash
90
- python src/utils/vis_proto.py --input <input_proto_file> --output <output_dir>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  ```
92
 
93
- ## 3. Schedule
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
- - [x] Inference Code(auto-regressive version)
96
  - [ ] Advanced Image Decoder
97
- - [ ] Discrete Diffusion Adaptation(DiDA) Inference & Weights
98
 
99
 
100
- ## 4. Citation
101
 
102
  ```bibtex
103
  @misc{cui2025emu35nativemultimodalmodels,
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: image-text-to-image
4
  library_name: transformers
5
  ---
 
6
  <div align='center'>
7
  <h1>Emu3.5: Native Multimodal Models are World Learners</h1>
8
 
9
  Emu3.5 Team, BAAI
10
 
11
+ [Project Page](https://emu.world/pages/web/landingPage) | [πŸ€—HF Models](https://huggingface.co/collections/BAAI/emu35) | [Paper](https://arxiv.org/pdf/2510.26583) | [App](https://emu.world/pages/web/home?route=index)
12
  </div>
13
 
14
 
15
+ > πŸ”” **Latest**: Emu3.5 Web & Mobile Apps and vLLM offline inference are live β€” see [πŸ”₯ News](#news) for details.
16
+
17
+
18
  <div align='center'>
19
  <img src="https://github.com/baaivision/Emu3.5/blob/main/assets/arch.png?raw=True" class="interpolation-image" alt="arch." height="100%" width="100%" />
20
  </div>
 
34
  | 🎯 | **RL Post-Training** | Large-scale **reinforcement learning** enhances **reasoning**, **compositionality**, and **generation quality**. |
35
  | ⚑ | **Discrete Diffusion Adaptation (DiDA)** | Converts **sequential decoding β†’ bidirectional parallel prediction**, achieving **β‰ˆ20Γ— faster inference without performance loss**. |
36
  | πŸ–ΌοΈ | **Versatile Generation** | Excels in **long-horizon vision–language generation**, **any-to-image (X2I)** synthesis, and **text-rich image creation**. |
37
+ | 🌐 | **Generalizable World Modeling** | Enables **spatiotemporally consistent world exploration**, and **open-world embodied manipulation** across diverse scenarios. |
38
  | πŸ† | **Performance Benchmark** | Matches **Gemini 2.5 Flash Image (Nano Banana)** on **image generation/editing**, and **outperforms** on **interleaved generation tasks**. |
39
 
40
 
41
+ <a id="news"></a>
42
+
43
+ ## πŸ”₯ News
44
+
45
+ - **2025-11-28 Β· 🌐 Emu3.5 Web & Mobile Apps Live** β€” Official product experience is **now available** on the web at [zh.emu.world](https://zh.emu.world) (Mainland China) and [emu.world](https://emu.world) (global) πŸŽ‰ The new homepage highlights featured cases and a β€œGet Started” entry, while the workspace and mobile apps bring together creation, inspiration feed, history, profile, and language switch across web, Android APK, and H5. *([See more details](#official-web--mobile-apps) below.)*
46
+ - **2025-11-19 Β· πŸš€ vLLM Offline Inference Released** β€” Meet `inference_vllm.py` with a new cond/uncond batch scheduler, delivering **4–5Γ— faster end-to-end generation** on vLLM 0.11.0 across Emu3.5 tasks. Jump to [#Run Inference with vLLM](#run-inference-with-vllm) for setup guidance and see PR [#47](https://github.com/baaivision/Emu3.5/pull/47) for full details.
47
+ - **2025-11-17 Β· πŸŽ›οΈ Gradio Demo (Transformers Backend)** β€” Introduced `gradio_demo_image.py` and `gradio_demo_interleave.py` presets for the standard Transformers runtime, providing turnkey T2I/X2I and interleaved generation experiences with streaming output. Try the commands in [#Gradio Demo](#3-gradio-demo) to launch both UIs locally.
48
 
49
  ## Table of Contents
50
 
51
  1. [Model & Weights](#1-model--weights)
52
  2. [Quick Start](#2-quick-start)
53
+ 3. [Gradio Demo](#3-gradio-demo)
54
+ 4. [Schedule](#4-schedule)
55
+ 5. [Citation](#5-citation)
56
 
57
  ## 1. Model & Weights
58
 
 
62
  | Emu3.5-Image | [πŸ€— HF link](https://huggingface.co/BAAI/Emu3.5-Image/tree/main) |
63
  | Emu3.5-VisionTokenizer | [πŸ€— HF link](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer/tree/main) |
64
 
65
+
66
+ *Note:*
67
+ - **Emu3.5** supports general-purpose multimodal predictions, including interleaved image-text generation and single-image generation (T2I/X2I) tasks.
68
+ - **Emu3.5-Image** is a model focused on T2I/X2I tasks for best performance on these scenarios.
69
+ - Both models are pure next-token predictors without DiDA acceleration (each image may take several minutes to generate).
70
+ - ⚑ **Stay tuned for DiDA-accelerated weights.**
71
+
72
+ > πŸ’‘ **Usage tip:**
73
+ > For **interleaved image-text generation**, use **Emu3.5**.
74
+ > For **single-image generation** (T2I and X2I), use **Emu3.5-Image** for the best quality.
75
+
76
 
77
 
78
  ## 2. Quick Start
 
80
  ### Environment Setup
81
 
82
  ```bash
83
+ # Requires Python 3.12 or higher.
84
  git clone https://github.com/baaivision/Emu3.5
85
  cd Emu3.5
86
+ pip install -r requirements/transformers.txt
87
  pip install flash_attn==2.8.3 --no-build-isolation
88
  ```
89
  ### Configuration
 
92
 
93
  - Paths: `model_path`, `vq_path`
94
  - Task template: `task_type in {t2i, x2i, howto, story, explore, vla}`
95
+ - Input image: `use_image` (True to provide reference images, controls <|IMAGE|> token); set `reference_image` in each prompt to specify the image path. For x2i task, we recommand using `reference_image` as a list containing single/multiple image paths to be compatible with multi-image input.
96
  - Sampling: `sampling_params` (classifier_free_guidance, temperature, top_k/top_p, etc.)
97
+ - Aspect Ratio (for t2i task): `aspect_ratio` ("4:3", "21:9", "1:1", "auto" etc..)
98
 
99
  ### Run Inference
100
 
 
102
  python inference.py --cfg configs/config.py
103
  ```
104
 
105
+
106
+ #### Example Configurations by Task
107
+ Below are example commands for different tasks.
108
+ Make sure to set CUDA_VISIBLE_DEVICES according to your available GPUs.
109
+
110
+
111
+ ```bash
112
+ # πŸ–ΌοΈ Text-to-Image (T2I) task
113
+ CUDA_VISIBLE_DEVICES=0 python inference.py --cfg configs/example_config_t2i.py
114
+
115
+ # πŸ”„ Any-to-Image (X2I) task
116
+ CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_x2i.py
117
+
118
+ # 🎯 Visual Guidance task
119
+ CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_guidance.py
120
+
121
+ # πŸ“– Visual Narrative task
122
+ CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_narrative.py
123
+
124
+
125
+ # After running inference, the model will generate results in protobuf format (.pb files) for each input prompt.
126
+ ```
127
+
128
+
129
  Protobuf outputs are written to `outputs/<exp_name>/proto/`. For better throughput, we recommend β‰₯2 GPUs.
130
 
131
+
132
+ ### Run Inference with vLLM
133
+
134
+ #### vLLM Enviroment Setup
135
+
136
+ 1. [Optional Recommendation] Use a new virtual environment for vLLM backend.
137
+ ```bash
138
+ conda create -n Emu3p5 python=3.12
139
+ ```
140
+
141
+ 2. Install vLLM and apply the patch files.
142
+ ```bash
143
+ # Requires Python 3.12 or higher.
144
+ # Recommended: CUDA 12.8.
145
+ pip install -r requirements/vllm.txt
146
+ pip install flash_attn==2.8.3 --no-build-isolation
147
+
148
+ cd Emu3.5
149
+ python src/patch/apply.py
150
+ ```
151
+
152
+ #### Example Configurations by Task
153
+
154
+ ```bash
155
+ # πŸ–ΌοΈ Text-to-Image (T2I) task
156
+ CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_t2i.py
157
+
158
+ # πŸ”„ Any-to-Image (X2I) task
159
+ CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_x2i.py
160
+
161
+ # 🎯 Visual Guidance task
162
+ CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_guidance.py
163
+
164
+ # πŸ“– Visual Narrative task
165
+ CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_narrative.py
166
+ ```
167
+
168
+
169
  ### Visualize Protobuf Outputs
170
 
171
+ To visualize generated protobuf files (--video: Generate video visualizations for interleaved output):
172
 
173
  ```bash
174
+ python src/utils/vis_proto.py --input <input_proto_path> [--output <output_dir>] [--video]
175
+ ```
176
+
177
+ - `--input`: supports a single `.pb` file or a directory; directories are scanned recursively.
178
+ - `--output`: optional; defaults to `<input_dir>/results/<file_stem>` for files, or `<parent_dir_of_input>/results` for directories.
179
+
180
+ Expected output directory layout (example):
181
+
182
+ ```text
183
+ results/<pb_name>/
184
+ β”œβ”€β”€ 000_question.txt
185
+ β”œβ”€β”€ 000_global_cot.txt
186
+ β”œβ”€β”€ 001_text.txt
187
+ β”œβ”€β”€ 001_00_image.png
188
+ β”œβ”€β”€ 001_00_image_cot.txt
189
+ β”œβ”€β”€ 002_text.txt
190
+ β”œβ”€β”€ 002_00_image.png
191
+ β”œβ”€β”€ ...
192
+ └── video.mp4 # only when --video is enabled
193
  ```
194
 
195
+ Each `*_text.txt` stores decoded segments, `*_image.png` stores generated frames, and matching `*_image_cot.txt` keeps image-level chain-of-thought notes when available.
196
+
197
+ ## 3. Gradio Demo
198
+
199
+ We provide two Gradio Demos for different application scenarios:
200
+
201
+ Emu3.5-Image Demo β€”β€” Interactive interface optimized for Text-to-Image (T2I) and Any-to-Image (X2I) tasks:
202
+
203
+ ```bash
204
+ CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_image.py --host 0.0.0.0 --port 7860
205
+ ```
206
+
207
+ Emu3.5-Interleave Demo β€”β€” Launch Emu3.5 Interleave Tasks (Visual Guidance and Visual Narrate) Gradio Demo
208
+ ```bash
209
+ CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_interleave.py --host 0.0.0.0 --port 7860
210
+ ```
211
+
212
+ ### Features
213
+
214
+ - Image Generation: Support Text-to-Image Generation and Multimodal Image Generation
215
+ - Interleaved Generation: Support long-sequence creation with alternating image and text generation
216
+ - Multiple Aspect Ratios for T2I: 9 preset aspect ratios (4:3, 16:9, 1:1, etc.) plus auto mode
217
+ - Chain-of-Thought Display: Automatically parse and format model's internal thinking process
218
+ - Real-time Streaming: Stream text and image generation with live updates
219
+
220
+ ### Official Web & Mobile Apps
221
+
222
+ - **Web**: Production-ready Emu3.5 experience is available at [zh.emu.world](https://zh.emu.world) (Mainland China) and [emu.world](https://emu.world) (global), featuring a curated homepage, β€œCreate” workspace, inspiration feed, history, personal profile, and language switching.
223
+ - **Mobile (Android APK & H5)**: Mobile clients provide the same core flows β€” prompt-based creation, β€œinspiration” gallery, personal center, and feedback & privacy entrypoints β€” with automatic UI language selection based on system settings.
224
+ - **Docs**: For product usage details, see the **Emu3.5 AI δ½Ώη”¨ζŒ‡ε— (Chinese)** and **Emu3.5 AI User Guide (English)**:
225
+ - CN: [Emu3.5 AI δ½Ώη”¨ζŒ‡ε—](https://jwolpxeehx.feishu.cn/wiki/BKuKwkzZOi4pdRkVV13csI0FnIg?from=from_copylink)
226
+ - EN: [Emu3.5 AI User Guide](https://jwolpxeehx.feishu.cn/wiki/Gcxtw9XHhisUu8kBEaac6s6xnhc?from=from_copylink)
227
+
228
+ #### Mobile App Download (QR Codes)
229
+
230
+ <div align='center'>
231
+ <table>
232
+ <tr>
233
+ <td align="center">
234
+ <img src="https://github.com/baaivision/Emu3.5/blob/main/assets/qr_zh.png?raw=True" alt="Emu3.5 Mobile App (Mainland China)" width="220" />
235
+ <br />
236
+ <sub><b>Emu3.5 Mobile Β· Mainland China</b></sub>
237
+ </td>
238
+ <td align="center">
239
+ <img src="https://github.com/baaivision/Emu3.5/blob/main/assets/qr.png?raw=True" alt="Emu3.5 Mobile App (Global)" width="220" />
240
+ <br />
241
+ <sub><b>Emu3.5 Mobile Β· Global</b></sub>
242
+ </td>
243
+ </tr>
244
+ </table>
245
+ </div>
246
+
247
+
248
+
249
+ ## 4. Schedule
250
 
251
+ - [x] Inference Code (NTP Version)
252
  - [ ] Advanced Image Decoder
253
+ - [ ] Discrete Diffusion Adaptation (DiDA) Inference & Weights
254
 
255
 
256
+ ## 5. Citation
257
 
258
  ```bibtex
259
  @misc{cui2025emu35nativemultimodalmodels,