Update README.md
Browse files
README.md
CHANGED
|
@@ -1,18 +1,20 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
-
pipeline_tag: text-to-image
|
| 4 |
library_name: transformers
|
| 5 |
---
|
| 6 |
-
|
| 7 |
<div align='center'>
|
| 8 |
<h1>Emu3.5: Native Multimodal Models are World Learners</h1>
|
| 9 |
|
| 10 |
Emu3.5 Team, BAAI
|
| 11 |
|
| 12 |
-
[Project Page](https://emu.world/) | [π€HF Models](https://huggingface.co/collections/BAAI/emu35) | [Paper](https://arxiv.org/pdf/2510.26583) | [
|
| 13 |
</div>
|
| 14 |
|
| 15 |
|
|
|
|
|
|
|
|
|
|
| 16 |
<div align='center'>
|
| 17 |
<img src="https://github.com/baaivision/Emu3.5/blob/main/assets/arch.png?raw=True" class="interpolation-image" alt="arch." height="100%" width="100%" />
|
| 18 |
</div>
|
|
@@ -32,17 +34,25 @@ Emu3.5 Team, BAAI
|
|
| 32 |
| π― | **RL Post-Training** | Large-scale **reinforcement learning** enhances **reasoning**, **compositionality**, and **generation quality**. |
|
| 33 |
| β‘ | **Discrete Diffusion Adaptation (DiDA)** | Converts **sequential decoding β bidirectional parallel prediction**, achieving **β20Γ faster inference without performance loss**. |
|
| 34 |
| πΌοΈ | **Versatile Generation** | Excels in **long-horizon visionβlanguage generation**, **any-to-image (X2I)** synthesis, and **text-rich image creation**. |
|
| 35 |
-
| π | **Generalizable World Modeling** | Enables **spatiotemporally consistent world exploration**, and **open-world embodied manipulation** across diverse scenarios.
|
| 36 |
| π | **Performance Benchmark** | Matches **Gemini 2.5 Flash Image (Nano Banana)** on **image generation/editing**, and **outperforms** on **interleaved generation tasks**. |
|
| 37 |
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
## Table of Contents
|
| 41 |
|
| 42 |
1. [Model & Weights](#1-model--weights)
|
| 43 |
2. [Quick Start](#2-quick-start)
|
| 44 |
-
3. [
|
| 45 |
-
4. [
|
|
|
|
| 46 |
|
| 47 |
## 1. Model & Weights
|
| 48 |
|
|
@@ -52,7 +62,17 @@ Emu3.5 Team, BAAI
|
|
| 52 |
| Emu3.5-Image | [π€ HF link](https://huggingface.co/BAAI/Emu3.5-Image/tree/main) |
|
| 53 |
| Emu3.5-VisionTokenizer | [π€ HF link](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer/tree/main) |
|
| 54 |
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
|
| 58 |
## 2. Quick Start
|
|
@@ -60,9 +80,10 @@ Emu3.5 Team, BAAI
|
|
| 60 |
### Environment Setup
|
| 61 |
|
| 62 |
```bash
|
|
|
|
| 63 |
git clone https://github.com/baaivision/Emu3.5
|
| 64 |
cd Emu3.5
|
| 65 |
-
pip install -r requirements.txt
|
| 66 |
pip install flash_attn==2.8.3 --no-build-isolation
|
| 67 |
```
|
| 68 |
### Configuration
|
|
@@ -71,8 +92,9 @@ Edit `configs/config.py` to set:
|
|
| 71 |
|
| 72 |
- Paths: `model_path`, `vq_path`
|
| 73 |
- Task template: `task_type in {t2i, x2i, howto, story, explore, vla}`
|
| 74 |
-
- Input image: `use_image` (True to provide reference images, controls <|IMAGE|> token); set `reference_image` in each prompt to specify the image path.
|
| 75 |
- Sampling: `sampling_params` (classifier_free_guidance, temperature, top_k/top_p, etc.)
|
|
|
|
| 76 |
|
| 77 |
### Run Inference
|
| 78 |
|
|
@@ -80,24 +102,158 @@ Edit `configs/config.py` to set:
|
|
| 80 |
python inference.py --cfg configs/config.py
|
| 81 |
```
|
| 82 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
Protobuf outputs are written to `outputs/<exp_name>/proto/`. For better throughput, we recommend β₯2 GPUs.
|
| 84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
### Visualize Protobuf Outputs
|
| 86 |
|
| 87 |
-
To visualize generated protobuf files:
|
| 88 |
|
| 89 |
```bash
|
| 90 |
-
python src/utils/vis_proto.py --input <
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
```
|
| 92 |
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
-
- [x] Inference Code(
|
| 96 |
- [ ] Advanced Image Decoder
|
| 97 |
-
- [ ] Discrete Diffusion Adaptation(DiDA) Inference & Weights
|
| 98 |
|
| 99 |
|
| 100 |
-
##
|
| 101 |
|
| 102 |
```bibtex
|
| 103 |
@misc{cui2025emu35nativemultimodalmodels,
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-text-to-image
|
| 4 |
library_name: transformers
|
| 5 |
---
|
|
|
|
| 6 |
<div align='center'>
|
| 7 |
<h1>Emu3.5: Native Multimodal Models are World Learners</h1>
|
| 8 |
|
| 9 |
Emu3.5 Team, BAAI
|
| 10 |
|
| 11 |
+
[Project Page](https://emu.world/pages/web/landingPage) | [π€HF Models](https://huggingface.co/collections/BAAI/emu35) | [Paper](https://arxiv.org/pdf/2510.26583) | [App](https://emu.world/pages/web/home?route=index)
|
| 12 |
</div>
|
| 13 |
|
| 14 |
|
| 15 |
+
> π **Latest**: Emu3.5 Web & Mobile Apps and vLLM offline inference are live β see [π₯ News](#news) for details.
|
| 16 |
+
|
| 17 |
+
|
| 18 |
<div align='center'>
|
| 19 |
<img src="https://github.com/baaivision/Emu3.5/blob/main/assets/arch.png?raw=True" class="interpolation-image" alt="arch." height="100%" width="100%" />
|
| 20 |
</div>
|
|
|
|
| 34 |
| π― | **RL Post-Training** | Large-scale **reinforcement learning** enhances **reasoning**, **compositionality**, and **generation quality**. |
|
| 35 |
| β‘ | **Discrete Diffusion Adaptation (DiDA)** | Converts **sequential decoding β bidirectional parallel prediction**, achieving **β20Γ faster inference without performance loss**. |
|
| 36 |
| πΌοΈ | **Versatile Generation** | Excels in **long-horizon visionβlanguage generation**, **any-to-image (X2I)** synthesis, and **text-rich image creation**. |
|
| 37 |
+
| π | **Generalizable World Modeling** | Enables **spatiotemporally consistent world exploration**, and **open-world embodied manipulation** across diverse scenarios. |
|
| 38 |
| π | **Performance Benchmark** | Matches **Gemini 2.5 Flash Image (Nano Banana)** on **image generation/editing**, and **outperforms** on **interleaved generation tasks**. |
|
| 39 |
|
| 40 |
|
| 41 |
+
<a id="news"></a>
|
| 42 |
+
|
| 43 |
+
## π₯ News
|
| 44 |
+
|
| 45 |
+
- **2025-11-28 Β· π Emu3.5 Web & Mobile Apps Live** β Official product experience is **now available** on the web at [zh.emu.world](https://zh.emu.world) (Mainland China) and [emu.world](https://emu.world) (global) π The new homepage highlights featured cases and a βGet Startedβ entry, while the workspace and mobile apps bring together creation, inspiration feed, history, profile, and language switch across web, Android APK, and H5. *([See more details](#official-web--mobile-apps) below.)*
|
| 46 |
+
- **2025-11-19 Β· π vLLM Offline Inference Released** β Meet `inference_vllm.py` with a new cond/uncond batch scheduler, delivering **4β5Γ faster end-to-end generation** on vLLM 0.11.0 across Emu3.5 tasks. Jump to [#Run Inference with vLLM](#run-inference-with-vllm) for setup guidance and see PR [#47](https://github.com/baaivision/Emu3.5/pull/47) for full details.
|
| 47 |
+
- **2025-11-17 Β· ποΈ Gradio Demo (Transformers Backend)** β Introduced `gradio_demo_image.py` and `gradio_demo_interleave.py` presets for the standard Transformers runtime, providing turnkey T2I/X2I and interleaved generation experiences with streaming output. Try the commands in [#Gradio Demo](#3-gradio-demo) to launch both UIs locally.
|
| 48 |
|
| 49 |
## Table of Contents
|
| 50 |
|
| 51 |
1. [Model & Weights](#1-model--weights)
|
| 52 |
2. [Quick Start](#2-quick-start)
|
| 53 |
+
3. [Gradio Demo](#3-gradio-demo)
|
| 54 |
+
4. [Schedule](#4-schedule)
|
| 55 |
+
5. [Citation](#5-citation)
|
| 56 |
|
| 57 |
## 1. Model & Weights
|
| 58 |
|
|
|
|
| 62 |
| Emu3.5-Image | [π€ HF link](https://huggingface.co/BAAI/Emu3.5-Image/tree/main) |
|
| 63 |
| Emu3.5-VisionTokenizer | [π€ HF link](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer/tree/main) |
|
| 64 |
|
| 65 |
+
|
| 66 |
+
*Note:*
|
| 67 |
+
- **Emu3.5** supports general-purpose multimodal predictions, including interleaved image-text generation and single-image generation (T2I/X2I) tasks.
|
| 68 |
+
- **Emu3.5-Image** is a model focused on T2I/X2I tasks for best performance on these scenarios.
|
| 69 |
+
- Both models are pure next-token predictors without DiDA acceleration (each image may take several minutes to generate).
|
| 70 |
+
- β‘ **Stay tuned for DiDA-accelerated weights.**
|
| 71 |
+
|
| 72 |
+
> π‘ **Usage tip:**
|
| 73 |
+
> For **interleaved image-text generation**, use **Emu3.5**.
|
| 74 |
+
> For **single-image generation** (T2I and X2I), use **Emu3.5-Image** for the best quality.
|
| 75 |
+
|
| 76 |
|
| 77 |
|
| 78 |
## 2. Quick Start
|
|
|
|
| 80 |
### Environment Setup
|
| 81 |
|
| 82 |
```bash
|
| 83 |
+
# Requires Python 3.12 or higher.
|
| 84 |
git clone https://github.com/baaivision/Emu3.5
|
| 85 |
cd Emu3.5
|
| 86 |
+
pip install -r requirements/transformers.txt
|
| 87 |
pip install flash_attn==2.8.3 --no-build-isolation
|
| 88 |
```
|
| 89 |
### Configuration
|
|
|
|
| 92 |
|
| 93 |
- Paths: `model_path`, `vq_path`
|
| 94 |
- Task template: `task_type in {t2i, x2i, howto, story, explore, vla}`
|
| 95 |
+
- Input image: `use_image` (True to provide reference images, controls <|IMAGE|> token); set `reference_image` in each prompt to specify the image path. For x2i task, we recommand using `reference_image` as a list containing single/multiple image paths to be compatible with multi-image input.
|
| 96 |
- Sampling: `sampling_params` (classifier_free_guidance, temperature, top_k/top_p, etc.)
|
| 97 |
+
- Aspect Ratio (for t2i task): `aspect_ratio` ("4:3", "21:9", "1:1", "auto" etc..)
|
| 98 |
|
| 99 |
### Run Inference
|
| 100 |
|
|
|
|
| 102 |
python inference.py --cfg configs/config.py
|
| 103 |
```
|
| 104 |
|
| 105 |
+
|
| 106 |
+
#### Example Configurations by Task
|
| 107 |
+
Below are example commands for different tasks.
|
| 108 |
+
Make sure to set CUDA_VISIBLE_DEVICES according to your available GPUs.
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
# πΌοΈ Text-to-Image (T2I) task
|
| 113 |
+
CUDA_VISIBLE_DEVICES=0 python inference.py --cfg configs/example_config_t2i.py
|
| 114 |
+
|
| 115 |
+
# π Any-to-Image (X2I) task
|
| 116 |
+
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_x2i.py
|
| 117 |
+
|
| 118 |
+
# π― Visual Guidance task
|
| 119 |
+
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_guidance.py
|
| 120 |
+
|
| 121 |
+
# π Visual Narrative task
|
| 122 |
+
CUDA_VISIBLE_DEVICES=0,1 python inference.py --cfg configs/example_config_visual_narrative.py
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
# After running inference, the model will generate results in protobuf format (.pb files) for each input prompt.
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
|
| 129 |
Protobuf outputs are written to `outputs/<exp_name>/proto/`. For better throughput, we recommend β₯2 GPUs.
|
| 130 |
|
| 131 |
+
|
| 132 |
+
### Run Inference with vLLM
|
| 133 |
+
|
| 134 |
+
#### vLLM Enviroment Setup
|
| 135 |
+
|
| 136 |
+
1. [Optional Recommendation] Use a new virtual environment for vLLM backend.
|
| 137 |
+
```bash
|
| 138 |
+
conda create -n Emu3p5 python=3.12
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
2. Install vLLM and apply the patch files.
|
| 142 |
+
```bash
|
| 143 |
+
# Requires Python 3.12 or higher.
|
| 144 |
+
# Recommended: CUDA 12.8.
|
| 145 |
+
pip install -r requirements/vllm.txt
|
| 146 |
+
pip install flash_attn==2.8.3 --no-build-isolation
|
| 147 |
+
|
| 148 |
+
cd Emu3.5
|
| 149 |
+
python src/patch/apply.py
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
#### Example Configurations by Task
|
| 153 |
+
|
| 154 |
+
```bash
|
| 155 |
+
# πΌοΈ Text-to-Image (T2I) task
|
| 156 |
+
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_t2i.py
|
| 157 |
+
|
| 158 |
+
# π Any-to-Image (X2I) task
|
| 159 |
+
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_x2i.py
|
| 160 |
+
|
| 161 |
+
# π― Visual Guidance task
|
| 162 |
+
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_guidance.py
|
| 163 |
+
|
| 164 |
+
# π Visual Narrative task
|
| 165 |
+
CUDA_VISIBLE_DEVICES=0,1 python inference_vllm.py --cfg configs/example_config_visual_narrative.py
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
|
| 169 |
### Visualize Protobuf Outputs
|
| 170 |
|
| 171 |
+
To visualize generated protobuf files (--video: Generate video visualizations for interleaved output):
|
| 172 |
|
| 173 |
```bash
|
| 174 |
+
python src/utils/vis_proto.py --input <input_proto_path> [--output <output_dir>] [--video]
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
- `--input`: supports a single `.pb` file or a directory; directories are scanned recursively.
|
| 178 |
+
- `--output`: optional; defaults to `<input_dir>/results/<file_stem>` for files, or `<parent_dir_of_input>/results` for directories.
|
| 179 |
+
|
| 180 |
+
Expected output directory layout (example):
|
| 181 |
+
|
| 182 |
+
```text
|
| 183 |
+
results/<pb_name>/
|
| 184 |
+
βββ 000_question.txt
|
| 185 |
+
βββ 000_global_cot.txt
|
| 186 |
+
βββ 001_text.txt
|
| 187 |
+
βββ 001_00_image.png
|
| 188 |
+
βββ 001_00_image_cot.txt
|
| 189 |
+
βββ 002_text.txt
|
| 190 |
+
βββ 002_00_image.png
|
| 191 |
+
βββ ...
|
| 192 |
+
βββ video.mp4 # only when --video is enabled
|
| 193 |
```
|
| 194 |
|
| 195 |
+
Each `*_text.txt` stores decoded segments, `*_image.png` stores generated frames, and matching `*_image_cot.txt` keeps image-level chain-of-thought notes when available.
|
| 196 |
+
|
| 197 |
+
## 3. Gradio Demo
|
| 198 |
+
|
| 199 |
+
We provide two Gradio Demos for different application scenarios:
|
| 200 |
+
|
| 201 |
+
Emu3.5-Image Demo ββ Interactive interface optimized for Text-to-Image (T2I) and Any-to-Image (X2I) tasks:
|
| 202 |
+
|
| 203 |
+
```bash
|
| 204 |
+
CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_image.py --host 0.0.0.0 --port 7860
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
Emu3.5-Interleave Demo ββ Launch Emu3.5 Interleave Tasks (Visual Guidance and Visual Narrate) Gradio Demo
|
| 208 |
+
```bash
|
| 209 |
+
CUDA_VISIBLE_DEVICES=0,1 python gradio_demo_interleave.py --host 0.0.0.0 --port 7860
|
| 210 |
+
```
|
| 211 |
+
|
| 212 |
+
### Features
|
| 213 |
+
|
| 214 |
+
- Image Generation: Support Text-to-Image Generation and Multimodal Image Generation
|
| 215 |
+
- Interleaved Generation: Support long-sequence creation with alternating image and text generation
|
| 216 |
+
- Multiple Aspect Ratios for T2I: 9 preset aspect ratios (4:3, 16:9, 1:1, etc.) plus auto mode
|
| 217 |
+
- Chain-of-Thought Display: Automatically parse and format model's internal thinking process
|
| 218 |
+
- Real-time Streaming: Stream text and image generation with live updates
|
| 219 |
+
|
| 220 |
+
### Official Web & Mobile Apps
|
| 221 |
+
|
| 222 |
+
- **Web**: Production-ready Emu3.5 experience is available at [zh.emu.world](https://zh.emu.world) (Mainland China) and [emu.world](https://emu.world) (global), featuring a curated homepage, βCreateβ workspace, inspiration feed, history, personal profile, and language switching.
|
| 223 |
+
- **Mobile (Android APK & H5)**: Mobile clients provide the same core flows β prompt-based creation, βinspirationβ gallery, personal center, and feedback & privacy entrypoints β with automatic UI language selection based on system settings.
|
| 224 |
+
- **Docs**: For product usage details, see the **Emu3.5 AI δ½Ώη¨ζε (Chinese)** and **Emu3.5 AI User Guide (English)**:
|
| 225 |
+
- CN: [Emu3.5 AI δ½Ώη¨ζε](https://jwolpxeehx.feishu.cn/wiki/BKuKwkzZOi4pdRkVV13csI0FnIg?from=from_copylink)
|
| 226 |
+
- EN: [Emu3.5 AI User Guide](https://jwolpxeehx.feishu.cn/wiki/Gcxtw9XHhisUu8kBEaac6s6xnhc?from=from_copylink)
|
| 227 |
+
|
| 228 |
+
#### Mobile App Download (QR Codes)
|
| 229 |
+
|
| 230 |
+
<div align='center'>
|
| 231 |
+
<table>
|
| 232 |
+
<tr>
|
| 233 |
+
<td align="center">
|
| 234 |
+
<img src="https://github.com/baaivision/Emu3.5/blob/main/assets/qr_zh.png?raw=True" alt="Emu3.5 Mobile App (Mainland China)" width="220" />
|
| 235 |
+
<br />
|
| 236 |
+
<sub><b>Emu3.5 Mobile Β· Mainland China</b></sub>
|
| 237 |
+
</td>
|
| 238 |
+
<td align="center">
|
| 239 |
+
<img src="https://github.com/baaivision/Emu3.5/blob/main/assets/qr.png?raw=True" alt="Emu3.5 Mobile App (Global)" width="220" />
|
| 240 |
+
<br />
|
| 241 |
+
<sub><b>Emu3.5 Mobile Β· Global</b></sub>
|
| 242 |
+
</td>
|
| 243 |
+
</tr>
|
| 244 |
+
</table>
|
| 245 |
+
</div>
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
## 4. Schedule
|
| 250 |
|
| 251 |
+
- [x] Inference Code (NTP Version)
|
| 252 |
- [ ] Advanced Image Decoder
|
| 253 |
+
- [ ] Discrete Diffusion Adaptation (DiDA) Inference & Weights
|
| 254 |
|
| 255 |
|
| 256 |
+
## 5. Citation
|
| 257 |
|
| 258 |
```bibtex
|
| 259 |
@misc{cui2025emu35nativemultimodalmodels,
|