David0219 commited on
Commit
524b8cf
·
verified ·
1 Parent(s): 807f74a

Upload 7 files

Browse files
.gitattributes CHANGED
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  siglip/tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
  flux/dev_grid.jpg filter=lfs diff=lfs merge=lfs -text
 
 
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  siglip/tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
  flux/dev_grid.jpg filter=lfs diff=lfs merge=lfs -text
38
+ uniworld/tokenizer.json filter=lfs diff=lfs merge=lfs -text
uniworld/README.md ADDED
@@ -0,0 +1,377 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ <p align="center">
6
+ <img src="https://s21.ax1x.com/2025/06/03/pVCBdw8.png" width="200"/>
7
+ <p>
8
+ <h2 align="center">
9
+ <a href="https://github.com/PKU-YuanGroup/UniWorld-V1/">
10
+ UniWorld: High-Resolution Semantic Encoders for <br> Unified Visual Understanding and Generation
11
+ </a>
12
+ </h2>
13
+
14
+
15
+
16
+ <h5 align="left">
17
+
18
+ [![arXiv](https://img.shields.io/badge/Arxiv-Report%20-b31b1b.svg?logo=arXiv)](https://github.com/user-attachments/files/20573816/report.pdf)
19
+ [![model](https://img.shields.io/badge/🤗-Model-blue.svg)](https://huggingface.co/LanguageBind/UniWorld-V1)
20
+ [![data](https://img.shields.io/badge/🤗-Dataset-blue.svg)](https://huggingface.co/datasets/LanguageBind/UniWorld-V1)
21
+ [![License](https://img.shields.io/badge/License-Apache-yellow)](https://github.com/PKU-YuanGroup/UniWorld/blob/main/LICENSE)
22
+ [![Twitter](https://img.shields.io/badge/-Twitter@LinBin46984-black?logo=twitter&logoColor=1D9BF0)](https://x.com/LinBin46984/status/1929905024349679682) <br>
23
+ [![demo0](https://img.shields.io/badge/🤗-Demo0-blue.svg)](http://8.130.165.159:8800/)
24
+ [![demo0](https://img.shields.io/badge/🤗-Demo1-blue.svg)](http://8.130.165.159:8801/)
25
+ [![demo0](https://img.shields.io/badge/🤗-Demo2-blue.svg)](http://8.130.165.159:8802/)
26
+ [![demo0](https://img.shields.io/badge/🤗-Demo3-blue.svg)](http://8.130.165.159:8803/)
27
+ [![demo0](https://img.shields.io/badge/🤗-Demo4-blue.svg)](http://8.130.165.159:8804/)
28
+ [![demo0](https://img.shields.io/badge/🤗-Demo5-blue.svg)](http://8.130.165.159:8805/)
29
+ [![demo0](https://img.shields.io/badge/🤗-Demo6-blue.svg)](http://8.130.165.159:8806/)
30
+ [![demo0](https://img.shields.io/badge/🤗-Demo7-blue.svg)](http://8.130.165.159:8807/) <br>
31
+ [![GitHub repo stars](https://img.shields.io/github/stars/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Stars)](https://github.com/PKU-YuanGroup/UniWorld-V1/stargazers)&#160;
32
+ [![GitHub repo forks](https://img.shields.io/github/forks/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Forks)](https://github.com/PKU-YuanGroup/UniWorld-V1/network)&#160;
33
+ [![GitHub repo watchers](https://img.shields.io/github/watchers/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Watchers)](https://github.com/PKU-YuanGroup/UniWorld-V1/watchers)&#160;
34
+ [![GitHub repo size](https://img.shields.io/github/repo-size/PKU-YuanGroup/UniWorld-V1?style=flat&logo=github&logoColor=whitesmoke&label=Repo%20Size)](https://github.com/PKU-YuanGroup/UniWorld-V1/archive/refs/heads/main.zip) <br>
35
+ [![GitHub repo contributors](https://img.shields.io/github/contributors-anon/PKU-YuanGroup/UniWorld-V1?style=flat&label=Contributors)](https://github.com/PKU-YuanGroup/UniWorld-V1/graphs/contributors)
36
+ [![GitHub Commit](https://img.shields.io/github/commit-activity/m/PKU-YuanGroup/UniWorld-V1?label=Commit)](https://github.com/PKU-YuanGroup/UniWorld-V1/commits/main/)
37
+ [![Pr](https://img.shields.io/github/issues-pr-closed-raw/PKU-YuanGroup/UniWorld-V1.svg?label=Merged+PRs&color=green)](https://github.com/PKU-YuanGroup/UniWorld-V1/pulls)
38
+ [![GitHub issues](https://img.shields.io/github/issues/PKU-YuanGroup/UniWorld-V1?color=critical&label=Issues)](https://github.com/PKU-YuanGroup/UniWorld-V1/issues?q=is%3Aopen+is%3Aissue)
39
+ [![GitHub closed issues](https://img.shields.io/github/issues-closed/PKU-YuanGroup/UniWorld-V1?color=success&label=Issues)](https://github.com/PKU-YuanGroup/UniWorld-V1/issues?q=is%3Aissue+is%3Aclosed)
40
+ </h5>
41
+
42
+
43
+
44
+ # 📣 News
45
+
46
+ * **[2025.06.03]** 🤗 We release UniWorld, a unified framework for understanding, generation, and editing. All [data](https://huggingface.co/datasets/LanguageBind/UniWorld-V1), [models](https://huggingface.co/LanguageBind/UniWorld-V1), [training code](https://github.com/PKU-YuanGroup/UniWorld-V1?tab=readme-ov-file#%EF%B8%8F-training), and [evaluation code](https://github.com/PKU-YuanGroup/UniWorld-V1?tab=readme-ov-file#%EF%B8%8F-evaluation) are open-sourced. Checking our [report](https://github.com/user-attachments/files/20573816/report.pdf) for more details. Welcome to **watch** 👀 this repository for the latest updates.
47
+
48
+
49
+ # 😍 Gallery
50
+
51
+ UniWorld shows excellent performance in **20+** tasks.
52
+
53
+ UniWorld, trained on only 2.7M samples, consistently outperforms [BAGEL](https://github.com/ByteDance-Seed/Bagel) (trained on 2665M samples) on the ImgEdit-Bench for image manipulation. It also surpasses the specialized image editing model [Step1X-Edit](https://github.com/stepfun-ai/Step1X-Edit) across multiple dimensions, including add, adjust, and extract on ImgEdit-Bench.
54
+
55
+ **Click to play**
56
+
57
+ <p align="left">
58
+ <a href="https://www.youtube.com/watch?v=77U0PKH7uxs" target="_blank">
59
+ <img src="https://github.com/user-attachments/assets/dbb2acf7-3a54-44b5-9bca-b30cb3385056" width="850" style="margin-bottom: 0.2;"/>
60
+ </a>
61
+ </p>
62
+
63
+
64
+ <p align="left">
65
+ <img src="https://s21.ax1x.com/2025/06/03/pVCB6ln.png" width="850" style="margin-bottom: 0.2;"/>
66
+ <p>
67
+
68
+ # 😮 Highlights
69
+
70
+ ### 1. All Resources Fully Open-Sourced
71
+ - We fully open-source the models, data, training and evaluation code to facilitate rapid community exploration of unified architectures.
72
+
73
+ - We curate 10+ CV downstream tasks, including canny, depth, sketch, MLSD, segmentation and so on.
74
+
75
+ - We annotate 286K long-caption samples using [Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct). We use GPT-4o to filter [ImgEdit](https://github.com/PKU-YuanGroup/ImgEdit), result in 724K high-quality editing samples (all shortedge ≥ 1024 pix). Additionally, we organize and filter existing open-sourced datasets. The details can be found [here](https://github.com/PKU-YuanGroup/UniWorld/tree/main?tab=readme-ov-file#data-details).
76
+
77
+ ### 2. Contrastive Semantic Encoders as Reference Control Signals
78
+ - Unlike prior approaches that use VAE-encoded reference images for low-level control, we advocate using contrastive visual encoders as control signals for reference images.
79
+
80
+ - For such encoders, we observe that as resolution increases, global features approach saturation and model capacity shifts toward preserving fine details, which is crucial for maintaining fidelity in non-edited regions.
81
+
82
+ ### 3. Image Priors via VLM Encoding Without Learnable Tokens
83
+
84
+ - We find that multimodal features encoded by VLMs can interpret instructions while retaining image priors. Due to causal attention, the format `<instruction><image>` is particularly important.
85
+
86
+
87
+ <p align="left">
88
+ <img src="https://s21.ax1x.com/2025/06/03/pVCB5Y4.jpg" width="850" style="margin-bottom: 0.2;"/>
89
+ <p>
90
+
91
+ # 🤗 Demo
92
+
93
+ ### Gradio Web UI
94
+
95
+ Highly recommend trying out our web demo by the following command.
96
+
97
+ ```bash
98
+ MODEL_PATH="path/to/model"
99
+ FLUX_PATH="path/to/flux"
100
+ SIGLIP_PATH="path/to/siglip"
101
+ CUDA_VISIBLE_DEVICES=0 python -m univa.serve.gradio_web_server \
102
+ --model_path ${MODEL_PATH} \
103
+ --flux_path ${FLUX_PATH} \
104
+ --siglip_path ${SIGLIP_PATH}
105
+ ```
106
+
107
+ ### CLI Inference
108
+
109
+ ```bash
110
+ MODEL_PATH="path/to/model"
111
+ FLUX_PATH="path/to/flux"
112
+ SIGLIP_PATH="path/to/siglip"
113
+ CUDA_VISIBLE_DEVICES=1 python -m univa.serve.cli \
114
+ --model_path ${MODEL_PATH} \
115
+ --flux_path ${FLUX_PATH} \
116
+ --siglip_path ${SIGLIP_PATH}
117
+ ```
118
+
119
+ ### ComfyUI
120
+
121
+ Coming soon...
122
+
123
+ # ⚙️ Requirements and Installation
124
+
125
+ 1. Clone this repository and navigate to UniWorld folder
126
+ ```
127
+ git clone https://github.com/PKU-YuanGroup/UniWorld
128
+ cd UniWorld
129
+ ```
130
+ 2. Install required packages
131
+
132
+ ```
133
+ conda create -n univa python=3.10 -y
134
+ conda activate univa
135
+ pip install -r requirements.txt
136
+ ```
137
+
138
+ # 🗝️ Training
139
+
140
+ ### Data preparation
141
+
142
+ Download the data from [LanguageBind/UniWorld-V1](https://huggingface.co/datasets/LanguageBind/UniWorld-V1). The dataset consists of two parts: source images and annotation JSON files.
143
+
144
+ Prepare a `data.txt` file in the following format:
145
+
146
+ 1. The first column is the root path to the image.
147
+
148
+ 2. The second column is the corresponding annotation JSON file.
149
+
150
+ 3. The third column indicates whether to enable the region-weighting strategy. We recommend setting it to True for edited data and False for others.
151
+
152
+ ```
153
+ data/BLIP3o-60k,json/blip3o_t2i_58859.json,false
154
+ data/coco2017_caption_canny-236k,coco2017_canny_236574.json,false
155
+ data/imgedit,json/imgedit/laion_add_part0_edit.json,true
156
+ ```
157
+
158
+ We provide a simple online verification tool to check whether your paths are set in `data.txt` correctly.
159
+ ```
160
+ python univa/serve/check_data.py
161
+ ```
162
+
163
+ <p align="left">
164
+ <img src="https://s21.ax1x.com/2025/05/30/pV9iP8f.png" width="850" style="margin-bottom: 0.2;"/>
165
+ <p>
166
+
167
+ ### Data details
168
+
169
+ <details><summary>Text-to-Image Generation</summary><p>
170
+
171
+ - [BLIP3o-60k](https://huggingface.co/datasets/BLIP3o/BLIP3o-60k): We add text-to-image instructions to half of the data. [108 GB storage usage.]
172
+ - [OSP1024-286k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/OSP1024-286k): Sourced from internal data of the [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), with captions generated using [Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct). Images have an aspect ratio between 3:4 and 4:3, aesthetic score ≥ 6, and a short side ≥ 1024 pixels. [326 GB storage usage.]
173
+
174
+ </p></details>
175
+
176
+ <details><summary>Image Editing</summary><p>
177
+
178
+ - [imgedit-724k](https://huggingface.co/datasets/sysuyy/ImgEdit/tree/main): Data is filtered using GPT-4o, retaining approximately half. [2.1T storage usage.]
179
+ - [OmniEdit-368k](https://huggingface.co/datasets/TIGER-Lab/OmniEdit-Filtered-1.2M): For image editing data, samples with edited regions smaller than 1/100 were filtered out; images have a short side ≥ 1024 pixels. [204 GB storage usage.]
180
+ - [SEED-Data-Edit-Part1-Openimages-65k](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit-Part1-Openimages): For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.]
181
+ - [SEED-Data-Edit-Part2-3-12k](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit-Part2-3): For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [10 GB storage usage.]
182
+ - [PromptfixData-18k](https://huggingface.co/datasets/yeates/PromptfixData): For image restoration data and some editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side ≥ 1024 pixels. [9 GB storage usage.]
183
+ - [StyleBooth-11k](https://huggingface.co/scepter-studio/stylebooth): For transfer style data, images have a short side ≥ 1024 pixels. [4 GB storage usage.]
184
+ - [Ghibli-36k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/Ghibli-36k): For transfer style data, images have a short side ≥ 1024 pixels. **Warning: This data has not been quality filtered.** [170 GB storage usage.]
185
+ </p></details>
186
+
187
+ <details><summary>Extract & Try-on</summary><p>
188
+
189
+ - [viton_hd-23k](https://huggingface.co/datasets/forgeml/viton_hd): Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
190
+ - [deepfashion-27k](https://huggingface.co/datasets/lirus18/deepfashion): Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
191
+ - [shop_product-23k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/shop_product-23k): Sourced from internal data of the [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan), focusing on product extraction and virtual try-on, with images having a short side ≥ 1024 pixels. [12 GB storage usage.]
192
+
193
+ </p></details>
194
+
195
+ <details><summary>Image Perception</summary><p>
196
+
197
+ - [coco2017_caption_canny-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_canny): img->canny & canny->img [25 GB storage usage.]
198
+ - [coco2017_caption_depth-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_depth): img->depth & depth->img [8 GB storage usage.]
199
+ - [coco2017_caption_hed-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_hed): img->hed & hed->img [13 GB storage usage.]
200
+ - [coco2017_caption_mlsd-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_mlsd): img->mlsd & mlsd->img [ GB storage usage.]
201
+ - [coco2017_caption_normal-236k](https://huggingface.co/datasets/gebinhui/coco2017_caption_normal): img->normal & normal->img [10 GB storage usage.]
202
+ - [coco2017_caption_openpose-62k](https://huggingface.co/datasets/wangherr/coco2017_caption_openpose): img->pose & pose->img [2 GB storage usage.]
203
+ - [coco2017_caption_sketch-236k](https://huggingface.co/datasets/wangherr/coco2017_caption_sketch): img->sketch & sketch->img [15 GB storage usage.]
204
+ - [unsplash_canny-20k](https://huggingface.co/datasets/wtcherr/unsplash_10k_canny): img->canny & canny->img [2 GB storage usage.]
205
+ - [open_pose-40k](https://huggingface.co/datasets/raulc0399/open_pose_controlnet): img->pose & pose->img [4 GB storage usage.]
206
+ - [mscoco-controlnet-canny-less-colors-236k](https://huggingface.co/datasets/hazal-karakus/mscoco-controlnet-canny-less-colors): img->canny & canny->img [13 GB storage usage.]
207
+ - [coco2017_seg_box-448k](https://huggingface.co/datasets/LanguageBind/UniWorld-V1/tree/main/data/coco2017_seg_box-448k): img->detection & img->segmentation (mask), instances with regions smaller than 1/100 were filtered out. We visualise masks on the original image as gt-image. [39 GB storage usage.]
208
+ - [viton_hd-11k](https://huggingface.co/datasets/forgeml/viton_hd): img->pose [1 GB storage usage.]
209
+ - [deepfashion-13k](https://huggingface.co/datasets/lirus18/deepfashion): img->pose [1 GB storage usage.]
210
+
211
+ </p></details>
212
+
213
+
214
+ ### Training
215
+
216
+ #### Prepare pretrained weights
217
+ Download [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) to `$FLUX_PATH`.
218
+ Download [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) to `$QWENVL_PATH`. We also support other sizes of Qwen2.5-VL.
219
+
220
+ ```
221
+ SAVE_PATH="path/to/save/UniWorld-Qwen2.5-VL-7B-Instruct-FLUX.1-dev-fp32"
222
+ python scripts/make_univa_qwen2p5vl_weight.py \
223
+ --origin_flux_ckpt_path $FLUX_PATH \
224
+ --origin_qwenvl_ckpt_path $QWENVL_PATH \
225
+ --save_path ${SAVE_PATH}
226
+ ```
227
+
228
+ ```
229
+ # stage1
230
+ bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage1_512.sh
231
+ ```
232
+
233
+ Download [flux-redux-siglipv2-512.bin](https://huggingface.co/LanguageBind/UniWorld-V1/resolve/main/flux-redux-siglipv2-512.bin?download=true) and set its path to `pretrained_siglip_mlp_path` in `stage2.yaml`. The weight is sourced from [ostris/Flex.1-alpha-Redux](https://huggingface.co/ostris/Flex.1-alpha-Redux), we just re-organize the weight.
234
+ ```
235
+ # stage2
236
+ bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage2_512.sh
237
+ ```
238
+
239
+ # ⚡️ Evaluation
240
+
241
+ ### Text-to-Image Generation
242
+
243
+ <details><summary>GenEval</summary><p>
244
+
245
+ ```
246
+ cd univa/eval/geneval
247
+ # follow the instruction in univa/eval/geneval/README.md
248
+ ```
249
+ </p></details>
250
+
251
+ <details><summary>WISE</summary><p>
252
+
253
+ ```
254
+ cd univa/eval/wise
255
+ # follow the instruction in univa/eval/wise/README.md
256
+ ```
257
+
258
+ </p></details>
259
+
260
+ <details><summary>GenAI-Bench</summary><p>
261
+
262
+ ```
263
+ cd univa/eval/genai
264
+ # follow the instruction in univa/eval/genai/README.md
265
+ ```
266
+
267
+ </p></details>
268
+
269
+ <details><summary>DPG-Bench</summary><p>
270
+
271
+ ```
272
+ cd univa/eval/dpgbench
273
+ # follow the instruction in univa/eval/dpgbench/README.md
274
+ ```
275
+
276
+ </p></details>
277
+
278
+ ### Image Editing
279
+
280
+ <details><summary>ImgEdit</summary><p>
281
+
282
+ ```
283
+ cd univa/eval/imgedit
284
+ # follow the instruction in univa/eval/imgedit/README.md
285
+ ```
286
+
287
+ </p></details>
288
+
289
+ <details><summary>GEdit</summary><p>
290
+
291
+ ```
292
+ cd univa/eval/gdit
293
+ # follow the instruction in univa/eval/gdit/README.md
294
+ ```
295
+
296
+ </p></details>
297
+
298
+ # 📊 Benchmarks
299
+
300
+
301
+
302
+ <p align="left">
303
+ <img src="https://s21.ax1x.com/2025/06/03/pVPFuTJ.png" width="850" style="margin-bottom: 0.2;"/>
304
+ <p>
305
+
306
+
307
+ # 💡 How to Contribute
308
+ We greatly appreciate your contributions to the UniWorld open-source community and helping us make it even better than it is now!
309
+
310
+ For more details, please refer to the [Contribution Guidelines](docs/Contribution_Guidelines.md).
311
+
312
+ # 👍 Acknowledgement and Related Work
313
+ * [ImgEdit](https://github.com/PKU-YuanGroup/ImgEdit): ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs.
314
+ * [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan): An open‑source text-to-image/video foundation model, which provides a lot of caption data.
315
+ * [SEED-Data-Edit](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit): A hybrid dataset for instruction-guided image editing.
316
+ * [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct): The new flagship vision-language model of Qwen.
317
+ * [FLUX.1-Redux-dev](https://huggingface.co/black-forest-labs/FLUX.1-Redux-dev): Given an input image, FLUX.1 Redux can reproduce the image with slight variation, allowing to refine a given image.
318
+ * [SigLIP 2](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md): New multilingual vision-language encoders.
319
+ * [Step1X-Edit](https://github.com/stepfun-ai/Step1X-Edit): A state-of-the-art image editing model.
320
+ * [BLIP3-o](https://github.com/JiuhaiChen/BLIP3o): A unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models.
321
+ * [BAGEL](https://github.com/ByteDance-Seed/Bagel): An open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data.
322
+
323
+
324
+ # 🔒 License
325
+ * See [LICENSE](LICENSE) for details. The FLUX weights fall under the [FLUX.1 [dev] Non-Commercial License](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/LICENSE.md).
326
+
327
+
328
+ ## ✨ Star History
329
+
330
+ [![Star History](https://api.star-history.com/svg?repos=PKU-YuanGroup/UniWorld)](https://star-history.com/#PKU-YuanGroup/UniWorld&Date)
331
+
332
+
333
+
334
+ # ✏️ Citing
335
+
336
+
337
+
338
+ ```bibtex
339
+ @misc{lin2025uniworldhighresolutionsemanticencoders,
340
+ title={UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation},
341
+ author={Bin Lin and Zongjian Li and Xinhua Cheng and Yuwei Niu and Yang Ye and Xianyi He and Shenghai Yuan and Wangbo Yu and Shaodong Wang and Yunyang Ge and Yatian Pang and Li Yuan},
342
+ year={2025},
343
+ eprint={2506.03147},
344
+ archivePrefix={arXiv},
345
+ primaryClass={cs.CV},
346
+ url={https://arxiv.org/abs/2506.03147},
347
+ }
348
+ ```
349
+
350
+
351
+ ```bibtex
352
+ @article{niu2025wise,
353
+ title={Wise: A world knowledge-informed semantic evaluation for text-to-image generation},
354
+ author={Niu, Yuwei and Ning, Munan and Zheng, Mengren and Lin, Bin and Jin, Peng and Liao, Jiaqi and Ning, Kunpeng and Zhu, Bin and Yuan, Li},
355
+ journal={arXiv preprint arXiv:2503.07265},
356
+ year={2025}
357
+ }
358
+ ```
359
+
360
+ ```bibtex
361
+ @article{lin2024open,
362
+ title={Open-Sora Plan: Open-Source Large Video Generation Model},
363
+ author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},
364
+ journal={arXiv preprint arXiv:2412.00131},
365
+ year={2024}
366
+ }
367
+ ```
368
+
369
+
370
+ # 🤝 Community contributors
371
+
372
+ <a href="https://github.com/PKU-YuanGroup/UniWorld-V1/graphs/contributors">
373
+ <img src="https://contrib.rocks/image?repo=PKU-YuanGroup/UniWorld-V1" />
374
+ </a>
375
+
376
+ This model is presented in the paper: [UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation](https://huggingface.co/papers/2506.03147)
377
+
uniworld/preprocessor_config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_normalize": true,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [
7
+ 0.48145466,
8
+ 0.4578275,
9
+ 0.40821073
10
+ ],
11
+ "image_processor_type": "Qwen2VLImageProcessor",
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "max_pixels": 12845056,
18
+ "merge_size": 2,
19
+ "min_pixels": 3136,
20
+ "patch_size": 14,
21
+ "processor_class": "Qwen2_5_VLProcessor",
22
+ "resample": 3,
23
+ "rescale_factor": 0.00392156862745098,
24
+ "size": {
25
+ "longest_edge": 12845056,
26
+ "shortest_edge": 3136
27
+ },
28
+ "temporal_patch_size": 2
29
+ }
uniworld/special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
uniworld/task_head_final.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de8c72b715f1a9f37a36f218b3481e8dbce5fe6f57a7890b660d5da0d611efce
3
+ size 146925712
uniworld/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba0c439f7be467bf47d12a7e6f9adc6116201056fc60c67f431c679b7c16afc8
3
+ size 11422064
uniworld/tokenizer_config.json ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
199
+ "clean_up_tokenization_spaces": false,
200
+ "eos_token": "<|im_end|>",
201
+ "errors": "replace",
202
+ "extra_special_tokens": {},
203
+ "max_length": null,
204
+ "model_max_length": 131072,
205
+ "pad_to_multiple_of": null,
206
+ "pad_token": "<|endoftext|>",
207
+ "pad_token_type_id": 0,
208
+ "padding_side": "right",
209
+ "processor_class": "Qwen2_5_VLProcessor",
210
+ "split_special_tokens": false,
211
+ "tokenizer_class": "Qwen2Tokenizer",
212
+ "unk_token": null
213
+ }
uniworld/vocab.json ADDED
The diff for this file is too large to render. See raw diff