ovedrive commited on
Commit
ff94f1e
·
verified ·
1 Parent(s): 4756f49

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ processor/tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ library_name: diffusers
7
+ pipeline_tag: image-to-image
8
+ ---
9
+ <p align="center">
10
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/qwen_image_edit_logo.png" width="400"/>
11
+ <p>
12
+ <p align="center">
13
+ 💜 <a href="https://chat.qwen.ai/"><b>Qwen Chat</b></a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/Qwen/Qwen-Image-Edit-2511">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/models/Qwen/Qwen-Image-Edit-2511">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Qwen_Image.pdf">Tech Report</a> &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qwenlm.github.io/blog/qwen-image-edit-2511/">Blog</a> &nbsp&nbsp
14
+ <br>
15
+ 🖥️ <a href="https://huggingface.co/spaces/Qwen/Qwen-Image-Edit-2511">Demo</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="https://github.com/QwenLM/Qwen-Image/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp🫨 <a href="https://discord.gg/CV4E9rpNSD">Discord</a>&nbsp&nbsp| &nbsp&nbsp <a href="https://github.com/QwenLM/Qwen-Image">Github</a>&nbsp&nbsp
16
+ </p>
17
+
18
+ <p align="center">
19
+ <img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen-Image/edit2511/edit2511big.JPG#center" width="1600"/>
20
+ <p>
21
+
22
+
23
+ # Introduction
24
+
25
+ We are excited to introduce Qwen-Image-Edit-2511, an enhanced version over Qwen-Image-Edit-2509, featuring multiple improvements—including notably better consistency. To try out the latest model, please visit [Qwen Chat](https://chat.qwen.ai/?inputFeature=image_edit) and select the Image Editing feature.
26
+
27
+ Key enhancements in Qwen-Image-Edit-2511 include: mitigate image drift, improved character consistency,integrated LoRA capabilities, enhanced industrial design generation, and strengthened geometric reasoning ability.
28
+
29
+
30
+ ## Quick Start
31
+
32
+ Install the latest version of diffusers
33
+ ```
34
+ pip install git+https://github.com/huggingface/diffusers
35
+ ```
36
+
37
+ The following contains a code snippet illustrating how to use `Qwen-Image-Edit-2511`:
38
+
39
+ ```python
40
+ import os
41
+ import torch
42
+ from PIL import Image
43
+ from diffusers import QwenImageEditPlusPipeline
44
+
45
+ pipeline = QwenImageEditPlusPipeline.from_pretrained("Qwen/Qwen-Image-Edit-2511", torch_dtype=torch.bfloat16)
46
+ print("pipeline loaded")
47
+
48
+ pipeline.to('cuda')
49
+ pipeline.set_progress_bar_config(disable=None)
50
+ image1 = Image.open("input1.png")
51
+ image2 = Image.open("input2.png")
52
+ prompt = "The magician bear is on the left, the alchemist bear is on the right, facing each other in the central park square."
53
+ inputs = {
54
+ "image": [image1, image2],
55
+ "prompt": prompt,
56
+ "generator": torch.manual_seed(0),
57
+ "true_cfg_scale": 4.0,
58
+ "negative_prompt": " ",
59
+ "num_inference_steps": 40,
60
+ "guidance_scale": 1.0,
61
+ "num_images_per_prompt": 1,
62
+ }
63
+ with torch.inference_mode():
64
+ output = pipeline(**inputs)
65
+ output_image = output.images[0]
66
+ output_image.save("output_image_edit_2511.png")
67
+ print("image saved at", os.path.abspath("output_image_edit_2511.png"))
68
+
69
+ ```
70
+
71
+ ## Showcase
72
+
73
+ **Qwen-Image-Edit-2511 Enhances Character Consistency**
74
+ In Qwen-Image-Edit-2511, character consistency has been significantly improved. The model can perform imaginative edits based on an input portrait while preserving the identity and visual characteristics of the subject.
75
+
76
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片1.JPG#center)
77
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片2.JPG#center)
78
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片3.JPG#center)
79
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片4.JPG#center)
80
+
81
+ **Improved Multi-Person Consistency**
82
+ While Qwen-Image-Edit-2509 already improved consistency for single-subject editing, Qwen-Image-Edit-2511 further enhances consistency in multi-person group photos—enabling high-fidelity fusion of two separate person images into a coherent group shot:
83
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片5.JPG#center)
84
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片6.JPG#center)
85
+
86
+ **Built-in Support for Community-Created LoRAs**
87
+ Since Qwen-Image-Edit’s release, the community has developed many creative and high-quality LoRAs—greatly expanding its expressive potential. Qwen-Image-Edit-2511 integrates selected popular LoRAs directly into the base model, unlocking their effects without extra tuning.
88
+
89
+ For example, Lighting Enhancement LoRA
90
+ Realistic lighting control is now achievable out-of-the-box:
91
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片7.JPG#center)
92
+
93
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片8.JPG#center)
94
+
95
+ Another example, generating new viewpoints can now be done directly with the base model:
96
+
97
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片9.JPG#center)
98
+
99
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片10.JPG#center)
100
+
101
+ **Industrial Design Applications**
102
+
103
+ We’ve paid special attention to practical engineering scenarios—for instance, batch industrial product design:
104
+
105
+
106
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片11.JPG#center)
107
+
108
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片12.JPG#center)
109
+
110
+ …and material replacement for industrial components:
111
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片13.JPG#center)
112
+
113
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片14.JPG#center)
114
+
115
+ **Enhanced Geometric Reasoning**
116
+ Qwen-Image-Edit-2511 introduces stronger geometric reasoning capability—e.g., directly generating auxiliary construction lines for design or annotation purposes:
117
+
118
+
119
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片15.JPG#center)
120
+
121
+ ![](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/edit2511/幻灯片16.JPG#center)
122
+
123
+ That wraps up the major updates in Qwen-Image-Edit-2511.
124
+ Enjoy exploring the new capabilities! 🎉
125
+
126
+ ## License Agreement
127
+
128
+ Qwen-Image is licensed under Apache 2.0.
129
+
130
+ ## Citation
131
+
132
+ We kindly encourage citation of our work if you find it useful.
133
+
134
+ ```bibtex
135
+ @misc{wu2025qwenimagetechnicalreport,
136
+ title={Qwen-Image Technical Report},
137
+ author={Chenfei Wu and Jiahao Li and Jingren Zhou and Junyang Lin and Kaiyuan Gao and Kun Yan and Sheng-ming Yin and Shuai Bai and Xiao Xu and Yilei Chen and Yuxiang Chen and Zecheng Tang and Zekai Zhang and Zhengyi Wang and An Yang and Bowen Yu and Chen Cheng and Dayiheng Liu and Deqing Li and Hang Zhang and Hao Meng and Hu Wei and Jingyuan Ni and Kai Chen and Kuan Cao and Liang Peng and Lin Qu and Minggang Wu and Peng Wang and Shuting Yu and Tingkun Wen and Wensen Feng and Xiaoxiao Xu and Yi Wang and Yichang Zhang and Yongqiang Zhu and Yujia Wu and Yuxuan Cai and Zenan Liu},
138
+ year={2025},
139
+ eprint={2508.02324},
140
+ archivePrefix={arXiv},
141
+ primaryClass={cs.CV},
142
+ url={https://arxiv.org/abs/2508.02324},
143
+ }
144
+ ```
model_index.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "QwenImageEditPlusPipeline",
3
+ "_diffusers_version": "0.36.0.dev0",
4
+ "_name_or_path": "./tmp_model",
5
+ "processor": [
6
+ "transformers",
7
+ "Qwen2VLProcessor"
8
+ ],
9
+ "scheduler": [
10
+ "diffusers",
11
+ "FlowMatchEulerDiscreteScheduler"
12
+ ],
13
+ "text_encoder": [
14
+ "transformers",
15
+ "Qwen2_5_VLForConditionalGeneration"
16
+ ],
17
+ "tokenizer": [
18
+ "transformers",
19
+ "Qwen2Tokenizer"
20
+ ],
21
+ "transformer": [
22
+ "diffusers",
23
+ "QwenImageTransformer2DModel"
24
+ ],
25
+ "vae": [
26
+ "diffusers",
27
+ "AutoencoderKLQwenImage"
28
+ ]
29
+ }
processor/added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
processor/chat_template.jinja ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system
2
+ You are a helpful assistant.<|im_end|>
3
+ {% endif %}<|im_start|>{{ message['role'] }}
4
+ {% if message['content'] is string %}{{ message['content'] }}<|im_end|>
5
+ {% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>
6
+ {% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
7
+ {% endif %}
processor/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
processor/preprocessor_config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "disable_grouping": null,
7
+ "do_center_crop": null,
8
+ "do_convert_rgb": true,
9
+ "do_normalize": true,
10
+ "do_pad": null,
11
+ "do_rescale": true,
12
+ "do_resize": true,
13
+ "image_mean": [
14
+ 0.48145466,
15
+ 0.4578275,
16
+ 0.40821073
17
+ ],
18
+ "image_processor_type": "Qwen2VLImageProcessorFast",
19
+ "image_std": [
20
+ 0.26862954,
21
+ 0.26130258,
22
+ 0.27577711
23
+ ],
24
+ "input_data_format": null,
25
+ "max_pixels": 12845056,
26
+ "merge_size": 2,
27
+ "min_pixels": 3136,
28
+ "pad_size": null,
29
+ "patch_size": 14,
30
+ "processor_class": "Qwen2VLProcessor",
31
+ "resample": 3,
32
+ "rescale_factor": 0.00392156862745098,
33
+ "return_tensors": null,
34
+ "size": {
35
+ "longest_edge": 12845056,
36
+ "shortest_edge": 3136
37
+ },
38
+ "temporal_patch_size": 2
39
+ }
processor/special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
processor/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
3
+ size 11421896
processor/tokenizer_config.json ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "clean_up_tokenization_spaces": false,
199
+ "eos_token": "<|im_end|>",
200
+ "errors": "replace",
201
+ "extra_special_tokens": {},
202
+ "model_max_length": 131072,
203
+ "pad_token": "<|endoftext|>",
204
+ "processor_class": "Qwen2VLProcessor",
205
+ "split_special_tokens": false,
206
+ "tokenizer_class": "Qwen2Tokenizer",
207
+ "unk_token": null
208
+ }
processor/video_preprocessor_config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "do_center_crop": null,
7
+ "do_convert_rgb": true,
8
+ "do_normalize": true,
9
+ "do_rescale": true,
10
+ "do_resize": true,
11
+ "do_sample_frames": false,
12
+ "fps": null,
13
+ "image_mean": [
14
+ 0.48145466,
15
+ 0.4578275,
16
+ 0.40821073
17
+ ],
18
+ "image_std": [
19
+ 0.26862954,
20
+ 0.26130258,
21
+ 0.27577711
22
+ ],
23
+ "input_data_format": null,
24
+ "max_frames": 768,
25
+ "max_pixels": 12845056,
26
+ "merge_size": 2,
27
+ "min_frames": 4,
28
+ "min_pixels": 3136,
29
+ "num_frames": null,
30
+ "pad_size": null,
31
+ "patch_size": 14,
32
+ "processor_class": "Qwen2VLProcessor",
33
+ "resample": 3,
34
+ "rescale_factor": 0.00392156862745098,
35
+ "return_metadata": false,
36
+ "size": {
37
+ "longest_edge": 12845056,
38
+ "shortest_edge": 3136
39
+ },
40
+ "temporal_patch_size": 2,
41
+ "video_metadata": null,
42
+ "video_processor_type": "Qwen2VLVideoProcessor"
43
+ }
processor/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
quantization_info.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "quantization_method": "mixed_precision_nf4",
3
+ "description": "First and last transformer blocks kept at bfloat16, middle layers quantized to NF4",
4
+ "high_precision_layers_count": 30,
5
+ "note": "Based on city96/Qwen-Image-gguf approach for better quality"
6
+ }
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "FlowMatchEulerDiscreteScheduler",
3
+ "_diffusers_version": "0.36.0.dev0",
4
+ "base_image_seq_len": 256,
5
+ "base_shift": 0.5,
6
+ "invert_sigmas": false,
7
+ "max_image_seq_len": 8192,
8
+ "max_shift": 0.9,
9
+ "num_train_timesteps": 1000,
10
+ "shift": 1.0,
11
+ "shift_terminal": 0.02,
12
+ "stochastic_sampling": false,
13
+ "time_shift_type": "exponential",
14
+ "use_beta_sigmas": false,
15
+ "use_dynamic_shifting": true,
16
+ "use_exponential_sigmas": false,
17
+ "use_karras_sigmas": false
18
+ }
text_encoder/config.json ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2_5_VLForConditionalGeneration"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 151643,
7
+ "dtype": "bfloat16",
8
+ "eos_token_id": 151645,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 3584,
11
+ "image_token_id": 151655,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 18944,
14
+ "max_position_embeddings": 128000,
15
+ "max_window_layers": 28,
16
+ "model_type": "qwen2_5_vl",
17
+ "num_attention_heads": 28,
18
+ "num_hidden_layers": 28,
19
+ "num_key_value_heads": 4,
20
+ "quantization_config": {
21
+ "_load_in_4bit": true,
22
+ "_load_in_8bit": false,
23
+ "bnb_4bit_compute_dtype": "bfloat16",
24
+ "bnb_4bit_quant_storage": "uint8",
25
+ "bnb_4bit_quant_type": "nf4",
26
+ "bnb_4bit_use_double_quant": true,
27
+ "llm_int8_enable_fp32_cpu_offload": false,
28
+ "llm_int8_has_fp16_weight": false,
29
+ "llm_int8_skip_modules": [
30
+ "transformer_blocks.0.img_mod.1",
31
+ "transformer_blocks.0.attn.to_q",
32
+ "transformer_blocks.0.attn.to_k",
33
+ "transformer_blocks.0.attn.to_v",
34
+ "transformer_blocks.0.attn.add_k_proj",
35
+ "transformer_blocks.0.attn.add_v_proj",
36
+ "transformer_blocks.0.attn.add_q_proj",
37
+ "transformer_blocks.0.attn.to_out.0",
38
+ "transformer_blocks.0.attn.to_add_out",
39
+ "transformer_blocks.0.img_mlp.net.0.proj",
40
+ "transformer_blocks.0.img_mlp.net.2",
41
+ "transformer_blocks.0.txt_mod.1",
42
+ "transformer_blocks.0.txt_mlp.net.0.proj",
43
+ "transformer_blocks.0.txt_mlp.net.2",
44
+ "transformer_blocks.59.img_mod.1",
45
+ "transformer_blocks.59.attn.to_q",
46
+ "transformer_blocks.59.attn.to_k",
47
+ "transformer_blocks.59.attn.to_v",
48
+ "transformer_blocks.59.attn.add_k_proj",
49
+ "transformer_blocks.59.attn.add_v_proj",
50
+ "transformer_blocks.59.attn.add_q_proj",
51
+ "transformer_blocks.59.attn.to_out.0",
52
+ "transformer_blocks.59.attn.to_add_out",
53
+ "transformer_blocks.59.img_mlp.net.0.proj",
54
+ "transformer_blocks.59.img_mlp.net.2",
55
+ "transformer_blocks.59.txt_mod.1",
56
+ "transformer_blocks.59.txt_mlp.net.0.proj",
57
+ "transformer_blocks.59.txt_mlp.net.2",
58
+ "norm_out.linear",
59
+ "proj_out"
60
+ ],
61
+ "llm_int8_threshold": 6.0,
62
+ "load_in_4bit": true,
63
+ "load_in_8bit": false,
64
+ "quant_method": "bitsandbytes"
65
+ },
66
+ "rms_norm_eps": 1e-06,
67
+ "rope_scaling": {
68
+ "mrope_section": [
69
+ 16,
70
+ 24,
71
+ 24
72
+ ],
73
+ "rope_type": "default",
74
+ "type": "default"
75
+ },
76
+ "rope_theta": 1000000.0,
77
+ "sliding_window": 32768,
78
+ "text_config": {
79
+ "_name_or_path": "./tmp_model/text_encoder",
80
+ "architectures": [
81
+ "Qwen2_5_VLForConditionalGeneration"
82
+ ],
83
+ "attention_dropout": 0.0,
84
+ "bos_token_id": 151643,
85
+ "dtype": "bfloat16",
86
+ "eos_token_id": 151645,
87
+ "hidden_act": "silu",
88
+ "hidden_size": 3584,
89
+ "initializer_range": 0.02,
90
+ "intermediate_size": 18944,
91
+ "layer_types": [
92
+ "full_attention",
93
+ "full_attention",
94
+ "full_attention",
95
+ "full_attention",
96
+ "full_attention",
97
+ "full_attention",
98
+ "full_attention",
99
+ "full_attention",
100
+ "full_attention",
101
+ "full_attention",
102
+ "full_attention",
103
+ "full_attention",
104
+ "full_attention",
105
+ "full_attention",
106
+ "full_attention",
107
+ "full_attention",
108
+ "full_attention",
109
+ "full_attention",
110
+ "full_attention",
111
+ "full_attention",
112
+ "full_attention",
113
+ "full_attention",
114
+ "full_attention",
115
+ "full_attention",
116
+ "full_attention",
117
+ "full_attention",
118
+ "full_attention",
119
+ "full_attention"
120
+ ],
121
+ "max_position_embeddings": 128000,
122
+ "max_window_layers": 28,
123
+ "model_type": "qwen2_5_vl_text",
124
+ "num_attention_heads": 28,
125
+ "num_hidden_layers": 28,
126
+ "num_key_value_heads": 4,
127
+ "rms_norm_eps": 1e-06,
128
+ "rope_scaling": {
129
+ "mrope_section": [
130
+ 16,
131
+ 24,
132
+ 24
133
+ ],
134
+ "rope_type": "default",
135
+ "type": "default"
136
+ },
137
+ "rope_theta": 1000000.0,
138
+ "sliding_window": null,
139
+ "use_cache": true,
140
+ "use_sliding_window": false,
141
+ "vision_token_id": 151654,
142
+ "vocab_size": 152064
143
+ },
144
+ "tie_word_embeddings": false,
145
+ "transformers_version": "4.57.1",
146
+ "use_cache": true,
147
+ "use_sliding_window": false,
148
+ "video_token_id": 151656,
149
+ "vision_config": {
150
+ "depth": 32,
151
+ "dtype": "bfloat16",
152
+ "fullatt_block_indexes": [
153
+ 7,
154
+ 15,
155
+ 23,
156
+ 31
157
+ ],
158
+ "hidden_act": "silu",
159
+ "hidden_size": 1280,
160
+ "in_channels": 3,
161
+ "in_chans": 3,
162
+ "initializer_range": 0.02,
163
+ "intermediate_size": 3420,
164
+ "model_type": "qwen2_5_vl",
165
+ "num_heads": 16,
166
+ "out_hidden_size": 3584,
167
+ "patch_size": 14,
168
+ "spatial_merge_size": 2,
169
+ "spatial_patch_size": 14,
170
+ "temporal_patch_size": 2,
171
+ "tokens_per_second": 2,
172
+ "window_size": 112
173
+ },
174
+ "vision_end_token_id": 151653,
175
+ "vision_start_token_id": 151652,
176
+ "vision_token_id": 151654,
177
+ "vocab_size": 152064
178
+ }
text_encoder/generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "repetition_penalty": 1.05,
10
+ "temperature": 0.1,
11
+ "top_k": 1,
12
+ "top_p": 0.001,
13
+ "transformers_version": "4.57.1"
14
+ }
text_encoder/model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1905aae08b7a148c6f2a69939fe196c314f7a75652934e7b85fe222976a28663
3
+ size 4809612598
text_encoder/model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2b727105e5dbd38583aee6a600cad89d2c911ed6592b728702fd4bb60f108704
3
+ size 281149198
text_encoder/model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
tokenizer/chat_template.jinja ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0]['role'] == 'system' %}
4
+ {{- messages[0]['content'] }}
5
+ {%- else %}
6
+ {{- 'You are a helpful assistant.' }}
7
+ {%- endif %}
8
+ {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
+ {%- for tool in tools %}
10
+ {{- "\n" }}
11
+ {{- tool | tojson }}
12
+ {%- endfor %}
13
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
+ {%- else %}
15
+ {%- if messages[0]['role'] == 'system' %}
16
+ {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
+ {%- else %}
18
+ {{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
19
+ {%- endif %}
20
+ {%- endif %}
21
+ {%- for message in messages %}
22
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
+ {%- elif message.role == "assistant" %}
25
+ {{- '<|im_start|>' + message.role }}
26
+ {%- if message.content %}
27
+ {{- '\n' + message.content }}
28
+ {%- endif %}
29
+ {%- for tool_call in message.tool_calls %}
30
+ {%- if tool_call.function is defined %}
31
+ {%- set tool_call = tool_call.function %}
32
+ {%- endif %}
33
+ {{- '\n<tool_call>\n{"name": "' }}
34
+ {{- tool_call.name }}
35
+ {{- '", "arguments": ' }}
36
+ {{- tool_call.arguments | tojson }}
37
+ {{- '}\n</tool_call>' }}
38
+ {%- endfor %}
39
+ {{- '<|im_end|>\n' }}
40
+ {%- elif message.role == "tool" %}
41
+ {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
+ {{- '<|im_start|>user' }}
43
+ {%- endif %}
44
+ {{- '\n<tool_response>\n' }}
45
+ {{- message.content }}
46
+ {{- '\n</tool_response>' }}
47
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
+ {{- '<|im_end|>\n' }}
49
+ {%- endif %}
50
+ {%- endif %}
51
+ {%- endfor %}
52
+ {%- if add_generation_prompt %}
53
+ {{- '<|im_start|>assistant\n' }}
54
+ {%- endif %}
tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "clean_up_tokenization_spaces": false,
199
+ "eos_token": "<|im_end|>",
200
+ "errors": "replace",
201
+ "extra_special_tokens": {},
202
+ "model_max_length": 131072,
203
+ "pad_token": "<|endoftext|>",
204
+ "split_special_tokens": false,
205
+ "tokenizer_class": "Qwen2Tokenizer",
206
+ "unk_token": null
207
+ }
tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
transformer/config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "QwenImageTransformer2DModel",
3
+ "_diffusers_version": "0.36.0.dev0",
4
+ "_name_or_path": "./tmp_model/transformer",
5
+ "attention_head_dim": 128,
6
+ "axes_dims_rope": [
7
+ 16,
8
+ 56,
9
+ 56
10
+ ],
11
+ "guidance_embeds": false,
12
+ "in_channels": 64,
13
+ "joint_attention_dim": 3584,
14
+ "num_attention_heads": 24,
15
+ "num_layers": 60,
16
+ "out_channels": 16,
17
+ "patch_size": 2,
18
+ "quantization_config": {
19
+ "_load_in_4bit": true,
20
+ "_load_in_8bit": false,
21
+ "bnb_4bit_compute_dtype": "bfloat16",
22
+ "bnb_4bit_quant_storage": "uint8",
23
+ "bnb_4bit_quant_type": "nf4",
24
+ "bnb_4bit_use_double_quant": true,
25
+ "llm_int8_enable_fp32_cpu_offload": false,
26
+ "llm_int8_has_fp16_weight": false,
27
+ "llm_int8_skip_modules": [
28
+ "transformer_blocks.0.img_mod.1",
29
+ "transformer_blocks.0.attn.to_q",
30
+ "transformer_blocks.0.attn.to_k",
31
+ "transformer_blocks.0.attn.to_v",
32
+ "transformer_blocks.0.attn.add_k_proj",
33
+ "transformer_blocks.0.attn.add_v_proj",
34
+ "transformer_blocks.0.attn.add_q_proj",
35
+ "transformer_blocks.0.attn.to_out.0",
36
+ "transformer_blocks.0.attn.to_add_out",
37
+ "transformer_blocks.0.img_mlp.net.0.proj",
38
+ "transformer_blocks.0.img_mlp.net.2",
39
+ "transformer_blocks.0.txt_mod.1",
40
+ "transformer_blocks.0.txt_mlp.net.0.proj",
41
+ "transformer_blocks.0.txt_mlp.net.2",
42
+ "transformer_blocks.59.img_mod.1",
43
+ "transformer_blocks.59.attn.to_q",
44
+ "transformer_blocks.59.attn.to_k",
45
+ "transformer_blocks.59.attn.to_v",
46
+ "transformer_blocks.59.attn.add_k_proj",
47
+ "transformer_blocks.59.attn.add_v_proj",
48
+ "transformer_blocks.59.attn.add_q_proj",
49
+ "transformer_blocks.59.attn.to_out.0",
50
+ "transformer_blocks.59.attn.to_add_out",
51
+ "transformer_blocks.59.img_mlp.net.0.proj",
52
+ "transformer_blocks.59.img_mlp.net.2",
53
+ "transformer_blocks.59.txt_mod.1",
54
+ "transformer_blocks.59.txt_mlp.net.0.proj",
55
+ "transformer_blocks.59.txt_mlp.net.2",
56
+ "norm_out.linear",
57
+ "proj_out"
58
+ ],
59
+ "llm_int8_threshold": 6.0,
60
+ "load_in_4bit": true,
61
+ "load_in_8bit": false,
62
+ "quant_method": "bitsandbytes"
63
+ },
64
+ "zero_cond_t": true
65
+ }
transformer/diffusion_pytorch_model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:10ba6c1dab151ed3774f04d134c91ae53cd13feaa9ca7f0097d52c4fa8b43941
3
+ size 9990996243
transformer/diffusion_pytorch_model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d49861efe7e99e0de84f452e59d4b71fcb3b23a2a0603c32af07bd9244c294d
3
+ size 1595201038
transformer/diffusion_pytorch_model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
vae/config.json ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKLQwenImage",
3
+ "_diffusers_version": "0.36.0.dev0",
4
+ "_name_or_path": "./tmp_model/vae",
5
+ "attn_scales": [],
6
+ "base_dim": 96,
7
+ "dim_mult": [
8
+ 1,
9
+ 2,
10
+ 4,
11
+ 4
12
+ ],
13
+ "dropout": 0.0,
14
+ "latents_mean": [
15
+ -0.7571,
16
+ -0.7089,
17
+ -0.9113,
18
+ 0.1075,
19
+ -0.1745,
20
+ 0.9653,
21
+ -0.1517,
22
+ 1.5508,
23
+ 0.4134,
24
+ -0.0715,
25
+ 0.5517,
26
+ -0.3632,
27
+ -0.1922,
28
+ -0.9497,
29
+ 0.2503,
30
+ -0.2921
31
+ ],
32
+ "latents_std": [
33
+ 2.8184,
34
+ 1.4541,
35
+ 2.3275,
36
+ 2.6558,
37
+ 1.2196,
38
+ 1.7708,
39
+ 2.6052,
40
+ 2.0743,
41
+ 3.2687,
42
+ 2.1526,
43
+ 2.8652,
44
+ 1.5579,
45
+ 1.6382,
46
+ 1.1253,
47
+ 2.8251,
48
+ 1.916
49
+ ],
50
+ "num_res_blocks": 2,
51
+ "quantization_config": {
52
+ "_load_in_4bit": true,
53
+ "_load_in_8bit": false,
54
+ "bnb_4bit_compute_dtype": "bfloat16",
55
+ "bnb_4bit_quant_storage": "uint8",
56
+ "bnb_4bit_quant_type": "nf4",
57
+ "bnb_4bit_use_double_quant": true,
58
+ "llm_int8_enable_fp32_cpu_offload": false,
59
+ "llm_int8_has_fp16_weight": false,
60
+ "llm_int8_skip_modules": [
61
+ "transformer_blocks.0.img_mod.1",
62
+ "transformer_blocks.0.attn.to_q",
63
+ "transformer_blocks.0.attn.to_k",
64
+ "transformer_blocks.0.attn.to_v",
65
+ "transformer_blocks.0.attn.add_k_proj",
66
+ "transformer_blocks.0.attn.add_v_proj",
67
+ "transformer_blocks.0.attn.add_q_proj",
68
+ "transformer_blocks.0.attn.to_out.0",
69
+ "transformer_blocks.0.attn.to_add_out",
70
+ "transformer_blocks.0.img_mlp.net.0.proj",
71
+ "transformer_blocks.0.img_mlp.net.2",
72
+ "transformer_blocks.0.txt_mod.1",
73
+ "transformer_blocks.0.txt_mlp.net.0.proj",
74
+ "transformer_blocks.0.txt_mlp.net.2",
75
+ "transformer_blocks.59.img_mod.1",
76
+ "transformer_blocks.59.attn.to_q",
77
+ "transformer_blocks.59.attn.to_k",
78
+ "transformer_blocks.59.attn.to_v",
79
+ "transformer_blocks.59.attn.add_k_proj",
80
+ "transformer_blocks.59.attn.add_v_proj",
81
+ "transformer_blocks.59.attn.add_q_proj",
82
+ "transformer_blocks.59.attn.to_out.0",
83
+ "transformer_blocks.59.attn.to_add_out",
84
+ "transformer_blocks.59.img_mlp.net.0.proj",
85
+ "transformer_blocks.59.img_mlp.net.2",
86
+ "transformer_blocks.59.txt_mod.1",
87
+ "transformer_blocks.59.txt_mlp.net.0.proj",
88
+ "transformer_blocks.59.txt_mlp.net.2",
89
+ "norm_out.linear",
90
+ "proj_out"
91
+ ],
92
+ "llm_int8_threshold": 6.0,
93
+ "load_in_4bit": true,
94
+ "load_in_8bit": false,
95
+ "quant_method": "bitsandbytes"
96
+ },
97
+ "temperal_downsample": [
98
+ false,
99
+ true,
100
+ true
101
+ ],
102
+ "z_dim": 16
103
+ }
vae/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0c8bc8b758c649abef9ea407b95408389a3b2f610d0d10fcb054fe171d0a8344
3
+ size 253806966