kelseye commited on
Commit
715fee1
·
verified ·
1 Parent(s): 74e182b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +305 -0
README.md ADDED
@@ -0,0 +1,305 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Qwen-Image-i2L (Image to LoRA)
2
+
3
+ ## Model Introduction
4
+
5
+ The i2L (Image to LoRA) model is a structure designed based on a crazy idea. The model takes an image as input and outputs a LoRA model trained on that image.
6
+
7
+ We are open-sourcing four models:
8
+
9
+ * **Qwen-Image-i2L-Style**
10
+ * **Introduction:** This is our first model that can be considered successfully trained. Its detail preservation capability is very weak, but this actually allows it to effectively extract style information from images. Therefore, this model can be used for style transfer.
11
+ * **Image Encoder:** SigLIP2, DINOv3
12
+ * **Parameters:** 2.4B
13
+
14
+ * **Qwen-Image-i2L-Coarse**
15
+ * **Introduction:** This model is a scaled-up version of Qwen-Image-i2L-Style. The LoRA it produces can preserve content information from the image, but details are not perfect. If this model is used for style transfer, it requires more input images, otherwise the model will tend to generate the content of the input images. We do not recommend using this model alone.
16
+ * **Image Encoder:** SigLIP2, DINOv3, Qwen-VL (Resolution: 224 x 224)
17
+ * **Parameters:** 7.9B
18
+
19
+ * **Qwen-Image-i2L-Fine**
20
+ * **Introduction:** This model is an incremental update version of Qwen-Image-i2L-Coarse and must be used together with it. It increases the image encoding resolution of Qwen-VL to 1024 x 1024, thereby capturing more detailed information.
21
+ * **Image Encoder:** SigLIP2, DINOv3, Qwen-VL (Resolution: 1024 x 1024)
22
+ * **Parameters:** 7.6B
23
+
24
+ * **Qwen-Image-i2L-Bias**
25
+ * **Introduction:** This model is a static supplementary LoRA. Since the training data distribution of Qwen-Image-i2L-Coarse and Qwen-Image-i2L-Fine differs from that of the Qwen-Image base model, the images generated by their LoRAs are not consistent with Qwen-Image's preferences. Using this LoRA makes the generated images closer to the Qwen-Image style.
26
+ * **Image Encoder:** None
27
+ * **Parameters:** 30M
28
+
29
+ **These models still have many limitations, with significant room for improvement in generalization and detail preservation. We are open-sourcing these models to inspire more innovative research.**
30
+
31
+ ## Showcase
32
+
33
+ ### Style
34
+
35
+ The Qwen-Image-i2L-Style model can be used to quickly generate style LoRAs by simply inputting a few images with a unified style. Below are our generated results; all random seeds are 0.
36
+
37
+ #### Style 1: Abstract Vector
38
+
39
+ Input Images:
40
+
41
+ |![](./assets/style/2/0.jpg)|![](./assets/style/2/1.jpg)|![](./assets/style/2/2.jpg)|![](./assets/style/2/3.jpg)|![](./assets/style/2/4.jpg)|![](./assets/style/2/5.jpg)|
42
+ |-|-|-|-|-|-|
43
+
44
+ Generated Images:
45
+
46
+ |a cat|a dog|a girl|
47
+ |-|-|-|
48
+ |![](./assets/style/2/image_0.jpg)|![](./assets/style/2/image_1.jpg)|![](./assets/style/2/image_2.jpg)|
49
+
50
+ #### Style 2: Black & White Sketch
51
+
52
+ Input Images:
53
+
54
+ |![](./assets/style/3/0.jpg)|![](./assets/style/3/1.jpg)|![](./assets/style/3/2.jpg)|![](./assets/style/3/3.jpg)|
55
+ |-|-|-|-|
56
+
57
+ Generated Images:
58
+
59
+ |a cat|a dog|a girl|
60
+ |-|-|-|
61
+ |![](./assets/style/3/image_0.jpg)|![](./assets/style/3/image_1.jpg)|![](./assets/style/3/image_2.jpg)|
62
+
63
+ #### Style 3: Rough Sketch
64
+
65
+ Input Images:
66
+
67
+ |![](./assets/style/1/0.jpg)|![](./assets/style/1/1.jpg)|![](./assets/style/1/2.jpg)|![](./assets/style/1/3.jpg)|![](./assets/style/1/4.jpg)|
68
+ |-|-|-|-|-|
69
+
70
+ Generated Images:
71
+
72
+ |a cat|a dog|a girl|
73
+ |-|-|-|
74
+ |![](./assets/style/1/image_0.jpg)|![](./assets/style/1/image_1.jpg)|![](./assets/style/1/image_2.jpg)|
75
+
76
+ #### Style 4: Blue Flat
77
+
78
+ Input Images:
79
+
80
+ |![](./assets/style/4/0.jpg)|![](./assets/style/4/1.jpg)|![](./assets/style/4/2.jpg)|![](./assets/style/4/3.jpg)|
81
+ |-|-|-|-|
82
+
83
+ Generated Images:
84
+
85
+ |a cat|a dog|a girl|
86
+ |-|-|-|
87
+ |![](./assets/style/4/image_0.jpg)|![](./assets/style/4/image_1.jpg)|![](./assets/style/4/image_2.jpg)|
88
+
89
+ ### Coarse + Fine + Bias
90
+
91
+ The combination of Qwen-Image-i2L-Coarse, Qwen-Image-i2L-Fine, and Qwen-Image-i2L-Bias can generate LoRA weights that preserve image content and detail information. These weights can serve as initialization weights for LoRA training to accelerate convergence.
92
+
93
+ #### LoRA Dataset: Puppy Backpack
94
+
95
+ Training Data:
96
+
97
+ |![](assets/lora/1/0.jpg)|![](assets/lora/1/1.jpg)|![](assets/lora/1/2.jpg)|![](assets/lora/1/3.jpg)|![](assets/lora/1/4.jpg)|
98
+ |-|-|-|-|-|
99
+
100
+ Sample Generation During Training:
101
+
102
+ ||Steps: 100|Steps: 200|Steps: 300|Steps: 400|Steps: 500|
103
+ |-|-|-|-|-|-|
104
+ |Random Init|![](assets/lora/1/image_original_100.jpg)|![](assets/lora/1/image_original_200.jpg)|![](assets/lora/1/image_original_300.jpg)|![](assets/lora/1/image_original_400.jpg)|![](assets/lora/1/image_original_500.jpg)|
105
+ |Image to LoRA Init|![](assets/lora/1/image_i2l_100.jpg)|![](assets/lora/1/image_i2l_200.jpg)|![](assets/lora/1/image_i2l_300.jpg)|![](assets/lora/1/image_i2l_400.jpg)|![](assets/lora/1/image_i2l_500.jpg)|
106
+
107
+ #### LoRA Dataset: Teddy Bear
108
+
109
+ Training Data:
110
+
111
+ |![](assets/lora/2/0.jpg)|![](assets/lora/2/1.jpg)|![](assets/lora/2/2.jpg)|![](assets/lora/2/3.jpg)|![](assets/lora/2/4.jpg)|
112
+ |-|-|-|-|-|
113
+
114
+ Sample Generation During Training:
115
+
116
+ ||Steps: 100|Steps: 200|Steps: 300|Steps: 400|Steps: 500|
117
+ |-|-|-|-|-|-|
118
+ |Random Init|![](assets/lora/2/image_original_100.jpg)|![](assets/lora/2/image_original_200.jpg)|![](assets/lora/2/image_original_300.jpg)|![](assets/lora/2/image_original_400.jpg)|![](assets/lora/2/image_original_500.jpg)|
119
+ |Image to LoRA Init|![](assets/lora/2/image_i2l_100.jpg)|![](assets/lora/2/image_i2l_200.jpg)|![](assets/lora/2/image_i2l_300.jpg)|![](assets/lora/2/image_i2l_400.jpg)|![](assets/lora/2/image_i2l_500.jpg)|
120
+
121
+ #### LoRA Dataset: Blueberries in a Bowl
122
+
123
+ Training Data:
124
+
125
+ |![](assets/lora/3/0.jpg)|![](assets/lora/3/1.jpg)|![](assets/lora/3/2.jpg)|![](assets/lora/3/3.jpg)|![](assets/lora/3/4.jpg)|![](assets/lora/3/5.jpg)|
126
+ |-|-|-|-|-|-|
127
+
128
+ Sample Generation During Training:
129
+
130
+ ||Steps: 100|Steps: 200|Steps: 300|Steps: 400|Steps: 500|
131
+ |-|-|-|-|-|-|
132
+ |Random Init|![](assets/lora/3/image_original_100.jpg)|![](assets/lora/3/image_original_200.jpg)|![](assets/lora/3/image_original_300.jpg)|![](assets/lora/3/image_original_400.jpg)|![](assets/lora/3/image_original_500.jpg)|
133
+ |Image to LoRA Init|![](assets/lora/3/image_i2l_100.jpg)|![](assets/lora/3/image_i2l_200.jpg)|![](assets/lora/3/image_i2l_300.jpg)|![](assets/lora/3/image_i2l_400.jpg)|![](assets/lora/3/image_i2l_500.jpg)|
134
+
135
+ ## Inference Code
136
+
137
+ Install [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio):
138
+
139
+ ```shell
140
+ git clone https://github.com/modelscope/DiffSynth-Studio.git
141
+ cd DiffSynth-Studio
142
+ pip install -e .
143
+ ```
144
+
145
+ ### Qwen-Image-i2L-Style
146
+
147
+ ```python
148
+ from diffsynth.pipelines.qwen_image import (
149
+ QwenImagePipeline, ModelConfig,
150
+ QwenImageUnit_Image2LoRAEncode, QwenImageUnit_Image2LoRADecode
151
+ )
152
+ from modelscope import snapshot_download
153
+ from safetensors.torch import save_file
154
+ import torch
155
+ from PIL import Image
156
+
157
+ vram_config_disk_offload = {
158
+ "offload_dtype": "disk",
159
+ "offload_device": "disk",
160
+ "onload_dtype": "disk",
161
+ "onload_device": "disk",
162
+ "preparing_dtype": torch.bfloat16,
163
+ "preparing_device": "cuda",
164
+ "computation_dtype": torch.bfloat16,
165
+ "computation_device": "cuda",
166
+ }
167
+
168
+ # Load models
169
+ pipe = QwenImagePipeline.from_pretrained(
170
+ torch_dtype=torch.bfloat16,
171
+ device="cuda",
172
+ model_configs=[
173
+ ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="SigLIP2-G384/model.safetensors", **vram_config_disk_offload),
174
+ ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="DINOv3-7B/model.safetensors", **vram_config_disk_offload),
175
+ ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Style.safetensors", **vram_config_disk_offload),
176
+ ],
177
+ processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
178
+ vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
179
+ )
180
+
181
+ # Load images
182
+ snapshot_download(
183
+ model_id="DiffSynth-Studio/Qwen-Image-i2L",
184
+ allow_file_pattern="assets/style/1/*",
185
+ local_dir="data/examples"
186
+ )
187
+ images = [
188
+ Image.open("data/examples/assets/style/1/0.jpg"),
189
+ Image.open("data/examples/assets/style/1/1.jpg"),
190
+ Image.open("data/examples/assets/style/1/2.jpg"),
191
+ Image.open("data/examples/assets/style/1/3.jpg"),
192
+ Image.open("data/examples/assets/style/1/4.jpg"),
193
+ ]
194
+
195
+ # Model inference
196
+ with torch.no_grad():
197
+ embs = QwenImageUnit_Image2LoRAEncode().process(pipe, image2lora_images=images)
198
+ lora = QwenImageUnit_Image2LoRADecode().process(pipe, **embs)["lora"]
199
+
200
+ save_file(lora, "model_style.safetensors")
201
+ ```
202
+
203
+ ### Qwen-Image-i2L-Coarse, Qwen-Image-i2L-Fine, Qwen-Image-i2L-Bias
204
+
205
+ ```python
206
+ from diffsynth.pipelines.qwen_image import (
207
+ QwenImagePipeline, ModelConfig,
208
+ QwenImageUnit_Image2LoRAEncode, QwenImageUnit_Image2LoRADecode
209
+ )
210
+ from diffsynth.utils.lora import merge_lora
211
+ from diffsynth import load_state_dict
212
+ from modelscope import snapshot_download
213
+ from safetensors.torch import save_file
214
+ import torch
215
+ from PIL import Image
216
+
217
+ vram_config_disk_offload = {
218
+ "offload_dtype": "disk",
219
+ "offload_device": "disk",
220
+ "onload_dtype": "disk",
221
+ "onload_device": "disk",
222
+ "preparing_dtype": torch.bfloat16,
223
+ "preparing_device": "cuda",
224
+ "computation_dtype": torch.bfloat16,
225
+ "computation_device": "cuda",
226
+ }
227
+
228
+ # Load models
229
+ pipe = QwenImagePipeline.from_pretrained(
230
+ torch_dtype=torch.bfloat16,
231
+ device="cuda",
232
+ model_configs=[
233
+ ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config_disk_offload),
234
+ ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="SigLIP2-G384/model.safetensors", **vram_config_disk_offload),
235
+ ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="DINOv3-7B/model.safetensors", **vram_config_disk_offload),
236
+ ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Coarse.safetensors", **vram_config_disk_offload),
237
+ ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Fine.safetensors", **vram_config_disk_offload),
238
+ ],
239
+ processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
240
+ vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
241
+ )
242
+
243
+ # Load images
244
+ snapshot_download(
245
+ model_id="DiffSynth-Studio/Qwen-Image-i2L",
246
+ allow_file_pattern="assets/lora/3/*",
247
+ local_dir="data/examples"
248
+ )
249
+ images = [
250
+ Image.open("data/examples/assets/lora/3/0.jpg"),
251
+ Image.open("data/examples/assets/lora/3/1.jpg"),
252
+ Image.open("data/examples/assets/lora/3/2.jpg"),
253
+ Image.open("data/examples/assets/lora/3/3.jpg"),
254
+ Image.open("data/examples/assets/lora/3/4.jpg"),
255
+ Image.open("data/examples/assets/lora/3/5.jpg"),
256
+ ]
257
+
258
+ # Model inference
259
+ with torch.no_grad():
260
+ embs = QwenImageUnit_Image2LoRAEncode().process(pipe, image2lora_images=images)
261
+ lora = QwenImageUnit_Image2LoRADecode().process(pipe, **embs)["lora"]
262
+
263
+ lora_bias = ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Bias.safetensors")
264
+ lora_bias.download_if_necessary()
265
+ lora_bias = load_state_dict(lora_bias.path, torch_dtype=torch.bfloat16, device="cuda")
266
+
267
+ lora = merge_lora([lora, lora_bias])
268
+
269
+ save_file(lora, "model_coarse_fine_bias.safetensors")
270
+ ```
271
+
272
+ ### Generate Images Using Generated LoRA
273
+
274
+ ```python
275
+ from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
276
+ import torch
277
+
278
+ vram_config = {
279
+ "offload_dtype": "disk",
280
+ "offload_device": "disk",
281
+ "onload_dtype": torch.bfloat16,
282
+ "onload_device": "cpu",
283
+ "preparing_dtype": torch.bfloat16,
284
+ "preparing_device": "cuda",
285
+ "computation_dtype": torch.bfloat16,
286
+ "computation_device": "cuda",
287
+ }
288
+
289
+ pipe = QwenImagePipeline.from_pretrained(
290
+ torch_dtype=torch.bfloat16,
291
+ device="cuda",
292
+ model_configs=[
293
+ ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
294
+ ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config),
295
+ ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
296
+ ],
297
+ tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
298
+ vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
299
+ )
300
+
301
+ pipe.load_lora(pipe.dit, "model_style.safetensors")
302
+
303
+ image = pipe("a cat", seed=0, height=1024, width=1024, num_inference_steps=50)
304
+ image.save("image.jpg")
305
+ ```