File size: 12,644 Bytes
4d6855e
 
 
715fee1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d6855e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
---
license: apache-2.0
---
# Qwen-Image-i2L (Image to LoRA)

## Model Introduction

The i2L (Image to LoRA) model is a structure designed based on a crazy idea. The model takes an image as input and outputs a LoRA model trained on that image.

We are open-sourcing four models:

*   **Qwen-Image-i2L-Style**
    *   **Introduction:** This is our first model that can be considered successfully trained. Its detail preservation capability is very weak, but this actually allows it to effectively extract style information from images. Therefore, this model can be used for style transfer.
    *   **Image Encoder:** SigLIP2, DINOv3
    *   **Parameters:** 2.4B

*   **Qwen-Image-i2L-Coarse**
    *   **Introduction:** This model is a scaled-up version of Qwen-Image-i2L-Style. The LoRA it produces can preserve content information from the image, but details are not perfect. If this model is used for style transfer, it requires more input images, otherwise the model will tend to generate the content of the input images. We do not recommend using this model alone.
    *   **Image Encoder:** SigLIP2, DINOv3, Qwen-VL (Resolution: 224 x 224)
    *   **Parameters:** 7.9B

*   **Qwen-Image-i2L-Fine**
    *   **Introduction:** This model is an incremental update version of Qwen-Image-i2L-Coarse and must be used together with it. It increases the image encoding resolution of Qwen-VL to 1024 x 1024, thereby capturing more detailed information.
    *   **Image Encoder:** SigLIP2, DINOv3, Qwen-VL (Resolution: 1024 x 1024)
    *   **Parameters:** 7.6B

*   **Qwen-Image-i2L-Bias**
    *   **Introduction:** This model is a static supplementary LoRA. Since the training data distribution of Qwen-Image-i2L-Coarse and Qwen-Image-i2L-Fine differs from that of the Qwen-Image base model, the images generated by their LoRAs are not consistent with Qwen-Image's preferences. Using this LoRA makes the generated images closer to the Qwen-Image style.
    *   **Image Encoder:** None
    *   **Parameters:** 30M

**These models still have many limitations, with significant room for improvement in generalization and detail preservation. We are open-sourcing these models to inspire more innovative research.**

## Showcase

### Style

The Qwen-Image-i2L-Style model can be used to quickly generate style LoRAs by simply inputting a few images with a unified style. Below are our generated results; all random seeds are 0.

#### Style 1: Abstract Vector

Input Images:

|![](./assets/style/2/0.jpg)|![](./assets/style/2/1.jpg)|![](./assets/style/2/2.jpg)|![](./assets/style/2/3.jpg)|![](./assets/style/2/4.jpg)|![](./assets/style/2/5.jpg)|
|-|-|-|-|-|-|

Generated Images:

|a cat|a dog|a girl|
|-|-|-|
|![](./assets/style/2/image_0.jpg)|![](./assets/style/2/image_1.jpg)|![](./assets/style/2/image_2.jpg)|

#### Style 2: Black & White Sketch

Input Images:

|![](./assets/style/3/0.jpg)|![](./assets/style/3/1.jpg)|![](./assets/style/3/2.jpg)|![](./assets/style/3/3.jpg)|
|-|-|-|-|

Generated Images:

|a cat|a dog|a girl|
|-|-|-|
|![](./assets/style/3/image_0.jpg)|![](./assets/style/3/image_1.jpg)|![](./assets/style/3/image_2.jpg)|

#### Style 3: Rough Sketch

Input Images:

|![](./assets/style/1/0.jpg)|![](./assets/style/1/1.jpg)|![](./assets/style/1/2.jpg)|![](./assets/style/1/3.jpg)|![](./assets/style/1/4.jpg)|
|-|-|-|-|-|

Generated Images:

|a cat|a dog|a girl|
|-|-|-|
|![](./assets/style/1/image_0.jpg)|![](./assets/style/1/image_1.jpg)|![](./assets/style/1/image_2.jpg)|

#### Style 4: Blue Flat

Input Images:

|![](./assets/style/4/0.jpg)|![](./assets/style/4/1.jpg)|![](./assets/style/4/2.jpg)|![](./assets/style/4/3.jpg)|
|-|-|-|-|

Generated Images:

|a cat|a dog|a girl|
|-|-|-|
|![](./assets/style/4/image_0.jpg)|![](./assets/style/4/image_1.jpg)|![](./assets/style/4/image_2.jpg)|

### Coarse + Fine + Bias

The combination of Qwen-Image-i2L-Coarse, Qwen-Image-i2L-Fine, and Qwen-Image-i2L-Bias can generate LoRA weights that preserve image content and detail information. These weights can serve as initialization weights for LoRA training to accelerate convergence.

#### LoRA Dataset: Puppy Backpack

Training Data:

|![](assets/lora/1/0.jpg)|![](assets/lora/1/1.jpg)|![](assets/lora/1/2.jpg)|![](assets/lora/1/3.jpg)|![](assets/lora/1/4.jpg)|
|-|-|-|-|-|

Sample Generation During Training:

||Steps: 100|Steps: 200|Steps: 300|Steps: 400|Steps: 500|
|-|-|-|-|-|-|
|Random Init|![](assets/lora/1/image_original_100.jpg)|![](assets/lora/1/image_original_200.jpg)|![](assets/lora/1/image_original_300.jpg)|![](assets/lora/1/image_original_400.jpg)|![](assets/lora/1/image_original_500.jpg)|
|Image to LoRA Init|![](assets/lora/1/image_i2l_100.jpg)|![](assets/lora/1/image_i2l_200.jpg)|![](assets/lora/1/image_i2l_300.jpg)|![](assets/lora/1/image_i2l_400.jpg)|![](assets/lora/1/image_i2l_500.jpg)|

#### LoRA Dataset: Teddy Bear

Training Data:

|![](assets/lora/2/0.jpg)|![](assets/lora/2/1.jpg)|![](assets/lora/2/2.jpg)|![](assets/lora/2/3.jpg)|![](assets/lora/2/4.jpg)|
|-|-|-|-|-|

Sample Generation During Training:

||Steps: 100|Steps: 200|Steps: 300|Steps: 400|Steps: 500|
|-|-|-|-|-|-|
|Random Init|![](assets/lora/2/image_original_100.jpg)|![](assets/lora/2/image_original_200.jpg)|![](assets/lora/2/image_original_300.jpg)|![](assets/lora/2/image_original_400.jpg)|![](assets/lora/2/image_original_500.jpg)|
|Image to LoRA Init|![](assets/lora/2/image_i2l_100.jpg)|![](assets/lora/2/image_i2l_200.jpg)|![](assets/lora/2/image_i2l_300.jpg)|![](assets/lora/2/image_i2l_400.jpg)|![](assets/lora/2/image_i2l_500.jpg)|

#### LoRA Dataset: Blueberries in a Bowl

Training Data:

|![](assets/lora/3/0.jpg)|![](assets/lora/3/1.jpg)|![](assets/lora/3/2.jpg)|![](assets/lora/3/3.jpg)|![](assets/lora/3/4.jpg)|![](assets/lora/3/5.jpg)|
|-|-|-|-|-|-|

Sample Generation During Training:

||Steps: 100|Steps: 200|Steps: 300|Steps: 400|Steps: 500|
|-|-|-|-|-|-|
|Random Init|![](assets/lora/3/image_original_100.jpg)|![](assets/lora/3/image_original_200.jpg)|![](assets/lora/3/image_original_300.jpg)|![](assets/lora/3/image_original_400.jpg)|![](assets/lora/3/image_original_500.jpg)|
|Image to LoRA Init|![](assets/lora/3/image_i2l_100.jpg)|![](assets/lora/3/image_i2l_200.jpg)|![](assets/lora/3/image_i2l_300.jpg)|![](assets/lora/3/image_i2l_400.jpg)|![](assets/lora/3/image_i2l_500.jpg)|

## Inference Code

Install [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio):

```shell
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
```

### Qwen-Image-i2L-Style

```python
from diffsynth.pipelines.qwen_image import (
    QwenImagePipeline, ModelConfig,
    QwenImageUnit_Image2LoRAEncode, QwenImageUnit_Image2LoRADecode
)
from modelscope import snapshot_download
from safetensors.torch import save_file
import torch
from PIL import Image

vram_config_disk_offload = {
    "offload_dtype": "disk",
    "offload_device": "disk",
    "onload_dtype": "disk",
    "onload_device": "disk",
    "preparing_dtype": torch.bfloat16,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}

# Load models
pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="SigLIP2-G384/model.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="DINOv3-7B/model.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Style.safetensors", **vram_config_disk_offload),
    ],
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)

# Load images
snapshot_download(
    model_id="DiffSynth-Studio/Qwen-Image-i2L",
    allow_file_pattern="assets/style/1/*",
    local_dir="data/examples"
)
images = [
    Image.open("data/examples/assets/style/1/0.jpg"),
    Image.open("data/examples/assets/style/1/1.jpg"),
    Image.open("data/examples/assets/style/1/2.jpg"),
    Image.open("data/examples/assets/style/1/3.jpg"),
    Image.open("data/examples/assets/style/1/4.jpg"),
]

# Model inference
with torch.no_grad():
    embs = QwenImageUnit_Image2LoRAEncode().process(pipe, image2lora_images=images)
    lora = QwenImageUnit_Image2LoRADecode().process(pipe, **embs)["lora"]

save_file(lora, "model_style.safetensors")
```

### Qwen-Image-i2L-Coarse, Qwen-Image-i2L-Fine, Qwen-Image-i2L-Bias

```python
from diffsynth.pipelines.qwen_image import (
    QwenImagePipeline, ModelConfig,
    QwenImageUnit_Image2LoRAEncode, QwenImageUnit_Image2LoRADecode
)
from diffsynth.utils.lora import merge_lora
from diffsynth import load_state_dict
from modelscope import snapshot_download
from safetensors.torch import save_file
import torch
from PIL import Image

vram_config_disk_offload = {
    "offload_dtype": "disk",
    "offload_device": "disk",
    "onload_dtype": "disk",
    "onload_device": "disk",
    "preparing_dtype": torch.bfloat16,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}

# Load models
pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="SigLIP2-G384/model.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="DINOv3-7B/model.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Coarse.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Fine.safetensors", **vram_config_disk_offload),
    ],
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)

# Load images
snapshot_download(
    model_id="DiffSynth-Studio/Qwen-Image-i2L",
    allow_file_pattern="assets/lora/3/*",
    local_dir="data/examples"
)
images = [
    Image.open("data/examples/assets/lora/3/0.jpg"),
    Image.open("data/examples/assets/lora/3/1.jpg"),
    Image.open("data/examples/assets/lora/3/2.jpg"),
    Image.open("data/examples/assets/lora/3/3.jpg"),
    Image.open("data/examples/assets/lora/3/4.jpg"),
    Image.open("data/examples/assets/lora/3/5.jpg"),
]

# Model inference
with torch.no_grad():
    embs = QwenImageUnit_Image2LoRAEncode().process(pipe, image2lora_images=images)
    lora = QwenImageUnit_Image2LoRADecode().process(pipe, **embs)["lora"]
    
    lora_bias = ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Bias.safetensors")
    lora_bias.download_if_necessary()
    lora_bias = load_state_dict(lora_bias.path, torch_dtype=torch.bfloat16, device="cuda")
    
    lora = merge_lora([lora, lora_bias])

save_file(lora, "model_coarse_fine_bias.safetensors")
```

### Generate Images Using Generated LoRA

```python
from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
import torch

vram_config = {
    "offload_dtype": "disk",
    "offload_device": "disk",
    "onload_dtype": torch.bfloat16,
    "onload_device": "cpu",
    "preparing_dtype": torch.bfloat16,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}

pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)

pipe.load_lora(pipe.dit, "model_style.safetensors")

image = pipe("a cat", seed=0, height=1024, width=1024, num_inference_steps=50)
image.save("image.jpg")
```