File size: 4,666 Bytes
f38ad1e
 
 
 
 
 
 
b42868f
f38ad1e
b42868f
9901bb0
f38ad1e
 
b42868f
 
f38ad1e
b42868f
 
 
f38ad1e
 
 
b42868f
f38ad1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b42868f
 
f38ad1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b42868f
 
f38ad1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b42868f
f38ad1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: apache-2.0
---
# Qwen-Image-Layered

## Model Introduction

This model is trained based on the model [Qwen/Qwen-Image-Layered](https://modelscope.cn/models/Qwen/Qwen-Image-Layered) using the dataset [artplus/PrismLayersPro](https://modelscope.cn/datasets/artplus/PrismLayersPro), enabling text-controlled extraction of segmented layers.

For more details about training strategies and implementation, feel free to check our [technical blog](https://modelscope.cn/learn/4938).

## Usage Tips

* The model architecture has been changed from multi-image output to single-image output, producing only the layer relevant to the provided text description.
* The model was trained exclusively on English text, but retains Chinese language understanding capabilities inherited from the base model.
* The native training resolution is 1024x1024; however, inference at other resolutions is supported.
* The model struggles to separate multiple entities that are heavily occluded or overlapping, such as the cartoon skeleton head and hat in the examples.
* The model excels at decomposing poster-like graphics but performs poorly on photographic images, especially those involving complex lighting and shadows.
* The model supports negative prompts—users can specify content they wish to exclude via negative prompt descriptions.

## Demo Examples

**Some images contain white text on light backgrounds. ModelScope users should click the "☀︎" icon in the top-right corner to switch to dark mode for better visibility.**

### Example 1

<div style="display: flex; justify-content: space-between;">

<div style="width: 30%;">

|Input Image|
|-|
|![](./assets/image_1_input.png)|

</div>

<div style="width: 66%;">

|Prompt|Output Image|Prompt|Output Image|
|-|-|-|-|
|A solid, uniform color with no distinguishable features or objects|![](./assets/image_1_0_0.png)|Text 'TRICK'|![](./assets/image_1_4_0.png)|
|Cloud|![](./assets/image_1_1_0.png)|Text 'TRICK OR TREAT'|![](./assets/image_1_3_0.png)|
|A cartoon skeleton character wearing a purple hat and holding a gift box|![](./assets/image_1_2_0.png)|Text 'TRICK OR'|![](./assets/image_1_7_0.png)|
|A purple hat and a head|![](./assets/image_1_5_0.png)|A gift box|![](./assets/image_1_6_0.png)|

</div>

</div>

### Example 2

<div style="display: flex; justify-content: space-between;">

<div style="width: 30%;">

|Input Image|
|-|
|![](./assets/image_2_input.png)|

</div>

<div style="width: 66%;">

|Prompt|Output Image|Prompt|Output Image|
|-|-|-|-|
|Blue sky, white clouds, a garden with colorful flowers|![](./assets/image_2_0_0.png)|Colorful, intricate floral wreath|![](./assets/image_2_2_0.png)|
|Girl, wreath, kitten|![](./assets/image_2_1_0.png)|Girl, kitten|![](./assets/image_2_3_0.png)|

</div>

</div>

### Example 3

<div style="display: flex; justify-content: space-between;">

<div style="width: 30%;">

|Input Image|
|-|
|![](./assets/image_3_input.png)|

</div>

<div style="width: 66%;">

|Prompt|Output Image|Prompt|Output Image|
|-|-|-|-|
|A clear blue sky and a turbulent sea|![](./assets/image_3_0_0.png)|Text "The Life I Long For"|![](./assets/image_3_2_0.png)|
|A seagull|![](./assets/image_3_1_0.png)|Text "Life"|![](./assets/image_3_3_0.png)|

</div>

</div>

## Inference Code

Install DiffSynth-Studio:

```
git clone https://github.com/modelscope/DiffSynth-Studio.git  
cd DiffSynth-Studio
pip install -e .
```

Model inference:

```python
from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from PIL import Image
import torch, requests

pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
)
prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw
input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024))
input_image.save("image_input.png")
images = pipe(
    prompt,
    seed=0,
    num_inference_steps=30, cfg_scale=4,
    height=1024, width=1024,
    layer_input_image=input_image,
    layer_num=0,
)
images[0].save("image.png")
```