llmrobot commited on
Commit
7111f55
·
verified ·
1 Parent(s): 496a3ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -83
README.md CHANGED
@@ -4,117 +4,139 @@ language:
4
  - en
5
  - zh
6
  base_model:
7
- - Qwen/Qwen-Image
8
  pipeline_tag: image-text-to-image
9
  library_name: diffusers
10
  ---
11
- <p align="center">
12
- <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/qwen-image-layered-logo.png" width="800"/>
13
- <p>
14
- <p align="center">&nbsp&nbsp🤗 <a href="https://huggingface.co/Qwen/Qwen-Image-Layered">HuggingFace</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/models/Qwen/Qwen-Image-Layered">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2512.15603">Research Paper</a> &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qwenlm.github.io/blog/qwen-image-layered/">Blog</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/spaces/Qwen/Qwen-Image-Layered">Demo</a> &nbsp&nbsp
15
- </p>
16
 
17
- <p align="center">
18
- <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/layered.JPG" width="1024"/>
19
- <p>
20
 
21
- ## Introduction
22
- We are excited to introduce **Qwen-Image-Layered**, a model capable of decomposing an image into multiple RGBA layers. This layered representation unlocks **inherent editability**: each layer can be independently manipulated without affecting other content. Meanwhile, such a layered representation naturally supports **high-fidelity elementary operations**-such as resizing, reposition, and recoloring. By physically isolating semantic or structural components into distinct layers, our approach enables high-fidelity and consistent editing.
23
 
24
- ## Quick Start
25
 
26
- 1. Make sure your transformers>=4.51.3 (Supporting Qwen2.5-VL)
27
 
28
- 2. Install the latest version of diffusers
29
- ```
30
- pip install git+https://github.com/huggingface/diffusers
31
- pip install python-pptx
32
- ```
 
33
 
 
34
 
35
- ```python
36
- from diffusers import QwenImageLayeredPipeline
37
- import torch
38
- from PIL import Image
39
 
40
- pipeline = QwenImageLayeredPipeline.from_pretrained("Qwen/Qwen-Image-Layered")
41
- pipeline = pipeline.to("cuda", torch.bfloat16)
42
- pipeline.set_progress_bar_config(disable=None)
43
-
44
- image = Image.open("asserts/test_images/1.png").convert("RGBA")
45
- inputs = {
46
- "image": image,
47
- "generator": torch.Generator(device='cuda').manual_seed(777),
48
- "true_cfg_scale": 4.0,
49
- "negative_prompt": " ",
50
- "num_inference_steps": 50,
51
- "num_images_per_prompt": 1,
52
- "layers": 4,
53
- "resolution": 640, # Using different bucket (640, 1024) to determine the resolution. For this version, 640 is recommended
54
- "cfg_normalize": True, # Whether enable cfg normalization.
55
- "use_en_prompt": True, # Automatic caption language if user does not provide caption
56
- }
57
-
58
- with torch.inference_mode():
59
- output = pipeline(**inputs)
60
- output_image = output.images[0]
61
-
62
- for i, image in enumerate(output_image):
63
- image.save(f"{i}.png")
64
- ```
65
 
 
66
 
67
- ## Showcase
68
- ### Layered Decomposition in Application
69
- Given an image, Qwen-Image-Layered can decompose it into several RGBA layers:
70
- ![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片1.JPG)
71
 
72
- After decomposition, edits are applied exclusively to the target layer, physically isolating it from the rest of the content, and thereby fundamentally ensuring consistency across edits.
73
 
74
- For example, we can recolor the first layer and keep all other content untouched:
75
- ![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片2.JPG)
76
 
77
- We can also replace the second layer from a girl to a boy (The target layer is edited using Qwen-Image-Edit):
78
- ![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片3.JPG)
79
 
80
- Here, we revise the text to "Qwen-Image" (The target layer is edited using Qwen-Image-Edit):
81
- ![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片4.JPG)
 
82
 
83
- Furthermore, the layered structure naturally supports elemetary operations. For example, we can delete unwanted objects cleanly:
84
- ![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片5.JPG)
85
 
86
- We can also resize an object without distortion:
87
- ![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片6.JPG)
88
 
89
- After layer decomposition, we can move objects freely within the canvas:
90
- ![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片7.JPG)
 
 
91
 
92
- ### Flexible and Iterative Decomposition
93
- Qwen-Image-Layered is not limited to a fixed number of layers. The model supports variable-layer decomposition. For example, we can decompose an image into either 3 or 8 layers as needed:
94
 
95
- ![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片8.JPG)
96
 
97
- Moreover, decomposition can be applied recursively: any layer can itself be further decomposed, enabling infinite decomposition.
98
 
99
- ![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片9.JPG)
100
 
 
101
 
102
- ## License Agreement
 
 
103
 
104
- Qwen-Image-Layered is licensed under Apache 2.0.
105
 
106
- ## Citation
107
 
108
- We kindly encourage citation of our work if you find it useful.
 
 
 
109
 
110
- ```bibtex
111
- @misc{yin2025qwenimagelayered,
112
- title={Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition},
113
- author={Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei Chen, Yuxiang Chen, Heung-Yeung Shum, Lionel M. Ni, Jingren Zhou, Junyang Lin, Chenfei Wu},
114
- year={2025},
115
- eprint={2512.15603},
116
- archivePrefix={arXiv},
117
- primaryClass={cs.CV},
118
- url={https://arxiv.org/abs/2512.15603},
119
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
  ```
 
4
  - en
5
  - zh
6
  base_model:
7
+ - Qwen/Qwen-Image-Layered
8
  pipeline_tag: image-text-to-image
9
  library_name: diffusers
10
  ---
11
+ # Qwen-Image-Layered
 
 
 
 
12
 
13
+ ## Model Introduction
 
 
14
 
15
+ This model is trained based on the model [Qwen/Qwen-Image-Layered](https://modelscope.cn/models/Qwen/Qwen-Image-Layered) using the dataset [artplus/PrismLayersPro](https://modelscope.cn/datasets/artplus/PrismLayersPro), enabling text-controlled extraction of segmented layers.
 
16
 
17
+ For more details about training strategies and implementation, feel free to check our [technical blog](https://modelscope.cn/learn/4938).
18
 
19
+ ## Usage Tips
20
 
21
+ * The model architecture has been changed from multi-image output to single-image output, producing only the layer relevant to the provided text description.
22
+ * The model was trained exclusively on English text, but retains Chinese language understanding capabilities inherited from the base model.
23
+ * The native training resolution is 1024x1024; however, inference at other resolutions is supported.
24
+ * The model struggles to separate multiple entities that are heavily occluded or overlapping, such as the cartoon skeleton head and hat in the examples.
25
+ * The model excels at decomposing poster-like graphics but performs poorly on photographic images, especially those involving complex lighting and shadows.
26
+ * The model supports negative prompts—users can specify content they wish to exclude via negative prompt descriptions.
27
 
28
+ ## Demo Examples
29
 
30
+ **Some images contain white text on light backgrounds. ModelScope users should click the "☀︎" icon in the top-right corner to switch to dark mode for better visibility.**
 
 
 
31
 
32
+ ### Example 1
33
+
34
+ <div style="display: flex; justify-content: space-between;">
35
+
36
+ <div style="width: 30%;">
37
+
38
+ |Input Image|
39
+ |-|
40
+ |![](./assets/image_1_input.png)|
41
+
42
+ </div>
43
+
44
+ <div style="width: 66%;">
45
+
46
+ |Prompt|Output Image|Prompt|Output Image|
47
+ |-|-|-|-|
48
+ |A solid, uniform color with no distinguishable features or objects|![](./assets/image_1_0_0.png)|Text 'TRICK'|![](./assets/image_1_4_0.png)|
49
+ |Cloud|![](./assets/image_1_1_0.png)|Text 'TRICK OR TREAT'|![](./assets/image_1_3_0.png)|
50
+ |A cartoon skeleton character wearing a purple hat and holding a gift box|![](./assets/image_1_2_0.png)|Text 'TRICK OR'|![](./assets/image_1_7_0.png)|
51
+ |A purple hat and a head|![](./assets/image_1_5_0.png)|A gift box|![](./assets/image_1_6_0.png)|
 
 
 
 
 
52
 
53
+ </div>
54
 
55
+ </div>
 
 
 
56
 
57
+ ### Example 2
58
 
59
+ <div style="display: flex; justify-content: space-between;">
 
60
 
61
+ <div style="width: 30%;">
 
62
 
63
+ |Input Image|
64
+ |-|
65
+ |![](./assets/image_2_input.png)|
66
 
67
+ </div>
 
68
 
69
+ <div style="width: 66%;">
 
70
 
71
+ |Prompt|Output Image|Prompt|Output Image|
72
+ |-|-|-|-|
73
+ |Blue sky, white clouds, a garden with colorful flowers|![](./assets/image_2_0_0.png)|Colorful, intricate floral wreath|![](./assets/image_2_2_0.png)|
74
+ |Girl, wreath, kitten|![](./assets/image_2_1_0.png)|Girl, kitten|![](./assets/image_2_3_0.png)|
75
 
76
+ </div>
 
77
 
78
+ </div>
79
 
80
+ ### Example 3
81
 
82
+ <div style="display: flex; justify-content: space-between;">
83
 
84
+ <div style="width: 30%;">
85
 
86
+ |Input Image|
87
+ |-|
88
+ |![](./assets/image_3_input.png)|
89
 
90
+ </div>
91
 
92
+ <div style="width: 66%;">
93
 
94
+ |Prompt|Output Image|Prompt|Output Image|
95
+ |-|-|-|-|
96
+ |A clear blue sky and a turbulent sea|![](./assets/image_3_0_0.png)|Text "The Life I Long For"|![](./assets/image_3_2_0.png)|
97
+ |A seagull|![](./assets/image_3_1_0.png)|Text "Life"|![](./assets/image_3_3_0.png)|
98
 
99
+ </div>
100
+
101
+ </div>
102
+
103
+ ## Inference Code
104
+
105
+ Install DiffSynth-Studio:
106
+
107
+ ```
108
+ git clone https://github.com/modelscope/DiffSynth-Studio.git
109
+ cd DiffSynth-Studio
110
+ pip install -e .
111
+ ```
112
+
113
+ Model inference:
114
+
115
+ ```python
116
+ from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
117
+ from PIL import Image
118
+ import torch, requests
119
+ pipe = QwenImagePipeline.from_pretrained(
120
+ torch_dtype=torch.bfloat16,
121
+ device="cuda",
122
+ model_configs=[
123
+ ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
124
+ ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
125
+ ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
126
+ ],
127
+ processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
128
+ )
129
+ prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
130
+ input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw
131
+ input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024))
132
+ input_image.save("image_input.png")
133
+ images = pipe(
134
+ prompt,
135
+ seed=0,
136
+ num_inference_steps=30, cfg_scale=4,
137
+ height=1024, width=1024,
138
+ layer_input_image=input_image,
139
+ layer_num=0,
140
+ )
141
+ images[0].save("image.png")
142
  ```