llmrobot
/

qwen-image-layered-control

@@ -4,117 +4,139 @@ language:
 - en
 - zh
 base_model:
-- Qwen/Qwen-Image
 pipeline_tag: image-text-to-image
 library_name: diffusers
 ---
-<p align="center">
-    <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/qwen-image-layered-logo.png" width="800"/>
-<p>
-<p align="center">&nbsp&nbsp🤗 <a href="https://huggingface.co/Qwen/Qwen-Image-Layered">HuggingFace</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/models/Qwen/Qwen-Image-Layered">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2512.15603">Research Paper</a> &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://qwenlm.github.io/blog/qwen-image-layered/">Blog</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/spaces/Qwen/Qwen-Image-Layered">Demo</a> &nbsp&nbsp
-</p>
-<p align="center">
-    <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/layered.JPG" width="1024"/>
-<p>
-## Introduction
-We are excited to introduce **Qwen-Image-Layered**, a model capable of decomposing an image into multiple RGBA layers. This layered representation unlocks **inherent editability**: each layer can be independently manipulated without affecting other content. Meanwhile, such a layered representation naturally supports **high-fidelity elementary operations**-such as resizing, reposition, and recoloring. By physically isolating semantic or structural components into distinct layers, our approach enables high-fidelity and consistent editing.
-## Quick Start
-1. Make sure your transformers>=4.51.3 (Supporting Qwen2.5-VL)
-2. Install the latest version of diffusers
-```
-pip install git+https://github.com/huggingface/diffusers
-pip install python-pptx
-```
-```python
-from diffusers import QwenImageLayeredPipeline
-import torch
-from PIL import Image
-pipeline = QwenImageLayeredPipeline.from_pretrained("Qwen/Qwen-Image-Layered")
-pipeline = pipeline.to("cuda", torch.bfloat16)
-pipeline.set_progress_bar_config(disable=None)
-image = Image.open("asserts/test_images/1.png").convert("RGBA")
-inputs = {
-    "image": image,
-    "generator": torch.Generator(device='cuda').manual_seed(777),
-    "true_cfg_scale": 4.0,
-    "negative_prompt": " ",
-    "num_inference_steps": 50,
-    "num_images_per_prompt": 1,
-    "layers": 4,
-    "resolution": 640,      # Using different bucket (640, 1024) to determine the resolution. For this version, 640 is recommended
-    "cfg_normalize": True,  # Whether enable cfg normalization.
-    "use_en_prompt": True,  # Automatic caption language if user does not provide caption
-}
-with torch.inference_mode():
-    output = pipeline(**inputs)
-    output_image = output.images[0]
-for i, image in enumerate(output_image):
-    image.save(f"{i}.png")
-```
-## Showcase
-### Layered Decomposition in Application
-Given an image, Qwen-Image-Layered can decompose it into several RGBA layers:
-![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片1.JPG)
-After decomposition, edits are applied exclusively to the target layer, physically isolating it from the rest of the content, and thereby fundamentally ensuring consistency across edits.
-For example, we can recolor the first layer and keep all other content untouched:
-![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片2.JPG)
-We can also replace the second layer from a girl to a boy (The target layer is edited using Qwen-Image-Edit):
-![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片3.JPG)
-Here, we revise the text to "Qwen-Image" (The target layer is edited using Qwen-Image-Edit):
-![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片4.JPG)
-Furthermore, the layered structure naturally supports elemetary operations. For example, we can delete unwanted objects cleanly:
-![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片5.JPG)
-We can also resize an object without distortion:
-![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片6.JPG)
-After layer decomposition, we can move objects freely within the canvas:
-![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片7.JPG)
-### Flexible and Iterative Decomposition
-Qwen-Image-Layered is not limited to a fixed number of layers. The model supports variable-layer decomposition. For example, we can decompose an image into either 3 or 8 layers as needed:
-![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片8.JPG)
-Moreover, decomposition can be applied recursively: any layer can itself be further decomposed, enabling infinite decomposition.
-![Example Image](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/layered/幻灯片9.JPG)
-## License Agreement
-Qwen-Image-Layered is licensed under Apache 2.0.
-## Citation
-We kindly encourage citation of our work if you find it useful.
-```bibtex
-@misc{yin2025qwenimagelayered,
-      title={Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition},
-      author={Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei Chen, Yuxiang Chen, Heung-Yeung Shum, Lionel M. Ni, Jingren Zhou, Junyang Lin, Chenfei Wu},
-      year={2025},
-      eprint={2512.15603},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV},
-      url={https://arxiv.org/abs/2512.15603},
-}
 ```

 - en
 - zh
 base_model:
+- Qwen/Qwen-Image-Layered
 pipeline_tag: image-text-to-image
 library_name: diffusers
 ---
+# Qwen-Image-Layered
+## Model Introduction
+This model is trained based on the model [Qwen/Qwen-Image-Layered](https://modelscope.cn/models/Qwen/Qwen-Image-Layered) using the dataset [artplus/PrismLayersPro](https://modelscope.cn/datasets/artplus/PrismLayersPro), enabling text-controlled extraction of segmented layers.
+For more details about training strategies and implementation, feel free to check our [technical blog](https://modelscope.cn/learn/4938).
+## Usage Tips
+* The model architecture has been changed from multi-image output to single-image output, producing only the layer relevant to the provided text description.
+* The model was trained exclusively on English text, but retains Chinese language understanding capabilities inherited from the base model.
+* The native training resolution is 1024x1024; however, inference at other resolutions is supported.
+* The model struggles to separate multiple entities that are heavily occluded or overlapping, such as the cartoon skeleton head and hat in the examples.
+* The model excels at decomposing poster-like graphics but performs poorly on photographic images, especially those involving complex lighting and shadows.
+* The model supports negative prompts—users can specify content they wish to exclude via negative prompt descriptions.
+## Demo Examples
+**Some images contain white text on light backgrounds. ModelScope users should click the "☀︎" icon in the top-right corner to switch to dark mode for better visibility.**
+### Example 1
+<div style="display: flex; justify-content: space-between;">
+<div style="width: 30%;">
+|Input Image|
+|-|
+|![](./assets/image_1_input.png)|
+</div>
+<div style="width: 66%;">
+|Prompt|Output Image|Prompt|Output Image|
+|-|-|-|-|
+|A solid, uniform color with no distinguishable features or objects|![](./assets/image_1_0_0.png)|Text 'TRICK'|![](./assets/image_1_4_0.png)|
+|Cloud|![](./assets/image_1_1_0.png)|Text 'TRICK OR TREAT'|![](./assets/image_1_3_0.png)|
+|A cartoon skeleton character wearing a purple hat and holding a gift box|![](./assets/image_1_2_0.png)|Text 'TRICK OR'|![](./assets/image_1_7_0.png)|
+|A purple hat and a head|![](./assets/image_1_5_0.png)|A gift box|![](./assets/image_1_6_0.png)|
+</div>
+</div>
+### Example 2
+<div style="display: flex; justify-content: space-between;">
+<div style="width: 30%;">
+|Input Image|
+|-|
+|![](./assets/image_2_input.png)|
+</div>
+<div style="width: 66%;">
+|Prompt|Output Image|Prompt|Output Image|
+|-|-|-|-|
+|Blue sky, white clouds, a garden with colorful flowers|![](./assets/image_2_0_0.png)|Colorful, intricate floral wreath|![](./assets/image_2_2_0.png)|
+|Girl, wreath, kitten|![](./assets/image_2_1_0.png)|Girl, kitten|![](./assets/image_2_3_0.png)|
+</div>
+</div>
+### Example 3
+<div style="display: flex; justify-content: space-between;">
+<div style="width: 30%;">
+|Input Image|
+|-|
+|![](./assets/image_3_input.png)|
+</div>
+<div style="width: 66%;">
+|Prompt|Output Image|Prompt|Output Image|
+|-|-|-|-|
+|A clear blue sky and a turbulent sea|![](./assets/image_3_0_0.png)|Text "The Life I Long For"|![](./assets/image_3_2_0.png)|
+|A seagull|![](./assets/image_3_1_0.png)|Text "Life"|![](./assets/image_3_3_0.png)|
+</div>
+</div>
+## Inference Code
+Install DiffSynth-Studio:
+```
+git clone https://github.com/modelscope/DiffSynth-Studio.git
+cd DiffSynth-Studio
+pip install -e .
+```
+Model inference:
+```python
+from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
+from PIL import Image
+import torch, requests
+pipe = QwenImagePipeline.from_pretrained(
+    torch_dtype=torch.bfloat16,
+    device="cuda",
+    model_configs=[
+        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
+        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
+        ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
+    ],
+    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
+)
+prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
+input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw
+input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024))
+input_image.save("image_input.png")
+images = pipe(
+    prompt,
+    seed=0,
+    num_inference_steps=30, cfg_scale=4,
+    height=1024, width=1024,
+    layer_input_image=input_image,
+    layer_num=0,
+)
+images[0].save("image.png")
 ```