---
license: apache-2.0
---
# Qwen-Image-Layered
## Model Introduction
This model is trained based on the model [Qwen/Qwen-Image-Layered](https://modelscope.cn/models/Qwen/Qwen-Image-Layered) using the dataset [artplus/PrismLayersPro](https://modelscope.cn/datasets/artplus/PrismLayersPro), enabling text-controlled extraction of segmented layers.
For more details about training strategies and implementation, feel free to check our [technical blog](https://modelscope.cn/learn/4938).
## Usage Tips
* The model architecture has been changed from multi-image output to single-image output, producing only the layer relevant to the provided text description.
* The model was trained exclusively on English text, but retains Chinese language understanding capabilities inherited from the base model.
* The native training resolution is 1024x1024; however, inference at other resolutions is supported.
* The model struggles to separate multiple entities that are heavily occluded or overlapping, such as the cartoon skeleton head and hat in the examples.
* The model excels at decomposing poster-like graphics but performs poorly on photographic images, especially those involving complex lighting and shadows.
* The model supports negative prompts—users can specify content they wish to exclude via negative prompt descriptions.
## Demo Examples
**Some images contain white text on light backgrounds. ModelScope users should click the "☀︎" icon in the top-right corner to switch to dark mode for better visibility.**
### Example 1
|Input Image|
|-|
||
|Prompt|Output Image|Prompt|Output Image|
|-|-|-|-|
|A solid, uniform color with no distinguishable features or objects||Text 'TRICK'||
|Cloud||Text 'TRICK OR TREAT'||
|A cartoon skeleton character wearing a purple hat and holding a gift box||Text 'TRICK OR'||
|A purple hat and a head||A gift box||
### Example 2
|Input Image|
|-|
||
|Prompt|Output Image|Prompt|Output Image|
|-|-|-|-|
|Blue sky, white clouds, a garden with colorful flowers||Colorful, intricate floral wreath||
|Girl, wreath, kitten||Girl, kitten||
### Example 3
|Input Image|
|-|
||
|Prompt|Output Image|Prompt|Output Image|
|-|-|-|-|
|A clear blue sky and a turbulent sea||Text "The Life I Long For"||
|A seagull||Text "Life"||
## Inference Code
Install DiffSynth-Studio:
```
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
```
Model inference:
```python
from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from PIL import Image
import torch, requests
pipe = QwenImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
],
processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
)
prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw
input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024))
input_image.save("image_input.png")
images = pipe(
prompt,
seed=0,
num_inference_steps=30, cfg_scale=4,
height=1024, width=1024,
layer_input_image=input_image,
layer_num=0,
)
images[0].save("image.png")
```