File size: 4,666 Bytes
f38ad1e b42868f f38ad1e b42868f 9901bb0 f38ad1e b42868f f38ad1e b42868f f38ad1e b42868f f38ad1e b42868f f38ad1e b42868f f38ad1e b42868f f38ad1e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: apache-2.0
---
# Qwen-Image-Layered
## Model Introduction
This model is trained based on the model [Qwen/Qwen-Image-Layered](https://modelscope.cn/models/Qwen/Qwen-Image-Layered) using the dataset [artplus/PrismLayersPro](https://modelscope.cn/datasets/artplus/PrismLayersPro), enabling text-controlled extraction of segmented layers.
For more details about training strategies and implementation, feel free to check our [technical blog](https://modelscope.cn/learn/4938).
## Usage Tips
* The model architecture has been changed from multi-image output to single-image output, producing only the layer relevant to the provided text description.
* The model was trained exclusively on English text, but retains Chinese language understanding capabilities inherited from the base model.
* The native training resolution is 1024x1024; however, inference at other resolutions is supported.
* The model struggles to separate multiple entities that are heavily occluded or overlapping, such as the cartoon skeleton head and hat in the examples.
* The model excels at decomposing poster-like graphics but performs poorly on photographic images, especially those involving complex lighting and shadows.
* The model supports negative prompts—users can specify content they wish to exclude via negative prompt descriptions.
## Demo Examples
**Some images contain white text on light backgrounds. ModelScope users should click the "☀︎" icon in the top-right corner to switch to dark mode for better visibility.**
### Example 1
<div style="display: flex; justify-content: space-between;">
<div style="width: 30%;">
|Input Image|
|-|
||
</div>
<div style="width: 66%;">
|Prompt|Output Image|Prompt|Output Image|
|-|-|-|-|
|A solid, uniform color with no distinguishable features or objects||Text 'TRICK'||
|Cloud||Text 'TRICK OR TREAT'||
|A cartoon skeleton character wearing a purple hat and holding a gift box||Text 'TRICK OR'||
|A purple hat and a head||A gift box||
</div>
</div>
### Example 2
<div style="display: flex; justify-content: space-between;">
<div style="width: 30%;">
|Input Image|
|-|
||
</div>
<div style="width: 66%;">
|Prompt|Output Image|Prompt|Output Image|
|-|-|-|-|
|Blue sky, white clouds, a garden with colorful flowers||Colorful, intricate floral wreath||
|Girl, wreath, kitten||Girl, kitten||
</div>
</div>
### Example 3
<div style="display: flex; justify-content: space-between;">
<div style="width: 30%;">
|Input Image|
|-|
||
</div>
<div style="width: 66%;">
|Prompt|Output Image|Prompt|Output Image|
|-|-|-|-|
|A clear blue sky and a turbulent sea||Text "The Life I Long For"||
|A seagull||Text "Life"||
</div>
</div>
## Inference Code
Install DiffSynth-Studio:
```
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
```
Model inference:
```python
from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from PIL import Image
import torch, requests
pipe = QwenImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
],
processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
)
prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw
input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024))
input_image.save("image_input.png")
images = pipe(
prompt,
seed=0,
num_inference_steps=30, cfg_scale=4,
height=1024, width=1024,
layer_input_image=input_image,
layer_num=0,
)
images[0].save("image.png")
``` |