|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
# Qwen-Image-Layered |
|
|
|
|
|
## Model Introduction |
|
|
|
|
|
This model is trained based on the model [Qwen/Qwen-Image-Layered](https://modelscope.cn/models/Qwen/Qwen-Image-Layered) using the dataset [artplus/PrismLayersPro](https://modelscope.cn/datasets/artplus/PrismLayersPro), enabling text-controlled extraction of segmented layers. |
|
|
|
|
|
For more details about training strategies and implementation, feel free to check our [technical blog](https://modelscope.cn/learn/4938). |
|
|
|
|
|
## Usage Tips |
|
|
|
|
|
* The model architecture has been changed from multi-image output to single-image output, producing only the layer relevant to the provided text description. |
|
|
* The model was trained exclusively on English text, but retains Chinese language understanding capabilities inherited from the base model. |
|
|
* The native training resolution is 1024x1024; however, inference at other resolutions is supported. |
|
|
* The model struggles to separate multiple entities that are heavily occluded or overlapping, such as the cartoon skeleton head and hat in the examples. |
|
|
* The model excels at decomposing poster-like graphics but performs poorly on photographic images, especially those involving complex lighting and shadows. |
|
|
* The model supports negative prompts—users can specify content they wish to exclude via negative prompt descriptions. |
|
|
|
|
|
## Demo Examples |
|
|
|
|
|
**Some images contain white text on light backgrounds. ModelScope users should click the "☀︎" icon in the top-right corner to switch to dark mode for better visibility.** |
|
|
|
|
|
### Example 1 |
|
|
|
|
|
<div style="display: flex; justify-content: space-between;"> |
|
|
|
|
|
<div style="width: 30%;"> |
|
|
|
|
|
|Input Image| |
|
|
|-| |
|
|
|| |
|
|
|
|
|
</div> |
|
|
|
|
|
<div style="width: 66%;"> |
|
|
|
|
|
|Prompt|Output Image|Prompt|Output Image| |
|
|
|-|-|-|-| |
|
|
|A solid, uniform color with no distinguishable features or objects||Text 'TRICK'|| |
|
|
|Cloud||Text 'TRICK OR TREAT'|| |
|
|
|A cartoon skeleton character wearing a purple hat and holding a gift box||Text 'TRICK OR'|| |
|
|
|A purple hat and a head||A gift box|| |
|
|
|
|
|
</div> |
|
|
|
|
|
</div> |
|
|
|
|
|
### Example 2 |
|
|
|
|
|
<div style="display: flex; justify-content: space-between;"> |
|
|
|
|
|
<div style="width: 30%;"> |
|
|
|
|
|
|Input Image| |
|
|
|-| |
|
|
|| |
|
|
|
|
|
</div> |
|
|
|
|
|
<div style="width: 66%;"> |
|
|
|
|
|
|Prompt|Output Image|Prompt|Output Image| |
|
|
|-|-|-|-| |
|
|
|Blue sky, white clouds, a garden with colorful flowers||Colorful, intricate floral wreath|| |
|
|
|Girl, wreath, kitten||Girl, kitten|| |
|
|
|
|
|
</div> |
|
|
|
|
|
</div> |
|
|
|
|
|
### Example 3 |
|
|
|
|
|
<div style="display: flex; justify-content: space-between;"> |
|
|
|
|
|
<div style="width: 30%;"> |
|
|
|
|
|
|Input Image| |
|
|
|-| |
|
|
|| |
|
|
|
|
|
</div> |
|
|
|
|
|
<div style="width: 66%;"> |
|
|
|
|
|
|Prompt|Output Image|Prompt|Output Image| |
|
|
|-|-|-|-| |
|
|
|A clear blue sky and a turbulent sea||Text "The Life I Long For"|| |
|
|
|A seagull||Text "Life"|| |
|
|
|
|
|
</div> |
|
|
|
|
|
</div> |
|
|
|
|
|
## Inference Code |
|
|
|
|
|
Install DiffSynth-Studio: |
|
|
|
|
|
``` |
|
|
git clone https://github.com/modelscope/DiffSynth-Studio.git |
|
|
cd DiffSynth-Studio |
|
|
pip install -e . |
|
|
``` |
|
|
|
|
|
Model inference: |
|
|
|
|
|
```python |
|
|
from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig |
|
|
from PIL import Image |
|
|
import torch, requests |
|
|
|
|
|
pipe = QwenImagePipeline.from_pretrained( |
|
|
torch_dtype=torch.bfloat16, |
|
|
device="cuda", |
|
|
model_configs=[ |
|
|
ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"), |
|
|
ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"), |
|
|
ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), |
|
|
], |
|
|
processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"), |
|
|
) |
|
|
prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box" |
|
|
input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw |
|
|
input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024)) |
|
|
input_image.save("image_input.png") |
|
|
images = pipe( |
|
|
prompt, |
|
|
seed=0, |
|
|
num_inference_steps=30, cfg_scale=4, |
|
|
height=1024, width=1024, |
|
|
layer_input_image=input_image, |
|
|
layer_num=0, |
|
|
) |
|
|
images[0].save("image.png") |
|
|
``` |