| --- |
| license: apache-2.0 |
| --- |
| # Qwen-Image-Layered |
|
|
| ## Model Introduction |
|
|
| This model is trained based on the model [Qwen/Qwen-Image-Layered](https://modelscope.cn/models/Qwen/Qwen-Image-Layered) using the dataset [artplus/PrismLayersPro](https://modelscope.cn/datasets/artplus/PrismLayersPro), enabling text-controlled extraction of segmented layers. |
|
|
| For more details about training strategies and implementation, feel free to check our [technical blog](https://modelscope.cn/learn/4938). |
|
|
| ## Usage Tips |
|
|
| * The model architecture has been changed from multi-image output to single-image output, producing only the layer relevant to the provided text description. |
| * The model was trained exclusively on English text, but retains Chinese language understanding capabilities inherited from the base model. |
| * The native training resolution is 1024x1024; however, inference at other resolutions is supported. |
| * The model struggles to separate multiple entities that are heavily occluded or overlapping, such as the cartoon skeleton head and hat in the examples. |
| * The model excels at decomposing poster-like graphics but performs poorly on photographic images, especially those involving complex lighting and shadows. |
| * The model supports negative prompts—users can specify content they wish to exclude via negative prompt descriptions. |
|
|
| ## Demo Examples |
|
|
| **Some images contain white text on light backgrounds. ModelScope users should click the "☀︎" icon in the top-right corner to switch to dark mode for better visibility.** |
|
|
| ### Example 1 |
|
|
| <div style="display: flex; justify-content: space-between;"> |
|
|
| <div style="width: 30%;"> |
|
|
| |Input Image| |
| |-| |
| || |
|
|
| </div> |
|
|
| <div style="width: 66%;"> |
|
|
| |Prompt|Output Image|Prompt|Output Image| |
| |-|-|-|-| |
| |A solid, uniform color with no distinguishable features or objects||Text 'TRICK'|| |
| |Cloud||Text 'TRICK OR TREAT'|| |
| |A cartoon skeleton character wearing a purple hat and holding a gift box||Text 'TRICK OR'|| |
| |A purple hat and a head||A gift box|| |
|
|
| </div> |
|
|
| </div> |
|
|
| ### Example 2 |
|
|
| <div style="display: flex; justify-content: space-between;"> |
|
|
| <div style="width: 30%;"> |
|
|
| |Input Image| |
| |-| |
| || |
|
|
| </div> |
|
|
| <div style="width: 66%;"> |
|
|
| |Prompt|Output Image|Prompt|Output Image| |
| |-|-|-|-| |
| |Blue sky, white clouds, a garden with colorful flowers||Colorful, intricate floral wreath|| |
| |Girl, wreath, kitten||Girl, kitten|| |
|
|
| </div> |
|
|
| </div> |
|
|
| ### Example 3 |
|
|
| <div style="display: flex; justify-content: space-between;"> |
|
|
| <div style="width: 30%;"> |
|
|
| |Input Image| |
| |-| |
| || |
|
|
| </div> |
|
|
| <div style="width: 66%;"> |
|
|
| |Prompt|Output Image|Prompt|Output Image| |
| |-|-|-|-| |
| |A clear blue sky and a turbulent sea||Text "The Life I Long For"|| |
| |A seagull||Text "Life"|| |
|
|
| </div> |
|
|
| </div> |
|
|
| ## Inference Code |
|
|
| Install DiffSynth-Studio: |
|
|
| ``` |
| git clone https://github.com/modelscope/DiffSynth-Studio.git |
| cd DiffSynth-Studio |
| pip install -e . |
| ``` |
|
|
| Model inference: |
|
|
| ```python |
| from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig |
| from PIL import Image |
| import torch, requests |
| |
| pipe = QwenImagePipeline.from_pretrained( |
| torch_dtype=torch.bfloat16, |
| device="cuda", |
| model_configs=[ |
| ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"), |
| ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"), |
| ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), |
| ], |
| processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"), |
| ) |
| prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box" |
| input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw |
| input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024)) |
| input_image.save("image_input.png") |
| images = pipe( |
| prompt, |
| seed=0, |
| num_inference_steps=30, cfg_scale=4, |
| height=1024, width=1024, |
| layer_input_image=input_image, |
| layer_num=0, |
| ) |
| images[0].save("image.png") |
| ``` |