Upload folder using huggingface_hub

b42868f verified 13 days ago

4.67 kB

	---
	license: apache-2.0
	---
	# Qwen-Image-Layered

	## Model Introduction

	This model is trained based on the model [Qwen/Qwen-Image-Layered](https://modelscope.cn/models/Qwen/Qwen-Image-Layered) using the dataset [artplus/PrismLayersPro](https://modelscope.cn/datasets/artplus/PrismLayersPro), enabling text-controlled extraction of segmented layers.

	For more details about training strategies and implementation, feel free to check our [technical blog](https://modelscope.cn/learn/4938).

	## Usage Tips

	* The model architecture has been changed from multi-image output to single-image output, producing only the layer relevant to the provided text description.
	* The model was trained exclusively on English text, but retains Chinese language understanding capabilities inherited from the base model.
	* The native training resolution is 1024x1024; however, inference at other resolutions is supported.
	* The model struggles to separate multiple entities that are heavily occluded or overlapping, such as the cartoon skeleton head and hat in the examples.
	* The model excels at decomposing poster-like graphics but performs poorly on photographic images, especially those involving complex lighting and shadows.
	* The model supports negative prompts—users can specify content they wish to exclude via negative prompt descriptions.

	## Demo Examples

	Some images contain white text on light backgrounds. ModelScope users should click the "☀︎" icon in the top-right corner to switch to dark mode for better visibility.

	### Example 1

	<div style="display: flex; justify-content: space-between;">

	<div style="width: 30%;">

	\|Input Image\|
	\|-\|
	\|![](./assets/image_1_input.png)\|

	</div>

	<div style="width: 66%;">

	\|Prompt\|Output Image\|Prompt\|Output Image\|
	\|-\|-\|-\|-\|
	\|A solid, uniform color with no distinguishable features or objects\|![](./assets/image_1_0_0.png)\|Text 'TRICK'\|![](./assets/image_1_4_0.png)\|
	\|Cloud\|![](./assets/image_1_1_0.png)\|Text 'TRICK OR TREAT'\|![](./assets/image_1_3_0.png)\|
	\|A cartoon skeleton character wearing a purple hat and holding a gift box\|![](./assets/image_1_2_0.png)\|Text 'TRICK OR'\|![](./assets/image_1_7_0.png)\|
	\|A purple hat and a head\|![](./assets/image_1_5_0.png)\|A gift box\|![](./assets/image_1_6_0.png)\|

	</div>

	</div>

	### Example 2

	<div style="display: flex; justify-content: space-between;">

	<div style="width: 30%;">

	\|Input Image\|
	\|-\|
	\|![](./assets/image_2_input.png)\|

	</div>

	<div style="width: 66%;">

	\|Prompt\|Output Image\|Prompt\|Output Image\|
	\|-\|-\|-\|-\|
	\|Blue sky, white clouds, a garden with colorful flowers\|![](./assets/image_2_0_0.png)\|Colorful, intricate floral wreath\|![](./assets/image_2_2_0.png)\|
	\|Girl, wreath, kitten\|![](./assets/image_2_1_0.png)\|Girl, kitten\|![](./assets/image_2_3_0.png)\|

	</div>

	</div>

	### Example 3

	<div style="display: flex; justify-content: space-between;">

	<div style="width: 30%;">

	\|Input Image\|
	\|-\|
	\|![](./assets/image_3_input.png)\|

	</div>

	<div style="width: 66%;">

	\|Prompt\|Output Image\|Prompt\|Output Image\|
	\|-\|-\|-\|-\|
	\|A clear blue sky and a turbulent sea\|![](./assets/image_3_0_0.png)\|Text "The Life I Long For"\|![](./assets/image_3_2_0.png)\|
	\|A seagull\|![](./assets/image_3_1_0.png)\|Text "Life"\|![](./assets/image_3_3_0.png)\|

	</div>

	</div>

	## Inference Code

	Install DiffSynth-Studio:

	```
	git clone https://github.com/modelscope/DiffSynth-Studio.git
	cd DiffSynth-Studio
	pip install -e .
	```

	Model inference:

	```python
	from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
	from PIL import Image
	import torch, requests

	pipe = QwenImagePipeline.from_pretrained(
	torch_dtype=torch.bfloat16,
	device="cuda",
	model_configs=[
	ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
	ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
	ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
	],
	processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
	)
	prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
	input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw
	input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024))
	input_image.save("image_input.png")
	images = pipe(
	prompt,
	seed=0,
	num_inference_steps=30, cfg_scale=4,
	height=1024, width=1024,
	layer_input_image=input_image,
	layer_num=0,
	)
	images[0].save("image.png")
	```