--- frameworks: - Pytorch tasks: - text-to-image-synthesis #model-type: ##如 gpt、phi、llama、chatglm、baichuan 等 #- gpt #domain: ##如 nlp、cv、audio、multi-modal #- nlp #language: ##语言代码列表 https://help.aliyun.com/document_detail/215387.html?spm=a2c4g.11186623.0.0.9f8d7467kni6Aa #- cn #metrics: ##如 CIDEr、Blue、ROUGE 等 #- CIDEr #tags: ##各种自定义,包括 pretrained、fine-tuned、instruction-tuned、RL-tuned 等训练方法和其他 #- pretrained #tools: ##如 vllm、fastchat、llamacpp、AdaSeq 等 #- vllm base_model: - Qwen/Qwen-Image base_model_relation: adapter --- # Qwen-Image Image Structure Control Model ![](assets/title.png) ## Model Introduction This model is a LoRA for image structure control, trained based on [Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image), adopting the In Context technical approach. It supports multiple conditions: canny, depth, lineart, softedge, normal, and openpose. The training framework is built upon [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) , and the dataset used is[Qwen-Image-Self-Generated-Dataset](https://www.modelscope.cn/datasets/DiffSynth-Studio/Qwen-Image-Self-Generated-Dataset) It is recommended to start the input Prompt with "Context_Control. ". Please note that when using Openpose control, due to the particularity of this type of control, it cannot achieve a similar "point-to-point" control effect as other control types. ## Effect Demonstration |Control Condition|Control Image|Generated Image 1|Generated Image 2| |-|-|-|-| |canny|![](./assets/1_canny.png)|![](./assets/canny_image_seed_1_blue.png)|![](./assets/canny_image_seed_3_pink.png)| |depth|![](./assets/1_depth.png)|![](./assets/depth_image_seed_2_blue.png)|![](./assets/depth_image_seed_2_pink_1.png)| |lineart|![](./assets/1_lineart.png)|![](./assets/lineart_image_seed_1_blue.png)|![](./assets/lineart_image_seed_2_pink_1.png)| |softedge|![](./assets/1_softedge.png)|![](./assets/softedge_image_seed_2_blue.png)|![](./assets/softedge_image_seed_2_pink_1.png)| |normal|![](./assets/1_normal.png)|![](./assets/normal_image_seed_2_blue.png)|![](./assets/normal_image_seed_2_pink_1.png)| |openpose|![](./assets/1_openpose.png)|![](./assets/openpose_image_seed_1_blue.png)|![](./assets/openpose_image_seed_4_pink.png)| ## Inference Code ``` git clone https://github.com/modelscope/DiffSynth-Studio.git cd DiffSynth-Studio pip install -e . ``` ```python from PIL import Image import torch from modelscope import dataset_snapshot_download, snapshot_download from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig from diffsynth.controlnets.processors import Annotator allow_file_pattern = ["sk_model.pth", "sk_model2.pth", "dpt_hybrid-midas-501f0c75.pt", "ControlNetHED.pth", "body_pose_model.pth", "hand_pose_model.pth", "facenet.pth", "scannet.pt"] snapshot_download("lllyasviel/Annotators", local_dir="models/Annotators", allow_file_pattern=allow_file_pattern) pipe = QwenImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), ], tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"), ) snapshot_download("DiffSynth-Studio/Qwen-Image-In-Context-Control-Union", local_dir="models/DiffSynth-Studio/Qwen-Image-In-Context-Control-Union", allow_file_pattern="model.safetensors") pipe.load_lora(pipe.dit, "models/DiffSynth-Studio/Qwen-Image-In-Context-Control-Union/model.safetensors") dataset_snapshot_download(dataset_id="DiffSynth-Studio/examples_in_diffsynth", local_dir="./", allow_file_pattern=f"data/examples/qwen-image-context-control/image.jpg") origin_image = Image.open("data/examples/qwen-image-context-control/image.jpg").resize((1024, 1024)) annotator_ids = ['openpose', 'canny', 'depth', 'lineart', 'softedge', 'normal'] for annotator_id in annotator_ids: annotator = Annotator(processor_id=annotator_id, device="cuda") control_image = annotator(origin_image) control_image.save(f"{annotator.processor_id}.png") control_prompt = "Context_Control. " prompt = f"{control_prompt}一A beautiful girl in light blue is dancing against a dreamy starry sky with interweaving light and shadow and exquisite details." negative_prompt = "Mesh, regular grid, blurry, low resolution, low quality, distorted, deformed, wrong anatomy, distorted hands, distorted body, distorted face, distorted hair, distorted eyes, distorted mouth" image = pipe(prompt, seed=1, negative_prompt=negative_prompt, context_image=control_image, height=1024, width=1024) image.save(f"image_{annotator.processor_id}.png") ``` --- license: apache-2.0 ---