--- base_model: stabilityai/stable-diffusion-2-1-base library_name: diffusers license: creativeml-openrail-m inference: true tags: - stable-diffusion - stable-diffusion-diffusers - text-to-image - diffusers - controlnet - diffusers-training --- # controlnet-DharunSN/model_out These are controlnet weights trained on stabilityai/stable-diffusion-2-1-base with new type of conditioning. You can find some example images below. NOTE: This is a low precision model so image quality and charactersitics may be lagging prompt: a white hoodie shirt on a size four model in a beach setting DensePose Condition: ![images_actual](./MEN-Denim-id_00000089-18_4_full_densepose.png) Image Generated: ![images_4)](./images_4) prompt: a green jumper shirt and white pants with a green overcoat on top ![images_actual](./MEN-Denim-id_00000089-15_4_full_densepose.png) Image Generated: ![images_5)](./images_6.png) ## Intended uses & limitations #### How to use ```python from diffusers import StableDiffusionControlNetPipeline, ControlNetModel from transformers import CLIPTokenizer, CLIPTextModel from PIL import Image import torch # Load pre-trained components controlnet = ControlNetModel.from_pretrained("path/to/your/controlnet-model") tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14") text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14") pipe = StableDiffusionControlNetPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base", controlnet=controlnet, tokenizer=tokenizer, text_encoder=text_encoder, torch_dtype=torch.float16, ).to("cuda") # Example: Generate an image of a red jacket on a model pose pose_image = Image.open("path/to/pose.png").convert("RGB").resize((512, 512)) prompt = "a red leather jacket with silver zippers, worn on a casual street-style model" image = pipe(prompt=prompt, control_image=pose_image, num_inference_steps=30).images[0] image.save("output.png") ``` #### Limitations and bias Pose Alignment Errors: The model may fail to accurately align garments with extremely dynamic or occluded body poses. Fabric Simulation: Lacks realistic physical behavior of fabrics like wrinkles, folds, or flowing movement. Resolution Constraints: Default generation is 512×512. Upscaling may lose fidelity unless further post-processing is used. Model Drift in Edge Cases: Struggles with rare combinations of garment types and unconventional descriptions. Dataset Bias: DeepFashion and related datasets often overrepresent certain body types, genders, and skin tones, which can skew model generalization. Style Bias: High fashion or Western clothing styles are more common in training data, leading to poorer performance for traditional or niche designs. Recommendations: Augment training data with underrepresented demographics and clothing styles Use reinforcement or adversarial training to improve physics realism and fairness Apply domain adaptation for traditional clothing categories ## Training details Training Data: DeepFashion: 400k images with clothing types, poses, and attributes (https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html) DensePose / OpenPose: For extracting skeletal keypoints and human pose conditioning maps Text Descriptions: Generated or curated captions describing clothing, structure, and materials Fashion-Design-10K, Fabric-Texture-2K, Fashion-Model-5K: Supporting datasets for garment diversity, material realism, and body types Training Setup: Hardware: NVIDIA L4 GPU (32GB VRAM) Precision: Mixed precision (torch.float16) Optimizer: 8-bit AdamW with learning rate 1e-5 Gradient Accumulation: 8–16 steps to support large batch emulation Epochs: 1–3 depending on overfitting trends Batch Size: Effective size of 32 (with accumulation) Training Duration: Approx. 12–15 hours for 1 full epoch on 400k images