File size: 4,070 Bytes
976d7ec 8d95ebd 976d7ec 30d8f11 7083f7d ee0e106 7083f7d 9249305 976d7ec fc34bf5 976d7ec fc34bf5 976d7ec fc34bf5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
---
base_model: stabilityai/stable-diffusion-2-1-base
library_name: diffusers
license: creativeml-openrail-m
inference: true
tags:
- stable-diffusion
- stable-diffusion-diffusers
- text-to-image
- diffusers
- controlnet
- diffusers-training
---
<!-- This model card has been generated automatically according to the information the training script had access to. You
should probably proofread and complete it, then remove this comment. -->
# controlnet-DharunSN/model_out
These are controlnet weights trained on stabilityai/stable-diffusion-2-1-base with new type of conditioning.
You can find some example images below.
NOTE: This is a low precision model so image quality and charactersitics may be lagging
prompt: a white hoodie shirt on a size four model in a beach setting
DensePose Condition:

Image Generated:

prompt: a green jumper shirt and white pants with a green overcoat on top

Image Generated:

## Intended uses & limitations
#### How to use
```python
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from transformers import CLIPTokenizer, CLIPTextModel
from PIL import Image
import torch
# Load pre-trained components
controlnet = ControlNetModel.from_pretrained("path/to/your/controlnet-model")
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base",
controlnet=controlnet,
tokenizer=tokenizer,
text_encoder=text_encoder,
torch_dtype=torch.float16,
).to("cuda")
# Example: Generate an image of a red jacket on a model pose
pose_image = Image.open("path/to/pose.png").convert("RGB").resize((512, 512))
prompt = "a red leather jacket with silver zippers, worn on a casual street-style model"
image = pipe(prompt=prompt, control_image=pose_image, num_inference_steps=30).images[0]
image.save("output.png")
```
#### Limitations and bias
Pose Alignment Errors: The model may fail to accurately align garments with extremely dynamic or occluded body poses.
Fabric Simulation: Lacks realistic physical behavior of fabrics like wrinkles, folds, or flowing movement.
Resolution Constraints: Default generation is 512×512. Upscaling may lose fidelity unless further post-processing is used.
Model Drift in Edge Cases: Struggles with rare combinations of garment types and unconventional descriptions.
Dataset Bias: DeepFashion and related datasets often overrepresent certain body types, genders, and skin tones, which can skew model generalization.
Style Bias: High fashion or Western clothing styles are more common in training data, leading to poorer performance for traditional or niche designs.
Recommendations:
Augment training data with underrepresented demographics and clothing styles
Use reinforcement or adversarial training to improve physics realism and fairness
Apply domain adaptation for traditional clothing categories
## Training details
Training Data:
DeepFashion: 400k images with clothing types, poses, and attributes
(https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html)
DensePose / OpenPose: For extracting skeletal keypoints and human pose conditioning maps
Text Descriptions: Generated or curated captions describing clothing, structure, and materials
Fashion-Design-10K, Fabric-Texture-2K, Fashion-Model-5K: Supporting datasets for garment diversity, material realism, and body types
Training Setup:
Hardware: NVIDIA L4 GPU (32GB VRAM)
Precision: Mixed precision (torch.float16)
Optimizer: 8-bit AdamW with learning rate 1e-5
Gradient Accumulation: 8–16 steps to support large batch emulation
Epochs: 1–3 depending on overfitting trends
Batch Size: Effective size of 32 (with accumulation)
Training Duration: Approx. 12–15 hours for 1 full epoch on 400k images |