File size: 4,070 Bytes
976d7ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d95ebd
 
976d7ec
 
30d8f11
 
 
7083f7d
ee0e106
7083f7d
 
9249305
976d7ec
 
 
 
 
 
 
 
fc34bf5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
976d7ec
 
 
 
fc34bf5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
976d7ec
 
 
fc34bf5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
base_model: stabilityai/stable-diffusion-2-1-base
library_name: diffusers
license: creativeml-openrail-m
inference: true
tags:
- stable-diffusion
- stable-diffusion-diffusers
- text-to-image
- diffusers
- controlnet
- diffusers-training
---

<!-- This model card has been generated automatically according to the information the training script had access to. You
should probably proofread and complete it, then remove this comment. -->


# controlnet-DharunSN/model_out

These are controlnet weights trained on stabilityai/stable-diffusion-2-1-base with new type of conditioning.
You can find some example images below. 
NOTE: This is a low precision model so image quality and charactersitics may be lagging

prompt: a white hoodie shirt on a size four model in a beach setting
DensePose Condition:
![images_actual](./MEN-Denim-id_00000089-18_4_full_densepose.png)
Image Generated: 
![images_4)](./images_4)
prompt: a green jumper shirt and white pants with a green overcoat on top 
![images_actual](./MEN-Denim-id_00000089-15_4_full_densepose.png)
Image Generated: 
![images_5)](./images_6.png)



## Intended uses & limitations

#### How to use

```python
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from transformers import CLIPTokenizer, CLIPTextModel
from PIL import Image
import torch

# Load pre-trained components
controlnet = ControlNetModel.from_pretrained("path/to/your/controlnet-model")
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base",
    controlnet=controlnet,
    tokenizer=tokenizer,
    text_encoder=text_encoder,
    torch_dtype=torch.float16,
).to("cuda")

# Example: Generate an image of a red jacket on a model pose
pose_image = Image.open("path/to/pose.png").convert("RGB").resize((512, 512))
prompt = "a red leather jacket with silver zippers, worn on a casual street-style model"

image = pipe(prompt=prompt, control_image=pose_image, num_inference_steps=30).images[0]
image.save("output.png")

```

#### Limitations and bias

Pose Alignment Errors: The model may fail to accurately align garments with extremely dynamic or occluded body poses.

Fabric Simulation: Lacks realistic physical behavior of fabrics like wrinkles, folds, or flowing movement.

Resolution Constraints: Default generation is 512×512. Upscaling may lose fidelity unless further post-processing is used.

Model Drift in Edge Cases: Struggles with rare combinations of garment types and unconventional descriptions.

Dataset Bias: DeepFashion and related datasets often overrepresent certain body types, genders, and skin tones, which can skew model generalization.

Style Bias: High fashion or Western clothing styles are more common in training data, leading to poorer performance for traditional or niche designs.

Recommendations: 
Augment training data with underrepresented demographics and clothing styles

Use reinforcement or adversarial training to improve physics realism and fairness

Apply domain adaptation for traditional clothing categories

## Training details

Training Data:

DeepFashion: 400k images with clothing types, poses, and attributes
(https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html)

DensePose / OpenPose: For extracting skeletal keypoints and human pose conditioning maps

Text Descriptions: Generated or curated captions describing clothing, structure, and materials

Fashion-Design-10K, Fabric-Texture-2K, Fashion-Model-5K: Supporting datasets for garment diversity, material realism, and body types

Training Setup:

Hardware: NVIDIA L4 GPU (32GB VRAM)

Precision: Mixed precision (torch.float16)

Optimizer: 8-bit AdamW with learning rate 1e-5

Gradient Accumulation: 8–16 steps to support large batch emulation

Epochs: 1–3 depending on overfitting trends

Batch Size: Effective size of 32 (with accumulation)

Training Duration: Approx. 12–15 hours for 1 full epoch on 400k images