|
|
--- |
|
|
base_model: stabilityai/stable-diffusion-2-1-base |
|
|
library_name: diffusers |
|
|
license: creativeml-openrail-m |
|
|
inference: true |
|
|
tags: |
|
|
- stable-diffusion |
|
|
- stable-diffusion-diffusers |
|
|
- text-to-image |
|
|
- diffusers |
|
|
- controlnet |
|
|
- diffusers-training |
|
|
--- |
|
|
|
|
|
<!-- This model card has been generated automatically according to the information the training script had access to. You |
|
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
|
|
|
|
|
# controlnet-DharunSN/model_out |
|
|
|
|
|
These are controlnet weights trained on stabilityai/stable-diffusion-2-1-base with new type of conditioning. |
|
|
You can find some example images below. |
|
|
NOTE: This is a low precision model so image quality and charactersitics may be lagging |
|
|
|
|
|
prompt: a white hoodie shirt on a size four model in a beach setting |
|
|
DensePose Condition: |
|
|
 |
|
|
Image Generated: |
|
|
 |
|
|
prompt: a green jumper shirt and white pants with a green overcoat on top |
|
|
 |
|
|
Image Generated: |
|
|
 |
|
|
|
|
|
|
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
#### How to use |
|
|
|
|
|
```python |
|
|
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel |
|
|
from transformers import CLIPTokenizer, CLIPTextModel |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
# Load pre-trained components |
|
|
controlnet = ControlNetModel.from_pretrained("path/to/your/controlnet-model") |
|
|
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14") |
|
|
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14") |
|
|
|
|
|
pipe = StableDiffusionControlNetPipeline.from_pretrained( |
|
|
"stabilityai/stable-diffusion-xl-base", |
|
|
controlnet=controlnet, |
|
|
tokenizer=tokenizer, |
|
|
text_encoder=text_encoder, |
|
|
torch_dtype=torch.float16, |
|
|
).to("cuda") |
|
|
|
|
|
# Example: Generate an image of a red jacket on a model pose |
|
|
pose_image = Image.open("path/to/pose.png").convert("RGB").resize((512, 512)) |
|
|
prompt = "a red leather jacket with silver zippers, worn on a casual street-style model" |
|
|
|
|
|
image = pipe(prompt=prompt, control_image=pose_image, num_inference_steps=30).images[0] |
|
|
image.save("output.png") |
|
|
|
|
|
``` |
|
|
|
|
|
#### Limitations and bias |
|
|
|
|
|
Pose Alignment Errors: The model may fail to accurately align garments with extremely dynamic or occluded body poses. |
|
|
|
|
|
Fabric Simulation: Lacks realistic physical behavior of fabrics like wrinkles, folds, or flowing movement. |
|
|
|
|
|
Resolution Constraints: Default generation is 512×512. Upscaling may lose fidelity unless further post-processing is used. |
|
|
|
|
|
Model Drift in Edge Cases: Struggles with rare combinations of garment types and unconventional descriptions. |
|
|
|
|
|
Dataset Bias: DeepFashion and related datasets often overrepresent certain body types, genders, and skin tones, which can skew model generalization. |
|
|
|
|
|
Style Bias: High fashion or Western clothing styles are more common in training data, leading to poorer performance for traditional or niche designs. |
|
|
|
|
|
Recommendations: |
|
|
Augment training data with underrepresented demographics and clothing styles |
|
|
|
|
|
Use reinforcement or adversarial training to improve physics realism and fairness |
|
|
|
|
|
Apply domain adaptation for traditional clothing categories |
|
|
|
|
|
## Training details |
|
|
|
|
|
Training Data: |
|
|
|
|
|
DeepFashion: 400k images with clothing types, poses, and attributes |
|
|
(https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html) |
|
|
|
|
|
DensePose / OpenPose: For extracting skeletal keypoints and human pose conditioning maps |
|
|
|
|
|
Text Descriptions: Generated or curated captions describing clothing, structure, and materials |
|
|
|
|
|
Fashion-Design-10K, Fabric-Texture-2K, Fashion-Model-5K: Supporting datasets for garment diversity, material realism, and body types |
|
|
|
|
|
Training Setup: |
|
|
|
|
|
Hardware: NVIDIA L4 GPU (32GB VRAM) |
|
|
|
|
|
Precision: Mixed precision (torch.float16) |
|
|
|
|
|
Optimizer: 8-bit AdamW with learning rate 1e-5 |
|
|
|
|
|
Gradient Accumulation: 8–16 steps to support large batch emulation |
|
|
|
|
|
Epochs: 1–3 depending on overfitting trends |
|
|
|
|
|
Batch Size: Effective size of 32 (with accumulation) |
|
|
|
|
|
Training Duration: Approx. 12–15 hours for 1 full epoch on 400k images |