model_out / README.md

Update README.md

ee0e106 verified 10 months ago

4.07 kB

	---
	base_model: stabilityai/stable-diffusion-2-1-base
	library_name: diffusers
	license: creativeml-openrail-m
	inference: true
	tags:
	- stable-diffusion
	- stable-diffusion-diffusers
	- text-to-image
	- diffusers
	- controlnet
	- diffusers-training
	---

	<!-- This model card has been generated automatically according to the information the training script had access to. You
	should probably proofread and complete it, then remove this comment. -->


	# controlnet-DharunSN/model_out

	These are controlnet weights trained on stabilityai/stable-diffusion-2-1-base with new type of conditioning.
	You can find some example images below.
	NOTE: This is a low precision model so image quality and charactersitics may be lagging

	prompt: a white hoodie shirt on a size four model in a beach setting
	DensePose Condition:
	![images_actual](./MEN-Denim-id_00000089-18_4_full_densepose.png)
	Image Generated:
	![images_4)](./images_4)
	prompt: a green jumper shirt and white pants with a green overcoat on top
	![images_actual](./MEN-Denim-id_00000089-15_4_full_densepose.png)
	Image Generated:
	![images_5)](./images_6.png)



	## Intended uses & limitations

	#### How to use

	```python
	from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
	from transformers import CLIPTokenizer, CLIPTextModel
	from PIL import Image
	import torch

	# Load pre-trained components
	controlnet = ControlNetModel.from_pretrained("path/to/your/controlnet-model")
	tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
	text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

	pipe = StableDiffusionControlNetPipeline.from_pretrained(
	"stabilityai/stable-diffusion-xl-base",
	controlnet=controlnet,
	tokenizer=tokenizer,
	text_encoder=text_encoder,
	torch_dtype=torch.float16,
	).to("cuda")

	# Example: Generate an image of a red jacket on a model pose
	pose_image = Image.open("path/to/pose.png").convert("RGB").resize((512, 512))
	prompt = "a red leather jacket with silver zippers, worn on a casual street-style model"

	image = pipe(prompt=prompt, control_image=pose_image, num_inference_steps=30).images[0]
	image.save("output.png")

	```

	#### Limitations and bias

	Pose Alignment Errors: The model may fail to accurately align garments with extremely dynamic or occluded body poses.

	Fabric Simulation: Lacks realistic physical behavior of fabrics like wrinkles, folds, or flowing movement.

	Resolution Constraints: Default generation is 512×512. Upscaling may lose fidelity unless further post-processing is used.

	Model Drift in Edge Cases: Struggles with rare combinations of garment types and unconventional descriptions.

	Dataset Bias: DeepFashion and related datasets often overrepresent certain body types, genders, and skin tones, which can skew model generalization.

	Style Bias: High fashion or Western clothing styles are more common in training data, leading to poorer performance for traditional or niche designs.

	Recommendations:
	Augment training data with underrepresented demographics and clothing styles

	Use reinforcement or adversarial training to improve physics realism and fairness

	Apply domain adaptation for traditional clothing categories

	## Training details

	Training Data:

	DeepFashion: 400k images with clothing types, poses, and attributes
	(https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html)

	DensePose / OpenPose: For extracting skeletal keypoints and human pose conditioning maps

	Text Descriptions: Generated or curated captions describing clothing, structure, and materials

	Fashion-Design-10K, Fabric-Texture-2K, Fashion-Model-5K: Supporting datasets for garment diversity, material realism, and body types

	Training Setup:

	Hardware: NVIDIA L4 GPU (32GB VRAM)

	Precision: Mixed precision (torch.float16)

	Optimizer: 8-bit AdamW with learning rate 1e-5

	Gradient Accumulation: 8–16 steps to support large batch emulation

	Epochs: 1–3 depending on overfitting trends

	Batch Size: Effective size of 32 (with accumulation)

	Training Duration: Approx. 12–15 hours for 1 full epoch on 400k images