Add model card

3b4d78e verified about 1 month ago

7.87 kB

	---
	license: mit
	tags:
	- stable-diffusion-xl
	- sdxl
	- ip-adapter
	- product-photography
	- e-commerce
	- text-to-image
	base_model: stabilityai/stable-diffusion-xl-base-1.0
	library_name: diffusers
	---

	# StudioDiffusion IP-Adapter (Shopify / Etsy / eBay)

	Three IP-Adapter weight sets trained on top of [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), each targeting a distinct e-commerce platform aesthetic:

	- Shopify — clean white / neutral backgrounds, studio lighting, minimal props, high contrast subject separation.
	- Etsy — warm color temperature, lifestyle / craft props, natural light, textured surfaces, artisanal hand-crafted feel.
	- eBay — bright even lighting, plain or gradient background, sharp focus on subject, utilitarian clarity.

	Companion code and training pipeline: https://github.com/s-zx/StudioDiffusion

	## Repository layout

	\| Path \| Contents \|
	\|---\|---\|
	\| `shopify/final/{image_proj_model,ip_attn_processors}.pt` \| Shopify checkpoint @ step 3000 \|
	\| `shopify/train.log` \| Shopify val-loss per 250 steps \|
	\| `etsy/final/{image_proj_model,ip_attn_processors}.pt` \| Etsy checkpoint @ step 3000 \|
	\| `etsy/checkpoint-500/{image_proj_model,ip_attn_processors}.pt` \| Recommended Etsy checkpoint — best val loss, before mild overfit \|
	\| `etsy/train.log` \| Etsy val-loss per 250 steps \|
	\| `ebay/final/{image_proj_model,ip_attn_processors}.pt` \| eBay checkpoint @ step 3000 \|
	\| `ebay/train.log` \| eBay val-loss per 250 steps \|

	Each checkpoint follows the `IPAdapterSDXL.save_pretrained` format defined in [`adapters/ip_adapter/model.py`](https://github.com/s-zx/StudioDiffusion/blob/main/adapters/ip_adapter/model.py). Two files per checkpoint: `image_proj_model.pt` (CLIP-embed → token projection) and `ip_attn_processors.pt` (injected K/V weights for every cross-attention block of the SDXL UNet).

	## Usage

	### Download

	```python
	from huggingface_hub import snapshot_download

	# Full set (~5.6 GB)
	snapshot_download(
	repo_id="jasonshen8848/StudioDiffusion-ip-adapter",
	local_dir="checkpoints/ip_adapter",
	)

	# Single platform (~1.4 GB)
	snapshot_download(
	repo_id="jasonshen8848/StudioDiffusion-ip-adapter",
	local_dir="checkpoints/ip_adapter",
	allow_patterns=["shopify/final/*", "shopify/train.log"],
	)
	```

	### Generate — minimal inference example

	A complete working example is at [`inference/smoke.py`](https://github.com/s-zx/StudioDiffusion/blob/main/inference/smoke.py). Core pattern:

	```python
	import torch
	from diffusers import StableDiffusionXLPipeline, AutoencoderKL
	from PIL import Image
	from torchvision import transforms

	from adapters.ip_adapter.model import IPAdapterSDXL # from the GitHub repo

	device, dtype = "mps", torch.float16 # also works on CUDA with these

	pipe = StableDiffusionXLPipeline.from_pretrained(
	"stabilityai/stable-diffusion-xl-base-1.0",
	vae=AutoencoderKL.from_pretrained(
	"madebyollin/sdxl-vae-fp16-fix", torch_dtype=dtype,
	),
	torch_dtype=dtype,
	).to(device)

	adapter = IPAdapterSDXL.load_pretrained(
	unet=pipe.unet,
	load_directory="checkpoints/ip_adapter/shopify/final",
	image_encoder_id="openai/clip-vit-large-patch14-336",
	num_tokens=16,
	adapter_scale=1.0,
	).to(device=device, dtype=dtype)

	clip_transform = transforms.Compose([
	transforms.Resize(336, interpolation=transforms.InterpolationMode.BICUBIC),
	transforms.CenterCrop(336),
	transforms.ToTensor(),
	transforms.Normalize(
	mean=[0.48145466, 0.4578275, 0.40821073],
	std=[0.26862954, 0.26130258, 0.27577711],
	),
	])

	ref = Image.open("my_product.jpg").convert("RGB")
	clip_input = clip_transform(ref).unsqueeze(0).to(device=device, dtype=dtype)
	with torch.no_grad():
	cond_ip, uncond_ip = adapter.encode_image(clip_input)
	ip_hidden_states = torch.cat([uncond_ip, cond_ip], dim=0) # [uncond, cond] for CFG

	image = pipe(
	prompt="a professional product photograph",
	negative_prompt="blurry, low quality, distorted, artifacts",
	num_inference_steps=30,
	guidance_scale=7.5,
	height=512, width=512,
	cross_attention_kwargs={"ip_hidden_states": ip_hidden_states},
	).images[0]
	image.save("out.png")
	```

	## Training summary

	\| \| Shopify \| Etsy \| eBay \|
	\|---\|---\|---\|---\|
	\| Train images \| 353 \| 325 \| 518 \|
	\| Val images \| 88 \| 81 \| 129 \|
	\| Start val loss (step 250) \| 0.073747 \| 0.131454 \| 0.058868 \|
	\| End val loss (step 3000) \| 0.072500 \| 0.132335 \| 0.055920 \|
	\| Best val loss \| 0.072463 @ step 2000 \| 0.131412 @ step 750 \| 0.055920 @ step 3000 \|
	\| Δ val loss \| −1.7% ↓ \| +0.7% ↑ (mild overfit) \| −5.0% ↓ \|
	\| Wall-clock \| ~9 h \| ~9 h \| ~9 h \|

	Hyperparameters (identical across platforms):

	- Base: `stabilityai/stable-diffusion-xl-base-1.0`
	- VAE: `madebyollin/sdxl-vae-fp16-fix`
	- Image encoder: `openai/clip-vit-large-patch14-336` (frozen)
	- Optimizer: AdamW, lr=1e-4, (β₁, β₂)=(0.9, 0.999), wd=0.01
	- LR schedule: cosine with 200-step warmup
	- Mixed precision: "no" (pure fp32) — required for MPS stability
	- Image size: 512×512 diffusion path; 336×336 CLIP-branch (fixed by encoder)
	- Effective batch: 2 micro × 4 grad-accum = 8
	- Steps: 3000 (= ~75 epochs on Shopify/Etsy, ~46 on eBay)
	- Gradient checkpointing: enabled (required on 48 GB M4 Pro)
	- Seed: 42

	Training data: curated via `data/curate_platform.py` in the companion repo. Sources: Amazon Berkeley Objects (ABO), LAION-Aesthetics, DeepFashion2. ~400 images per platform selected by CLIP platform-prompt similarity + category balancing; 80/20 train/val split recorded in manifest CSVs.

	Hardware: Apple MacBook Pro M4 Pro, 48 GB unified memory, PyTorch MPS backend.

	## Known limitations

	- Captions are identity placeholders. Training used `"a product photo"` for every sample (BLIP-2 caption generation was deferred). Text conditioning therefore provides minimal per-sample variance; all platform aesthetic signal flows through the IP-Adapter image branch.
	- Shopify adapter may over-desaturate color. In qualitative spot checks, the Shopify adapter can push outputs towards white even when the reference product has a distinct color. If color fidelity matters, try `adapter_scale=0.5–0.75` at inference.
	- Etsy is mildly overfit after step 750. Val loss rose ~0.7% from step 750 → 3000. The `final/` checkpoint is stylistically the strongest but diverges more from the reference content. For content-preserving generation, prefer `etsy/checkpoint-500/` (closest available to the val-loss optimum).
	- fp32 training was forced by MPS. On Apple Silicon, autocast fp16/bf16 for SDXL + IP-Adapter raises an MPS `NDArrayMatrixMultiplication` assertion on the first forward pass. These weights are architecturally compatible with fp16 inference (verified on MPS — see the example above), but fp16 / bf16 training of this adapter configuration on CUDA has not been tested here.
	- No ControlNet / segmentation integration in these weights. The companion repo plans a SAM2 + seg-trained ControlNet path; these checkpoints were trained without any spatial conditioning signal.

	## License

	MIT — matches the parent project.

	Individual dataset licenses (ABO CC BY-NC 4.0, DeepFashion2 gated, LAION CC BY 4.0) apply to the training data, not to these weight files. Please consult those upstream licenses before commercial use.

	## Citation

	If you use these checkpoints, please cite the parent project:

	```bibtex
	@misc{studiodiffusion2026,
	title = {StudioDiffusion: Training Platform-Specific Aesthetic Adapters for Product
	Photography Using Segmentation-Conditioned Diffusion Models},
	author = {Shen, Jason and contributors},
	year = {2026},
	howpublished = {\url{https://github.com/s-zx/StudioDiffusion}},
	note = {CS 7643 Deep Learning final project, Georgia Tech}
	}
	```