jasonshen8848 commited on
Commit
3b4d78e
Β·
verified Β·
1 Parent(s): 7579b2b

Add model card

Browse files
Files changed (1) hide show
  1. README.md +173 -0
README.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - stable-diffusion-xl
5
+ - sdxl
6
+ - ip-adapter
7
+ - product-photography
8
+ - e-commerce
9
+ - text-to-image
10
+ base_model: stabilityai/stable-diffusion-xl-base-1.0
11
+ library_name: diffusers
12
+ ---
13
+
14
+ # StudioDiffusion IP-Adapter (Shopify / Etsy / eBay)
15
+
16
+ Three **IP-Adapter** weight sets trained on top of [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), each targeting a distinct e-commerce platform aesthetic:
17
+
18
+ - **Shopify** β€” clean white / neutral backgrounds, studio lighting, minimal props, high contrast subject separation.
19
+ - **Etsy** β€” warm color temperature, lifestyle / craft props, natural light, textured surfaces, artisanal hand-crafted feel.
20
+ - **eBay** β€” bright even lighting, plain or gradient background, sharp focus on subject, utilitarian clarity.
21
+
22
+ Companion code and training pipeline: **https://github.com/s-zx/StudioDiffusion**
23
+
24
+ ## Repository layout
25
+
26
+ | Path | Contents |
27
+ |---|---|
28
+ | `shopify/final/{image_proj_model,ip_attn_processors}.pt` | Shopify checkpoint @ step 3000 |
29
+ | `shopify/train.log` | Shopify val-loss per 250 steps |
30
+ | `etsy/final/{image_proj_model,ip_attn_processors}.pt` | Etsy checkpoint @ step 3000 |
31
+ | `etsy/checkpoint-500/{image_proj_model,ip_attn_processors}.pt` | **Recommended** Etsy checkpoint β€” best val loss, before mild overfit |
32
+ | `etsy/train.log` | Etsy val-loss per 250 steps |
33
+ | `ebay/final/{image_proj_model,ip_attn_processors}.pt` | eBay checkpoint @ step 3000 |
34
+ | `ebay/train.log` | eBay val-loss per 250 steps |
35
+
36
+ Each checkpoint follows the `IPAdapterSDXL.save_pretrained` format defined in [`adapters/ip_adapter/model.py`](https://github.com/s-zx/StudioDiffusion/blob/main/adapters/ip_adapter/model.py). Two files per checkpoint: `image_proj_model.pt` (CLIP-embed β†’ token projection) and `ip_attn_processors.pt` (injected K/V weights for every cross-attention block of the SDXL UNet).
37
+
38
+ ## Usage
39
+
40
+ ### Download
41
+
42
+ ```python
43
+ from huggingface_hub import snapshot_download
44
+
45
+ # Full set (~5.6 GB)
46
+ snapshot_download(
47
+ repo_id="jasonshen8848/StudioDiffusion-ip-adapter",
48
+ local_dir="checkpoints/ip_adapter",
49
+ )
50
+
51
+ # Single platform (~1.4 GB)
52
+ snapshot_download(
53
+ repo_id="jasonshen8848/StudioDiffusion-ip-adapter",
54
+ local_dir="checkpoints/ip_adapter",
55
+ allow_patterns=["shopify/final/*", "shopify/train.log"],
56
+ )
57
+ ```
58
+
59
+ ### Generate β€” minimal inference example
60
+
61
+ A complete working example is at [`inference/smoke.py`](https://github.com/s-zx/StudioDiffusion/blob/main/inference/smoke.py). Core pattern:
62
+
63
+ ```python
64
+ import torch
65
+ from diffusers import StableDiffusionXLPipeline, AutoencoderKL
66
+ from PIL import Image
67
+ from torchvision import transforms
68
+
69
+ from adapters.ip_adapter.model import IPAdapterSDXL # from the GitHub repo
70
+
71
+ device, dtype = "mps", torch.float16 # also works on CUDA with these
72
+
73
+ pipe = StableDiffusionXLPipeline.from_pretrained(
74
+ "stabilityai/stable-diffusion-xl-base-1.0",
75
+ vae=AutoencoderKL.from_pretrained(
76
+ "madebyollin/sdxl-vae-fp16-fix", torch_dtype=dtype,
77
+ ),
78
+ torch_dtype=dtype,
79
+ ).to(device)
80
+
81
+ adapter = IPAdapterSDXL.load_pretrained(
82
+ unet=pipe.unet,
83
+ load_directory="checkpoints/ip_adapter/shopify/final",
84
+ image_encoder_id="openai/clip-vit-large-patch14-336",
85
+ num_tokens=16,
86
+ adapter_scale=1.0,
87
+ ).to(device=device, dtype=dtype)
88
+
89
+ clip_transform = transforms.Compose([
90
+ transforms.Resize(336, interpolation=transforms.InterpolationMode.BICUBIC),
91
+ transforms.CenterCrop(336),
92
+ transforms.ToTensor(),
93
+ transforms.Normalize(
94
+ mean=[0.48145466, 0.4578275, 0.40821073],
95
+ std=[0.26862954, 0.26130258, 0.27577711],
96
+ ),
97
+ ])
98
+
99
+ ref = Image.open("my_product.jpg").convert("RGB")
100
+ clip_input = clip_transform(ref).unsqueeze(0).to(device=device, dtype=dtype)
101
+ with torch.no_grad():
102
+ cond_ip, uncond_ip = adapter.encode_image(clip_input)
103
+ ip_hidden_states = torch.cat([uncond_ip, cond_ip], dim=0) # [uncond, cond] for CFG
104
+
105
+ image = pipe(
106
+ prompt="a professional product photograph",
107
+ negative_prompt="blurry, low quality, distorted, artifacts",
108
+ num_inference_steps=30,
109
+ guidance_scale=7.5,
110
+ height=512, width=512,
111
+ cross_attention_kwargs={"ip_hidden_states": ip_hidden_states},
112
+ ).images[0]
113
+ image.save("out.png")
114
+ ```
115
+
116
+ ## Training summary
117
+
118
+ | | Shopify | Etsy | eBay |
119
+ |---|---|---|---|
120
+ | Train images | 353 | 325 | 518 |
121
+ | Val images | 88 | 81 | 129 |
122
+ | Start val loss (step 250) | 0.073747 | 0.131454 | 0.058868 |
123
+ | End val loss (step 3000) | 0.072500 | 0.132335 | 0.055920 |
124
+ | Best val loss | 0.072463 @ step 2000 | **0.131412 @ step 750** | 0.055920 @ step 3000 |
125
+ | Ξ” val loss | **βˆ’1.7%** ↓ | **+0.7%** ↑ (mild overfit) | **βˆ’5.0%** ↓ |
126
+ | Wall-clock | ~9 h | ~9 h | ~9 h |
127
+
128
+ **Hyperparameters** (identical across platforms):
129
+
130
+ - Base: `stabilityai/stable-diffusion-xl-base-1.0`
131
+ - VAE: `madebyollin/sdxl-vae-fp16-fix`
132
+ - Image encoder: `openai/clip-vit-large-patch14-336` (frozen)
133
+ - Optimizer: AdamW, lr=1e-4, (β₁, Ξ²β‚‚)=(0.9, 0.999), wd=0.01
134
+ - LR schedule: cosine with 200-step warmup
135
+ - **Mixed precision: "no" (pure fp32)** β€” required for MPS stability
136
+ - Image size: 512Γ—512 diffusion path; 336Γ—336 CLIP-branch (fixed by encoder)
137
+ - Effective batch: 2 micro Γ— 4 grad-accum = 8
138
+ - Steps: 3000 (= ~75 epochs on Shopify/Etsy, ~46 on eBay)
139
+ - Gradient checkpointing: enabled (required on 48 GB M4 Pro)
140
+ - Seed: 42
141
+
142
+ **Training data**: curated via `data/curate_platform.py` in the companion repo. Sources: Amazon Berkeley Objects (ABO), LAION-Aesthetics, DeepFashion2. ~400 images per platform selected by CLIP platform-prompt similarity + category balancing; 80/20 train/val split recorded in manifest CSVs.
143
+
144
+ **Hardware**: Apple MacBook Pro M4 Pro, 48 GB unified memory, PyTorch MPS backend.
145
+
146
+ ## Known limitations
147
+
148
+ - **Captions are identity placeholders.** Training used `"a product photo"` for every sample (BLIP-2 caption generation was deferred). Text conditioning therefore provides minimal per-sample variance; all platform aesthetic signal flows through the IP-Adapter image branch.
149
+ - **Shopify adapter may over-desaturate color.** In qualitative spot checks, the Shopify adapter can push outputs towards white even when the reference product has a distinct color. If color fidelity matters, try `adapter_scale=0.5–0.75` at inference.
150
+ - **Etsy is mildly overfit after step 750.** Val loss rose ~0.7% from step 750 β†’ 3000. The `final/` checkpoint is stylistically the strongest but diverges more from the reference content. **For content-preserving generation, prefer `etsy/checkpoint-500/`** (closest available to the val-loss optimum).
151
+ - **fp32 training was forced by MPS.** On Apple Silicon, autocast fp16/bf16 for SDXL + IP-Adapter raises an MPS `NDArrayMatrixMultiplication` assertion on the first forward pass. These weights are architecturally compatible with fp16 inference (verified on MPS β€” see the example above), but **fp16 / bf16 training** of this adapter configuration on CUDA has not been tested here.
152
+ - **No ControlNet / segmentation integration in these weights.** The companion repo plans a SAM2 + seg-trained ControlNet path; these checkpoints were trained without any spatial conditioning signal.
153
+
154
+ ## License
155
+
156
+ MIT β€” matches the parent project.
157
+
158
+ Individual dataset licenses (ABO CC BY-NC 4.0, DeepFashion2 gated, LAION CC BY 4.0) apply to the *training data*, not to these weight files. Please consult those upstream licenses before commercial use.
159
+
160
+ ## Citation
161
+
162
+ If you use these checkpoints, please cite the parent project:
163
+
164
+ ```bibtex
165
+ @misc{studiodiffusion2026,
166
+ title = {StudioDiffusion: Training Platform-Specific Aesthetic Adapters for Product
167
+ Photography Using Segmentation-Conditioned Diffusion Models},
168
+ author = {Shen, Jason and contributors},
169
+ year = {2026},
170
+ howpublished = {\url{https://github.com/s-zx/StudioDiffusion}},
171
+ note = {CS 7643 Deep Learning final project, Georgia Tech}
172
+ }
173
+ ```