Instructions to use jasonshen8848/StudioDiffusion-ip-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use jasonshen8848/StudioDiffusion-ip-adapter with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("jasonshen8848/StudioDiffusion-ip-adapter", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
Add model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- stable-diffusion-xl
|
| 5 |
+
- sdxl
|
| 6 |
+
- ip-adapter
|
| 7 |
+
- product-photography
|
| 8 |
+
- e-commerce
|
| 9 |
+
- text-to-image
|
| 10 |
+
base_model: stabilityai/stable-diffusion-xl-base-1.0
|
| 11 |
+
library_name: diffusers
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# StudioDiffusion IP-Adapter (Shopify / Etsy / eBay)
|
| 15 |
+
|
| 16 |
+
Three **IP-Adapter** weight sets trained on top of [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), each targeting a distinct e-commerce platform aesthetic:
|
| 17 |
+
|
| 18 |
+
- **Shopify** β clean white / neutral backgrounds, studio lighting, minimal props, high contrast subject separation.
|
| 19 |
+
- **Etsy** β warm color temperature, lifestyle / craft props, natural light, textured surfaces, artisanal hand-crafted feel.
|
| 20 |
+
- **eBay** β bright even lighting, plain or gradient background, sharp focus on subject, utilitarian clarity.
|
| 21 |
+
|
| 22 |
+
Companion code and training pipeline: **https://github.com/s-zx/StudioDiffusion**
|
| 23 |
+
|
| 24 |
+
## Repository layout
|
| 25 |
+
|
| 26 |
+
| Path | Contents |
|
| 27 |
+
|---|---|
|
| 28 |
+
| `shopify/final/{image_proj_model,ip_attn_processors}.pt` | Shopify checkpoint @ step 3000 |
|
| 29 |
+
| `shopify/train.log` | Shopify val-loss per 250 steps |
|
| 30 |
+
| `etsy/final/{image_proj_model,ip_attn_processors}.pt` | Etsy checkpoint @ step 3000 |
|
| 31 |
+
| `etsy/checkpoint-500/{image_proj_model,ip_attn_processors}.pt` | **Recommended** Etsy checkpoint β best val loss, before mild overfit |
|
| 32 |
+
| `etsy/train.log` | Etsy val-loss per 250 steps |
|
| 33 |
+
| `ebay/final/{image_proj_model,ip_attn_processors}.pt` | eBay checkpoint @ step 3000 |
|
| 34 |
+
| `ebay/train.log` | eBay val-loss per 250 steps |
|
| 35 |
+
|
| 36 |
+
Each checkpoint follows the `IPAdapterSDXL.save_pretrained` format defined in [`adapters/ip_adapter/model.py`](https://github.com/s-zx/StudioDiffusion/blob/main/adapters/ip_adapter/model.py). Two files per checkpoint: `image_proj_model.pt` (CLIP-embed β token projection) and `ip_attn_processors.pt` (injected K/V weights for every cross-attention block of the SDXL UNet).
|
| 37 |
+
|
| 38 |
+
## Usage
|
| 39 |
+
|
| 40 |
+
### Download
|
| 41 |
+
|
| 42 |
+
```python
|
| 43 |
+
from huggingface_hub import snapshot_download
|
| 44 |
+
|
| 45 |
+
# Full set (~5.6 GB)
|
| 46 |
+
snapshot_download(
|
| 47 |
+
repo_id="jasonshen8848/StudioDiffusion-ip-adapter",
|
| 48 |
+
local_dir="checkpoints/ip_adapter",
|
| 49 |
+
)
|
| 50 |
+
|
| 51 |
+
# Single platform (~1.4 GB)
|
| 52 |
+
snapshot_download(
|
| 53 |
+
repo_id="jasonshen8848/StudioDiffusion-ip-adapter",
|
| 54 |
+
local_dir="checkpoints/ip_adapter",
|
| 55 |
+
allow_patterns=["shopify/final/*", "shopify/train.log"],
|
| 56 |
+
)
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
### Generate β minimal inference example
|
| 60 |
+
|
| 61 |
+
A complete working example is at [`inference/smoke.py`](https://github.com/s-zx/StudioDiffusion/blob/main/inference/smoke.py). Core pattern:
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
+
import torch
|
| 65 |
+
from diffusers import StableDiffusionXLPipeline, AutoencoderKL
|
| 66 |
+
from PIL import Image
|
| 67 |
+
from torchvision import transforms
|
| 68 |
+
|
| 69 |
+
from adapters.ip_adapter.model import IPAdapterSDXL # from the GitHub repo
|
| 70 |
+
|
| 71 |
+
device, dtype = "mps", torch.float16 # also works on CUDA with these
|
| 72 |
+
|
| 73 |
+
pipe = StableDiffusionXLPipeline.from_pretrained(
|
| 74 |
+
"stabilityai/stable-diffusion-xl-base-1.0",
|
| 75 |
+
vae=AutoencoderKL.from_pretrained(
|
| 76 |
+
"madebyollin/sdxl-vae-fp16-fix", torch_dtype=dtype,
|
| 77 |
+
),
|
| 78 |
+
torch_dtype=dtype,
|
| 79 |
+
).to(device)
|
| 80 |
+
|
| 81 |
+
adapter = IPAdapterSDXL.load_pretrained(
|
| 82 |
+
unet=pipe.unet,
|
| 83 |
+
load_directory="checkpoints/ip_adapter/shopify/final",
|
| 84 |
+
image_encoder_id="openai/clip-vit-large-patch14-336",
|
| 85 |
+
num_tokens=16,
|
| 86 |
+
adapter_scale=1.0,
|
| 87 |
+
).to(device=device, dtype=dtype)
|
| 88 |
+
|
| 89 |
+
clip_transform = transforms.Compose([
|
| 90 |
+
transforms.Resize(336, interpolation=transforms.InterpolationMode.BICUBIC),
|
| 91 |
+
transforms.CenterCrop(336),
|
| 92 |
+
transforms.ToTensor(),
|
| 93 |
+
transforms.Normalize(
|
| 94 |
+
mean=[0.48145466, 0.4578275, 0.40821073],
|
| 95 |
+
std=[0.26862954, 0.26130258, 0.27577711],
|
| 96 |
+
),
|
| 97 |
+
])
|
| 98 |
+
|
| 99 |
+
ref = Image.open("my_product.jpg").convert("RGB")
|
| 100 |
+
clip_input = clip_transform(ref).unsqueeze(0).to(device=device, dtype=dtype)
|
| 101 |
+
with torch.no_grad():
|
| 102 |
+
cond_ip, uncond_ip = adapter.encode_image(clip_input)
|
| 103 |
+
ip_hidden_states = torch.cat([uncond_ip, cond_ip], dim=0) # [uncond, cond] for CFG
|
| 104 |
+
|
| 105 |
+
image = pipe(
|
| 106 |
+
prompt="a professional product photograph",
|
| 107 |
+
negative_prompt="blurry, low quality, distorted, artifacts",
|
| 108 |
+
num_inference_steps=30,
|
| 109 |
+
guidance_scale=7.5,
|
| 110 |
+
height=512, width=512,
|
| 111 |
+
cross_attention_kwargs={"ip_hidden_states": ip_hidden_states},
|
| 112 |
+
).images[0]
|
| 113 |
+
image.save("out.png")
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
## Training summary
|
| 117 |
+
|
| 118 |
+
| | Shopify | Etsy | eBay |
|
| 119 |
+
|---|---|---|---|
|
| 120 |
+
| Train images | 353 | 325 | 518 |
|
| 121 |
+
| Val images | 88 | 81 | 129 |
|
| 122 |
+
| Start val loss (step 250) | 0.073747 | 0.131454 | 0.058868 |
|
| 123 |
+
| End val loss (step 3000) | 0.072500 | 0.132335 | 0.055920 |
|
| 124 |
+
| Best val loss | 0.072463 @ step 2000 | **0.131412 @ step 750** | 0.055920 @ step 3000 |
|
| 125 |
+
| Ξ val loss | **β1.7%** β | **+0.7%** β (mild overfit) | **β5.0%** β |
|
| 126 |
+
| Wall-clock | ~9 h | ~9 h | ~9 h |
|
| 127 |
+
|
| 128 |
+
**Hyperparameters** (identical across platforms):
|
| 129 |
+
|
| 130 |
+
- Base: `stabilityai/stable-diffusion-xl-base-1.0`
|
| 131 |
+
- VAE: `madebyollin/sdxl-vae-fp16-fix`
|
| 132 |
+
- Image encoder: `openai/clip-vit-large-patch14-336` (frozen)
|
| 133 |
+
- Optimizer: AdamW, lr=1e-4, (Ξ²β, Ξ²β)=(0.9, 0.999), wd=0.01
|
| 134 |
+
- LR schedule: cosine with 200-step warmup
|
| 135 |
+
- **Mixed precision: "no" (pure fp32)** β required for MPS stability
|
| 136 |
+
- Image size: 512Γ512 diffusion path; 336Γ336 CLIP-branch (fixed by encoder)
|
| 137 |
+
- Effective batch: 2 micro Γ 4 grad-accum = 8
|
| 138 |
+
- Steps: 3000 (= ~75 epochs on Shopify/Etsy, ~46 on eBay)
|
| 139 |
+
- Gradient checkpointing: enabled (required on 48 GB M4 Pro)
|
| 140 |
+
- Seed: 42
|
| 141 |
+
|
| 142 |
+
**Training data**: curated via `data/curate_platform.py` in the companion repo. Sources: Amazon Berkeley Objects (ABO), LAION-Aesthetics, DeepFashion2. ~400 images per platform selected by CLIP platform-prompt similarity + category balancing; 80/20 train/val split recorded in manifest CSVs.
|
| 143 |
+
|
| 144 |
+
**Hardware**: Apple MacBook Pro M4 Pro, 48 GB unified memory, PyTorch MPS backend.
|
| 145 |
+
|
| 146 |
+
## Known limitations
|
| 147 |
+
|
| 148 |
+
- **Captions are identity placeholders.** Training used `"a product photo"` for every sample (BLIP-2 caption generation was deferred). Text conditioning therefore provides minimal per-sample variance; all platform aesthetic signal flows through the IP-Adapter image branch.
|
| 149 |
+
- **Shopify adapter may over-desaturate color.** In qualitative spot checks, the Shopify adapter can push outputs towards white even when the reference product has a distinct color. If color fidelity matters, try `adapter_scale=0.5β0.75` at inference.
|
| 150 |
+
- **Etsy is mildly overfit after step 750.** Val loss rose ~0.7% from step 750 β 3000. The `final/` checkpoint is stylistically the strongest but diverges more from the reference content. **For content-preserving generation, prefer `etsy/checkpoint-500/`** (closest available to the val-loss optimum).
|
| 151 |
+
- **fp32 training was forced by MPS.** On Apple Silicon, autocast fp16/bf16 for SDXL + IP-Adapter raises an MPS `NDArrayMatrixMultiplication` assertion on the first forward pass. These weights are architecturally compatible with fp16 inference (verified on MPS β see the example above), but **fp16 / bf16 training** of this adapter configuration on CUDA has not been tested here.
|
| 152 |
+
- **No ControlNet / segmentation integration in these weights.** The companion repo plans a SAM2 + seg-trained ControlNet path; these checkpoints were trained without any spatial conditioning signal.
|
| 153 |
+
|
| 154 |
+
## License
|
| 155 |
+
|
| 156 |
+
MIT β matches the parent project.
|
| 157 |
+
|
| 158 |
+
Individual dataset licenses (ABO CC BY-NC 4.0, DeepFashion2 gated, LAION CC BY 4.0) apply to the *training data*, not to these weight files. Please consult those upstream licenses before commercial use.
|
| 159 |
+
|
| 160 |
+
## Citation
|
| 161 |
+
|
| 162 |
+
If you use these checkpoints, please cite the parent project:
|
| 163 |
+
|
| 164 |
+
```bibtex
|
| 165 |
+
@misc{studiodiffusion2026,
|
| 166 |
+
title = {StudioDiffusion: Training Platform-Specific Aesthetic Adapters for Product
|
| 167 |
+
Photography Using Segmentation-Conditioned Diffusion Models},
|
| 168 |
+
author = {Shen, Jason and contributors},
|
| 169 |
+
year = {2026},
|
| 170 |
+
howpublished = {\url{https://github.com/s-zx/StudioDiffusion}},
|
| 171 |
+
note = {CS 7643 Deep Learning final project, Georgia Tech}
|
| 172 |
+
}
|
| 173 |
+
```
|