virtaava

Fix model card: actual ONNX input names are down_residual_*/mid_residual (not down_block_res_samples_*/mid_block_res_sample)

05fa259 verified about 1 month ago

preview code

raw

history blame contribute delete

6 kB

	---
	license: creativeml-openrail-m
	library_name: onnx
	tags:
	- onnx
	- stable-diffusion
	- ip-adapter
	- text-to-image
	- mobile
	inference: false
	base_model:
	- runwayml/stable-diffusion-v1-5
	- h94/IP-Adapter
	---

	# Sona Forge — SD 1.5 IP-Adapter UNet (ONNX FP16)

	A single fused ONNX FP16 graph combining the SD 1.5 UNet with IP-Adapter image-conditioning weights baked into the cross-attention layers. Used by the Sona Forge Android app for identity-preserving avatar generation. Pair with [`sona-forge/clip-vit-h-14-image-fp16`](https://huggingface.co/sona-forge/clip-vit-h-14-image-fp16).

	Revision 1.1.0 (2026-05-01) adds 13 optional ControlNet residual inputs (12 down-block residuals + 1 mid-block residual) so the same UNet drives both Phase 6 (IP-Adapter only — pass zero-filled residuals or rely on the residual-aware export's pass-through-when-empty semantics) and Phase 7 (IP-Adapter + ControlNet Canny — pass the residuals from [`sona-forge/sd15-controlnet-canny-fp16`](https://huggingface.co/sona-forge/sd15-controlnet-canny-fp16)).

	## ONNX shape

	\| Input \| Shape \| dtype \| Notes \|
	\|---\|---\|---\|---\|
	\| `sample` \| `[batch, 4, 64, 64]` \| FP16 \| latent state at step t \|
	\| `timestep` \| `[batch]` \| FP16 \| scheduler timestep \|
	\| `encoder_hidden_states` \| `[batch, 77, 768]` \| FP16 \| text embeds (e.g. from CLIP text encoder) \|
	\| `image_embeds` \| `[batch, num_images, 1024]` \| FP16 \| rank-3 per diffusers 0.27.2's `MultiIPAdapterImageProjection`. On-device path uses `num_images=1`. \|
	\| `down_residual_0..11` \| 12 tensors \| FP16 \| ControlNet down-block residuals (canonical SD 1.5 shapes). Pass zeros for Phase-6-only inference. \|
	\| `mid_residual` \| `[batch, 1280, 8, 8]` \| FP16 \| ControlNet mid-block residual. Pass zeros for Phase-6-only inference. \|

	\| Output \| Shape \| dtype \|
	\|---\|---\|---\|
	\| `noise_pred` \| `[batch, 4, 64, 64]` \| FP16 \|

	Down-block residual canonical shapes (SD 1.5):
	`[batch, 320, 64, 64]` ×3, `[batch, 320, 32, 32]`, `[batch, 640, 32, 32]` ×2, `[batch, 640, 16, 16]`, `[batch, 1280, 16, 16]` ×2, `[batch, 1280, 8, 8]` ×3.

	## How it was made

	Pinned conversion environment:

	\| Package \| Version \|
	\|---\|---\|
	\| diffusers \| 0.27.2 \|
	\| transformers \| 4.40.0 \|
	\| torch \| 2.3.0 \|
	\| onnx \| 1.16.0 \|
	\| onnxruntime \| 1.18.0 \|
	\| numpy \| <2 (ABI compat) \|

	Conversion sequence:
	1. Load `runwayml/stable-diffusion-v1-5` UNet at FP16.
	2. Download `h94/IP-Adapter`'s `models/ip-adapter_sd15.bin` checkpoint (image-projection MLP + cross-attn K/V).
	3. Apply weights via `unet._load_ip_adapter_weights([state_dict])` (the diffusers 0.27.2 internal — public `unet.load_ip_adapter()` doesn't exist on `UNet2DConditionModel` in this version).
	4. Set `attn_processor.scale = [1.0, ...]` on each `IPAdapterAttnProcessor` / `IPAdapterAttnProcessor2_0`.
	5. Wrap the UNet so `added_cond_kwargs={"image_embeds": [image_embeds]}` is positional and `down_block_additional_residuals` / `mid_block_additional_residual` flow through to the UNet forward call. Then `torch.onnx.export` at opset 17 with FP16 dummy inputs at canonical SD 1.5 shapes.

	Re-running the conversion from the same pinned environment produces byte-identical output (same sha256). Conversion artefacts include a spike report with full TracerWarning output, validation metrics, and round-trip checks.

	## Files

	\| File \| Size \| sha256 \| Revision \|
	\|---\|---\|---\|---\|
	\| `model.onnx` \| 1,764,924,739 B (1683 MB) \| `a0287f119d85b8028d9673850322247b5978ed9b504077bc04d433f4c9fadcb7` \| 1.1.0 (current — residual-accepting, Phase 7) \|
	\| ~~`model.onnx`~~ \| ~~1,764,923,048 B (1683 MB)~~ \| ~~`29e749b2c8dfdd6953a9165eca42e11489f8f90d43fac66c333cfdf6aae0014f`~~ \| 1.0.0 (Phase 6 — superseded by 1.1.0; signature was the 4-input subset) \|

	No external-data sidecar — graph + weights fit under the 2 GB protobuf single-file limit.

	The 1.1.0 export is a strict superset of the 1.0.0 input signature: zero-filled residual inputs reproduce the 1.0.0 numerical output (verified during the Phase 7 spike).

	## Licence

	The fused ONNX is a composite of two upstream artefacts:

	- SD 1.5 base UNet — [CreativeML OpenRAIL-M](https://huggingface.co/spaces/CompVis/stable-diffusion-license) (use-based restrictions; permits redistribution + modification).
	- IP-Adapter weights — [Apache-2.0](https://github.com/tencent-ailab/IP-Adapter/blob/main/LICENSE).

	The composite is distributed under the most restrictive of these terms — CreativeML OpenRAIL-M.

	## Memory footprint

	ORT CPU EP promotes FP16 to FP32 at session load (Phase 6 spike measured ~3.5 GB resident just for this UNet). On Android, NNAPI / XNNPack execute FP16 natively and the on-device working set is closer to the FP16 disk size + activation buffers. Sona Forge gates this pack to Tier B+ devices (≥ 7 GB total RAM).

	## Usage

	```python
	import onnxruntime as ort
	import numpy as np

	session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

	# CFG batch=2; uncond at index 0, cond at index 1.
	sample = np.random.randn(2, 4, 64, 64).astype(np.float16)
	timestep = np.array([999.0, 999.0], dtype=np.float16)
	encoder_hidden_states = np.random.randn(2, 77, 768).astype(np.float16)

	# image_embeds: zeros for uncond branch, scaled CLIP embeds for cond branch.
	clip_emb = np.random.randn(1, 1024).astype(np.float16) # one reference image
	ip_scale = 0.7
	image_embeds = np.stack([
	np.zeros_like(clip_emb),
	clip_emb * ip_scale,
	]).astype(np.float16) # shape (2, 1, 1024)

	noise_pred = session.run(None, {
	"sample": sample,
	"timestep": timestep,
	"encoder_hidden_states": encoder_hidden_states,
	"image_embeds": image_embeds,
	})[0]
	```

	## Provenance

	- Original SD 1.5 weights: [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5). (The official repo was removed from HF in early 2024; community mirrors persist.)
	- Original IP-Adapter checkpoint: [`h94/IP-Adapter/models/ip-adapter_sd15.bin`](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter_sd15.bin).