Sona Forge — SD 1.5 IP-Adapter UNet (ONNX FP16)

A single fused ONNX FP16 graph combining the SD 1.5 UNet with IP-Adapter image-conditioning weights baked into the cross-attention layers. Used by the Sona Forge Android app for identity-preserving avatar generation. Pair with sona-forge/clip-vit-h-14-image-fp16.

Revision 1.1.0 (2026-05-01) adds 13 optional ControlNet residual inputs (12 down-block residuals + 1 mid-block residual) so the same UNet drives both Phase 6 (IP-Adapter only — pass zero-filled residuals or rely on the residual-aware export's pass-through-when-empty semantics) and Phase 7 (IP-Adapter + ControlNet Canny — pass the residuals from sona-forge/sd15-controlnet-canny-fp16).

ONNX shape

Input	Shape	dtype	Notes
`sample`	`[batch, 4, 64, 64]`	FP16	latent state at step t
`timestep`	`[batch]`	FP16	scheduler timestep
`encoder_hidden_states`	`[batch, 77, 768]`	FP16	text embeds (e.g. from CLIP text encoder)
`image_embeds`	`[batch, num_images, 1024]`	FP16	rank-3 per diffusers 0.27.2's `MultiIPAdapterImageProjection`. On-device path uses `num_images=1`.
`down_residual_0..11`	12 tensors	FP16	ControlNet down-block residuals (canonical SD 1.5 shapes). Pass zeros for Phase-6-only inference.
`mid_residual`	`[batch, 1280, 8, 8]`	FP16	ControlNet mid-block residual. Pass zeros for Phase-6-only inference.

Output	Shape	dtype
`noise_pred`	`[batch, 4, 64, 64]`	FP16

Down-block residual canonical shapes (SD 1.5): [batch, 320, 64, 64] ×3, [batch, 320, 32, 32], [batch, 640, 32, 32] ×2, [batch, 640, 16, 16], [batch, 1280, 16, 16] ×2, [batch, 1280, 8, 8] ×3.

How it was made

Pinned conversion environment:

Package	Version
diffusers	0.27.2
transformers	4.40.0
torch	2.3.0
onnx	1.16.0
onnxruntime	1.18.0
numpy	<2 (ABI compat)

Conversion sequence:

Load runwayml/stable-diffusion-v1-5 UNet at FP16.
Download h94/IP-Adapter's models/ip-adapter_sd15.bin checkpoint (image-projection MLP + cross-attn K/V).
Apply weights via unet._load_ip_adapter_weights([state_dict]) (the diffusers 0.27.2 internal — public unet.load_ip_adapter() doesn't exist on UNet2DConditionModel in this version).
Set attn_processor.scale = [1.0, ...] on each IPAdapterAttnProcessor / IPAdapterAttnProcessor2_0.
Wrap the UNet so added_cond_kwargs={"image_embeds": [image_embeds]} is positional and down_block_additional_residuals / mid_block_additional_residual flow through to the UNet forward call. Then torch.onnx.export at opset 17 with FP16 dummy inputs at canonical SD 1.5 shapes.

Re-running the conversion from the same pinned environment produces byte-identical output (same sha256). Conversion artefacts include a spike report with full TracerWarning output, validation metrics, and round-trip checks.

Files

File	Size	sha256	Revision
`model.onnx`	1,764,924,739 B (1683 MB)	`a0287f119d85b8028d9673850322247b5978ed9b504077bc04d433f4c9fadcb7`	1.1.0 (current — residual-accepting, Phase 7)
~~`model.onnx`~~	~~1,764,923,048 B (1683 MB)~~	~~`29e749b2c8dfdd6953a9165eca42e11489f8f90d43fac66c333cfdf6aae0014f`~~	1.0.0 (Phase 6 — superseded by 1.1.0; signature was the 4-input subset)

No external-data sidecar — graph + weights fit under the 2 GB protobuf single-file limit.

The 1.1.0 export is a strict superset of the 1.0.0 input signature: zero-filled residual inputs reproduce the 1.0.0 numerical output (verified during the Phase 7 spike).

Licence

The fused ONNX is a composite of two upstream artefacts:

SD 1.5 base UNet — CreativeML OpenRAIL-M (use-based restrictions; permits redistribution + modification).
IP-Adapter weights — Apache-2.0.

The composite is distributed under the most restrictive of these terms — CreativeML OpenRAIL-M.

Memory footprint

ORT CPU EP promotes FP16 to FP32 at session load (Phase 6 spike measured ~3.5 GB resident just for this UNet). On Android, NNAPI / XNNPack execute FP16 natively and the on-device working set is closer to the FP16 disk size + activation buffers. Sona Forge gates this pack to Tier B+ devices (≥ 7 GB total RAM).

Usage

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

# CFG batch=2; uncond at index 0, cond at index 1.
sample = np.random.randn(2, 4, 64, 64).astype(np.float16)
timestep = np.array([999.0, 999.0], dtype=np.float16)
encoder_hidden_states = np.random.randn(2, 77, 768).astype(np.float16)

# image_embeds: zeros for uncond branch, scaled CLIP embeds for cond branch.
clip_emb = np.random.randn(1, 1024).astype(np.float16)  # one reference image
ip_scale = 0.7
image_embeds = np.stack([
    np.zeros_like(clip_emb),
    clip_emb * ip_scale,
]).astype(np.float16)  # shape (2, 1, 1024)

noise_pred = session.run(None, {
    "sample": sample,
    "timestep": timestep,
    "encoder_hidden_states": encoder_hidden_states,
    "image_embeds": image_embeds,
})[0]

Provenance

Original SD 1.5 weights: runwayml/stable-diffusion-v1-5. (The official repo was removed from HF in early 2024; community mirrors persist.)
Original IP-Adapter checkpoint: h94/IP-Adapter/models/ip-adapter_sd15.bin.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for sona-forge/sd15-ipadapter-fp16

Base model

h94/IP-Adapter

Quantized

(2)

this model