Sona Forge β SD 1.5 IP-Adapter UNet (ONNX FP16)
A single fused ONNX FP16 graph combining the SD 1.5 UNet with IP-Adapter image-conditioning weights baked into the cross-attention layers. Used by the Sona Forge Android app for identity-preserving avatar generation. Pair with sona-forge/clip-vit-h-14-image-fp16.
Revision 1.1.0 (2026-05-01) adds 13 optional ControlNet residual inputs (12 down-block residuals + 1 mid-block residual) so the same UNet drives both Phase 6 (IP-Adapter only β pass zero-filled residuals or rely on the residual-aware export's pass-through-when-empty semantics) and Phase 7 (IP-Adapter + ControlNet Canny β pass the residuals from sona-forge/sd15-controlnet-canny-fp16).
ONNX shape
| Input | Shape | dtype | Notes |
|---|---|---|---|
sample |
[batch, 4, 64, 64] |
FP16 | latent state at step t |
timestep |
[batch] |
FP16 | scheduler timestep |
encoder_hidden_states |
[batch, 77, 768] |
FP16 | text embeds (e.g. from CLIP text encoder) |
image_embeds |
[batch, num_images, 1024] |
FP16 | rank-3 per diffusers 0.27.2's MultiIPAdapterImageProjection. On-device path uses num_images=1. |
down_residual_0..11 |
12 tensors | FP16 | ControlNet down-block residuals (canonical SD 1.5 shapes). Pass zeros for Phase-6-only inference. |
mid_residual |
[batch, 1280, 8, 8] |
FP16 | ControlNet mid-block residual. Pass zeros for Phase-6-only inference. |
| Output | Shape | dtype |
|---|---|---|
noise_pred |
[batch, 4, 64, 64] |
FP16 |
Down-block residual canonical shapes (SD 1.5):
[batch, 320, 64, 64] Γ3, [batch, 320, 32, 32], [batch, 640, 32, 32] Γ2, [batch, 640, 16, 16], [batch, 1280, 16, 16] Γ2, [batch, 1280, 8, 8] Γ3.
How it was made
Pinned conversion environment:
| Package | Version |
|---|---|
| diffusers | 0.27.2 |
| transformers | 4.40.0 |
| torch | 2.3.0 |
| onnx | 1.16.0 |
| onnxruntime | 1.18.0 |
| numpy | <2 (ABI compat) |
Conversion sequence:
- Load
runwayml/stable-diffusion-v1-5UNet at FP16. - Download
h94/IP-Adapter'smodels/ip-adapter_sd15.bincheckpoint (image-projection MLP + cross-attn K/V). - Apply weights via
unet._load_ip_adapter_weights([state_dict])(the diffusers 0.27.2 internal β publicunet.load_ip_adapter()doesn't exist onUNet2DConditionModelin this version). - Set
attn_processor.scale = [1.0, ...]on eachIPAdapterAttnProcessor/IPAdapterAttnProcessor2_0. - Wrap the UNet so
added_cond_kwargs={"image_embeds": [image_embeds]}is positional anddown_block_additional_residuals/mid_block_additional_residualflow through to the UNet forward call. Thentorch.onnx.exportat opset 17 with FP16 dummy inputs at canonical SD 1.5 shapes.
Re-running the conversion from the same pinned environment produces byte-identical output (same sha256). Conversion artefacts include a spike report with full TracerWarning output, validation metrics, and round-trip checks.
Files
| File | Size | sha256 | Revision |
|---|---|---|---|
model.onnx |
1,764,924,739 B (1683 MB) | a0287f119d85b8028d9673850322247b5978ed9b504077bc04d433f4c9fadcb7 |
1.1.0 (current β residual-accepting, Phase 7) |
model.onnx |
29e749b2c8dfdd6953a9165eca42e11489f8f90d43fac66c333cfdf6aae0014f |
1.0.0 (Phase 6 β superseded by 1.1.0; signature was the 4-input subset) |
No external-data sidecar β graph + weights fit under the 2 GB protobuf single-file limit.
The 1.1.0 export is a strict superset of the 1.0.0 input signature: zero-filled residual inputs reproduce the 1.0.0 numerical output (verified during the Phase 7 spike).
Licence
The fused ONNX is a composite of two upstream artefacts:
- SD 1.5 base UNet β CreativeML OpenRAIL-M (use-based restrictions; permits redistribution + modification).
- IP-Adapter weights β Apache-2.0.
The composite is distributed under the most restrictive of these terms β CreativeML OpenRAIL-M.
Memory footprint
ORT CPU EP promotes FP16 to FP32 at session load (Phase 6 spike measured ~3.5 GB resident just for this UNet). On Android, NNAPI / XNNPack execute FP16 natively and the on-device working set is closer to the FP16 disk size + activation buffers. Sona Forge gates this pack to Tier B+ devices (β₯ 7 GB total RAM).
Usage
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
# CFG batch=2; uncond at index 0, cond at index 1.
sample = np.random.randn(2, 4, 64, 64).astype(np.float16)
timestep = np.array([999.0, 999.0], dtype=np.float16)
encoder_hidden_states = np.random.randn(2, 77, 768).astype(np.float16)
# image_embeds: zeros for uncond branch, scaled CLIP embeds for cond branch.
clip_emb = np.random.randn(1, 1024).astype(np.float16) # one reference image
ip_scale = 0.7
image_embeds = np.stack([
np.zeros_like(clip_emb),
clip_emb * ip_scale,
]).astype(np.float16) # shape (2, 1, 1024)
noise_pred = session.run(None, {
"sample": sample,
"timestep": timestep,
"encoder_hidden_states": encoder_hidden_states,
"image_embeds": image_embeds,
})[0]
Provenance
- Original SD 1.5 weights:
runwayml/stable-diffusion-v1-5. (The official repo was removed from HF in early 2024; community mirrors persist.) - Original IP-Adapter checkpoint:
h94/IP-Adapter/models/ip-adapter_sd15.bin.
Model tree for sona-forge/sd15-ipadapter-fp16
Base model
h94/IP-Adapter