| --- |
| license: creativeml-openrail-m |
| library_name: onnx |
| tags: |
| - onnx |
| - stable-diffusion |
| - ip-adapter |
| - text-to-image |
| - mobile |
| inference: false |
| base_model: |
| - runwayml/stable-diffusion-v1-5 |
| - h94/IP-Adapter |
| --- |
| |
| # Sona Forge β SD 1.5 IP-Adapter UNet (ONNX FP16) |
|
|
| A single fused ONNX FP16 graph combining the SD 1.5 UNet with IP-Adapter image-conditioning weights baked into the cross-attention layers. Used by the Sona Forge Android app for identity-preserving avatar generation. Pair with [`sona-forge/clip-vit-h-14-image-fp16`](https://huggingface.co/sona-forge/clip-vit-h-14-image-fp16). |
|
|
| **Revision 1.1.0 (2026-05-01)** adds 13 optional ControlNet residual inputs (12 down-block residuals + 1 mid-block residual) so the same UNet drives both Phase 6 (IP-Adapter only β pass zero-filled residuals or rely on the residual-aware export's pass-through-when-empty semantics) and Phase 7 (IP-Adapter + ControlNet Canny β pass the residuals from [`sona-forge/sd15-controlnet-canny-fp16`](https://huggingface.co/sona-forge/sd15-controlnet-canny-fp16)). |
|
|
| ## ONNX shape |
|
|
| | Input | Shape | dtype | Notes | |
| |---|---|---|---| |
| | `sample` | `[batch, 4, 64, 64]` | FP16 | latent state at step t | |
| | `timestep` | `[batch]` | FP16 | scheduler timestep | |
| | `encoder_hidden_states` | `[batch, 77, 768]` | FP16 | text embeds (e.g. from CLIP text encoder) | |
| | `image_embeds` | `[batch, num_images, 1024]` | FP16 | **rank-3** per diffusers 0.27.2's `MultiIPAdapterImageProjection`. On-device path uses `num_images=1`. | |
| | `down_residual_0..11` | 12 tensors | FP16 | ControlNet down-block residuals (canonical SD 1.5 shapes). Pass zeros for Phase-6-only inference. | |
| | `mid_residual` | `[batch, 1280, 8, 8]` | FP16 | ControlNet mid-block residual. Pass zeros for Phase-6-only inference. | |
|
|
| | Output | Shape | dtype | |
| |---|---|---| |
| | `noise_pred` | `[batch, 4, 64, 64]` | FP16 | |
|
|
| Down-block residual canonical shapes (SD 1.5): |
| `[batch, 320, 64, 64]` Γ3, `[batch, 320, 32, 32]`, `[batch, 640, 32, 32]` Γ2, `[batch, 640, 16, 16]`, `[batch, 1280, 16, 16]` Γ2, `[batch, 1280, 8, 8]` Γ3. |
|
|
| ## How it was made |
|
|
| Pinned conversion environment: |
|
|
| | Package | Version | |
| |---|---| |
| | diffusers | 0.27.2 | |
| | transformers | 4.40.0 | |
| | torch | 2.3.0 | |
| | onnx | 1.16.0 | |
| | onnxruntime | 1.18.0 | |
| | numpy | <2 (ABI compat) | |
|
|
| Conversion sequence: |
| 1. Load `runwayml/stable-diffusion-v1-5` UNet at FP16. |
| 2. Download `h94/IP-Adapter`'s `models/ip-adapter_sd15.bin` checkpoint (image-projection MLP + cross-attn K/V). |
| 3. Apply weights via `unet._load_ip_adapter_weights([state_dict])` (the diffusers 0.27.2 internal β public `unet.load_ip_adapter()` doesn't exist on `UNet2DConditionModel` in this version). |
| 4. Set `attn_processor.scale = [1.0, ...]` on each `IPAdapterAttnProcessor` / `IPAdapterAttnProcessor2_0`. |
| 5. Wrap the UNet so `added_cond_kwargs={"image_embeds": [image_embeds]}` is positional and `down_block_additional_residuals` / `mid_block_additional_residual` flow through to the UNet forward call. Then `torch.onnx.export` at opset 17 with FP16 dummy inputs at canonical SD 1.5 shapes. |
|
|
| Re-running the conversion from the same pinned environment produces byte-identical output (same sha256). Conversion artefacts include a spike report with full TracerWarning output, validation metrics, and round-trip checks. |
|
|
| ## Files |
|
|
| | File | Size | sha256 | Revision | |
| |---|---|---|---| |
| | `model.onnx` | 1,764,924,739 B (1683 MB) | `a0287f119d85b8028d9673850322247b5978ed9b504077bc04d433f4c9fadcb7` | 1.1.0 (current β residual-accepting, Phase 7) | |
| | ~~`model.onnx`~~ | ~~1,764,923,048 B (1683 MB)~~ | ~~`29e749b2c8dfdd6953a9165eca42e11489f8f90d43fac66c333cfdf6aae0014f`~~ | 1.0.0 (Phase 6 β superseded by 1.1.0; signature was the 4-input subset) | |
|
|
| No external-data sidecar β graph + weights fit under the 2 GB protobuf single-file limit. |
|
|
| The 1.1.0 export is a strict superset of the 1.0.0 input signature: zero-filled residual inputs reproduce the 1.0.0 numerical output (verified during the Phase 7 spike). |
|
|
| ## Licence |
|
|
| The fused ONNX is a composite of two upstream artefacts: |
|
|
| - SD 1.5 base UNet β [CreativeML OpenRAIL-M](https://huggingface.co/spaces/CompVis/stable-diffusion-license) (use-based restrictions; permits redistribution + modification). |
| - IP-Adapter weights β [Apache-2.0](https://github.com/tencent-ailab/IP-Adapter/blob/main/LICENSE). |
|
|
| The composite is distributed under the most restrictive of these terms β **CreativeML OpenRAIL-M**. |
|
|
| ## Memory footprint |
|
|
| ORT CPU EP promotes FP16 to FP32 at session load (Phase 6 spike measured ~3.5 GB resident just for this UNet). On Android, NNAPI / XNNPack execute FP16 natively and the on-device working set is closer to the FP16 disk size + activation buffers. Sona Forge gates this pack to Tier B+ devices (β₯ 7 GB total RAM). |
|
|
| ## Usage |
|
|
| ```python |
| import onnxruntime as ort |
| import numpy as np |
| |
| session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"]) |
| |
| # CFG batch=2; uncond at index 0, cond at index 1. |
| sample = np.random.randn(2, 4, 64, 64).astype(np.float16) |
| timestep = np.array([999.0, 999.0], dtype=np.float16) |
| encoder_hidden_states = np.random.randn(2, 77, 768).astype(np.float16) |
| |
| # image_embeds: zeros for uncond branch, scaled CLIP embeds for cond branch. |
| clip_emb = np.random.randn(1, 1024).astype(np.float16) # one reference image |
| ip_scale = 0.7 |
| image_embeds = np.stack([ |
| np.zeros_like(clip_emb), |
| clip_emb * ip_scale, |
| ]).astype(np.float16) # shape (2, 1, 1024) |
| |
| noise_pred = session.run(None, { |
| "sample": sample, |
| "timestep": timestep, |
| "encoder_hidden_states": encoder_hidden_states, |
| "image_embeds": image_embeds, |
| })[0] |
| ``` |
|
|
| ## Provenance |
|
|
| - Original SD 1.5 weights: [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5). (The official repo was removed from HF in early 2024; community mirrors persist.) |
| - Original IP-Adapter checkpoint: [`h94/IP-Adapter/models/ip-adapter_sd15.bin`](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter_sd15.bin). |
|
|