File size: 10,008 Bytes
424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 6333bc0 424bd46 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 | ---
license: apache-2.0
library_name: coreml
base_model: google/siglip2-base-patch16-224
tags:
- siglip2
- siglip
- core-ml
- coreml
- apple-silicon
- ane
- neural-engine
- image-encoder
- image-embedding
- semantic-search
- on-device
language:
- en
pipeline_tag: zero-shot-image-classification
---
# ViT-B-16-SigLIP2 β Image Encoder, Apple Core ML
Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series
Macs and iOS 17+ devices. Apple's [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
ships the older MobileCLIP family in Core ML form but stops there; this repo
fills the gap for the higher-accuracy SigLIP2.
**Bit-perfect to PyTorch at fp16** (cosine 1.0000 vs `open_clip` reference,
verified end-to-end β see "Verifying the conversion yourself" below). Three
precision tiers ship in this repo so you can pick the right size/quality
trade-off.
All `.mlpackage` files are stored via Git LFS β clone with `git lfs install &&
git clone β¦` or download individual files via the HF Hub web UI.
## Quick start
```python
import coremltools as ct
from PIL import Image
model = ct.models.MLModel(
"ViT-B-16-SigLIP2_image_8bit.mlpackage",
compute_units=ct.ComputeUnit.CPU_AND_NE, # ANE-accelerated
)
# Resize to 224Γ224 with a quality resampler. Core ML's image input bakes in
# the [-1, 1] normalization but does NOT do its own resize.
img = Image.open("photo.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
out = model.predict({"image": img})
embedding = out["embedding"][0] # (768,) float32, L2-normalized
```
The model **outputs L2-normalized embeddings** so cosine similarity is just a
dot product. **Channel normalization** (the [-1, 1] mapping SigLIP2 expects)
is baked into the Core ML graph; you only need to deliver a 224Γ224 RGB image.
## Verifying the conversion yourself
Don't trust the cosine claims β reproduce them in 30 seconds:
```python
import open_clip, torch, numpy as np, coremltools as ct
from PIL import Image, ImageDraw
pt_model, _, pt_pre = open_clip.create_model_and_transforms(
"ViT-B-16-SigLIP2", pretrained="webli")
pt_model.eval()
img = Image.new("RGB", (224, 224), (40, 40, 40))
ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0))
# PyTorch reference
with torch.no_grad():
pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0))
pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy()
# Core ML
cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage",
compute_units=ct.ComputeUnit.CPU_AND_NE)
cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32)
cm_emb /= np.linalg.norm(cm_emb)
print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}") # β 1.000000 for fp16
```
## Available variants
| File | Size | Cosine vs PyTorch | Best for |
|---|---|---|---|
| `ViT-B-16-SigLIP2_image_fp16.mlpackage` | 185 MB | **1.0000** | Reference / benchmark reproduction |
| `ViT-B-16-SigLIP2_image_8bit.mlpackage` | 93 MB | **0.9942** | **Default** β near-perfect, 2Γ smaller |
| `ViT-B-16-SigLIP2_image_6bit.mlpackage` | 70 MB | 0.9629 | When download size matters more than ranking precision |
Cosine measured on a synthetic test image; rankings on real photo collections
are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering
of close-scoring results. 4-bit was tested (0.79 cosine) and deliberately
excluded β too lossy for retrieval ranking.
## Performance (Apple M-series, CPU+ANE)
Measured on a single image at a time; Apple's published Core ML packages
(including this one) are compiled at batch=1 β see "Limitations" for why.
| Variant | Throughput | vs PyTorch+MPS baseline |
|---|---|---|
| fp16 Core ML | ~133 img/s | **2.6Γ faster** |
| 8-bit Core ML | ~138 img/s | **2.7Γ faster** |
| 6-bit Core ML | ~139 img/s | **2.7Γ faster** |
| (PyTorch+MPS reference) | ~51 img/s | 1.0Γ |
End-to-end with disk loading + parallel PIL preprocessing the throughput
caps around ~110 img/s for the 8-bit variant β PIL decode becomes the
bottleneck before ANE does, which is why palettization throughput differences
vanish in practice.
**Expected ANE utilization is low (~5%)** when you profile this in Instruments.
That's not a bug β the per-call dispatch overhead (Python β Objective-C β
Apple's compute scheduler) dominates the actual ANE compute time per inference.
Throughput improves anyway because the dispatch + ANE pipeline keeps moving.
The fix would be a batched model, which Apple's stack doesn't currently support
for ViT-B at runtime (see Limitations).
## Limitations
- **Image branch only.** The text encoder *can* be converted using the
`convert_text_encoder.py` script in this repo (it works around coremltools'
lack of converter for PyTorch's fused `_native_multi_head_attention` by
replacing `nn.MultiheadAttention` with manual matmul attention before
tracing β output is bit-perfect to PyTorch at cosine 0.999996). **However**,
SigLIP2 uses Gemma2's 256K-token vocabulary; the token embedding alone is
393 MB at fp16, pushing the full text encoder to ~565 MB. Too large to
bundle here. Recommended: run the text encoder via
[`open_clip_torch`](https://github.com/mlfoundations/open_clip) on demand β
it's a one-shot per query, ~50ms on PyTorch+MPS or CPU.
- **Batch=1 only.** Two separate constraints conflate here:
- Apple's `ct.ImageType` is hardcoded to require batch=1
([source: `coremltools/converters/mil/backend/backend_helper.py`](https://github.com/apple/coremltools/blob/main/coremltools/converters/mil/backend/backend_helper.py),
line ~63 β the validator rejects any shape where `shape[0] != 1`).
- Switching to `ct.TensorType` (raw float input) bypasses that validator
but caused `ct.convert()` to stall indefinitely in Apple's native ANE
compiler on the two ViT-B models I tried (MobileCLIP2-B at batch=16,
SigLIP2-B at batch=8 β both hung past 2-minute timeouts). I haven't
isolated whether this is a fundamental ANE limit or specific to
certain architectures; treat as "empirically blocked at batch>1 for
these ViT-B image encoders, root cause unconfirmed."
For high-throughput indexing within batch=1, dispatch many images via
`model.predict([{"image": img1}, {"image": img2}, ...])` β coremltools
loops them in C and amortizes the Python β Objective-C overhead.
- **iOS 17+ / macOS 14+.** Compiled with `minimum_deployment_target=macOS14`
to enable modern Core ML features. Older OS versions need re-conversion
via `convert_to_coreml.py`.
## Pairing with a text encoder
The standard hybrid pattern: Core ML image encoder for the heavy per-photo
work, PyTorch text encoder for one-shot per query.
```python
import open_clip, torch
import coremltools as ct
import numpy as np
from PIL import Image
# Image side: Core ML on ANE
img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage",
compute_units=ct.ComputeUnit.CPU_AND_NE)
img = Image.open("cat.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
img_emb = next(iter(img_model.predict({"image": img}).values()))[0]
img_emb /= np.linalg.norm(img_emb)
# Text side: PyTorch on CPU/MPS β one-time per query
pt_model, _, _ = open_clip.create_model_and_transforms(
"ViT-B-16-SigLIP2", pretrained="webli")
tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2")
pt_model.eval()
with torch.no_grad():
tokens = tokenizer(["a photo of a cat", "a photo of a dog"])
txt_emb = pt_model.encode_text(tokens)
txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
txt_emb = txt_emb.numpy()
similarities = txt_emb @ img_emb # shape (2,)
# similarities[i] = cosine(prompt_i, image). Higher = better match.
```
## How this was made
The `convert_to_coreml.py` script in this repo reproduces the image variants;
`convert_text_encoder.py` does the (large) text encoder.
```bash
pip install coremltools open_clip_torch torch torchvision pillow numpy transformers
# fp16 only
python convert_to_coreml.py ViT-B-16-SigLIP2
# fp16 + 8-bit (run once with --palettize 6 separately for the 6-bit variant)
python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8
```
The image converter:
1. Loads PyTorch weights via `open_clip.create_model_and_transforms`
2. Wraps with an L2-normalize head so output is search-ready
3. Traces with `torch.jit.trace` at the model's expected input shape (224Γ224)
4. Reads `mean`/`std` from the preprocess transform β derives Core ML
`scale`/`bias`. **This step is load-bearing for SigLIP2 specifically**
because its preprocess uses `Normalize(mean=0.5, std=0.5)` (mapping pixels
to [-1, 1]); hardcoding the more common [0, 1] mapping silently degrades
cosine vs PyTorch by ~0.024 (model-specific β for a model with no Normalize
transform like MobileCLIP2-B, getting it wrong is a no-op).
5. Converts via `ct.convert` with `compute_units=CPU_AND_NE`
6. Optionally palettizes via `coremltools.optimize.coreml.palettize_weights`
7. Verifies cosine vs PyTorch + benchmarks ANE throughput
Conversion timings: ~15s for fp16, +20s for 6-bit palettization, +50s for
8-bit (palettization scales with model depth and bit precision).
## Attribution
- **Weights**: SigLIP2 by Google ([paper](https://arxiv.org/abs/2502.14786)).
The `webli` pretrained tag in `open_clip` corresponds to Google's release
trained on the WebLI dataset. Model loaded via the
[`mlfoundations/open_clip`](https://github.com/mlfoundations/open_clip)
community library β `open_clip` is `mlfoundations`' work, not Google's or Apple's.
- **Conversion tooling**: `coremltools` (Apple), `open_clip_torch` (mlfoundations).
- **Pattern reference**:
[`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
established the convention of Core ML CLIP-family image encoders shipped
alongside PyTorch text encoders.
## License
Apache 2.0, matching the upstream SigLIP2 weights' license. See `LICENSE`.
|