File size: 10,008 Bytes
424bd46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6333bc0
 
 
 
 
 
 
 
 
 
 
424bd46
 
 
 
 
 
 
 
 
 
 
 
6333bc0
 
 
424bd46
 
 
 
 
6333bc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
424bd46
 
 
 
 
 
 
 
 
 
 
6333bc0
 
424bd46
 
 
6333bc0
 
424bd46
 
 
 
 
 
 
 
6333bc0
 
 
 
 
 
 
 
 
 
 
424bd46
 
 
6333bc0
 
 
 
 
 
 
 
 
 
424bd46
 
 
 
 
 
 
 
 
 
 
6333bc0
424bd46
 
 
 
 
 
6333bc0
 
424bd46
6333bc0
424bd46
6333bc0
 
424bd46
 
 
 
 
 
 
 
 
 
6333bc0
 
424bd46
 
6333bc0
 
 
424bd46
 
 
 
 
 
 
 
 
6333bc0
 
424bd46
 
 
 
6333bc0
 
424bd46
 
 
6333bc0
 
 
 
 
424bd46
 
 
6333bc0
424bd46
 
 
 
6333bc0
 
 
 
 
424bd46
 
 
 
6333bc0
 
424bd46
 
 
6333bc0
 
 
 
 
 
 
424bd46
6333bc0
424bd46
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
license: apache-2.0
library_name: coreml
base_model: google/siglip2-base-patch16-224
tags:
- siglip2
- siglip
- core-ml
- coreml
- apple-silicon
- ane
- neural-engine
- image-encoder
- image-embedding
- semantic-search
- on-device
language:
- en
pipeline_tag: zero-shot-image-classification
---

# ViT-B-16-SigLIP2 β€” Image Encoder, Apple Core ML

Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series
Macs and iOS 17+ devices. Apple's [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
ships the older MobileCLIP family in Core ML form but stops there; this repo
fills the gap for the higher-accuracy SigLIP2.

**Bit-perfect to PyTorch at fp16** (cosine 1.0000 vs `open_clip` reference,
verified end-to-end β€” see "Verifying the conversion yourself" below). Three
precision tiers ship in this repo so you can pick the right size/quality
trade-off.

All `.mlpackage` files are stored via Git LFS β€” clone with `git lfs install &&
git clone …` or download individual files via the HF Hub web UI.

## Quick start

```python
import coremltools as ct
from PIL import Image

model = ct.models.MLModel(
    "ViT-B-16-SigLIP2_image_8bit.mlpackage",
    compute_units=ct.ComputeUnit.CPU_AND_NE,  # ANE-accelerated
)

# Resize to 224Γ—224 with a quality resampler. Core ML's image input bakes in
# the [-1, 1] normalization but does NOT do its own resize.
img = Image.open("photo.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
out = model.predict({"image": img})
embedding = out["embedding"][0]   # (768,) float32, L2-normalized
```

The model **outputs L2-normalized embeddings** so cosine similarity is just a
dot product. **Channel normalization** (the [-1, 1] mapping SigLIP2 expects)
is baked into the Core ML graph; you only need to deliver a 224Γ—224 RGB image.

## Verifying the conversion yourself

Don't trust the cosine claims β€” reproduce them in 30 seconds:

```python
import open_clip, torch, numpy as np, coremltools as ct
from PIL import Image, ImageDraw

pt_model, _, pt_pre = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP2", pretrained="webli")
pt_model.eval()
img = Image.new("RGB", (224, 224), (40, 40, 40))
ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0))

# PyTorch reference
with torch.no_grad():
    pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0))
    pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy()

# Core ML
cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage",
                       compute_units=ct.ComputeUnit.CPU_AND_NE)
cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32)
cm_emb /= np.linalg.norm(cm_emb)

print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}")  # β†’ 1.000000 for fp16
```

## Available variants

| File | Size | Cosine vs PyTorch | Best for |
|---|---|---|---|
| `ViT-B-16-SigLIP2_image_fp16.mlpackage` | 185 MB | **1.0000** | Reference / benchmark reproduction |
| `ViT-B-16-SigLIP2_image_8bit.mlpackage` | 93 MB | **0.9942** | **Default** β€” near-perfect, 2Γ— smaller |
| `ViT-B-16-SigLIP2_image_6bit.mlpackage` | 70 MB | 0.9629 | When download size matters more than ranking precision |

Cosine measured on a synthetic test image; rankings on real photo collections
are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering
of close-scoring results. 4-bit was tested (0.79 cosine) and deliberately
excluded β€” too lossy for retrieval ranking.

## Performance (Apple M-series, CPU+ANE)

Measured on a single image at a time; Apple's published Core ML packages
(including this one) are compiled at batch=1 β€” see "Limitations" for why.

| Variant | Throughput | vs PyTorch+MPS baseline |
|---|---|---|
| fp16 Core ML | ~133 img/s | **2.6Γ— faster** |
| 8-bit Core ML | ~138 img/s | **2.7Γ— faster** |
| 6-bit Core ML | ~139 img/s | **2.7Γ— faster** |
| (PyTorch+MPS reference) | ~51 img/s | 1.0Γ— |

End-to-end with disk loading + parallel PIL preprocessing the throughput
caps around ~110 img/s for the 8-bit variant β€” PIL decode becomes the
bottleneck before ANE does, which is why palettization throughput differences
vanish in practice.

**Expected ANE utilization is low (~5%)** when you profile this in Instruments.
That's not a bug β€” the per-call dispatch overhead (Python ↔ Objective-C ↔
Apple's compute scheduler) dominates the actual ANE compute time per inference.
Throughput improves anyway because the dispatch + ANE pipeline keeps moving.
The fix would be a batched model, which Apple's stack doesn't currently support
for ViT-B at runtime (see Limitations).

## Limitations

- **Image branch only.** The text encoder *can* be converted using the
  `convert_text_encoder.py` script in this repo (it works around coremltools'
  lack of converter for PyTorch's fused `_native_multi_head_attention` by
  replacing `nn.MultiheadAttention` with manual matmul attention before
  tracing β€” output is bit-perfect to PyTorch at cosine 0.999996). **However**,
  SigLIP2 uses Gemma2's 256K-token vocabulary; the token embedding alone is
  393 MB at fp16, pushing the full text encoder to ~565 MB. Too large to
  bundle here. Recommended: run the text encoder via
  [`open_clip_torch`](https://github.com/mlfoundations/open_clip) on demand β€”
  it's a one-shot per query, ~50ms on PyTorch+MPS or CPU.

- **Batch=1 only.** Two separate constraints conflate here:
  - Apple's `ct.ImageType` is hardcoded to require batch=1
    ([source: `coremltools/converters/mil/backend/backend_helper.py`](https://github.com/apple/coremltools/blob/main/coremltools/converters/mil/backend/backend_helper.py),
    line ~63 β€” the validator rejects any shape where `shape[0] != 1`).
  - Switching to `ct.TensorType` (raw float input) bypasses that validator
    but caused `ct.convert()` to stall indefinitely in Apple's native ANE
    compiler on the two ViT-B models I tried (MobileCLIP2-B at batch=16,
    SigLIP2-B at batch=8 β€” both hung past 2-minute timeouts). I haven't
    isolated whether this is a fundamental ANE limit or specific to
    certain architectures; treat as "empirically blocked at batch>1 for
    these ViT-B image encoders, root cause unconfirmed."

  For high-throughput indexing within batch=1, dispatch many images via
  `model.predict([{"image": img1}, {"image": img2}, ...])` β€” coremltools
  loops them in C and amortizes the Python ↔ Objective-C overhead.

- **iOS 17+ / macOS 14+.** Compiled with `minimum_deployment_target=macOS14`
  to enable modern Core ML features. Older OS versions need re-conversion
  via `convert_to_coreml.py`.

## Pairing with a text encoder

The standard hybrid pattern: Core ML image encoder for the heavy per-photo
work, PyTorch text encoder for one-shot per query.

```python
import open_clip, torch
import coremltools as ct
import numpy as np
from PIL import Image

# Image side: Core ML on ANE
img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage",
                              compute_units=ct.ComputeUnit.CPU_AND_NE)
img = Image.open("cat.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
img_emb = next(iter(img_model.predict({"image": img}).values()))[0]
img_emb /= np.linalg.norm(img_emb)

# Text side: PyTorch on CPU/MPS β€” one-time per query
pt_model, _, _ = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP2", pretrained="webli")
tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2")
pt_model.eval()

with torch.no_grad():
    tokens = tokenizer(["a photo of a cat", "a photo of a dog"])
    txt_emb = pt_model.encode_text(tokens)
    txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
txt_emb = txt_emb.numpy()

similarities = txt_emb @ img_emb        # shape (2,)
# similarities[i] = cosine(prompt_i, image). Higher = better match.
```

## How this was made

The `convert_to_coreml.py` script in this repo reproduces the image variants;
`convert_text_encoder.py` does the (large) text encoder.

```bash
pip install coremltools open_clip_torch torch torchvision pillow numpy transformers

# fp16 only
python convert_to_coreml.py ViT-B-16-SigLIP2

# fp16 + 8-bit  (run once with --palettize 6 separately for the 6-bit variant)
python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8
```

The image converter:
1. Loads PyTorch weights via `open_clip.create_model_and_transforms`
2. Wraps with an L2-normalize head so output is search-ready
3. Traces with `torch.jit.trace` at the model's expected input shape (224Γ—224)
4. Reads `mean`/`std` from the preprocess transform β†’ derives Core ML
   `scale`/`bias`. **This step is load-bearing for SigLIP2 specifically**
   because its preprocess uses `Normalize(mean=0.5, std=0.5)` (mapping pixels
   to [-1, 1]); hardcoding the more common [0, 1] mapping silently degrades
   cosine vs PyTorch by ~0.024 (model-specific β€” for a model with no Normalize
   transform like MobileCLIP2-B, getting it wrong is a no-op).
5. Converts via `ct.convert` with `compute_units=CPU_AND_NE`
6. Optionally palettizes via `coremltools.optimize.coreml.palettize_weights`
7. Verifies cosine vs PyTorch + benchmarks ANE throughput

Conversion timings: ~15s for fp16, +20s for 6-bit palettization, +50s for
8-bit (palettization scales with model depth and bit precision).

## Attribution

- **Weights**: SigLIP2 by Google ([paper](https://arxiv.org/abs/2502.14786)).
  The `webli` pretrained tag in `open_clip` corresponds to Google's release
  trained on the WebLI dataset. Model loaded via the
  [`mlfoundations/open_clip`](https://github.com/mlfoundations/open_clip)
  community library β€” `open_clip` is `mlfoundations`' work, not Google's or Apple's.
- **Conversion tooling**: `coremltools` (Apple), `open_clip_torch` (mlfoundations).
- **Pattern reference**:
  [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
  established the convention of Core ML CLIP-family image encoders shipped
  alongside PyTorch text encoders.

## License

Apache 2.0, matching the upstream SigLIP2 weights' license. See `LICENSE`.