Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -22,10 +22,17 @@ pipeline_tag: zero-shot-image-classification
|
|
| 22 |
# ViT-B-16-SigLIP2 β Image Encoder, Apple Core ML
|
| 23 |
|
| 24 |
Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series
|
| 25 |
-
Macs and
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
## Quick start
|
| 31 |
|
|
@@ -33,21 +40,49 @@ trade-off for your use case.
|
|
| 33 |
import coremltools as ct
|
| 34 |
from PIL import Image
|
| 35 |
|
| 36 |
-
# Load any of the three variants β see "Available variants" below
|
| 37 |
model = ct.models.MLModel(
|
| 38 |
"ViT-B-16-SigLIP2_image_8bit.mlpackage",
|
| 39 |
compute_units=ct.ComputeUnit.CPU_AND_NE, # ANE-accelerated
|
| 40 |
)
|
| 41 |
|
| 42 |
-
|
|
|
|
|
|
|
| 43 |
out = model.predict({"image": img})
|
| 44 |
embedding = out["embedding"][0] # (768,) float32, L2-normalized
|
| 45 |
```
|
| 46 |
|
| 47 |
The model **outputs L2-normalized embeddings** so cosine similarity is just a
|
| 48 |
-
dot product. **
|
| 49 |
-
is baked into the Core ML graph
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
## Available variants
|
| 53 |
|
|
@@ -59,13 +94,13 @@ done.
|
|
| 59 |
|
| 60 |
Cosine measured on a synthetic test image; rankings on real photo collections
|
| 61 |
are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering
|
| 62 |
-
of close-scoring results.
|
|
|
|
| 63 |
|
| 64 |
## Performance (Apple M-series, CPU+ANE)
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
for ViT-B-shaped models, regardless of architecture).
|
| 69 |
|
| 70 |
| Variant | Throughput | vs PyTorch+MPS baseline |
|
| 71 |
|---|---|---|
|
|
@@ -74,23 +109,30 @@ for ViT-B-shaped models, regardless of architecture).
|
|
| 74 |
| 6-bit Core ML | ~139 img/s | **2.7Γ faster** |
|
| 75 |
| (PyTorch+MPS reference) | ~51 img/s | 1.0Γ |
|
| 76 |
|
| 77 |
-
|
| 78 |
-
~110 img/s for the 8-bit variant
|
| 79 |
-
ANE does,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
## Limitations
|
| 82 |
|
| 83 |
-
- **Image branch only
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
256K-token vocabulary
|
| 89 |
-
pushing the full text encoder to ~565 MB.
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
can build the text encoder if you specifically need it.
|
| 94 |
|
| 95 |
- **Batch=1 only.** Two separate constraints conflate here:
|
| 96 |
- Apple's `ct.ImageType` is hardcoded to require batch=1
|
|
@@ -102,19 +144,20 @@ ANE does, so palettization throughput differences vanish in practice.
|
|
| 102 |
SigLIP2-B at batch=8 β both hung past 2-minute timeouts). I haven't
|
| 103 |
isolated whether this is a fundamental ANE limit or specific to
|
| 104 |
certain architectures; treat as "empirically blocked at batch>1 for
|
| 105 |
-
ViT-B image encoders, root cause unconfirmed."
|
| 106 |
|
| 107 |
For high-throughput indexing within batch=1, dispatch many images via
|
| 108 |
`model.predict([{"image": img1}, {"image": img2}, ...])` β coremltools
|
| 109 |
loops them in C and amortizes the Python β Objective-C overhead.
|
| 110 |
|
| 111 |
- **iOS 17+ / macOS 14+.** Compiled with `minimum_deployment_target=macOS14`
|
| 112 |
-
to enable modern Core ML features. Older OS versions need re-conversion
|
|
|
|
| 113 |
|
| 114 |
-
##
|
| 115 |
|
| 116 |
-
|
| 117 |
-
|
| 118 |
|
| 119 |
```python
|
| 120 |
import open_clip, torch
|
|
@@ -125,13 +168,13 @@ from PIL import Image
|
|
| 125 |
# Image side: Core ML on ANE
|
| 126 |
img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage",
|
| 127 |
compute_units=ct.ComputeUnit.CPU_AND_NE)
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
}).values()))[0]
|
| 131 |
img_emb /= np.linalg.norm(img_emb)
|
| 132 |
|
| 133 |
-
# Text side: PyTorch on CPU/MPS
|
| 134 |
-
pt_model, _, _ = open_clip.create_model_and_transforms(
|
|
|
|
| 135 |
tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2")
|
| 136 |
pt_model.eval()
|
| 137 |
|
|
@@ -141,71 +184,55 @@ with torch.no_grad():
|
|
| 141 |
txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
|
| 142 |
txt_emb = txt_emb.numpy()
|
| 143 |
|
| 144 |
-
similarities = txt_emb @ img_emb
|
| 145 |
-
# similarities[
|
| 146 |
```
|
| 147 |
|
| 148 |
## How this was made
|
| 149 |
|
| 150 |
-
The `convert_to_coreml.py` in this repo reproduces
|
| 151 |
-
|
| 152 |
|
| 153 |
```bash
|
| 154 |
pip install coremltools open_clip_torch torch torchvision pillow numpy transformers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8
|
| 156 |
-
# fp16 + 8-bit variants in current dir; rerun with --palettize 6 for that one
|
| 157 |
```
|
| 158 |
|
| 159 |
-
The
|
| 160 |
1. Loads PyTorch weights via `open_clip.create_model_and_transforms`
|
| 161 |
2. Wraps with an L2-normalize head so output is search-ready
|
| 162 |
3. Traces with `torch.jit.trace` at the model's expected input shape (224Γ224)
|
| 163 |
4. Reads `mean`/`std` from the preprocess transform β derives Core ML
|
| 164 |
-
`scale`/`bias`
|
| 165 |
-
|
|
|
|
|
|
|
|
|
|
| 166 |
5. Converts via `ct.convert` with `compute_units=CPU_AND_NE`
|
| 167 |
6. Optionally palettizes via `coremltools.optimize.coreml.palettize_weights`
|
| 168 |
7. Verifies cosine vs PyTorch + benchmarks ANE throughput
|
| 169 |
|
| 170 |
-
Conversion
|
|
|
|
| 171 |
|
| 172 |
## Attribution
|
| 173 |
|
| 174 |
-
- **Weights**: SigLIP2 by Google
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
- **
|
|
|
|
| 180 |
[`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
|
| 181 |
-
established the convention of Core ML
|
| 182 |
alongside PyTorch text encoders.
|
| 183 |
|
| 184 |
## License
|
| 185 |
|
| 186 |
Apache 2.0, matching the upstream SigLIP2 weights' license. See `LICENSE`.
|
| 187 |
-
|
| 188 |
-
## Verifying the conversion yourself
|
| 189 |
-
|
| 190 |
-
```python
|
| 191 |
-
import open_clip, torch, numpy as np, coremltools as ct
|
| 192 |
-
from PIL import Image, ImageDraw
|
| 193 |
-
|
| 194 |
-
pt_model, _, pt_pre = open_clip.create_model_and_transforms("ViT-B-16-SigLIP2", pretrained="webli")
|
| 195 |
-
pt_model.eval()
|
| 196 |
-
img = Image.new("RGB", (224, 224), (40, 40, 40))
|
| 197 |
-
ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0))
|
| 198 |
-
|
| 199 |
-
# PyTorch reference
|
| 200 |
-
with torch.no_grad():
|
| 201 |
-
pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0))
|
| 202 |
-
pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy()
|
| 203 |
-
|
| 204 |
-
# Core ML
|
| 205 |
-
cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage",
|
| 206 |
-
compute_units=ct.ComputeUnit.CPU_AND_NE)
|
| 207 |
-
cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32)
|
| 208 |
-
cm_emb /= np.linalg.norm(cm_emb)
|
| 209 |
-
|
| 210 |
-
print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}") # β 1.000000 for fp16
|
| 211 |
-
```
|
|
|
|
| 22 |
# ViT-B-16-SigLIP2 β Image Encoder, Apple Core ML
|
| 23 |
|
| 24 |
Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series
|
| 25 |
+
Macs and iOS 17+ devices. Apple's [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
|
| 26 |
+
ships the older MobileCLIP family in Core ML form but stops there; this repo
|
| 27 |
+
fills the gap for the higher-accuracy SigLIP2.
|
| 28 |
+
|
| 29 |
+
**Bit-perfect to PyTorch at fp16** (cosine 1.0000 vs `open_clip` reference,
|
| 30 |
+
verified end-to-end β see "Verifying the conversion yourself" below). Three
|
| 31 |
+
precision tiers ship in this repo so you can pick the right size/quality
|
| 32 |
+
trade-off.
|
| 33 |
+
|
| 34 |
+
All `.mlpackage` files are stored via Git LFS β clone with `git lfs install &&
|
| 35 |
+
git clone β¦` or download individual files via the HF Hub web UI.
|
| 36 |
|
| 37 |
## Quick start
|
| 38 |
|
|
|
|
| 40 |
import coremltools as ct
|
| 41 |
from PIL import Image
|
| 42 |
|
|
|
|
| 43 |
model = ct.models.MLModel(
|
| 44 |
"ViT-B-16-SigLIP2_image_8bit.mlpackage",
|
| 45 |
compute_units=ct.ComputeUnit.CPU_AND_NE, # ANE-accelerated
|
| 46 |
)
|
| 47 |
|
| 48 |
+
# Resize to 224Γ224 with a quality resampler. Core ML's image input bakes in
|
| 49 |
+
# the [-1, 1] normalization but does NOT do its own resize.
|
| 50 |
+
img = Image.open("photo.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
|
| 51 |
out = model.predict({"image": img})
|
| 52 |
embedding = out["embedding"][0] # (768,) float32, L2-normalized
|
| 53 |
```
|
| 54 |
|
| 55 |
The model **outputs L2-normalized embeddings** so cosine similarity is just a
|
| 56 |
+
dot product. **Channel normalization** (the [-1, 1] mapping SigLIP2 expects)
|
| 57 |
+
is baked into the Core ML graph; you only need to deliver a 224Γ224 RGB image.
|
| 58 |
+
|
| 59 |
+
## Verifying the conversion yourself
|
| 60 |
+
|
| 61 |
+
Don't trust the cosine claims β reproduce them in 30 seconds:
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
+
import open_clip, torch, numpy as np, coremltools as ct
|
| 65 |
+
from PIL import Image, ImageDraw
|
| 66 |
+
|
| 67 |
+
pt_model, _, pt_pre = open_clip.create_model_and_transforms(
|
| 68 |
+
"ViT-B-16-SigLIP2", pretrained="webli")
|
| 69 |
+
pt_model.eval()
|
| 70 |
+
img = Image.new("RGB", (224, 224), (40, 40, 40))
|
| 71 |
+
ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0))
|
| 72 |
+
|
| 73 |
+
# PyTorch reference
|
| 74 |
+
with torch.no_grad():
|
| 75 |
+
pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0))
|
| 76 |
+
pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy()
|
| 77 |
+
|
| 78 |
+
# Core ML
|
| 79 |
+
cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage",
|
| 80 |
+
compute_units=ct.ComputeUnit.CPU_AND_NE)
|
| 81 |
+
cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32)
|
| 82 |
+
cm_emb /= np.linalg.norm(cm_emb)
|
| 83 |
+
|
| 84 |
+
print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}") # β 1.000000 for fp16
|
| 85 |
+
```
|
| 86 |
|
| 87 |
## Available variants
|
| 88 |
|
|
|
|
| 94 |
|
| 95 |
Cosine measured on a synthetic test image; rankings on real photo collections
|
| 96 |
are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering
|
| 97 |
+
of close-scoring results. 4-bit was tested (0.79 cosine) and deliberately
|
| 98 |
+
excluded β too lossy for retrieval ranking.
|
| 99 |
|
| 100 |
## Performance (Apple M-series, CPU+ANE)
|
| 101 |
|
| 102 |
+
Measured on a single image at a time; Apple's published Core ML packages
|
| 103 |
+
(including this one) are compiled at batch=1 β see "Limitations" for why.
|
|
|
|
| 104 |
|
| 105 |
| Variant | Throughput | vs PyTorch+MPS baseline |
|
| 106 |
|---|---|---|
|
|
|
|
| 109 |
| 6-bit Core ML | ~139 img/s | **2.7Γ faster** |
|
| 110 |
| (PyTorch+MPS reference) | ~51 img/s | 1.0Γ |
|
| 111 |
|
| 112 |
+
End-to-end with disk loading + parallel PIL preprocessing the throughput
|
| 113 |
+
caps around ~110 img/s for the 8-bit variant β PIL decode becomes the
|
| 114 |
+
bottleneck before ANE does, which is why palettization throughput differences
|
| 115 |
+
vanish in practice.
|
| 116 |
+
|
| 117 |
+
**Expected ANE utilization is low (~5%)** when you profile this in Instruments.
|
| 118 |
+
That's not a bug β the per-call dispatch overhead (Python β Objective-C β
|
| 119 |
+
Apple's compute scheduler) dominates the actual ANE compute time per inference.
|
| 120 |
+
Throughput improves anyway because the dispatch + ANE pipeline keeps moving.
|
| 121 |
+
The fix would be a batched model, which Apple's stack doesn't currently support
|
| 122 |
+
for ViT-B at runtime (see Limitations).
|
| 123 |
|
| 124 |
## Limitations
|
| 125 |
|
| 126 |
+
- **Image branch only.** The text encoder *can* be converted using the
|
| 127 |
+
`convert_text_encoder.py` script in this repo (it works around coremltools'
|
| 128 |
+
lack of converter for PyTorch's fused `_native_multi_head_attention` by
|
| 129 |
+
replacing `nn.MultiheadAttention` with manual matmul attention before
|
| 130 |
+
tracing β output is bit-perfect to PyTorch at cosine 0.999996). **However**,
|
| 131 |
+
SigLIP2 uses Gemma2's 256K-token vocabulary; the token embedding alone is
|
| 132 |
+
393 MB at fp16, pushing the full text encoder to ~565 MB. Too large to
|
| 133 |
+
bundle here. Recommended: run the text encoder via
|
| 134 |
+
[`open_clip_torch`](https://github.com/mlfoundations/open_clip) on demand β
|
| 135 |
+
it's a one-shot per query, ~50ms on PyTorch+MPS or CPU.
|
|
|
|
| 136 |
|
| 137 |
- **Batch=1 only.** Two separate constraints conflate here:
|
| 138 |
- Apple's `ct.ImageType` is hardcoded to require batch=1
|
|
|
|
| 144 |
SigLIP2-B at batch=8 β both hung past 2-minute timeouts). I haven't
|
| 145 |
isolated whether this is a fundamental ANE limit or specific to
|
| 146 |
certain architectures; treat as "empirically blocked at batch>1 for
|
| 147 |
+
these ViT-B image encoders, root cause unconfirmed."
|
| 148 |
|
| 149 |
For high-throughput indexing within batch=1, dispatch many images via
|
| 150 |
`model.predict([{"image": img1}, {"image": img2}, ...])` β coremltools
|
| 151 |
loops them in C and amortizes the Python β Objective-C overhead.
|
| 152 |
|
| 153 |
- **iOS 17+ / macOS 14+.** Compiled with `minimum_deployment_target=macOS14`
|
| 154 |
+
to enable modern Core ML features. Older OS versions need re-conversion
|
| 155 |
+
via `convert_to_coreml.py`.
|
| 156 |
|
| 157 |
+
## Pairing with a text encoder
|
| 158 |
|
| 159 |
+
The standard hybrid pattern: Core ML image encoder for the heavy per-photo
|
| 160 |
+
work, PyTorch text encoder for one-shot per query.
|
| 161 |
|
| 162 |
```python
|
| 163 |
import open_clip, torch
|
|
|
|
| 168 |
# Image side: Core ML on ANE
|
| 169 |
img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage",
|
| 170 |
compute_units=ct.ComputeUnit.CPU_AND_NE)
|
| 171 |
+
img = Image.open("cat.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
|
| 172 |
+
img_emb = next(iter(img_model.predict({"image": img}).values()))[0]
|
|
|
|
| 173 |
img_emb /= np.linalg.norm(img_emb)
|
| 174 |
|
| 175 |
+
# Text side: PyTorch on CPU/MPS β one-time per query
|
| 176 |
+
pt_model, _, _ = open_clip.create_model_and_transforms(
|
| 177 |
+
"ViT-B-16-SigLIP2", pretrained="webli")
|
| 178 |
tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2")
|
| 179 |
pt_model.eval()
|
| 180 |
|
|
|
|
| 184 |
txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
|
| 185 |
txt_emb = txt_emb.numpy()
|
| 186 |
|
| 187 |
+
similarities = txt_emb @ img_emb # shape (2,)
|
| 188 |
+
# similarities[i] = cosine(prompt_i, image). Higher = better match.
|
| 189 |
```
|
| 190 |
|
| 191 |
## How this was made
|
| 192 |
|
| 193 |
+
The `convert_to_coreml.py` script in this repo reproduces the image variants;
|
| 194 |
+
`convert_text_encoder.py` does the (large) text encoder.
|
| 195 |
|
| 196 |
```bash
|
| 197 |
pip install coremltools open_clip_torch torch torchvision pillow numpy transformers
|
| 198 |
+
|
| 199 |
+
# fp16 only
|
| 200 |
+
python convert_to_coreml.py ViT-B-16-SigLIP2
|
| 201 |
+
|
| 202 |
+
# fp16 + 8-bit (run once with --palettize 6 separately for the 6-bit variant)
|
| 203 |
python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8
|
|
|
|
| 204 |
```
|
| 205 |
|
| 206 |
+
The image converter:
|
| 207 |
1. Loads PyTorch weights via `open_clip.create_model_and_transforms`
|
| 208 |
2. Wraps with an L2-normalize head so output is search-ready
|
| 209 |
3. Traces with `torch.jit.trace` at the model's expected input shape (224Γ224)
|
| 210 |
4. Reads `mean`/`std` from the preprocess transform β derives Core ML
|
| 211 |
+
`scale`/`bias`. **This step is load-bearing for SigLIP2 specifically**
|
| 212 |
+
because its preprocess uses `Normalize(mean=0.5, std=0.5)` (mapping pixels
|
| 213 |
+
to [-1, 1]); hardcoding the more common [0, 1] mapping silently degrades
|
| 214 |
+
cosine vs PyTorch by ~0.024 (model-specific β for a model with no Normalize
|
| 215 |
+
transform like MobileCLIP2-B, getting it wrong is a no-op).
|
| 216 |
5. Converts via `ct.convert` with `compute_units=CPU_AND_NE`
|
| 217 |
6. Optionally palettizes via `coremltools.optimize.coreml.palettize_weights`
|
| 218 |
7. Verifies cosine vs PyTorch + benchmarks ANE throughput
|
| 219 |
|
| 220 |
+
Conversion timings: ~15s for fp16, +20s for 6-bit palettization, +50s for
|
| 221 |
+
8-bit (palettization scales with model depth and bit precision).
|
| 222 |
|
| 223 |
## Attribution
|
| 224 |
|
| 225 |
+
- **Weights**: SigLIP2 by Google ([paper](https://arxiv.org/abs/2502.14786)).
|
| 226 |
+
The `webli` pretrained tag in `open_clip` corresponds to Google's release
|
| 227 |
+
trained on the WebLI dataset. Model loaded via the
|
| 228 |
+
[`mlfoundations/open_clip`](https://github.com/mlfoundations/open_clip)
|
| 229 |
+
community library β `open_clip` is `mlfoundations`' work, not Google's or Apple's.
|
| 230 |
+
- **Conversion tooling**: `coremltools` (Apple), `open_clip_torch` (mlfoundations).
|
| 231 |
+
- **Pattern reference**:
|
| 232 |
[`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
|
| 233 |
+
established the convention of Core ML CLIP-family image encoders shipped
|
| 234 |
alongside PyTorch text encoders.
|
| 235 |
|
| 236 |
## License
|
| 237 |
|
| 238 |
Apache 2.0, matching the upstream SigLIP2 weights' license. See `LICENSE`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|