| --- |
| license: apache-2.0 |
| library_name: coreml |
| base_model: google/siglip2-base-patch16-224 |
| tags: |
| - siglip2 |
| - siglip |
| - core-ml |
| - coreml |
| - apple-silicon |
| - ane |
| - neural-engine |
| - image-encoder |
| - image-embedding |
| - semantic-search |
| - on-device |
| language: |
| - en |
| pipeline_tag: zero-shot-image-classification |
| --- |
| |
| # ViT-B-16-SigLIP2 β Image Encoder, Apple Core ML |
|
|
| Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series |
| Macs and iOS 17+ devices. Apple's [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip) |
| ships the older MobileCLIP family in Core ML form but stops there; this repo |
| fills the gap for the higher-accuracy SigLIP2. |
|
|
| **Bit-perfect to PyTorch at fp16** (cosine 1.0000 vs `open_clip` reference, |
| verified end-to-end β see "Verifying the conversion yourself" below). Three |
| precision tiers ship in this repo so you can pick the right size/quality |
| trade-off. |
|
|
| All `.mlpackage` files are stored via Git LFS β clone with `git lfs install && |
| git clone β¦` or download individual files via the HF Hub web UI. |
|
|
| ## Quick start |
|
|
| ```python |
| import coremltools as ct |
| from PIL import Image |
| |
| model = ct.models.MLModel( |
| "ViT-B-16-SigLIP2_image_8bit.mlpackage", |
| compute_units=ct.ComputeUnit.CPU_AND_NE, # ANE-accelerated |
| ) |
| |
| # Resize to 224Γ224 with a quality resampler. Core ML's image input bakes in |
| # the [-1, 1] normalization but does NOT do its own resize. |
| img = Image.open("photo.jpg").convert("RGB").resize((224, 224), Image.BICUBIC) |
| out = model.predict({"image": img}) |
| embedding = out["embedding"][0] # (768,) float32, L2-normalized |
| ``` |
|
|
| The model **outputs L2-normalized embeddings** so cosine similarity is just a |
| dot product. **Channel normalization** (the [-1, 1] mapping SigLIP2 expects) |
| is baked into the Core ML graph; you only need to deliver a 224Γ224 RGB image. |
|
|
| ## Verifying the conversion yourself |
|
|
| Don't trust the cosine claims β reproduce them in 30 seconds: |
|
|
| ```python |
| import open_clip, torch, numpy as np, coremltools as ct |
| from PIL import Image, ImageDraw |
| |
| pt_model, _, pt_pre = open_clip.create_model_and_transforms( |
| "ViT-B-16-SigLIP2", pretrained="webli") |
| pt_model.eval() |
| img = Image.new("RGB", (224, 224), (40, 40, 40)) |
| ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0)) |
| |
| # PyTorch reference |
| with torch.no_grad(): |
| pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0)) |
| pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy() |
| |
| # Core ML |
| cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage", |
| compute_units=ct.ComputeUnit.CPU_AND_NE) |
| cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32) |
| cm_emb /= np.linalg.norm(cm_emb) |
| |
| print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}") # β 1.000000 for fp16 |
| ``` |
|
|
| ## Available variants |
|
|
| | File | Size | Cosine vs PyTorch | Best for | |
| |---|---|---|---| |
| | `ViT-B-16-SigLIP2_image_fp16.mlpackage` | 185 MB | **1.0000** | Reference / benchmark reproduction | |
| | `ViT-B-16-SigLIP2_image_8bit.mlpackage` | 93 MB | **0.9942** | **Default** β near-perfect, 2Γ smaller | |
| | `ViT-B-16-SigLIP2_image_6bit.mlpackage` | 70 MB | 0.9629 | When download size matters more than ranking precision | |
|
|
| Cosine measured on a synthetic test image; rankings on real photo collections |
| are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering |
| of close-scoring results. 4-bit was tested (0.79 cosine) and deliberately |
| excluded β too lossy for retrieval ranking. |
|
|
| ## Performance (Apple M-series, CPU+ANE) |
|
|
| Measured on a single image at a time; Apple's published Core ML packages |
| (including this one) are compiled at batch=1 β see "Limitations" for why. |
|
|
| | Variant | Throughput | vs PyTorch+MPS baseline | |
| |---|---|---| |
| | fp16 Core ML | ~133 img/s | **2.6Γ faster** | |
| | 8-bit Core ML | ~138 img/s | **2.7Γ faster** | |
| | 6-bit Core ML | ~139 img/s | **2.7Γ faster** | |
| | (PyTorch+MPS reference) | ~51 img/s | 1.0Γ | |
|
|
| End-to-end with disk loading + parallel PIL preprocessing the throughput |
| caps around ~110 img/s for the 8-bit variant β PIL decode becomes the |
| bottleneck before ANE does, which is why palettization throughput differences |
| vanish in practice. |
|
|
| **Expected ANE utilization is low (~5%)** when you profile this in Instruments. |
| That's not a bug β the per-call dispatch overhead (Python β Objective-C β |
| Apple's compute scheduler) dominates the actual ANE compute time per inference. |
| Throughput improves anyway because the dispatch + ANE pipeline keeps moving. |
| The fix would be a batched model, which Apple's stack doesn't currently support |
| for ViT-B at runtime (see Limitations). |
|
|
| ## Limitations |
|
|
| - **Image branch only.** The text encoder *can* be converted using the |
| `convert_text_encoder.py` script in this repo (it works around coremltools' |
| lack of converter for PyTorch's fused `_native_multi_head_attention` by |
| replacing `nn.MultiheadAttention` with manual matmul attention before |
| tracing β output is bit-perfect to PyTorch at cosine 0.999996). **However**, |
| SigLIP2 uses Gemma2's 256K-token vocabulary; the token embedding alone is |
| 393 MB at fp16, pushing the full text encoder to ~565 MB. Too large to |
| bundle here. Recommended: run the text encoder via |
| [`open_clip_torch`](https://github.com/mlfoundations/open_clip) on demand β |
| it's a one-shot per query, ~50ms on PyTorch+MPS or CPU. |
|
|
| - **Batch=1 only.** Two separate constraints conflate here: |
| - Apple's `ct.ImageType` is hardcoded to require batch=1 |
| ([source: `coremltools/converters/mil/backend/backend_helper.py`](https://github.com/apple/coremltools/blob/main/coremltools/converters/mil/backend/backend_helper.py), |
| line ~63 β the validator rejects any shape where `shape[0] != 1`). |
| - Switching to `ct.TensorType` (raw float input) bypasses that validator |
| but caused `ct.convert()` to stall indefinitely in Apple's native ANE |
| compiler on the two ViT-B models I tried (MobileCLIP2-B at batch=16, |
| SigLIP2-B at batch=8 β both hung past 2-minute timeouts). I haven't |
| isolated whether this is a fundamental ANE limit or specific to |
| certain architectures; treat as "empirically blocked at batch>1 for |
| these ViT-B image encoders, root cause unconfirmed." |
| |
| For high-throughput indexing within batch=1, dispatch many images via |
| `model.predict([{"image": img1}, {"image": img2}, ...])` β coremltools |
| loops them in C and amortizes the Python β Objective-C overhead. |
|
|
| - **iOS 17+ / macOS 14+.** Compiled with `minimum_deployment_target=macOS14` |
| to enable modern Core ML features. Older OS versions need re-conversion |
| via `convert_to_coreml.py`. |
|
|
| ## Pairing with a text encoder |
|
|
| The standard hybrid pattern: Core ML image encoder for the heavy per-photo |
| work, PyTorch text encoder for one-shot per query. |
|
|
| ```python |
| import open_clip, torch |
| import coremltools as ct |
| import numpy as np |
| from PIL import Image |
| |
| # Image side: Core ML on ANE |
| img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage", |
| compute_units=ct.ComputeUnit.CPU_AND_NE) |
| img = Image.open("cat.jpg").convert("RGB").resize((224, 224), Image.BICUBIC) |
| img_emb = next(iter(img_model.predict({"image": img}).values()))[0] |
| img_emb /= np.linalg.norm(img_emb) |
| |
| # Text side: PyTorch on CPU/MPS β one-time per query |
| pt_model, _, _ = open_clip.create_model_and_transforms( |
| "ViT-B-16-SigLIP2", pretrained="webli") |
| tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2") |
| pt_model.eval() |
| |
| with torch.no_grad(): |
| tokens = tokenizer(["a photo of a cat", "a photo of a dog"]) |
| txt_emb = pt_model.encode_text(tokens) |
| txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True) |
| txt_emb = txt_emb.numpy() |
| |
| similarities = txt_emb @ img_emb # shape (2,) |
| # similarities[i] = cosine(prompt_i, image). Higher = better match. |
| ``` |
|
|
| ## How this was made |
|
|
| The `convert_to_coreml.py` script in this repo reproduces the image variants; |
| `convert_text_encoder.py` does the (large) text encoder. |
|
|
| ```bash |
| pip install coremltools open_clip_torch torch torchvision pillow numpy transformers |
| |
| # fp16 only |
| python convert_to_coreml.py ViT-B-16-SigLIP2 |
| |
| # fp16 + 8-bit (run once with --palettize 6 separately for the 6-bit variant) |
| python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8 |
| ``` |
|
|
| The image converter: |
| 1. Loads PyTorch weights via `open_clip.create_model_and_transforms` |
| 2. Wraps with an L2-normalize head so output is search-ready |
| 3. Traces with `torch.jit.trace` at the model's expected input shape (224Γ224) |
| 4. Reads `mean`/`std` from the preprocess transform β derives Core ML |
| `scale`/`bias`. **This step is load-bearing for SigLIP2 specifically** |
| because its preprocess uses `Normalize(mean=0.5, std=0.5)` (mapping pixels |
| to [-1, 1]); hardcoding the more common [0, 1] mapping silently degrades |
| cosine vs PyTorch by ~0.024 (model-specific β for a model with no Normalize |
| transform like MobileCLIP2-B, getting it wrong is a no-op). |
| 5. Converts via `ct.convert` with `compute_units=CPU_AND_NE` |
| 6. Optionally palettizes via `coremltools.optimize.coreml.palettize_weights` |
| 7. Verifies cosine vs PyTorch + benchmarks ANE throughput |
|
|
| Conversion timings: ~15s for fp16, +20s for 6-bit palettization, +50s for |
| 8-bit (palettization scales with model depth and bit precision). |
|
|
| ## Attribution |
|
|
| - **Weights**: SigLIP2 by Google ([paper](https://arxiv.org/abs/2502.14786)). |
| The `webli` pretrained tag in `open_clip` corresponds to Google's release |
| trained on the WebLI dataset. Model loaded via the |
| [`mlfoundations/open_clip`](https://github.com/mlfoundations/open_clip) |
| community library β `open_clip` is `mlfoundations`' work, not Google's or Apple's. |
| - **Conversion tooling**: `coremltools` (Apple), `open_clip_torch` (mlfoundations). |
| - **Pattern reference**: |
| [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip) |
| established the convention of Core ML CLIP-family image encoders shipped |
| alongside PyTorch text encoders. |
|
|
| ## License |
|
|
| Apache 2.0, matching the upstream SigLIP2 weights' license. See `LICENSE`. |
|
|