--- license: apache-2.0 library_name: coreml base_model: google/siglip2-base-patch16-224 tags: - siglip2 - siglip - core-ml - coreml - apple-silicon - ane - neural-engine - image-encoder - image-embedding - semantic-search - on-device language: - en pipeline_tag: zero-shot-image-classification --- # ViT-B-16-SigLIP2 — Image Encoder, Apple Core ML Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series Macs and iOS 17+ devices. Apple's [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip) ships the older MobileCLIP family in Core ML form but stops there; this repo fills the gap for the higher-accuracy SigLIP2. **Bit-perfect to PyTorch at fp16** (cosine 1.0000 vs `open_clip` reference, verified end-to-end — see "Verifying the conversion yourself" below). Three precision tiers ship in this repo so you can pick the right size/quality trade-off. All `.mlpackage` files are stored via Git LFS — clone with `git lfs install && git clone …` or download individual files via the HF Hub web UI. ## Quick start ```python import coremltools as ct from PIL import Image model = ct.models.MLModel( "ViT-B-16-SigLIP2_image_8bit.mlpackage", compute_units=ct.ComputeUnit.CPU_AND_NE, # ANE-accelerated ) # Resize to 224×224 with a quality resampler. Core ML's image input bakes in # the [-1, 1] normalization but does NOT do its own resize. img = Image.open("photo.jpg").convert("RGB").resize((224, 224), Image.BICUBIC) out = model.predict({"image": img}) embedding = out["embedding"][0] # (768,) float32, L2-normalized ``` The model **outputs L2-normalized embeddings** so cosine similarity is just a dot product. **Channel normalization** (the [-1, 1] mapping SigLIP2 expects) is baked into the Core ML graph; you only need to deliver a 224×224 RGB image. ## Verifying the conversion yourself Don't trust the cosine claims — reproduce them in 30 seconds: ```python import open_clip, torch, numpy as np, coremltools as ct from PIL import Image, ImageDraw pt_model, _, pt_pre = open_clip.create_model_and_transforms( "ViT-B-16-SigLIP2", pretrained="webli") pt_model.eval() img = Image.new("RGB", (224, 224), (40, 40, 40)) ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0)) # PyTorch reference with torch.no_grad(): pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0)) pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy() # Core ML cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage", compute_units=ct.ComputeUnit.CPU_AND_NE) cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32) cm_emb /= np.linalg.norm(cm_emb) print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}") # → 1.000000 for fp16 ``` ## Available variants | File | Size | Cosine vs PyTorch | Best for | |---|---|---|---| | `ViT-B-16-SigLIP2_image_fp16.mlpackage` | 185 MB | **1.0000** | Reference / benchmark reproduction | | `ViT-B-16-SigLIP2_image_8bit.mlpackage` | 93 MB | **0.9942** | **Default** — near-perfect, 2× smaller | | `ViT-B-16-SigLIP2_image_6bit.mlpackage` | 70 MB | 0.9629 | When download size matters more than ranking precision | Cosine measured on a synthetic test image; rankings on real photo collections are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering of close-scoring results. 4-bit was tested (0.79 cosine) and deliberately excluded — too lossy for retrieval ranking. ## Performance (Apple M-series, CPU+ANE) Measured on a single image at a time; Apple's published Core ML packages (including this one) are compiled at batch=1 — see "Limitations" for why. | Variant | Throughput | vs PyTorch+MPS baseline | |---|---|---| | fp16 Core ML | ~133 img/s | **2.6× faster** | | 8-bit Core ML | ~138 img/s | **2.7× faster** | | 6-bit Core ML | ~139 img/s | **2.7× faster** | | (PyTorch+MPS reference) | ~51 img/s | 1.0× | End-to-end with disk loading + parallel PIL preprocessing the throughput caps around ~110 img/s for the 8-bit variant — PIL decode becomes the bottleneck before ANE does, which is why palettization throughput differences vanish in practice. **Expected ANE utilization is low (~5%)** when you profile this in Instruments. That's not a bug — the per-call dispatch overhead (Python ↔ Objective-C ↔ Apple's compute scheduler) dominates the actual ANE compute time per inference. Throughput improves anyway because the dispatch + ANE pipeline keeps moving. The fix would be a batched model, which Apple's stack doesn't currently support for ViT-B at runtime (see Limitations). ## Limitations - **Image branch only.** The text encoder *can* be converted using the `convert_text_encoder.py` script in this repo (it works around coremltools' lack of converter for PyTorch's fused `_native_multi_head_attention` by replacing `nn.MultiheadAttention` with manual matmul attention before tracing — output is bit-perfect to PyTorch at cosine 0.999996). **However**, SigLIP2 uses Gemma2's 256K-token vocabulary; the token embedding alone is 393 MB at fp16, pushing the full text encoder to ~565 MB. Too large to bundle here. Recommended: run the text encoder via [`open_clip_torch`](https://github.com/mlfoundations/open_clip) on demand — it's a one-shot per query, ~50ms on PyTorch+MPS or CPU. - **Batch=1 only.** Two separate constraints conflate here: - Apple's `ct.ImageType` is hardcoded to require batch=1 ([source: `coremltools/converters/mil/backend/backend_helper.py`](https://github.com/apple/coremltools/blob/main/coremltools/converters/mil/backend/backend_helper.py), line ~63 — the validator rejects any shape where `shape[0] != 1`). - Switching to `ct.TensorType` (raw float input) bypasses that validator but caused `ct.convert()` to stall indefinitely in Apple's native ANE compiler on the two ViT-B models I tried (MobileCLIP2-B at batch=16, SigLIP2-B at batch=8 — both hung past 2-minute timeouts). I haven't isolated whether this is a fundamental ANE limit or specific to certain architectures; treat as "empirically blocked at batch>1 for these ViT-B image encoders, root cause unconfirmed." For high-throughput indexing within batch=1, dispatch many images via `model.predict([{"image": img1}, {"image": img2}, ...])` — coremltools loops them in C and amortizes the Python ↔ Objective-C overhead. - **iOS 17+ / macOS 14+.** Compiled with `minimum_deployment_target=macOS14` to enable modern Core ML features. Older OS versions need re-conversion via `convert_to_coreml.py`. ## Pairing with a text encoder The standard hybrid pattern: Core ML image encoder for the heavy per-photo work, PyTorch text encoder for one-shot per query. ```python import open_clip, torch import coremltools as ct import numpy as np from PIL import Image # Image side: Core ML on ANE img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage", compute_units=ct.ComputeUnit.CPU_AND_NE) img = Image.open("cat.jpg").convert("RGB").resize((224, 224), Image.BICUBIC) img_emb = next(iter(img_model.predict({"image": img}).values()))[0] img_emb /= np.linalg.norm(img_emb) # Text side: PyTorch on CPU/MPS — one-time per query pt_model, _, _ = open_clip.create_model_and_transforms( "ViT-B-16-SigLIP2", pretrained="webli") tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2") pt_model.eval() with torch.no_grad(): tokens = tokenizer(["a photo of a cat", "a photo of a dog"]) txt_emb = pt_model.encode_text(tokens) txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True) txt_emb = txt_emb.numpy() similarities = txt_emb @ img_emb # shape (2,) # similarities[i] = cosine(prompt_i, image). Higher = better match. ``` ## How this was made The `convert_to_coreml.py` script in this repo reproduces the image variants; `convert_text_encoder.py` does the (large) text encoder. ```bash pip install coremltools open_clip_torch torch torchvision pillow numpy transformers # fp16 only python convert_to_coreml.py ViT-B-16-SigLIP2 # fp16 + 8-bit (run once with --palettize 6 separately for the 6-bit variant) python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8 ``` The image converter: 1. Loads PyTorch weights via `open_clip.create_model_and_transforms` 2. Wraps with an L2-normalize head so output is search-ready 3. Traces with `torch.jit.trace` at the model's expected input shape (224×224) 4. Reads `mean`/`std` from the preprocess transform → derives Core ML `scale`/`bias`. **This step is load-bearing for SigLIP2 specifically** because its preprocess uses `Normalize(mean=0.5, std=0.5)` (mapping pixels to [-1, 1]); hardcoding the more common [0, 1] mapping silently degrades cosine vs PyTorch by ~0.024 (model-specific — for a model with no Normalize transform like MobileCLIP2-B, getting it wrong is a no-op). 5. Converts via `ct.convert` with `compute_units=CPU_AND_NE` 6. Optionally palettizes via `coremltools.optimize.coreml.palettize_weights` 7. Verifies cosine vs PyTorch + benchmarks ANE throughput Conversion timings: ~15s for fp16, +20s for 6-bit palettization, +50s for 8-bit (palettization scales with model depth and bit precision). ## Attribution - **Weights**: SigLIP2 by Google ([paper](https://arxiv.org/abs/2502.14786)). The `webli` pretrained tag in `open_clip` corresponds to Google's release trained on the WebLI dataset. Model loaded via the [`mlfoundations/open_clip`](https://github.com/mlfoundations/open_clip) community library — `open_clip` is `mlfoundations`' work, not Google's or Apple's. - **Conversion tooling**: `coremltools` (Apple), `open_clip_torch` (mlfoundations). - **Pattern reference**: [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip) established the convention of Core ML CLIP-family image encoders shipped alongside PyTorch text encoders. ## License Apache 2.0, matching the upstream SigLIP2 weights' license. See `LICENSE`.