batmac's picture
Upload folder using huggingface_hub
6333bc0 verified
metadata
license: apache-2.0
library_name: coreml
base_model: google/siglip2-base-patch16-224
tags:
  - siglip2
  - siglip
  - core-ml
  - coreml
  - apple-silicon
  - ane
  - neural-engine
  - image-encoder
  - image-embedding
  - semantic-search
  - on-device
language:
  - en
pipeline_tag: zero-shot-image-classification

ViT-B-16-SigLIP2 β€” Image Encoder, Apple Core ML

Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series Macs and iOS 17+ devices. Apple's apple/coreml-mobileclip ships the older MobileCLIP family in Core ML form but stops there; this repo fills the gap for the higher-accuracy SigLIP2.

Bit-perfect to PyTorch at fp16 (cosine 1.0000 vs open_clip reference, verified end-to-end β€” see "Verifying the conversion yourself" below). Three precision tiers ship in this repo so you can pick the right size/quality trade-off.

All .mlpackage files are stored via Git LFS β€” clone with git lfs install && git clone … or download individual files via the HF Hub web UI.

Quick start

import coremltools as ct
from PIL import Image

model = ct.models.MLModel(
    "ViT-B-16-SigLIP2_image_8bit.mlpackage",
    compute_units=ct.ComputeUnit.CPU_AND_NE,  # ANE-accelerated
)

# Resize to 224Γ—224 with a quality resampler. Core ML's image input bakes in
# the [-1, 1] normalization but does NOT do its own resize.
img = Image.open("photo.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
out = model.predict({"image": img})
embedding = out["embedding"][0]   # (768,) float32, L2-normalized

The model outputs L2-normalized embeddings so cosine similarity is just a dot product. Channel normalization (the [-1, 1] mapping SigLIP2 expects) is baked into the Core ML graph; you only need to deliver a 224Γ—224 RGB image.

Verifying the conversion yourself

Don't trust the cosine claims β€” reproduce them in 30 seconds:

import open_clip, torch, numpy as np, coremltools as ct
from PIL import Image, ImageDraw

pt_model, _, pt_pre = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP2", pretrained="webli")
pt_model.eval()
img = Image.new("RGB", (224, 224), (40, 40, 40))
ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0))

# PyTorch reference
with torch.no_grad():
    pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0))
    pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy()

# Core ML
cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage",
                       compute_units=ct.ComputeUnit.CPU_AND_NE)
cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32)
cm_emb /= np.linalg.norm(cm_emb)

print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}")  # β†’ 1.000000 for fp16

Available variants

File Size Cosine vs PyTorch Best for
ViT-B-16-SigLIP2_image_fp16.mlpackage 185 MB 1.0000 Reference / benchmark reproduction
ViT-B-16-SigLIP2_image_8bit.mlpackage 93 MB 0.9942 Default β€” near-perfect, 2Γ— smaller
ViT-B-16-SigLIP2_image_6bit.mlpackage 70 MB 0.9629 When download size matters more than ranking precision

Cosine measured on a synthetic test image; rankings on real photo collections are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering of close-scoring results. 4-bit was tested (0.79 cosine) and deliberately excluded β€” too lossy for retrieval ranking.

Performance (Apple M-series, CPU+ANE)

Measured on a single image at a time; Apple's published Core ML packages (including this one) are compiled at batch=1 β€” see "Limitations" for why.

Variant Throughput vs PyTorch+MPS baseline
fp16 Core ML ~133 img/s 2.6Γ— faster
8-bit Core ML ~138 img/s 2.7Γ— faster
6-bit Core ML ~139 img/s 2.7Γ— faster
(PyTorch+MPS reference) ~51 img/s 1.0Γ—

End-to-end with disk loading + parallel PIL preprocessing the throughput caps around ~110 img/s for the 8-bit variant β€” PIL decode becomes the bottleneck before ANE does, which is why palettization throughput differences vanish in practice.

Expected ANE utilization is low (~5%) when you profile this in Instruments. That's not a bug β€” the per-call dispatch overhead (Python ↔ Objective-C ↔ Apple's compute scheduler) dominates the actual ANE compute time per inference. Throughput improves anyway because the dispatch + ANE pipeline keeps moving. The fix would be a batched model, which Apple's stack doesn't currently support for ViT-B at runtime (see Limitations).

Limitations

  • Image branch only. The text encoder can be converted using the convert_text_encoder.py script in this repo (it works around coremltools' lack of converter for PyTorch's fused _native_multi_head_attention by replacing nn.MultiheadAttention with manual matmul attention before tracing β€” output is bit-perfect to PyTorch at cosine 0.999996). However, SigLIP2 uses Gemma2's 256K-token vocabulary; the token embedding alone is 393 MB at fp16, pushing the full text encoder to ~565 MB. Too large to bundle here. Recommended: run the text encoder via open_clip_torch on demand β€” it's a one-shot per query, ~50ms on PyTorch+MPS or CPU.

  • Batch=1 only. Two separate constraints conflate here:

    • Apple's ct.ImageType is hardcoded to require batch=1 (source: coremltools/converters/mil/backend/backend_helper.py, line ~63 β€” the validator rejects any shape where shape[0] != 1).
    • Switching to ct.TensorType (raw float input) bypasses that validator but caused ct.convert() to stall indefinitely in Apple's native ANE compiler on the two ViT-B models I tried (MobileCLIP2-B at batch=16, SigLIP2-B at batch=8 β€” both hung past 2-minute timeouts). I haven't isolated whether this is a fundamental ANE limit or specific to certain architectures; treat as "empirically blocked at batch>1 for these ViT-B image encoders, root cause unconfirmed."

    For high-throughput indexing within batch=1, dispatch many images via model.predict([{"image": img1}, {"image": img2}, ...]) β€” coremltools loops them in C and amortizes the Python ↔ Objective-C overhead.

  • iOS 17+ / macOS 14+. Compiled with minimum_deployment_target=macOS14 to enable modern Core ML features. Older OS versions need re-conversion via convert_to_coreml.py.

Pairing with a text encoder

The standard hybrid pattern: Core ML image encoder for the heavy per-photo work, PyTorch text encoder for one-shot per query.

import open_clip, torch
import coremltools as ct
import numpy as np
from PIL import Image

# Image side: Core ML on ANE
img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage",
                              compute_units=ct.ComputeUnit.CPU_AND_NE)
img = Image.open("cat.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
img_emb = next(iter(img_model.predict({"image": img}).values()))[0]
img_emb /= np.linalg.norm(img_emb)

# Text side: PyTorch on CPU/MPS β€” one-time per query
pt_model, _, _ = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP2", pretrained="webli")
tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2")
pt_model.eval()

with torch.no_grad():
    tokens = tokenizer(["a photo of a cat", "a photo of a dog"])
    txt_emb = pt_model.encode_text(tokens)
    txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
txt_emb = txt_emb.numpy()

similarities = txt_emb @ img_emb        # shape (2,)
# similarities[i] = cosine(prompt_i, image). Higher = better match.

How this was made

The convert_to_coreml.py script in this repo reproduces the image variants; convert_text_encoder.py does the (large) text encoder.

pip install coremltools open_clip_torch torch torchvision pillow numpy transformers

# fp16 only
python convert_to_coreml.py ViT-B-16-SigLIP2

# fp16 + 8-bit  (run once with --palettize 6 separately for the 6-bit variant)
python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8

The image converter:

  1. Loads PyTorch weights via open_clip.create_model_and_transforms
  2. Wraps with an L2-normalize head so output is search-ready
  3. Traces with torch.jit.trace at the model's expected input shape (224Γ—224)
  4. Reads mean/std from the preprocess transform β†’ derives Core ML scale/bias. This step is load-bearing for SigLIP2 specifically because its preprocess uses Normalize(mean=0.5, std=0.5) (mapping pixels to [-1, 1]); hardcoding the more common [0, 1] mapping silently degrades cosine vs PyTorch by ~0.024 (model-specific β€” for a model with no Normalize transform like MobileCLIP2-B, getting it wrong is a no-op).
  5. Converts via ct.convert with compute_units=CPU_AND_NE
  6. Optionally palettizes via coremltools.optimize.coreml.palettize_weights
  7. Verifies cosine vs PyTorch + benchmarks ANE throughput

Conversion timings: ~15s for fp16, +20s for 6-bit palettization, +50s for 8-bit (palettization scales with model depth and bit precision).

Attribution

  • Weights: SigLIP2 by Google (paper). The webli pretrained tag in open_clip corresponds to Google's release trained on the WebLI dataset. Model loaded via the mlfoundations/open_clip community library β€” open_clip is mlfoundations' work, not Google's or Apple's.
  • Conversion tooling: coremltools (Apple), open_clip_torch (mlfoundations).
  • Pattern reference: apple/coreml-mobileclip established the convention of Core ML CLIP-family image encoders shipped alongside PyTorch text encoders.

License

Apache 2.0, matching the upstream SigLIP2 weights' license. See LICENSE.