ViT-B-16-SigLIP2 β€” Image Encoder, Apple Core ML

Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series Macs and iOS 17+ devices. Apple's apple/coreml-mobileclip ships the older MobileCLIP family in Core ML form but stops there; this repo fills the gap for the higher-accuracy SigLIP2.

Bit-perfect to PyTorch at fp16 (cosine 1.0000 vs open_clip reference, verified end-to-end β€” see "Verifying the conversion yourself" below). Three precision tiers ship in this repo so you can pick the right size/quality trade-off.

All .mlpackage files are stored via Git LFS β€” clone with git lfs install && git clone … or download individual files via the HF Hub web UI.

Quick start

import coremltools as ct
from PIL import Image

model = ct.models.MLModel(
    "ViT-B-16-SigLIP2_image_8bit.mlpackage",
    compute_units=ct.ComputeUnit.CPU_AND_NE,  # ANE-accelerated
)

# Resize to 224Γ—224 with a quality resampler. Core ML's image input bakes in
# the [-1, 1] normalization but does NOT do its own resize.
img = Image.open("photo.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
out = model.predict({"image": img})
embedding = out["embedding"][0]   # (768,) float32, L2-normalized

The model outputs L2-normalized embeddings so cosine similarity is just a dot product. Channel normalization (the [-1, 1] mapping SigLIP2 expects) is baked into the Core ML graph; you only need to deliver a 224Γ—224 RGB image.

Verifying the conversion yourself

Don't trust the cosine claims β€” reproduce them in 30 seconds:

import open_clip, torch, numpy as np, coremltools as ct
from PIL import Image, ImageDraw

pt_model, _, pt_pre = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP2", pretrained="webli")
pt_model.eval()
img = Image.new("RGB", (224, 224), (40, 40, 40))
ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0))

# PyTorch reference
with torch.no_grad():
    pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0))
    pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy()

# Core ML
cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage",
                       compute_units=ct.ComputeUnit.CPU_AND_NE)
cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32)
cm_emb /= np.linalg.norm(cm_emb)

print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}")  # β†’ 1.000000 for fp16

Available variants

File Size Cosine vs PyTorch Best for
ViT-B-16-SigLIP2_image_fp16.mlpackage 185 MB 1.0000 Reference / benchmark reproduction
ViT-B-16-SigLIP2_image_8bit.mlpackage 93 MB 0.9942 Default β€” near-perfect, 2Γ— smaller
ViT-B-16-SigLIP2_image_6bit.mlpackage 70 MB 0.9629 When download size matters more than ranking precision

Cosine measured on a synthetic test image; rankings on real photo collections are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering of close-scoring results. 4-bit was tested (0.79 cosine) and deliberately excluded β€” too lossy for retrieval ranking.

Performance (Apple M-series, CPU+ANE)

Measured on a single image at a time; Apple's published Core ML packages (including this one) are compiled at batch=1 β€” see "Limitations" for why.

Variant Throughput vs PyTorch+MPS baseline
fp16 Core ML ~133 img/s 2.6Γ— faster
8-bit Core ML ~138 img/s 2.7Γ— faster
6-bit Core ML ~139 img/s 2.7Γ— faster
(PyTorch+MPS reference) ~51 img/s 1.0Γ—

End-to-end with disk loading + parallel PIL preprocessing the throughput caps around ~110 img/s for the 8-bit variant β€” PIL decode becomes the bottleneck before ANE does, which is why palettization throughput differences vanish in practice.

Expected ANE utilization is low (~5%) when you profile this in Instruments. That's not a bug β€” the per-call dispatch overhead (Python ↔ Objective-C ↔ Apple's compute scheduler) dominates the actual ANE compute time per inference. Throughput improves anyway because the dispatch + ANE pipeline keeps moving. The fix would be a batched model, which Apple's stack doesn't currently support for ViT-B at runtime (see Limitations).

Limitations

  • Image branch only. The text encoder can be converted using the convert_text_encoder.py script in this repo (it works around coremltools' lack of converter for PyTorch's fused _native_multi_head_attention by replacing nn.MultiheadAttention with manual matmul attention before tracing β€” output is bit-perfect to PyTorch at cosine 0.999996). However, SigLIP2 uses Gemma2's 256K-token vocabulary; the token embedding alone is 393 MB at fp16, pushing the full text encoder to ~565 MB. Too large to bundle here. Recommended: run the text encoder via open_clip_torch on demand β€” it's a one-shot per query, ~50ms on PyTorch+MPS or CPU.

  • Batch=1 only. Two separate constraints conflate here:

    • Apple's ct.ImageType is hardcoded to require batch=1 (source: coremltools/converters/mil/backend/backend_helper.py, line ~63 β€” the validator rejects any shape where shape[0] != 1).
    • Switching to ct.TensorType (raw float input) bypasses that validator but caused ct.convert() to stall indefinitely in Apple's native ANE compiler on the two ViT-B models I tried (MobileCLIP2-B at batch=16, SigLIP2-B at batch=8 β€” both hung past 2-minute timeouts). I haven't isolated whether this is a fundamental ANE limit or specific to certain architectures; treat as "empirically blocked at batch>1 for these ViT-B image encoders, root cause unconfirmed."

    For high-throughput indexing within batch=1, dispatch many images via model.predict([{"image": img1}, {"image": img2}, ...]) β€” coremltools loops them in C and amortizes the Python ↔ Objective-C overhead.

  • iOS 17+ / macOS 14+. Compiled with minimum_deployment_target=macOS14 to enable modern Core ML features. Older OS versions need re-conversion via convert_to_coreml.py.

Pairing with a text encoder

The standard hybrid pattern: Core ML image encoder for the heavy per-photo work, PyTorch text encoder for one-shot per query.

import open_clip, torch
import coremltools as ct
import numpy as np
from PIL import Image

# Image side: Core ML on ANE
img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage",
                              compute_units=ct.ComputeUnit.CPU_AND_NE)
img = Image.open("cat.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
img_emb = next(iter(img_model.predict({"image": img}).values()))[0]
img_emb /= np.linalg.norm(img_emb)

# Text side: PyTorch on CPU/MPS β€” one-time per query
pt_model, _, _ = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP2", pretrained="webli")
tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2")
pt_model.eval()

with torch.no_grad():
    tokens = tokenizer(["a photo of a cat", "a photo of a dog"])
    txt_emb = pt_model.encode_text(tokens)
    txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
txt_emb = txt_emb.numpy()

similarities = txt_emb @ img_emb        # shape (2,)
# similarities[i] = cosine(prompt_i, image). Higher = better match.

How this was made

The convert_to_coreml.py script in this repo reproduces the image variants; convert_text_encoder.py does the (large) text encoder.

pip install coremltools open_clip_torch torch torchvision pillow numpy transformers

# fp16 only
python convert_to_coreml.py ViT-B-16-SigLIP2

# fp16 + 8-bit  (run once with --palettize 6 separately for the 6-bit variant)
python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8

The image converter:

  1. Loads PyTorch weights via open_clip.create_model_and_transforms
  2. Wraps with an L2-normalize head so output is search-ready
  3. Traces with torch.jit.trace at the model's expected input shape (224Γ—224)
  4. Reads mean/std from the preprocess transform β†’ derives Core ML scale/bias. This step is load-bearing for SigLIP2 specifically because its preprocess uses Normalize(mean=0.5, std=0.5) (mapping pixels to [-1, 1]); hardcoding the more common [0, 1] mapping silently degrades cosine vs PyTorch by ~0.024 (model-specific β€” for a model with no Normalize transform like MobileCLIP2-B, getting it wrong is a no-op).
  5. Converts via ct.convert with compute_units=CPU_AND_NE
  6. Optionally palettizes via coremltools.optimize.coreml.palettize_weights
  7. Verifies cosine vs PyTorch + benchmarks ANE throughput

Conversion timings: ~15s for fp16, +20s for 6-bit palettization, +50s for 8-bit (palettization scales with model depth and bit precision).

Attribution

  • Weights: SigLIP2 by Google (paper). The webli pretrained tag in open_clip corresponds to Google's release trained on the WebLI dataset. Model loaded via the mlfoundations/open_clip community library β€” open_clip is mlfoundations' work, not Google's or Apple's.
  • Conversion tooling: coremltools (Apple), open_clip_torch (mlfoundations).
  • Pattern reference: apple/coreml-mobileclip established the convention of Core ML CLIP-family image encoders shipped alongside PyTorch text encoders.

License

Apache 2.0, matching the upstream SigLIP2 weights' license. See LICENSE.

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for batmac/ViT-B-16-SigLIP2-Image-CoreML

Quantized
(8)
this model

Paper for batmac/ViT-B-16-SigLIP2-Image-CoreML