batmac
/

ViT-B-16-SigLIP2-Image-CoreML

@@ -22,10 +22,17 @@ pipeline_tag: zero-shot-image-classification
 # ViT-B-16-SigLIP2 — Image Encoder, Apple Core ML
 Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series
-Macs and modern iPhones/iPads.
-**Bit-perfect to PyTorch at fp16** (cosine 1.0000 vs `open_clip` reference).
-Three precision tiers ship in this repo so you can pick the right size/quality
-trade-off for your use case.
 ## Quick start
@@ -33,21 +40,49 @@ trade-off for your use case.
 import coremltools as ct
 from PIL import Image
-# Load any of the three variants — see "Available variants" below
 model = ct.models.MLModel(
     "ViT-B-16-SigLIP2_image_8bit.mlpackage",
     compute_units=ct.ComputeUnit.CPU_AND_NE,  # ANE-accelerated
 )
-img = Image.open("photo.jpg").convert("RGB").resize((224, 224))
 out = model.predict({"image": img})
 embedding = out["embedding"][0]   # (768,) float32, L2-normalized
 ```
 The model **outputs L2-normalized embeddings** so cosine similarity is just a
-dot product. **Input preprocessing** (resize 224×224, [-1, 1] normalization)
-is baked into the Core ML graph — pass any RGB PIL image at 224×224 and you're
-done.
 ## Available variants
@@ -59,13 +94,13 @@ done.
 Cosine measured on a synthetic test image; rankings on real photo collections
 are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering
-of close-scoring results.
 ## Performance (Apple M-series, CPU+ANE)
-Benchmarked on a single-image-at-a-time basis (Apple's published Core ML
-packages are batch=1 by design — the ANE compiler doesn't support batch>1
-for ViT-B-shaped models, regardless of architecture).
 | Variant | Throughput | vs PyTorch+MPS baseline |
 |---|---|---|
@@ -74,23 +109,30 @@ for ViT-B-shaped models, regardless of architecture).
 | 6-bit Core ML | ~139 img/s | **2.7× faster** |
 | (PyTorch+MPS reference) | ~51 img/s | 1.0× |
-Real-world end-to-end with disk loading + parallel PIL preprocessing:
-~110 img/s for the 8-bit variant. PIL decode becomes the bottleneck before
-ANE does, so palettization throughput differences vanish in practice.
 ## Limitations
-- **Image branch only — text encoder ships separately or use PyTorch.**
-  The text encoder *does* convert (with a small workaround: replace
-  `nn.MultiheadAttention` with manual matmul attention before tracing, since
-  coremltools doesn't yet support PyTorch's fused `_native_multi_head_attention`),
-  achieving 0.999996 cosine vs PyTorch. **However**, SigLIP2 uses Gemma2's
-  256K-token vocabulary — the token embedding table alone is 393 MB at fp16,
-  pushing the full text encoder to ~565 MB. That's prohibitive for shipping
-  alongside the image encoder for most use cases. Recommended workflow:
-  run the text encoder via [`open_clip_torch`](https://github.com/mlfoundations/open_clip)
-  on demand (~50ms one-time per query). The reproduction script in this repo
-  can build the text encoder if you specifically need it.
 - **Batch=1 only.** Two separate constraints conflate here:
   - Apple's `ct.ImageType` is hardcoded to require batch=1
@@ -102,19 +144,20 @@ ANE does, so palettization throughput differences vanish in practice.
     SigLIP2-B at batch=8 — both hung past 2-minute timeouts). I haven't
     isolated whether this is a fundamental ANE limit or specific to
     certain architectures; treat as "empirically blocked at batch>1 for
-    ViT-B image encoders, root cause unconfirmed."
   For high-throughput indexing within batch=1, dispatch many images via
   `model.predict([{"image": img1}, {"image": img2}, ...])` — coremltools
   loops them in C and amortizes the Python ↔ Objective-C overhead.
 - **iOS 17+ / macOS 14+.** Compiled with `minimum_deployment_target=macOS14`
-  to enable modern Core ML features. Older OS versions need re-conversion.
-## Using the text encoder
-Until coremltools fixes the SigLIP2 text-conversion bug, pair this Core ML
-image encoder with PyTorch text encoding:
 ```python
 import open_clip, torch
@@ -125,13 +168,13 @@ from PIL import Image
 # Image side: Core ML on ANE
 img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage",
                               compute_units=ct.ComputeUnit.CPU_AND_NE)
-img_emb = next(iter(img_model.predict({
-    "image": Image.open("cat.jpg").convert("RGB").resize((224, 224))
-}).values()))[0]
 img_emb /= np.linalg.norm(img_emb)
-# Text side: PyTorch on CPU/MPS (~50ms one-time per query)
-pt_model, _, _ = open_clip.create_model_and_transforms("ViT-B-16-SigLIP2", pretrained="webli")
 tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2")
 pt_model.eval()
@@ -141,71 +184,55 @@ with torch.no_grad():
     txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
 txt_emb = txt_emb.numpy()
-similarities = txt_emb @ img_emb
-# similarities[0] = cat score; similarities[1] = dog score
 ```
 ## How this was made
-The `convert_to_coreml.py` in this repo reproduces all three variants from
-the upstream `ViT-B-16-SigLIP2 (webli)` PyTorch weights:
 ```bash
 pip install coremltools open_clip_torch torch torchvision pillow numpy transformers
 python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8
-# fp16 + 8-bit variants in current dir; rerun with --palettize 6 for that one
 ```
-The script:
 1. Loads PyTorch weights via `open_clip.create_model_and_transforms`
 2. Wraps with an L2-normalize head so output is search-ready
 3. Traces with `torch.jit.trace` at the model's expected input shape (224×224)
 4. Reads `mean`/`std` from the preprocess transform → derives Core ML
-   `scale`/`bias` (this is the step Apple's docs don't emphasize — getting it
-   wrong silently degrades cosine similarity by ~0.024)
 5. Converts via `ct.convert` with `compute_units=CPU_AND_NE`
 6. Optionally palettizes via `coremltools.optimize.coreml.palettize_weights`
 7. Verifies cosine vs PyTorch + benchmarks ANE throughput
-Conversion takes ~15s for fp16, +20s for palettization.
 ## Attribution
-- **Weights**: SigLIP2 by Google
-  ([apple's open_clip integration](https://github.com/mlfoundations/open_clip),
-  [paper](https://arxiv.org/abs/2502.14786)). `dfndr2b` pretrained tag is
-  Google's release of SigLIP2 trained on the DFN-2B dataset.
-- **Conversion**: `coremltools` (Apple), `open_clip_torch` (mlfoundations)
-- **Inspiration for the conversion pattern**:
   [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
-  established the convention of Core ML SigLIP-family image encoders shipped
   alongside PyTorch text encoders.
 ## License
 Apache 2.0, matching the upstream SigLIP2 weights' license. See `LICENSE`.
-## Verifying the conversion yourself
-```python
-import open_clip, torch, numpy as np, coremltools as ct
-from PIL import Image, ImageDraw
-pt_model, _, pt_pre = open_clip.create_model_and_transforms("ViT-B-16-SigLIP2", pretrained="webli")
-pt_model.eval()
-img = Image.new("RGB", (224, 224), (40, 40, 40))
-ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0))
-# PyTorch reference
-with torch.no_grad():
-    pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0))
-    pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy()
-# Core ML
-cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage",
-                       compute_units=ct.ComputeUnit.CPU_AND_NE)
-cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32)
-cm_emb /= np.linalg.norm(cm_emb)
-print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}")  # → 1.000000 for fp16
-```

 # ViT-B-16-SigLIP2 — Image Encoder, Apple Core ML
 Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series
+Macs and iOS 17+ devices. Apple's [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
+ships the older MobileCLIP family in Core ML form but stops there; this repo
+fills the gap for the higher-accuracy SigLIP2.
+**Bit-perfect to PyTorch at fp16** (cosine 1.0000 vs `open_clip` reference,
+verified end-to-end — see "Verifying the conversion yourself" below). Three
+precision tiers ship in this repo so you can pick the right size/quality
+trade-off.
+All `.mlpackage` files are stored via Git LFS — clone with `git lfs install &&
+git clone …` or download individual files via the HF Hub web UI.
 ## Quick start
 import coremltools as ct
 from PIL import Image
 model = ct.models.MLModel(
     "ViT-B-16-SigLIP2_image_8bit.mlpackage",
     compute_units=ct.ComputeUnit.CPU_AND_NE,  # ANE-accelerated
 )
+# Resize to 224×224 with a quality resampler. Core ML's image input bakes in
+# the [-1, 1] normalization but does NOT do its own resize.
+img = Image.open("photo.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
 out = model.predict({"image": img})
 embedding = out["embedding"][0]   # (768,) float32, L2-normalized
 ```
 The model **outputs L2-normalized embeddings** so cosine similarity is just a
+dot product. **Channel normalization** (the [-1, 1] mapping SigLIP2 expects)
+is baked into the Core ML graph; you only need to deliver a 224×224 RGB image.
+## Verifying the conversion yourself
+Don't trust the cosine claims — reproduce them in 30 seconds:
+```python
+import open_clip, torch, numpy as np, coremltools as ct
+from PIL import Image, ImageDraw
+pt_model, _, pt_pre = open_clip.create_model_and_transforms(
+    "ViT-B-16-SigLIP2", pretrained="webli")
+pt_model.eval()
+img = Image.new("RGB", (224, 224), (40, 40, 40))
+ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0))
+# PyTorch reference
+with torch.no_grad():
+    pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0))
+    pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy()
+# Core ML
+cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage",
+                       compute_units=ct.ComputeUnit.CPU_AND_NE)
+cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32)
+cm_emb /= np.linalg.norm(cm_emb)
+print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}")  # → 1.000000 for fp16
+```
 ## Available variants
 Cosine measured on a synthetic test image; rankings on real photo collections
 are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering
+of close-scoring results. 4-bit was tested (0.79 cosine) and deliberately
+excluded — too lossy for retrieval ranking.
 ## Performance (Apple M-series, CPU+ANE)
+Measured on a single image at a time; Apple's published Core ML packages
+(including this one) are compiled at batch=1 — see "Limitations" for why.
 | Variant | Throughput | vs PyTorch+MPS baseline |
 |---|---|---|
 | 6-bit Core ML | ~139 img/s | **2.7× faster** |
 | (PyTorch+MPS reference) | ~51 img/s | 1.0× |
+End-to-end with disk loading + parallel PIL preprocessing the throughput
+caps around ~110 img/s for the 8-bit variant — PIL decode becomes the
+bottleneck before ANE does, which is why palettization throughput differences
+vanish in practice.
+**Expected ANE utilization is low (~5%)** when you profile this in Instruments.
+That's not a bug — the per-call dispatch overhead (Python ↔ Objective-C ↔
+Apple's compute scheduler) dominates the actual ANE compute time per inference.
+Throughput improves anyway because the dispatch + ANE pipeline keeps moving.
+The fix would be a batched model, which Apple's stack doesn't currently support
+for ViT-B at runtime (see Limitations).
 ## Limitations
+- **Image branch only.** The text encoder *can* be converted using the
+  `convert_text_encoder.py` script in this repo (it works around coremltools'
+  lack of converter for PyTorch's fused `_native_multi_head_attention` by
+  replacing `nn.MultiheadAttention` with manual matmul attention before
+  tracing — output is bit-perfect to PyTorch at cosine 0.999996). **However**,
+  SigLIP2 uses Gemma2's 256K-token vocabulary; the token embedding alone is
+  393 MB at fp16, pushing the full text encoder to ~565 MB. Too large to
+  bundle here. Recommended: run the text encoder via
+  [`open_clip_torch`](https://github.com/mlfoundations/open_clip) on demand —
+  it's a one-shot per query, ~50ms on PyTorch+MPS or CPU.
 - **Batch=1 only.** Two separate constraints conflate here:
   - Apple's `ct.ImageType` is hardcoded to require batch=1
     SigLIP2-B at batch=8 — both hung past 2-minute timeouts). I haven't
     isolated whether this is a fundamental ANE limit or specific to
     certain architectures; treat as "empirically blocked at batch>1 for
+    these ViT-B image encoders, root cause unconfirmed."
   For high-throughput indexing within batch=1, dispatch many images via
   `model.predict([{"image": img1}, {"image": img2}, ...])` — coremltools
   loops them in C and amortizes the Python ↔ Objective-C overhead.
 - **iOS 17+ / macOS 14+.** Compiled with `minimum_deployment_target=macOS14`
+  to enable modern Core ML features. Older OS versions need re-conversion
+  via `convert_to_coreml.py`.
+## Pairing with a text encoder
+The standard hybrid pattern: Core ML image encoder for the heavy per-photo
+work, PyTorch text encoder for one-shot per query.
 ```python
 import open_clip, torch
 # Image side: Core ML on ANE
 img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage",
                               compute_units=ct.ComputeUnit.CPU_AND_NE)
+img = Image.open("cat.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
+img_emb = next(iter(img_model.predict({"image": img}).values()))[0]
 img_emb /= np.linalg.norm(img_emb)
+# Text side: PyTorch on CPU/MPS — one-time per query
+pt_model, _, _ = open_clip.create_model_and_transforms(
+    "ViT-B-16-SigLIP2", pretrained="webli")
 tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2")
 pt_model.eval()
     txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
 txt_emb = txt_emb.numpy()
+similarities = txt_emb @ img_emb        # shape (2,)
+# similarities[i] = cosine(prompt_i, image). Higher = better match.
 ```
 ## How this was made
+The `convert_to_coreml.py` script in this repo reproduces the image variants;
+`convert_text_encoder.py` does the (large) text encoder.
 ```bash
 pip install coremltools open_clip_torch torch torchvision pillow numpy transformers
+# fp16 only
+python convert_to_coreml.py ViT-B-16-SigLIP2
+# fp16 + 8-bit  (run once with --palettize 6 separately for the 6-bit variant)
 python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8
 ```
+The image converter:
 1. Loads PyTorch weights via `open_clip.create_model_and_transforms`
 2. Wraps with an L2-normalize head so output is search-ready
 3. Traces with `torch.jit.trace` at the model's expected input shape (224×224)
 4. Reads `mean`/`std` from the preprocess transform → derives Core ML
+   `scale`/`bias`. **This step is load-bearing for SigLIP2 specifically**
+   because its preprocess uses `Normalize(mean=0.5, std=0.5)` (mapping pixels
+   to [-1, 1]); hardcoding the more common [0, 1] mapping silently degrades
+   cosine vs PyTorch by ~0.024 (model-specific — for a model with no Normalize
+   transform like MobileCLIP2-B, getting it wrong is a no-op).
 5. Converts via `ct.convert` with `compute_units=CPU_AND_NE`
 6. Optionally palettizes via `coremltools.optimize.coreml.palettize_weights`
 7. Verifies cosine vs PyTorch + benchmarks ANE throughput
+Conversion timings: ~15s for fp16, +20s for 6-bit palettization, +50s for
+8-bit (palettization scales with model depth and bit precision).
 ## Attribution
+- **Weights**: SigLIP2 by Google ([paper](https://arxiv.org/abs/2502.14786)).
+  The `webli` pretrained tag in `open_clip` corresponds to Google's release
+  trained on the WebLI dataset. Model loaded via the
+  [`mlfoundations/open_clip`](https://github.com/mlfoundations/open_clip)
+  community library — `open_clip` is `mlfoundations`' work, not Google's or Apple's.
+- **Conversion tooling**: `coremltools` (Apple), `open_clip_torch` (mlfoundations).
+- **Pattern reference**:
   [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
+  established the convention of Core ML CLIP-family image encoders shipped
   alongside PyTorch text encoders.
 ## License
 Apache 2.0, matching the upstream SigLIP2 weights' license. See `LICENSE`.