Upload folder using huggingface_hub

6333bc0 verified 27 days ago

10 kB

	---
	license: apache-2.0
	library_name: coreml
	base_model: google/siglip2-base-patch16-224
	tags:
	- siglip2
	- siglip
	- core-ml
	- coreml
	- apple-silicon
	- ane
	- neural-engine
	- image-encoder
	- image-embedding
	- semantic-search
	- on-device
	language:
	- en
	pipeline_tag: zero-shot-image-classification
	---

	# ViT-B-16-SigLIP2 — Image Encoder, Apple Core ML

	Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series
	Macs and iOS 17+ devices. Apple's [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
	ships the older MobileCLIP family in Core ML form but stops there; this repo
	fills the gap for the higher-accuracy SigLIP2.

	Bit-perfect to PyTorch at fp16 (cosine 1.0000 vs `open_clip` reference,
	verified end-to-end — see "Verifying the conversion yourself" below). Three
	precision tiers ship in this repo so you can pick the right size/quality
	trade-off.

	All `.mlpackage` files are stored via Git LFS — clone with `git lfs install &&
	git clone …` or download individual files via the HF Hub web UI.

	## Quick start

	```python
	import coremltools as ct
	from PIL import Image

	model = ct.models.MLModel(
	"ViT-B-16-SigLIP2_image_8bit.mlpackage",
	compute_units=ct.ComputeUnit.CPU_AND_NE, # ANE-accelerated
	)

	# Resize to 224×224 with a quality resampler. Core ML's image input bakes in
	# the [-1, 1] normalization but does NOT do its own resize.
	img = Image.open("photo.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
	out = model.predict({"image": img})
	embedding = out["embedding"][0] # (768,) float32, L2-normalized
	```

	The model outputs L2-normalized embeddings so cosine similarity is just a
	dot product. Channel normalization (the [-1, 1] mapping SigLIP2 expects)
	is baked into the Core ML graph; you only need to deliver a 224×224 RGB image.

	## Verifying the conversion yourself

	Don't trust the cosine claims — reproduce them in 30 seconds:

	```python
	import open_clip, torch, numpy as np, coremltools as ct
	from PIL import Image, ImageDraw

	pt_model, _, pt_pre = open_clip.create_model_and_transforms(
	"ViT-B-16-SigLIP2", pretrained="webli")
	pt_model.eval()
	img = Image.new("RGB", (224, 224), (40, 40, 40))
	ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0))

	# PyTorch reference
	with torch.no_grad():
	pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0))
	pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy()

	# Core ML
	cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage",
	compute_units=ct.ComputeUnit.CPU_AND_NE)
	cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32)
	cm_emb /= np.linalg.norm(cm_emb)

	print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}") # → 1.000000 for fp16
	```

	## Available variants

	\| File \| Size \| Cosine vs PyTorch \| Best for \|
	\|---\|---\|---\|---\|
	\| `ViT-B-16-SigLIP2_image_fp16.mlpackage` \| 185 MB \| 1.0000 \| Reference / benchmark reproduction \|
	\| `ViT-B-16-SigLIP2_image_8bit.mlpackage` \| 93 MB \| 0.9942 \| Default — near-perfect, 2× smaller \|
	\| `ViT-B-16-SigLIP2_image_6bit.mlpackage` \| 70 MB \| 0.9629 \| When download size matters more than ranking precision \|

	Cosine measured on a synthetic test image; rankings on real photo collections
	are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering
	of close-scoring results. 4-bit was tested (0.79 cosine) and deliberately
	excluded — too lossy for retrieval ranking.

	## Performance (Apple M-series, CPU+ANE)

	Measured on a single image at a time; Apple's published Core ML packages
	(including this one) are compiled at batch=1 — see "Limitations" for why.

	\| Variant \| Throughput \| vs PyTorch+MPS baseline \|
	\|---\|---\|---\|
	\| fp16 Core ML \| ~133 img/s \| 2.6× faster \|
	\| 8-bit Core ML \| ~138 img/s \| 2.7× faster \|
	\| 6-bit Core ML \| ~139 img/s \| 2.7× faster \|
	\| (PyTorch+MPS reference) \| ~51 img/s \| 1.0× \|

	End-to-end with disk loading + parallel PIL preprocessing the throughput
	caps around ~110 img/s for the 8-bit variant — PIL decode becomes the
	bottleneck before ANE does, which is why palettization throughput differences
	vanish in practice.

	Expected ANE utilization is low (~5%) when you profile this in Instruments.
	That's not a bug — the per-call dispatch overhead (Python ↔ Objective-C ↔
	Apple's compute scheduler) dominates the actual ANE compute time per inference.
	Throughput improves anyway because the dispatch + ANE pipeline keeps moving.
	The fix would be a batched model, which Apple's stack doesn't currently support
	for ViT-B at runtime (see Limitations).

	## Limitations

	- Image branch only. The text encoder can be converted using the
	`convert_text_encoder.py` script in this repo (it works around coremltools'
	lack of converter for PyTorch's fused `_native_multi_head_attention` by
	replacing `nn.MultiheadAttention` with manual matmul attention before
	tracing — output is bit-perfect to PyTorch at cosine 0.999996). However,
	SigLIP2 uses Gemma2's 256K-token vocabulary; the token embedding alone is
	393 MB at fp16, pushing the full text encoder to ~565 MB. Too large to
	bundle here. Recommended: run the text encoder via
	[`open_clip_torch`](https://github.com/mlfoundations/open_clip) on demand —
	it's a one-shot per query, ~50ms on PyTorch+MPS or CPU.

	- Batch=1 only. Two separate constraints conflate here:
	- Apple's `ct.ImageType` is hardcoded to require batch=1
	([source: `coremltools/converters/mil/backend/backend_helper.py`](https://github.com/apple/coremltools/blob/main/coremltools/converters/mil/backend/backend_helper.py),
	line ~63 — the validator rejects any shape where `shape[0] != 1`).
	- Switching to `ct.TensorType` (raw float input) bypasses that validator
	but caused `ct.convert()` to stall indefinitely in Apple's native ANE
	compiler on the two ViT-B models I tried (MobileCLIP2-B at batch=16,
	SigLIP2-B at batch=8 — both hung past 2-minute timeouts). I haven't
	isolated whether this is a fundamental ANE limit or specific to
	certain architectures; treat as "empirically blocked at batch>1 for
	these ViT-B image encoders, root cause unconfirmed."

	For high-throughput indexing within batch=1, dispatch many images via
	`model.predict([{"image": img1}, {"image": img2}, ...])` — coremltools
	loops them in C and amortizes the Python ↔ Objective-C overhead.

	- iOS 17+ / macOS 14+. Compiled with `minimum_deployment_target=macOS14`
	to enable modern Core ML features. Older OS versions need re-conversion
	via `convert_to_coreml.py`.

	## Pairing with a text encoder

	The standard hybrid pattern: Core ML image encoder for the heavy per-photo
	work, PyTorch text encoder for one-shot per query.

	```python
	import open_clip, torch
	import coremltools as ct
	import numpy as np
	from PIL import Image

	# Image side: Core ML on ANE
	img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage",
	compute_units=ct.ComputeUnit.CPU_AND_NE)
	img = Image.open("cat.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
	img_emb = next(iter(img_model.predict({"image": img}).values()))[0]
	img_emb /= np.linalg.norm(img_emb)

	# Text side: PyTorch on CPU/MPS — one-time per query
	pt_model, _, _ = open_clip.create_model_and_transforms(
	"ViT-B-16-SigLIP2", pretrained="webli")
	tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2")
	pt_model.eval()

	with torch.no_grad():
	tokens = tokenizer(["a photo of a cat", "a photo of a dog"])
	txt_emb = pt_model.encode_text(tokens)
	txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
	txt_emb = txt_emb.numpy()

	similarities = txt_emb @ img_emb # shape (2,)
	# similarities[i] = cosine(prompt_i, image). Higher = better match.
	```

	## How this was made

	The `convert_to_coreml.py` script in this repo reproduces the image variants;
	`convert_text_encoder.py` does the (large) text encoder.

	```bash
	pip install coremltools open_clip_torch torch torchvision pillow numpy transformers

	# fp16 only
	python convert_to_coreml.py ViT-B-16-SigLIP2

	# fp16 + 8-bit (run once with --palettize 6 separately for the 6-bit variant)
	python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8
	```

	The image converter:
	1. Loads PyTorch weights via `open_clip.create_model_and_transforms`
	2. Wraps with an L2-normalize head so output is search-ready
	3. Traces with `torch.jit.trace` at the model's expected input shape (224×224)
	4. Reads `mean`/`std` from the preprocess transform → derives Core ML
	`scale`/`bias`. This step is load-bearing for SigLIP2 specifically
	because its preprocess uses `Normalize(mean=0.5, std=0.5)` (mapping pixels
	to [-1, 1]); hardcoding the more common [0, 1] mapping silently degrades
	cosine vs PyTorch by ~0.024 (model-specific — for a model with no Normalize
	transform like MobileCLIP2-B, getting it wrong is a no-op).
	5. Converts via `ct.convert` with `compute_units=CPU_AND_NE`
	6. Optionally palettizes via `coremltools.optimize.coreml.palettize_weights`
	7. Verifies cosine vs PyTorch + benchmarks ANE throughput

	Conversion timings: ~15s for fp16, +20s for 6-bit palettization, +50s for
	8-bit (palettization scales with model depth and bit precision).

	## Attribution

	- Weights: SigLIP2 by Google ([paper](https://arxiv.org/abs/2502.14786)).
	The `webli` pretrained tag in `open_clip` corresponds to Google's release
	trained on the WebLI dataset. Model loaded via the
	[`mlfoundations/open_clip`](https://github.com/mlfoundations/open_clip)
	community library — `open_clip` is `mlfoundations`' work, not Google's or Apple's.
	- Conversion tooling: `coremltools` (Apple), `open_clip_torch` (mlfoundations).
	- Pattern reference:
	[`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
	established the convention of Core ML CLIP-family image encoders shipped
	alongside PyTorch text encoders.

	## License

	Apache 2.0, matching the upstream SigLIP2 weights' license. See `LICENSE`.