batmac commited on
Commit
6333bc0
Β·
verified Β·
1 Parent(s): 424bd46

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +105 -78
README.md CHANGED
@@ -22,10 +22,17 @@ pipeline_tag: zero-shot-image-classification
22
  # ViT-B-16-SigLIP2 β€” Image Encoder, Apple Core ML
23
 
24
  Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series
25
- Macs and modern iPhones/iPads.
26
- **Bit-perfect to PyTorch at fp16** (cosine 1.0000 vs `open_clip` reference).
27
- Three precision tiers ship in this repo so you can pick the right size/quality
28
- trade-off for your use case.
 
 
 
 
 
 
 
29
 
30
  ## Quick start
31
 
@@ -33,21 +40,49 @@ trade-off for your use case.
33
  import coremltools as ct
34
  from PIL import Image
35
 
36
- # Load any of the three variants β€” see "Available variants" below
37
  model = ct.models.MLModel(
38
  "ViT-B-16-SigLIP2_image_8bit.mlpackage",
39
  compute_units=ct.ComputeUnit.CPU_AND_NE, # ANE-accelerated
40
  )
41
 
42
- img = Image.open("photo.jpg").convert("RGB").resize((224, 224))
 
 
43
  out = model.predict({"image": img})
44
  embedding = out["embedding"][0] # (768,) float32, L2-normalized
45
  ```
46
 
47
  The model **outputs L2-normalized embeddings** so cosine similarity is just a
48
- dot product. **Input preprocessing** (resize 224Γ—224, [-1, 1] normalization)
49
- is baked into the Core ML graph β€” pass any RGB PIL image at 224Γ—224 and you're
50
- done.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  ## Available variants
53
 
@@ -59,13 +94,13 @@ done.
59
 
60
  Cosine measured on a synthetic test image; rankings on real photo collections
61
  are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering
62
- of close-scoring results.
 
63
 
64
  ## Performance (Apple M-series, CPU+ANE)
65
 
66
- Benchmarked on a single-image-at-a-time basis (Apple's published Core ML
67
- packages are batch=1 by design β€” the ANE compiler doesn't support batch>1
68
- for ViT-B-shaped models, regardless of architecture).
69
 
70
  | Variant | Throughput | vs PyTorch+MPS baseline |
71
  |---|---|---|
@@ -74,23 +109,30 @@ for ViT-B-shaped models, regardless of architecture).
74
  | 6-bit Core ML | ~139 img/s | **2.7Γ— faster** |
75
  | (PyTorch+MPS reference) | ~51 img/s | 1.0Γ— |
76
 
77
- Real-world end-to-end with disk loading + parallel PIL preprocessing:
78
- ~110 img/s for the 8-bit variant. PIL decode becomes the bottleneck before
79
- ANE does, so palettization throughput differences vanish in practice.
 
 
 
 
 
 
 
 
80
 
81
  ## Limitations
82
 
83
- - **Image branch only β€” text encoder ships separately or use PyTorch.**
84
- The text encoder *does* convert (with a small workaround: replace
85
- `nn.MultiheadAttention` with manual matmul attention before tracing, since
86
- coremltools doesn't yet support PyTorch's fused `_native_multi_head_attention`),
87
- achieving 0.999996 cosine vs PyTorch. **However**, SigLIP2 uses Gemma2's
88
- 256K-token vocabulary β€” the token embedding table alone is 393 MB at fp16,
89
- pushing the full text encoder to ~565 MB. That's prohibitive for shipping
90
- alongside the image encoder for most use cases. Recommended workflow:
91
- run the text encoder via [`open_clip_torch`](https://github.com/mlfoundations/open_clip)
92
- on demand (~50ms one-time per query). The reproduction script in this repo
93
- can build the text encoder if you specifically need it.
94
 
95
  - **Batch=1 only.** Two separate constraints conflate here:
96
  - Apple's `ct.ImageType` is hardcoded to require batch=1
@@ -102,19 +144,20 @@ ANE does, so palettization throughput differences vanish in practice.
102
  SigLIP2-B at batch=8 β€” both hung past 2-minute timeouts). I haven't
103
  isolated whether this is a fundamental ANE limit or specific to
104
  certain architectures; treat as "empirically blocked at batch>1 for
105
- ViT-B image encoders, root cause unconfirmed."
106
 
107
  For high-throughput indexing within batch=1, dispatch many images via
108
  `model.predict([{"image": img1}, {"image": img2}, ...])` β€” coremltools
109
  loops them in C and amortizes the Python ↔ Objective-C overhead.
110
 
111
  - **iOS 17+ / macOS 14+.** Compiled with `minimum_deployment_target=macOS14`
112
- to enable modern Core ML features. Older OS versions need re-conversion.
 
113
 
114
- ## Using the text encoder
115
 
116
- Until coremltools fixes the SigLIP2 text-conversion bug, pair this Core ML
117
- image encoder with PyTorch text encoding:
118
 
119
  ```python
120
  import open_clip, torch
@@ -125,13 +168,13 @@ from PIL import Image
125
  # Image side: Core ML on ANE
126
  img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage",
127
  compute_units=ct.ComputeUnit.CPU_AND_NE)
128
- img_emb = next(iter(img_model.predict({
129
- "image": Image.open("cat.jpg").convert("RGB").resize((224, 224))
130
- }).values()))[0]
131
  img_emb /= np.linalg.norm(img_emb)
132
 
133
- # Text side: PyTorch on CPU/MPS (~50ms one-time per query)
134
- pt_model, _, _ = open_clip.create_model_and_transforms("ViT-B-16-SigLIP2", pretrained="webli")
 
135
  tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2")
136
  pt_model.eval()
137
 
@@ -141,71 +184,55 @@ with torch.no_grad():
141
  txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
142
  txt_emb = txt_emb.numpy()
143
 
144
- similarities = txt_emb @ img_emb
145
- # similarities[0] = cat score; similarities[1] = dog score
146
  ```
147
 
148
  ## How this was made
149
 
150
- The `convert_to_coreml.py` in this repo reproduces all three variants from
151
- the upstream `ViT-B-16-SigLIP2 (webli)` PyTorch weights:
152
 
153
  ```bash
154
  pip install coremltools open_clip_torch torch torchvision pillow numpy transformers
 
 
 
 
 
155
  python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8
156
- # fp16 + 8-bit variants in current dir; rerun with --palettize 6 for that one
157
  ```
158
 
159
- The script:
160
  1. Loads PyTorch weights via `open_clip.create_model_and_transforms`
161
  2. Wraps with an L2-normalize head so output is search-ready
162
  3. Traces with `torch.jit.trace` at the model's expected input shape (224Γ—224)
163
  4. Reads `mean`/`std` from the preprocess transform β†’ derives Core ML
164
- `scale`/`bias` (this is the step Apple's docs don't emphasize β€” getting it
165
- wrong silently degrades cosine similarity by ~0.024)
 
 
 
166
  5. Converts via `ct.convert` with `compute_units=CPU_AND_NE`
167
  6. Optionally palettizes via `coremltools.optimize.coreml.palettize_weights`
168
  7. Verifies cosine vs PyTorch + benchmarks ANE throughput
169
 
170
- Conversion takes ~15s for fp16, +20s for palettization.
 
171
 
172
  ## Attribution
173
 
174
- - **Weights**: SigLIP2 by Google
175
- ([apple's open_clip integration](https://github.com/mlfoundations/open_clip),
176
- [paper](https://arxiv.org/abs/2502.14786)). `dfndr2b` pretrained tag is
177
- Google's release of SigLIP2 trained on the DFN-2B dataset.
178
- - **Conversion**: `coremltools` (Apple), `open_clip_torch` (mlfoundations)
179
- - **Inspiration for the conversion pattern**:
 
180
  [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
181
- established the convention of Core ML SigLIP-family image encoders shipped
182
  alongside PyTorch text encoders.
183
 
184
  ## License
185
 
186
  Apache 2.0, matching the upstream SigLIP2 weights' license. See `LICENSE`.
187
-
188
- ## Verifying the conversion yourself
189
-
190
- ```python
191
- import open_clip, torch, numpy as np, coremltools as ct
192
- from PIL import Image, ImageDraw
193
-
194
- pt_model, _, pt_pre = open_clip.create_model_and_transforms("ViT-B-16-SigLIP2", pretrained="webli")
195
- pt_model.eval()
196
- img = Image.new("RGB", (224, 224), (40, 40, 40))
197
- ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0))
198
-
199
- # PyTorch reference
200
- with torch.no_grad():
201
- pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0))
202
- pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy()
203
-
204
- # Core ML
205
- cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage",
206
- compute_units=ct.ComputeUnit.CPU_AND_NE)
207
- cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32)
208
- cm_emb /= np.linalg.norm(cm_emb)
209
-
210
- print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}") # β†’ 1.000000 for fp16
211
- ```
 
22
  # ViT-B-16-SigLIP2 β€” Image Encoder, Apple Core ML
23
 
24
  Apple Neural Engine (ANE) acceleration for SigLIP2's image branch on M-series
25
+ Macs and iOS 17+ devices. Apple's [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
26
+ ships the older MobileCLIP family in Core ML form but stops there; this repo
27
+ fills the gap for the higher-accuracy SigLIP2.
28
+
29
+ **Bit-perfect to PyTorch at fp16** (cosine 1.0000 vs `open_clip` reference,
30
+ verified end-to-end β€” see "Verifying the conversion yourself" below). Three
31
+ precision tiers ship in this repo so you can pick the right size/quality
32
+ trade-off.
33
+
34
+ All `.mlpackage` files are stored via Git LFS β€” clone with `git lfs install &&
35
+ git clone …` or download individual files via the HF Hub web UI.
36
 
37
  ## Quick start
38
 
 
40
  import coremltools as ct
41
  from PIL import Image
42
 
 
43
  model = ct.models.MLModel(
44
  "ViT-B-16-SigLIP2_image_8bit.mlpackage",
45
  compute_units=ct.ComputeUnit.CPU_AND_NE, # ANE-accelerated
46
  )
47
 
48
+ # Resize to 224Γ—224 with a quality resampler. Core ML's image input bakes in
49
+ # the [-1, 1] normalization but does NOT do its own resize.
50
+ img = Image.open("photo.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
51
  out = model.predict({"image": img})
52
  embedding = out["embedding"][0] # (768,) float32, L2-normalized
53
  ```
54
 
55
  The model **outputs L2-normalized embeddings** so cosine similarity is just a
56
+ dot product. **Channel normalization** (the [-1, 1] mapping SigLIP2 expects)
57
+ is baked into the Core ML graph; you only need to deliver a 224Γ—224 RGB image.
58
+
59
+ ## Verifying the conversion yourself
60
+
61
+ Don't trust the cosine claims β€” reproduce them in 30 seconds:
62
+
63
+ ```python
64
+ import open_clip, torch, numpy as np, coremltools as ct
65
+ from PIL import Image, ImageDraw
66
+
67
+ pt_model, _, pt_pre = open_clip.create_model_and_transforms(
68
+ "ViT-B-16-SigLIP2", pretrained="webli")
69
+ pt_model.eval()
70
+ img = Image.new("RGB", (224, 224), (40, 40, 40))
71
+ ImageDraw.Draw(img).ellipse([40, 40, 184, 184], fill=(0, 255, 0))
72
+
73
+ # PyTorch reference
74
+ with torch.no_grad():
75
+ pt_emb = pt_model.visual(pt_pre(img).unsqueeze(0))
76
+ pt_emb = (pt_emb / pt_emb.norm(dim=-1, keepdim=True))[0].numpy()
77
+
78
+ # Core ML
79
+ cm = ct.models.MLModel("ViT-B-16-SigLIP2_image_fp16.mlpackage",
80
+ compute_units=ct.ComputeUnit.CPU_AND_NE)
81
+ cm_emb = next(iter(cm.predict({"image": img}).values()))[0].astype(np.float32)
82
+ cm_emb /= np.linalg.norm(cm_emb)
83
+
84
+ print(f"cosine: {float(np.dot(pt_emb, cm_emb)):.6f}") # β†’ 1.000000 for fp16
85
+ ```
86
 
87
  ## Available variants
88
 
 
94
 
95
  Cosine measured on a synthetic test image; rankings on real photo collections
96
  are essentially identical between fp16 and 8-bit. 6-bit shows minor reordering
97
+ of close-scoring results. 4-bit was tested (0.79 cosine) and deliberately
98
+ excluded β€” too lossy for retrieval ranking.
99
 
100
  ## Performance (Apple M-series, CPU+ANE)
101
 
102
+ Measured on a single image at a time; Apple's published Core ML packages
103
+ (including this one) are compiled at batch=1 β€” see "Limitations" for why.
 
104
 
105
  | Variant | Throughput | vs PyTorch+MPS baseline |
106
  |---|---|---|
 
109
  | 6-bit Core ML | ~139 img/s | **2.7Γ— faster** |
110
  | (PyTorch+MPS reference) | ~51 img/s | 1.0Γ— |
111
 
112
+ End-to-end with disk loading + parallel PIL preprocessing the throughput
113
+ caps around ~110 img/s for the 8-bit variant β€” PIL decode becomes the
114
+ bottleneck before ANE does, which is why palettization throughput differences
115
+ vanish in practice.
116
+
117
+ **Expected ANE utilization is low (~5%)** when you profile this in Instruments.
118
+ That's not a bug β€” the per-call dispatch overhead (Python ↔ Objective-C ↔
119
+ Apple's compute scheduler) dominates the actual ANE compute time per inference.
120
+ Throughput improves anyway because the dispatch + ANE pipeline keeps moving.
121
+ The fix would be a batched model, which Apple's stack doesn't currently support
122
+ for ViT-B at runtime (see Limitations).
123
 
124
  ## Limitations
125
 
126
+ - **Image branch only.** The text encoder *can* be converted using the
127
+ `convert_text_encoder.py` script in this repo (it works around coremltools'
128
+ lack of converter for PyTorch's fused `_native_multi_head_attention` by
129
+ replacing `nn.MultiheadAttention` with manual matmul attention before
130
+ tracing β€” output is bit-perfect to PyTorch at cosine 0.999996). **However**,
131
+ SigLIP2 uses Gemma2's 256K-token vocabulary; the token embedding alone is
132
+ 393 MB at fp16, pushing the full text encoder to ~565 MB. Too large to
133
+ bundle here. Recommended: run the text encoder via
134
+ [`open_clip_torch`](https://github.com/mlfoundations/open_clip) on demand β€”
135
+ it's a one-shot per query, ~50ms on PyTorch+MPS or CPU.
 
136
 
137
  - **Batch=1 only.** Two separate constraints conflate here:
138
  - Apple's `ct.ImageType` is hardcoded to require batch=1
 
144
  SigLIP2-B at batch=8 β€” both hung past 2-minute timeouts). I haven't
145
  isolated whether this is a fundamental ANE limit or specific to
146
  certain architectures; treat as "empirically blocked at batch>1 for
147
+ these ViT-B image encoders, root cause unconfirmed."
148
 
149
  For high-throughput indexing within batch=1, dispatch many images via
150
  `model.predict([{"image": img1}, {"image": img2}, ...])` β€” coremltools
151
  loops them in C and amortizes the Python ↔ Objective-C overhead.
152
 
153
  - **iOS 17+ / macOS 14+.** Compiled with `minimum_deployment_target=macOS14`
154
+ to enable modern Core ML features. Older OS versions need re-conversion
155
+ via `convert_to_coreml.py`.
156
 
157
+ ## Pairing with a text encoder
158
 
159
+ The standard hybrid pattern: Core ML image encoder for the heavy per-photo
160
+ work, PyTorch text encoder for one-shot per query.
161
 
162
  ```python
163
  import open_clip, torch
 
168
  # Image side: Core ML on ANE
169
  img_model = ct.models.MLModel("ViT-B-16-SigLIP2_image_8bit.mlpackage",
170
  compute_units=ct.ComputeUnit.CPU_AND_NE)
171
+ img = Image.open("cat.jpg").convert("RGB").resize((224, 224), Image.BICUBIC)
172
+ img_emb = next(iter(img_model.predict({"image": img}).values()))[0]
 
173
  img_emb /= np.linalg.norm(img_emb)
174
 
175
+ # Text side: PyTorch on CPU/MPS β€” one-time per query
176
+ pt_model, _, _ = open_clip.create_model_and_transforms(
177
+ "ViT-B-16-SigLIP2", pretrained="webli")
178
  tokenizer = open_clip.get_tokenizer("ViT-B-16-SigLIP2")
179
  pt_model.eval()
180
 
 
184
  txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
185
  txt_emb = txt_emb.numpy()
186
 
187
+ similarities = txt_emb @ img_emb # shape (2,)
188
+ # similarities[i] = cosine(prompt_i, image). Higher = better match.
189
  ```
190
 
191
  ## How this was made
192
 
193
+ The `convert_to_coreml.py` script in this repo reproduces the image variants;
194
+ `convert_text_encoder.py` does the (large) text encoder.
195
 
196
  ```bash
197
  pip install coremltools open_clip_torch torch torchvision pillow numpy transformers
198
+
199
+ # fp16 only
200
+ python convert_to_coreml.py ViT-B-16-SigLIP2
201
+
202
+ # fp16 + 8-bit (run once with --palettize 6 separately for the 6-bit variant)
203
  python convert_to_coreml.py ViT-B-16-SigLIP2 --palettize 8
 
204
  ```
205
 
206
+ The image converter:
207
  1. Loads PyTorch weights via `open_clip.create_model_and_transforms`
208
  2. Wraps with an L2-normalize head so output is search-ready
209
  3. Traces with `torch.jit.trace` at the model's expected input shape (224Γ—224)
210
  4. Reads `mean`/`std` from the preprocess transform β†’ derives Core ML
211
+ `scale`/`bias`. **This step is load-bearing for SigLIP2 specifically**
212
+ because its preprocess uses `Normalize(mean=0.5, std=0.5)` (mapping pixels
213
+ to [-1, 1]); hardcoding the more common [0, 1] mapping silently degrades
214
+ cosine vs PyTorch by ~0.024 (model-specific β€” for a model with no Normalize
215
+ transform like MobileCLIP2-B, getting it wrong is a no-op).
216
  5. Converts via `ct.convert` with `compute_units=CPU_AND_NE`
217
  6. Optionally palettizes via `coremltools.optimize.coreml.palettize_weights`
218
  7. Verifies cosine vs PyTorch + benchmarks ANE throughput
219
 
220
+ Conversion timings: ~15s for fp16, +20s for 6-bit palettization, +50s for
221
+ 8-bit (palettization scales with model depth and bit precision).
222
 
223
  ## Attribution
224
 
225
+ - **Weights**: SigLIP2 by Google ([paper](https://arxiv.org/abs/2502.14786)).
226
+ The `webli` pretrained tag in `open_clip` corresponds to Google's release
227
+ trained on the WebLI dataset. Model loaded via the
228
+ [`mlfoundations/open_clip`](https://github.com/mlfoundations/open_clip)
229
+ community library β€” `open_clip` is `mlfoundations`' work, not Google's or Apple's.
230
+ - **Conversion tooling**: `coremltools` (Apple), `open_clip_torch` (mlfoundations).
231
+ - **Pattern reference**:
232
  [`apple/coreml-mobileclip`](https://huggingface.co/apple/coreml-mobileclip)
233
+ established the convention of Core ML CLIP-family image encoders shipped
234
  alongside PyTorch text encoders.
235
 
236
  ## License
237
 
238
  Apache 2.0, matching the upstream SigLIP2 weights' license. See `LICENSE`.