Decluttr AI · CLIP ViT-B/32 · Core ML
On-device text↔image search for the Decluttr AI iOS app.
This is a Core ML port of OpenAI's CLIP ViT-B/32 (image and text encoders), exported in FP16. It is designed to be fetched on demand by an iOS client and executed entirely on the device's Neural Engine.
The app ships without these weights so the App Store binary stays small (around 86 MB). When a user enables Visual Search inside the app, this repository is fetched, integrity-checked, compiled to .mlmodelc, and run locally. Nothing leaves the phone.
Files
| Path | Bytes | Purpose |
|---|---|---|
CLIPImage.mlpackage/Data/com.apple.CoreML/weights/weight.bin |
175,712,384 | FP16 image-encoder weights |
CLIPImage.mlpackage/Data/com.apple.CoreML/model.mlmodel |
136,412 | Image-encoder graph |
CLIPImage.mlpackage/Manifest.json |
617 | Core ML package manifest |
CLIPText.mlpackage/Data/com.apple.CoreML/weights/weight.bin |
126,878,848 | FP16 text-encoder weights |
CLIPText.mlpackage/Data/com.apple.CoreML/model.mlmodel |
171,503 | Text-encoder graph |
CLIPText.mlpackage/Manifest.json |
617 | Core ML package manifest |
clip_tokenizer/vocab.json |
961,143 | BPE token to id map |
clip_tokenizer/merges.txt |
524,619 | BPE merge rules |
Total: 303 MB.
Encoder I/O contract
Both encoders are L2-normalised at the output, so cosine similarity reduces to a dot product.
Image encoder · CLIPImage.mlpackage
| Input | image |
| Type | RGB CGImage, 224×224, 0–1 normalised pixels |
| Output | embedding |
| Shape | MLMultiArray<Float16> (1, 512) |
The model bakes in CLIP's (mean, std) normalisation internally, so the Swift caller passes raw 0–1 pixels.
Text encoder · CLIPText.mlpackage
| Input | input_ids |
| Type | MLMultiArray<Int32> (1, 77), BPE-tokenized, padded with 0 |
| Output | embedding |
| Shape | MLMultiArray<Float16> (1, 512) |
Tokenization is not baked into the Core ML graph. The consuming app runs a Swift BPE that loads vocab.json and merges.txt.
| Special token | ID |
|---|---|
<|startoftext|> |
49406 |
<|endoftext|> |
49407 |
| pad | 0 |
Usage (Swift, iOS 17+)
import CoreML
let imageURL = bundleOrAppSupport.appendingPathComponent("CLIPImage.mlmodelc")
let textURL = bundleOrAppSupport.appendingPathComponent("CLIPText.mlmodelc")
let imageModel = try MLModel(contentsOf: imageURL)
let textModel = try MLModel(contentsOf: textURL)
Image: pass an RGB CVPixelBuffer (224×224, 0–1) as feature image. The output is MLMultiArray<Float16> of length 512, which you can cast to [Float].
Text: run your BPE, fill MLMultiArray<Int32>(shape: [1, 77]), and pass as feature input_ids. Pad with 0. The first token must be 49406 and the last non-pad token 49407.
For a full reference implementation (download manager, SHA-256 verification, on-device MLModel.compileModel step, Swift BPE tokenizer), see the Decluttr AI source files:
Core/ML/CLIPModelStore.swiftCore/ML/CLIPSearchService.swiftCore/ML/CLIPTokenizer.swift
How this was built
pip install torch open-clip-torch coremltools "numpy<2.0"
python scripts/convert_clip_to_coreml.py
Pipeline:
- Load
open_clip.create_model_and_transforms("ViT-B-32", pretrained="openai")for the original OpenAI CLIP weights. - Wrap the vision tower with a thin module that bakes in CLIP's
(mean, std)image normalisation so the Swift caller can pass raw 0–1 pixels. - Trace each encoder with
torch.jit.trace(..., strict=False, check_trace=False). - Convert the image encoder via
coremltools.convert(direct PyTorch path, FP16, targetiOS17). - Convert the text encoder via the same path with
input_ids: (1, 77) int32. - Dump
open_clip.tokenizer.SimpleTokenizer.encoderandbpe_rankstovocab.jsonandmerges.txt.
Pinning notes
coremltools9.0 has a numpy 2.x interop bug in itsaten::Inthandler that breaks the text-encoder conversion. The script pinsnumpy<2.- The full conversion script lives in the consuming app's repo at
scripts/convert_clip_to_coreml.py.
Evaluation
Not formally benchmarked. Validation was qualitative: a held-out 200-photo personal library produced sensible top-K rankings for prompts like sunset over water, red car, food on a plate, and person smiling.
Quality should match the source openai/clip-vit-base-patch32 for the same prompts. The FP16 weights introduce negligible drift.
Intended use
Good fit:
- On-device cross-modal photo search in iOS apps.
- Lightweight semantic similarity over user-curated photo libraries.
Not appropriate for:
- Safety-critical classification.
- Content moderation as a sole signal.
- Tasks where the original
openai/clip-vit-base-patch32is unsuitable.
For limitations and biases of the underlying model, see the upstream OpenAI model card. They apply equally here.
License
MIT, matching the upstream OpenAI CLIP weights.
The conversion script and the Decluttr AI consumer code are licensed separately by the app's author.
Citation
@inproceedings{radford2021learning,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
booktitle={Proceedings of the 38th International Conference on Machine Learning},
year={2021}
}
- Downloads last month
- 14
Model tree for ddtfzco/decluttr-clip-vit-b32-coreml
Base model
openai/clip-vit-base-patch32