Decluttr AI · CLIP ViT-B/32 · Core ML

On-device text↔image search for the Decluttr AI iOS app.

This is a Core ML port of OpenAI's CLIP ViT-B/32 (image and text encoders), exported in FP16. It is designed to be fetched on demand by an iOS client and executed entirely on the device's Neural Engine.

The app ships without these weights so the App Store binary stays small (around 86 MB). When a user enables Visual Search inside the app, this repository is fetched, integrity-checked, compiled to .mlmodelc, and run locally. Nothing leaves the phone.

Files

Path	Bytes	Purpose
`CLIPImage.mlpackage/Data/com.apple.CoreML/weights/weight.bin`	175,712,384	FP16 image-encoder weights
`CLIPImage.mlpackage/Data/com.apple.CoreML/model.mlmodel`	136,412	Image-encoder graph
`CLIPImage.mlpackage/Manifest.json`	617	Core ML package manifest
`CLIPText.mlpackage/Data/com.apple.CoreML/weights/weight.bin`	126,878,848	FP16 text-encoder weights
`CLIPText.mlpackage/Data/com.apple.CoreML/model.mlmodel`	171,503	Text-encoder graph
`CLIPText.mlpackage/Manifest.json`	617	Core ML package manifest
`clip_tokenizer/vocab.json`	961,143	BPE token to id map
`clip_tokenizer/merges.txt`	524,619	BPE merge rules

Total: 303 MB.

Encoder I/O contract

Both encoders are L2-normalised at the output, so cosine similarity reduces to a dot product.

Image encoder · `CLIPImage.mlpackage`


Input	`image`
Type	RGB `CGImage`, 224×224, 0–1 normalised pixels
Output	`embedding`
Shape	`MLMultiArray<Float16>` `(1, 512)`

The model bakes in CLIP's (mean, std) normalisation internally, so the Swift caller passes raw 0–1 pixels.

Text encoder · `CLIPText.mlpackage`


Input	`input_ids`
Type	`MLMultiArray<Int32>` `(1, 77)`, BPE-tokenized, padded with `0`
Output	`embedding`
Shape	`MLMultiArray<Float16>` `(1, 512)`

Tokenization is not baked into the Core ML graph. The consuming app runs a Swift BPE that loads vocab.json and merges.txt.

Special token	ID
`<\|startoftext\|>`	49406
`<\|endoftext\|>`	49407
pad	0

Usage (Swift, iOS 17+)

import CoreML

let imageURL  = bundleOrAppSupport.appendingPathComponent("CLIPImage.mlmodelc")
let textURL   = bundleOrAppSupport.appendingPathComponent("CLIPText.mlmodelc")
let imageModel = try MLModel(contentsOf: imageURL)
let textModel  = try MLModel(contentsOf: textURL)

Image: pass an RGB CVPixelBuffer (224×224, 0–1) as feature image. The output is MLMultiArray<Float16> of length 512, which you can cast to [Float].

Text: run your BPE, fill MLMultiArray<Int32>(shape: [1, 77]), and pass as feature input_ids. Pad with 0. The first token must be 49406 and the last non-pad token 49407.

For a full reference implementation (download manager, SHA-256 verification, on-device MLModel.compileModel step, Swift BPE tokenizer), see the Decluttr AI source files:

Core/ML/CLIPModelStore.swift
Core/ML/CLIPSearchService.swift
Core/ML/CLIPTokenizer.swift

How this was built

pip install torch open-clip-torch coremltools "numpy<2.0"
python scripts/convert_clip_to_coreml.py

Pipeline:

Load open_clip.create_model_and_transforms("ViT-B-32", pretrained="openai") for the original OpenAI CLIP weights.
Wrap the vision tower with a thin module that bakes in CLIP's (mean, std) image normalisation so the Swift caller can pass raw 0–1 pixels.
Trace each encoder with torch.jit.trace(..., strict=False, check_trace=False).
Convert the image encoder via coremltools.convert (direct PyTorch path, FP16, target iOS17).
Convert the text encoder via the same path with input_ids: (1, 77) int32.
Dump open_clip.tokenizer.SimpleTokenizer.encoder and bpe_ranks to vocab.json and merges.txt.

Pinning notes

coremltools 9.0 has a numpy 2.x interop bug in its aten::Int handler that breaks the text-encoder conversion. The script pins numpy<2.
The full conversion script lives in the consuming app's repo at scripts/convert_clip_to_coreml.py.

Evaluation

Not formally benchmarked. Validation was qualitative: a held-out 200-photo personal library produced sensible top-K rankings for prompts like sunset over water, red car, food on a plate, and person smiling.

Quality should match the source openai/clip-vit-base-patch32 for the same prompts. The FP16 weights introduce negligible drift.

Intended use

Good fit:

On-device cross-modal photo search in iOS apps.
Lightweight semantic similarity over user-curated photo libraries.

Not appropriate for:

Safety-critical classification.
Content moderation as a sole signal.
Tasks where the original openai/clip-vit-base-patch32 is unsuitable.

For limitations and biases of the underlying model, see the upstream OpenAI model card. They apply equally here.

License

MIT, matching the upstream OpenAI CLIP weights.

The conversion script and the Decluttr AI consumer code are licensed separately by the app's author.

Citation

@inproceedings{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
  booktitle={Proceedings of the 38th International Conference on Machine Learning},
  year={2021}
}

Downloads last month: 9

Model tree for ddtfzco/decluttr-clip-vit-b32-coreml

Base model

openai/clip-vit-base-patch32

Quantized

(5)

this model