Decluttr AI · CLIP ViT-B/32 · Core ML

On-device text↔image search for the Decluttr AI iOS app.

This is a Core ML port of OpenAI's CLIP ViT-B/32 (image and text encoders), exported in FP16. It is designed to be fetched on demand by an iOS client and executed entirely on the device's Neural Engine.

The app ships without these weights so the App Store binary stays small (around 86 MB). When a user enables Visual Search inside the app, this repository is fetched, integrity-checked, compiled to .mlmodelc, and run locally. Nothing leaves the phone.

Files

Path Bytes Purpose
CLIPImage.mlpackage/Data/com.apple.CoreML/weights/weight.bin 175,712,384 FP16 image-encoder weights
CLIPImage.mlpackage/Data/com.apple.CoreML/model.mlmodel 136,412 Image-encoder graph
CLIPImage.mlpackage/Manifest.json 617 Core ML package manifest
CLIPText.mlpackage/Data/com.apple.CoreML/weights/weight.bin 126,878,848 FP16 text-encoder weights
CLIPText.mlpackage/Data/com.apple.CoreML/model.mlmodel 171,503 Text-encoder graph
CLIPText.mlpackage/Manifest.json 617 Core ML package manifest
clip_tokenizer/vocab.json 961,143 BPE token to id map
clip_tokenizer/merges.txt 524,619 BPE merge rules

Total: 303 MB.

Encoder I/O contract

Both encoders are L2-normalised at the output, so cosine similarity reduces to a dot product.

Image encoder · CLIPImage.mlpackage

Input image
Type RGB CGImage, 224×224, 0–1 normalised pixels
Output embedding
Shape MLMultiArray<Float16> (1, 512)

The model bakes in CLIP's (mean, std) normalisation internally, so the Swift caller passes raw 0–1 pixels.

Text encoder · CLIPText.mlpackage

Input input_ids
Type MLMultiArray<Int32> (1, 77), BPE-tokenized, padded with 0
Output embedding
Shape MLMultiArray<Float16> (1, 512)

Tokenization is not baked into the Core ML graph. The consuming app runs a Swift BPE that loads vocab.json and merges.txt.

Special token ID
<|startoftext|> 49406
<|endoftext|> 49407
pad 0

Usage (Swift, iOS 17+)

import CoreML

let imageURL  = bundleOrAppSupport.appendingPathComponent("CLIPImage.mlmodelc")
let textURL   = bundleOrAppSupport.appendingPathComponent("CLIPText.mlmodelc")
let imageModel = try MLModel(contentsOf: imageURL)
let textModel  = try MLModel(contentsOf: textURL)

Image: pass an RGB CVPixelBuffer (224×224, 0–1) as feature image. The output is MLMultiArray<Float16> of length 512, which you can cast to [Float].

Text: run your BPE, fill MLMultiArray<Int32>(shape: [1, 77]), and pass as feature input_ids. Pad with 0. The first token must be 49406 and the last non-pad token 49407.

For a full reference implementation (download manager, SHA-256 verification, on-device MLModel.compileModel step, Swift BPE tokenizer), see the Decluttr AI source files:

  • Core/ML/CLIPModelStore.swift
  • Core/ML/CLIPSearchService.swift
  • Core/ML/CLIPTokenizer.swift

How this was built

pip install torch open-clip-torch coremltools "numpy<2.0"
python scripts/convert_clip_to_coreml.py

Pipeline:

  1. Load open_clip.create_model_and_transforms("ViT-B-32", pretrained="openai") for the original OpenAI CLIP weights.
  2. Wrap the vision tower with a thin module that bakes in CLIP's (mean, std) image normalisation so the Swift caller can pass raw 0–1 pixels.
  3. Trace each encoder with torch.jit.trace(..., strict=False, check_trace=False).
  4. Convert the image encoder via coremltools.convert (direct PyTorch path, FP16, target iOS17).
  5. Convert the text encoder via the same path with input_ids: (1, 77) int32.
  6. Dump open_clip.tokenizer.SimpleTokenizer.encoder and bpe_ranks to vocab.json and merges.txt.

Pinning notes

  • coremltools 9.0 has a numpy 2.x interop bug in its aten::Int handler that breaks the text-encoder conversion. The script pins numpy<2.
  • The full conversion script lives in the consuming app's repo at scripts/convert_clip_to_coreml.py.

Evaluation

Not formally benchmarked. Validation was qualitative: a held-out 200-photo personal library produced sensible top-K rankings for prompts like sunset over water, red car, food on a plate, and person smiling.

Quality should match the source openai/clip-vit-base-patch32 for the same prompts. The FP16 weights introduce negligible drift.

Intended use

Good fit:

  • On-device cross-modal photo search in iOS apps.
  • Lightweight semantic similarity over user-curated photo libraries.

Not appropriate for:

  • Safety-critical classification.
  • Content moderation as a sole signal.
  • Tasks where the original openai/clip-vit-base-patch32 is unsuitable.

For limitations and biases of the underlying model, see the upstream OpenAI model card. They apply equally here.

License

MIT, matching the upstream OpenAI CLIP weights.

The conversion script and the Decluttr AI consumer code are licensed separately by the app's author.

Citation

@inproceedings{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
  booktitle={Proceedings of the 38th International Conference on Machine Learning},
  year={2021}
}
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ddtfzco/decluttr-clip-vit-b32-coreml

Quantized
(6)
this model