OpenCLIP ViT-L/14 (DFN-2B) ExecuTorch (CoreML)

ExecuTorch .pte exports of the OpenCLIP ViT-L/14 (DFN-2B) visual and text encoders for on-device inference on Apple devices (iOS 18+ / macOS 15+).

Source code & export scripts: github.com/mallman/CoreMLCLIP

Files

File Encoder Precision Backend Compute Units
clip_vit_l14_visual_fp16_all.pte Visual fp16 CoreML + XNNPACK fallback CPU + GPU + ANE
clip_vit_l14_visual_fp32_cpu.pte Visual fp32 XNNPACK CPU only
clip_vit_l14_text_fp16_all.pte Text fp16 CoreML + XNNPACK fallback CPU + GPU + ANE
clip_vit_l14_text_fp32_cpu.pte Text fp32 XNNPACK CPU only
vocab.json Tokenizer vocabulary
merges.txt Tokenizer BPE merges
config.json Model metadata

The fp16 CoreML variants are recommended for deployment — they leverage the Apple Neural Engine.

Model Details

  • Source model: open_clip ViT-L-14 / dfn2b
  • Visual encoder: ViT-L/14 (~302M params)
  • Text encoder: Transformer (12 layers, 768-dim)
  • Visual input: [1, 3, 224, 224] float tensor (RGB, normalized)
  • Text input: [1, 77] int64 tensor (tokenized)
  • Output: 768-dim embedding vector (not L2-normalized)
  • ExecuTorch version: 1.1.0
  • Minimum deployment target: iOS 18 / macOS 15

Usage

Both encoders take preprocessed inputs and return 768-dim embeddings. For zero-shot classification, L2-normalize both embeddings and compute their dot product.

Image Preprocessing

Parameter Value
Input size 224 x 224
Resize Bicubic, shortest edge to 224
Crop Center crop
Color space RGB, [0, 1] range
Normalization mean [0.48145466, 0.4578275, 0.40821073]
Normalization std [0.26862954, 0.26130258, 0.27577711]

Tokenizer

Uses the standard CLIP BPE tokenizer (vocab size 49408, context length 77). The vocab.json and merges.txt files are included for on-device tokenization.

Verification

Both variants were verified against the original PyTorch model using deterministic random inputs:

Variant Encoder Cosine Similarity Max Abs Diff
fp16 CoreML Visual 1.000000 0.000091
fp16 CoreML Text 1.000000 0.000052
fp32 XNNPACK Visual 1.000000 0.000000
fp32 XNNPACK Text 1.000000 0.000000

Cross-modal similarity rankings match for both variants.

How to Reproduce

git clone https://github.com/mallman/CoreMLCLIP.git
cd CoreMLCLIP
pip install -r requirements.txt
python export_openclip.py

See the GitHub repo for full instructions.

Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mallman/openclip-vit-l14-coreml

Quantized
(3)
this model