OpenCLIP ViT-L/14 (DFN-2B) ExecuTorch (CoreML)
ExecuTorch .pte exports of the OpenCLIP ViT-L/14 (DFN-2B) visual and text encoders for on-device inference on Apple devices (iOS 18+ / macOS 15+).
Source code & export scripts: github.com/mallman/CoreMLCLIP
Files
| File | Encoder | Precision | Backend | Compute Units |
|---|---|---|---|---|
clip_vit_l14_visual_fp16_all.pte |
Visual | fp16 | CoreML + XNNPACK fallback | CPU + GPU + ANE |
clip_vit_l14_visual_fp32_cpu.pte |
Visual | fp32 | XNNPACK | CPU only |
clip_vit_l14_text_fp16_all.pte |
Text | fp16 | CoreML + XNNPACK fallback | CPU + GPU + ANE |
clip_vit_l14_text_fp32_cpu.pte |
Text | fp32 | XNNPACK | CPU only |
vocab.json |
Tokenizer vocabulary | |||
merges.txt |
Tokenizer BPE merges | |||
config.json |
Model metadata |
The fp16 CoreML variants are recommended for deployment — they leverage the Apple Neural Engine.
Model Details
- Source model: open_clip
ViT-L-14/dfn2b - Visual encoder: ViT-L/14 (~302M params)
- Text encoder: Transformer (12 layers, 768-dim)
- Visual input:
[1, 3, 224, 224]float tensor (RGB, normalized) - Text input:
[1, 77]int64 tensor (tokenized) - Output: 768-dim embedding vector (not L2-normalized)
- ExecuTorch version: 1.1.0
- Minimum deployment target: iOS 18 / macOS 15
Usage
Both encoders take preprocessed inputs and return 768-dim embeddings. For zero-shot classification, L2-normalize both embeddings and compute their dot product.
Image Preprocessing
| Parameter | Value |
|---|---|
| Input size | 224 x 224 |
| Resize | Bicubic, shortest edge to 224 |
| Crop | Center crop |
| Color space | RGB, [0, 1] range |
| Normalization mean | [0.48145466, 0.4578275, 0.40821073] |
| Normalization std | [0.26862954, 0.26130258, 0.27577711] |
Tokenizer
Uses the standard CLIP BPE tokenizer (vocab size 49408, context length 77). The vocab.json and merges.txt files are included for on-device tokenization.
Verification
Both variants were verified against the original PyTorch model using deterministic random inputs:
| Variant | Encoder | Cosine Similarity | Max Abs Diff |
|---|---|---|---|
| fp16 CoreML | Visual | 1.000000 | 0.000091 |
| fp16 CoreML | Text | 1.000000 | 0.000052 |
| fp32 XNNPACK | Visual | 1.000000 | 0.000000 |
| fp32 XNNPACK | Text | 1.000000 | 0.000000 |
Cross-modal similarity rankings match for both variants.
How to Reproduce
git clone https://github.com/mallman/CoreMLCLIP.git
cd CoreMLCLIP
pip install -r requirements.txt
python export_openclip.py
See the GitHub repo for full instructions.
- Downloads last month
- 32
Model tree for mallman/openclip-vit-l14-coreml
Base model
laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K