OpenCLIP ViT-L/14 (DFN-2B) ExecuTorch (CoreML)

ExecuTorch .pte exports of the OpenCLIP ViT-L/14 (DFN-2B) visual and text encoders for on-device inference on Apple devices (iOS 18+ / macOS 15+).

Source code & export scripts: github.com/mallman/CoreMLCLIP

Files

File	Encoder	Precision	Backend	Compute Units
`clip_vit_l14_visual_fp16_all.pte`	Visual	fp16	CoreML + XNNPACK fallback	CPU + GPU + ANE
`clip_vit_l14_visual_fp32_cpu.pte`	Visual	fp32	XNNPACK	CPU only
`clip_vit_l14_text_fp16_all.pte`	Text	fp16	CoreML + XNNPACK fallback	CPU + GPU + ANE
`clip_vit_l14_text_fp32_cpu.pte`	Text	fp32	XNNPACK	CPU only
`vocab.json`	Tokenizer vocabulary
`merges.txt`	Tokenizer BPE merges
`config.json`	Model metadata

The fp16 CoreML variants are recommended for deployment — they leverage the Apple Neural Engine.

Model Details

Source model: open_clip ViT-L-14 / dfn2b
Visual encoder: ViT-L/14 (~302M params)
Text encoder: Transformer (12 layers, 768-dim)
Visual input: [1, 3, 224, 224] float tensor (RGB, normalized)
Text input: [1, 77] int64 tensor (tokenized)
Output: 768-dim embedding vector (not L2-normalized)
ExecuTorch version: 1.1.0
Minimum deployment target: iOS 18 / macOS 15

Usage

Both encoders take preprocessed inputs and return 768-dim embeddings. For zero-shot classification, L2-normalize both embeddings and compute their dot product.

Image Preprocessing

Parameter	Value
Input size	224 x 224
Resize	Bicubic, shortest edge to 224
Crop	Center crop
Color space	RGB, [0, 1] range
Normalization mean	`[0.48145466, 0.4578275, 0.40821073]`
Normalization std	`[0.26862954, 0.26130258, 0.27577711]`

Tokenizer

Uses the standard CLIP BPE tokenizer (vocab size 49408, context length 77). The vocab.json and merges.txt files are included for on-device tokenization.

Verification

Both variants were verified against the original PyTorch model using deterministic random inputs:

Variant	Encoder	Cosine Similarity	Max Abs Diff
fp16 CoreML	Visual	1.000000	0.000091
fp16 CoreML	Text	1.000000	0.000052
fp32 XNNPACK	Visual	1.000000	0.000000
fp32 XNNPACK	Text	1.000000	0.000000

Cross-modal similarity rankings match for both variants.

How to Reproduce

git clone https://github.com/mallman/CoreMLCLIP.git
cd CoreMLCLIP
pip install -r requirements.txt
python export_openclip.py

See the GitHub repo for full instructions.

Downloads last month: 32

Model tree for mallman/openclip-vit-l14-coreml

Base model

laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K

Quantized

(3)

this model