MobileCLIP2 ONNX

ONNX exports of Apple's MobileCLIP2 models for use with Transformers.js.

Available Models

Model	Vision Size	Embed Dim	Image Size	Use Case
S0	43 MB	512	256x256	Ultra-lightweight, fastest inference
S2	136 MB	512	256x256	Good balance of size and quality
B	330 MB	512	224x224	Higher quality, ViT-based
L-14	1.1 GB	768	224x224	Highest quality, largest

All models include both vision and text encoders.

Usage with Transformers.js

import { CLIPVisionModelWithProjection, AutoProcessor, RawImage } from '@huggingface/transformers';

// Choose your model size: 's0', 's2', 'b', or 'l14'
const modelSize = 's2';

// Load model and processor
const model = await CLIPVisionModelWithProjection.from_pretrained('plhery/mobileclip2-onnx', {
  device: 'webgpu', // or 'wasm'
  dtype: 'fp32',
  model_file_name: `onnx/${modelSize}/vision_model`,
});
const processor = await AutoProcessor.from_pretrained('plhery/mobileclip2-onnx', {
  config_file_name: `onnx/${modelSize}/preprocessor_config.json`,
});

// Process an image
const image = await RawImage.read('path/to/image.jpg');
const inputs = await processor([image]);

// Get embeddings (L2-normalized)
const outputs = await model({ pixel_values: inputs.pixel_values });
const embeddings = outputs.image_embeds.normalize(2, -1);

File Structure

onnx/
  s0/
    vision_model.onnx
    text_model.onnx
    config.json
    preprocessor_config.json
  s2/
    ...
  b/
    ...
  l14/
    ...

Technical Notes

Outputs are unnormalized embeddings; L2-normalize before computing cosine similarities
Text input: token IDs shaped [batch, 77] (CLIP BPE vocab size 49408)
Preprocessing: image_mean=(0,0,0), image_std=(1,1,1) for all variants

Local Conversion

./setup_open_clip.sh
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Convert specific model (default: S2)
python convert_mobileclip2_b_to_onnx.py --model-name MobileCLIP2-S0 --out-dir onnx/s0 --skip-fp16
python convert_mobileclip2_b_to_onnx.py --model-name MobileCLIP2-B --out-dir onnx/b --skip-fp16

License

Apple Sample Code License (apple-amlr), following the original MobileCLIP license.

Acknowledgments

Original models by Apple ML Research
Converted for BestPick photo organizer

Downloads last month: 48

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plhery/mobileclip2-onnx

Base model

apple/MobileCLIP2-S2

Quantized

(1)

this model