MobileCLIP2 ONNX

ONNX exports of Apple's MobileCLIP2 models for use with Transformers.js.

Available Models

Model Vision Size Embed Dim Image Size Use Case
S0 43 MB 512 256x256 Ultra-lightweight, fastest inference
S2 136 MB 512 256x256 Good balance of size and quality
B 330 MB 512 224x224 Higher quality, ViT-based
L-14 1.1 GB 768 224x224 Highest quality, largest

All models include both vision and text encoders.

Usage with Transformers.js

import { CLIPVisionModelWithProjection, AutoProcessor, RawImage } from '@huggingface/transformers';

// Choose your model size: 's0', 's2', 'b', or 'l14'
const modelSize = 's2';

// Load model and processor
const model = await CLIPVisionModelWithProjection.from_pretrained('plhery/mobileclip2-onnx', {
  device: 'webgpu', // or 'wasm'
  dtype: 'fp32',
  model_file_name: `onnx/${modelSize}/vision_model`,
});
const processor = await AutoProcessor.from_pretrained('plhery/mobileclip2-onnx', {
  config_file_name: `onnx/${modelSize}/preprocessor_config.json`,
});

// Process an image
const image = await RawImage.read('path/to/image.jpg');
const inputs = await processor([image]);

// Get embeddings (L2-normalized)
const outputs = await model({ pixel_values: inputs.pixel_values });
const embeddings = outputs.image_embeds.normalize(2, -1);

File Structure

onnx/
  s0/
    vision_model.onnx
    text_model.onnx
    config.json
    preprocessor_config.json
  s2/
    ...
  b/
    ...
  l14/
    ...

Technical Notes

  • Outputs are unnormalized embeddings; L2-normalize before computing cosine similarities
  • Text input: token IDs shaped [batch, 77] (CLIP BPE vocab size 49408)
  • Preprocessing: image_mean=(0,0,0), image_std=(1,1,1) for all variants

Local Conversion

./setup_open_clip.sh
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Convert specific model (default: S2)
python convert_mobileclip2_b_to_onnx.py --model-name MobileCLIP2-S0 --out-dir onnx/s0 --skip-fp16
python convert_mobileclip2_b_to_onnx.py --model-name MobileCLIP2-B --out-dir onnx/b --skip-fp16

License

Apple Sample Code License (apple-amlr), following the original MobileCLIP license.

Acknowledgments

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plhery/mobileclip2-onnx

Quantized
(1)
this model