CoreML MedSigLIP-448

CoreML conversion of google/medsiglip-448 for on-device inference on iOS/macOS. MedSigLIP is a SigLIP vision-language model fine-tuned on medical images for zero-shot classification.

Models

The original dual-encoder model is split into two separate .mlpackage files:

Model Inputs Output Size
MedSigLIP_VisionEncoder.mlpackage pixel_values β€” (1, 3, 448, 448) float32, normalized to [-1, 1] image_embeds β€” (1, 1152) float16, L2-normalized ~815 MB
MedSigLIP_TextEncoder.mlpackage input_ids β€” (1, 64) int32, padded with token ID 1; attention_mask β€” (1, 64) int32 text_embeds β€” (1, 1152) float16, L2-normalized ~858 MB

Both encoders output L2-normalized embeddings, so dot product = cosine similarity at inference time.

How to use

Python (validation)

import coremltools as ct
import numpy as np

vision_model = ct.models.MLModel("MedSigLIP_VisionEncoder.mlpackage")
text_model = ct.models.MLModel("MedSigLIP_TextEncoder.mlpackage")

# Vision encoder
pixel_values = np.random.randn(1, 3, 448, 448).astype(np.float32)
image_embeds = vision_model.predict({"pixel_values": pixel_values})["image_embeds"]

# Text encoder β€” input_ids and attention_mask must be float32 for CoreML predict()
input_ids = np.array([[523, 87, 1] + [1] * 61], dtype=np.float32)
attention_mask = np.array([[1, 1, 1] + [0] * 61], dtype=np.float32)
text_embeds = text_model.predict({
    "input_ids": input_ids,
    "attention_mask": attention_mask,
})["text_embeds"]

# Cosine similarity (embeddings are already L2-normalized)
similarity = np.dot(image_embeds.flatten(), text_embeds.flatten())

Swift (iOS/macOS)

import CoreML

let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine

let visionModel = try await MedSigLIP_VisionEncoder.load(configuration: config)
let textModel = try await MedSigLIP_TextEncoder.load(configuration: config)

// Run predictions off the main thread
let imageEmbeds = try await Task.detached {
    try visionModel.prediction(pixel_values: pixelValuesArray)
}.value

let textEmbeds = try await Task.detached {
    try textModel.prediction(input_ids: inputIdsArray, attention_mask: maskArray)
}.value

Conversion details

  • Format: ML Program (.mlpackage), requires iOS 17+ / macOS 14+
  • Precision: Float16 (weights and activations)
  • L2 normalization: Baked into both models via wrapper modules
  • Exported with: torch.export.export(strict=False) + run_decompositions(), converted with coremltools.convert()
  • Attention implementation: eager (SDPA not supported by coremltools)
  • MHA fast path: Disabled (torch.backends.mha.set_fastpath_enabled(False))

Validation

Not yet validated. Validation in progress.

Key details for integration

  • Tokenizer: SentencePiece Unigram. Use swift-transformers on iOS with tokenizer_class set to "T5Tokenizer" in tokenizer_config.json (functionally identical to SiglipTokenizer)
  • Attention mask: Must be passed to the text encoder. Build from pre-padding token count, not token values β€” pad token and EOS token share ID 1 (</s>)
  • Image preprocessing: Resize to 448x448, normalize pixels to [-1, 1]: pixel / 255 * 2 - 1
  • Logit scale: Apply score * 10.0 before softmax (learned logit_scale = exp(2.3))
  • Max sequence length: 64 tokens
  • First-launch compilation: ~30-40s (CoreML compiles for Neural Engine, cached after)

Known differences from original (WIP)

Difference
Float16 vs float32
Bilinear vs BICUBIC image resampling (iOS)
Tokenizer class name override
Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for JacobNewmes/coreml-medsiglip-448

Quantized
(2)
this model