CoreML MedSigLIP-448

CoreML conversion of google/medsiglip-448 for on-device inference on iOS/macOS. MedSigLIP is a SigLIP vision-language model fine-tuned on medical images for zero-shot classification.

Models

The original dual-encoder model is split into two separate .mlpackage files:

Model Inputs Output Size
MedSigLIP_VisionEncoder.mlpackage pixel_values β€” (1, 3, 448, 448) float32, normalized to [-1, 1] image_embeds β€” (1, 1152) float16, L2-normalized ~815 MB
MedSigLIP_TextEncoder.mlpackage input_ids β€” (1, 64) int32, padded with token ID 1; attention_mask β€” (1, 64) int32 text_embeds β€” (1, 1152) float16, L2-normalized ~858 MB

Both encoders output L2-normalized embeddings, so dot product = cosine similarity at inference time.

How to use

Python (validation)

import coremltools as ct
import numpy as np

vision_model = ct.models.MLModel("MedSigLIP_VisionEncoder.mlpackage")
text_model = ct.models.MLModel("MedSigLIP_TextEncoder.mlpackage")

# Vision encoder
pixel_values = np.random.randn(1, 3, 448, 448).astype(np.float32)
image_embeds = vision_model.predict({"pixel_values": pixel_values})["image_embeds"]

# Text encoder β€” input_ids and attention_mask must be float32 for CoreML predict()
input_ids = np.array([[523, 87, 1] + [1] * 61], dtype=np.float32)
attention_mask = np.array([[1, 1, 1] + [0] * 61], dtype=np.float32)
text_embeds = text_model.predict({
    "input_ids": input_ids,
    "attention_mask": attention_mask,
})["text_embeds"]

# Cosine similarity (embeddings are already L2-normalized)
similarity = np.dot(image_embeds.flatten(), text_embeds.flatten())

Swift (iOS/macOS)

import CoreML

let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine

let visionModel = try await MedSigLIP_VisionEncoder.load(configuration: config)
let textModel = try await MedSigLIP_TextEncoder.load(configuration: config)

// Run predictions off the main thread
let imageEmbeds = try await Task.detached {
    try visionModel.prediction(pixel_values: pixelValuesArray)
}.value

let textEmbeds = try await Task.detached {
    try textModel.prediction(input_ids: inputIdsArray, attention_mask: maskArray)
}.value

Conversion details

  • Format: ML Program (.mlpackage), requires iOS 17+ / macOS 14+
  • Precision: Float16 (weights and activations)
  • L2 normalization: Baked into both models via wrapper modules
  • Exported with: torch.export.export(strict=False) + run_decompositions(), converted with coremltools.convert()
  • Attention implementation: eager (SDPA not supported by coremltools)
  • MHA fast path: Disabled (torch.backends.mha.set_fastpath_enabled(False))

Validation

Not yet validated. Validation in progress.

Key details for integration

  • Tokenizer: SentencePiece Unigram. Use swift-transformers on iOS with tokenizer_class set to "T5Tokenizer" in tokenizer_config.json (functionally identical to SiglipTokenizer)
  • Attention mask: Must be passed to the text encoder. Build from pre-padding token count, not token values β€” pad token and EOS token share ID 1 (</s>)
  • Image preprocessing: Resize to 448x448, normalize pixels to [-1, 1]: pixel / 255 * 2 - 1
  • Logit scale: Apply score * 10.0 before softmax (learned logit_scale = exp(2.3))
  • Max sequence length: 64 tokens
  • First-launch compilation: ~30-40s (CoreML compiles for Neural Engine, cached after)

Known differences from original (WIP)

Difference
Float16 vs float32
Bilinear vs BICUBIC image resampling (iOS)
Tokenizer class name override
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for JacobNewmes/coreml-medsiglip-448

Quantized
(2)
this model