CoreML MedSigLIP-448

CoreML conversion of google/medsiglip-448 for on-device inference on iOS/macOS. MedSigLIP is a SigLIP vision-language model fine-tuned on medical images for zero-shot classification.

Models

The original dual-encoder model is split into two separate .mlpackage files:

Model	Inputs	Output	Size
`MedSigLIP_VisionEncoder.mlpackage`	`pixel_values` — (1, 3, 448, 448) float32, normalized to [-1, 1]	`image_embeds` — (1, 1152) float16, L2-normalized	~815 MB
`MedSigLIP_TextEncoder.mlpackage`	`input_ids` — (1, 64) int32, padded with token ID 1; `attention_mask` — (1, 64) int32	`text_embeds` — (1, 1152) float16, L2-normalized	~858 MB

Both encoders output L2-normalized embeddings, so dot product = cosine similarity at inference time.

How to use

Python (validation)

import coremltools as ct
import numpy as np

vision_model = ct.models.MLModel("MedSigLIP_VisionEncoder.mlpackage")
text_model = ct.models.MLModel("MedSigLIP_TextEncoder.mlpackage")

# Vision encoder
pixel_values = np.random.randn(1, 3, 448, 448).astype(np.float32)
image_embeds = vision_model.predict({"pixel_values": pixel_values})["image_embeds"]

# Text encoder — input_ids and attention_mask must be float32 for CoreML predict()
input_ids = np.array([[523, 87, 1] + [1] * 61], dtype=np.float32)
attention_mask = np.array([[1, 1, 1] + [0] * 61], dtype=np.float32)
text_embeds = text_model.predict({
    "input_ids": input_ids,
    "attention_mask": attention_mask,
})["text_embeds"]

# Cosine similarity (embeddings are already L2-normalized)
similarity = np.dot(image_embeds.flatten(), text_embeds.flatten())

Swift (iOS/macOS)

import CoreML

let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine

let visionModel = try await MedSigLIP_VisionEncoder.load(configuration: config)
let textModel = try await MedSigLIP_TextEncoder.load(configuration: config)

// Run predictions off the main thread
let imageEmbeds = try await Task.detached {
    try visionModel.prediction(pixel_values: pixelValuesArray)
}.value

let textEmbeds = try await Task.detached {
    try textModel.prediction(input_ids: inputIdsArray, attention_mask: maskArray)
}.value

Conversion details

Format: ML Program (.mlpackage), requires iOS 17+ / macOS 14+
Precision: Float16 (weights and activations)
L2 normalization: Baked into both models via wrapper modules
Exported with: torch.export.export(strict=False) + run_decompositions(), converted with coremltools.convert()
Attention implementation: eager (SDPA not supported by coremltools)
MHA fast path: Disabled (torch.backends.mha.set_fastpath_enabled(False))

Validation

Not yet validated. Validation in progress.

Key details for integration

Tokenizer: SentencePiece Unigram. Use swift-transformers on iOS with tokenizer_class set to "T5Tokenizer" in tokenizer_config.json (functionally identical to SiglipTokenizer)
Attention mask: Must be passed to the text encoder. Build from pre-padding token count, not token values — pad token and EOS token share ID 1 (</s>)
Image preprocessing: Resize to 448x448, normalize pixels to [-1, 1]: pixel / 255 * 2 - 1
Logit scale: Apply score * 10.0 before softmax (learned logit_scale = exp(2.3))
Max sequence length: 64 tokens
First-launch compilation: ~30-40s (CoreML compiles for Neural Engine, cached after)

Known differences from original (WIP)

Difference
Float16 vs float32
Bilinear vs BICUBIC image resampling (iOS)
Tokenizer class name override

Downloads last month: -

Model tree for JacobNewmes/coreml-medsiglip-448

Base model

google/medsiglip-448

Quantized

(2)

this model