CoreML MedSigLIP-448
CoreML conversion of google/medsiglip-448 for on-device inference on iOS/macOS. MedSigLIP is a SigLIP vision-language model fine-tuned on medical images for zero-shot classification.
Models
The original dual-encoder model is split into two separate .mlpackage files:
| Model | Inputs | Output | Size |
|---|---|---|---|
MedSigLIP_VisionEncoder.mlpackage |
pixel_values β (1, 3, 448, 448) float32, normalized to [-1, 1] |
image_embeds β (1, 1152) float16, L2-normalized |
~815 MB |
MedSigLIP_TextEncoder.mlpackage |
input_ids β (1, 64) int32, padded with token ID 1; attention_mask β (1, 64) int32 |
text_embeds β (1, 1152) float16, L2-normalized |
~858 MB |
Both encoders output L2-normalized embeddings, so dot product = cosine similarity at inference time.
How to use
Python (validation)
import coremltools as ct
import numpy as np
vision_model = ct.models.MLModel("MedSigLIP_VisionEncoder.mlpackage")
text_model = ct.models.MLModel("MedSigLIP_TextEncoder.mlpackage")
# Vision encoder
pixel_values = np.random.randn(1, 3, 448, 448).astype(np.float32)
image_embeds = vision_model.predict({"pixel_values": pixel_values})["image_embeds"]
# Text encoder β input_ids and attention_mask must be float32 for CoreML predict()
input_ids = np.array([[523, 87, 1] + [1] * 61], dtype=np.float32)
attention_mask = np.array([[1, 1, 1] + [0] * 61], dtype=np.float32)
text_embeds = text_model.predict({
"input_ids": input_ids,
"attention_mask": attention_mask,
})["text_embeds"]
# Cosine similarity (embeddings are already L2-normalized)
similarity = np.dot(image_embeds.flatten(), text_embeds.flatten())
Swift (iOS/macOS)
import CoreML
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let visionModel = try await MedSigLIP_VisionEncoder.load(configuration: config)
let textModel = try await MedSigLIP_TextEncoder.load(configuration: config)
// Run predictions off the main thread
let imageEmbeds = try await Task.detached {
try visionModel.prediction(pixel_values: pixelValuesArray)
}.value
let textEmbeds = try await Task.detached {
try textModel.prediction(input_ids: inputIdsArray, attention_mask: maskArray)
}.value
Conversion details
- Format: ML Program (
.mlpackage), requires iOS 17+ / macOS 14+ - Precision: Float16 (weights and activations)
- L2 normalization: Baked into both models via wrapper modules
- Exported with:
torch.export.export(strict=False)+run_decompositions(), converted withcoremltools.convert() - Attention implementation:
eager(SDPA not supported by coremltools) - MHA fast path: Disabled (
torch.backends.mha.set_fastpath_enabled(False))
Validation
Not yet validated. Validation in progress.
Key details for integration
- Tokenizer: SentencePiece Unigram. Use swift-transformers on iOS with
tokenizer_classset to"T5Tokenizer"intokenizer_config.json(functionally identical toSiglipTokenizer) - Attention mask: Must be passed to the text encoder. Build from pre-padding token count, not token values β pad token and EOS token share ID 1 (
</s>) - Image preprocessing: Resize to 448x448, normalize pixels to [-1, 1]:
pixel / 255 * 2 - 1 - Logit scale: Apply
score * 10.0before softmax (learnedlogit_scale = exp(2.3)) - Max sequence length: 64 tokens
- First-launch compilation: ~30-40s (CoreML compiles for Neural Engine, cached after)
Known differences from original (WIP)
| Difference |
|---|
| Float16 vs float32 |
| Bilinear vs BICUBIC image resampling (iOS) |
| Tokenizer class name override |
- Downloads last month
- 10
Model tree for JacobNewmes/coreml-medsiglip-448
Base model
google/medsiglip-448