| --- |
| license: mit |
| tags: |
| - coreai |
| - clip |
| - apple-silicon |
| - on-device |
| --- |
| |
| # CLIP ViT-B/32 β Core AI export (official recipe) |
|
|
| fp16 static export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) |
| via apple/coreai-models' official recipe (`models/clip/export.py`), with one change: text |
| inputs are padded to the full 77-token context (`padding="max_length"`) so free-text |
| queries work, instead of the recipe's 7-token example trace. |
|
|
| Runs out of the box with [CoreAIKit](https://github.com/john-rocky/coreai-kit)'s |
| `ImageTextEncoder`: |
|
|
| ```swift |
| let encoder = try await ImageTextEncoder() // downloads this repo |
| let imageVec = try await encoder.encode(image: cgImage) |
| let textVec = try await encoder.encode(text: "red bike at the beach") |
| let score = ImageTextEncoder.cosineSimilarity(imageVec, textVec) |
| ``` |
|
|
| ## Bundle layout |
|
|
| ``` |
| model/ |
| βββ clip-vit-base-patch32_float16_static.aimodel |
| βββ tokenizer.json |
| ``` |
|
|
| ## Graph contract |
|
|
| | | name | shape | dtype | |
| |---|---|---|---| |
| | input | `pixel_values` | [1, 3, 224, 224] | fp16 | |
| | input | `input_ids` | [3, 77] | int32 | |
| | input | `attention_mask` | [3, 77] | int32 | |
| | output | `image_embeds` | [1, 512] | fp16, L2-normalized | |
| | output | `text_embeds` | [3, 512] | fp16, L2-normalized | |
| | output | `logits_per_image` / `logits_per_text` | [1, 3] / [3, 1] | fp16 | |
|
|
| Preprocessing: 224Γ224 resize + CLIP mean/std normalization (handled by |
| `ImageTextEncoder`). |
|
|
| ## Performance |
|
|
| M4 Max: ~3.7 ms per image on the Neural Engine (fp16). Requires macOS 27 beta / |
| iOS 27 beta (device β the CoreAI framework is not in the iOS Simulator SDK). |
|
|
| ## License |
|
|
| Model weights: MIT (OpenAI CLIP); see the upstream repo. Export recipe: |
| BSD-3-Clause (apple/coreai-models). |
|
|