mlboydaisuke's picture
Upload folder using huggingface_hub
5ea2781 verified
|
Raw
History Blame Contribute Delete
1.76 kB
---
license: mit
tags:
- coreai
- clip
- apple-silicon
- on-device
---
# CLIP ViT-B/32 β€” Core AI export (official recipe)
fp16 static export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
via apple/coreai-models' official recipe (`models/clip/export.py`), with one change: text
inputs are padded to the full 77-token context (`padding="max_length"`) so free-text
queries work, instead of the recipe's 7-token example trace.
Runs out of the box with [CoreAIKit](https://github.com/john-rocky/coreai-kit)'s
`ImageTextEncoder`:
```swift
let encoder = try await ImageTextEncoder() // downloads this repo
let imageVec = try await encoder.encode(image: cgImage)
let textVec = try await encoder.encode(text: "red bike at the beach")
let score = ImageTextEncoder.cosineSimilarity(imageVec, textVec)
```
## Bundle layout
```
model/
β”œβ”€β”€ clip-vit-base-patch32_float16_static.aimodel
└── tokenizer.json
```
## Graph contract
| | name | shape | dtype |
|---|---|---|---|
| input | `pixel_values` | [1, 3, 224, 224] | fp16 |
| input | `input_ids` | [3, 77] | int32 |
| input | `attention_mask` | [3, 77] | int32 |
| output | `image_embeds` | [1, 512] | fp16, L2-normalized |
| output | `text_embeds` | [3, 512] | fp16, L2-normalized |
| output | `logits_per_image` / `logits_per_text` | [1, 3] / [3, 1] | fp16 |
Preprocessing: 224Γ—224 resize + CLIP mean/std normalization (handled by
`ImageTextEncoder`).
## Performance
M4 Max: ~3.7 ms per image on the Neural Engine (fp16). Requires macOS 27 beta /
iOS 27 beta (device β€” the CoreAI framework is not in the iOS Simulator SDK).
## License
Model weights: MIT (OpenAI CLIP); see the upstream repo. Export recipe:
BSD-3-Clause (apple/coreai-models).