Instructions to use mlboydaisuke/clip-vit-b32-litert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use mlboydaisuke/clip-vit-b32-litert with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
CLIP ViT-B/32 image encoder โ LiteRT (TFLite) GPU
OpenAI / OpenCLIP CLIP ViT-B/32 image encoder converted with litert-torch for
LiteRT CompiledModel GPU (ML Drift) on Android. Verified end-to-end on a Pixel 8a: the full
graph (691/691 ops) runs on the OpenCL GPU delegate at ~40 ms/inference.
Files
clip_image_encoder.tfliteโ image encoder, NCHW[1, 3, 224, 224]โ[1, 512](L2-normalized).text_embeddings.binโ pre-computed text embeddings for 96 labels ([96, 512], prompt"a photo of a {label}"). Little-endian:int32 num_labels, int32 dim, float32[num_labels*dim].labels.txtโ the 96 labels, one per line.
Preprocessing / use
RGB โ center-crop to square โ resize 224ร224 โ CLIP normalization
(mean = [0.4815, 0.4578, 0.4082] ร 255, std = [0.2686, 0.2613, 0.2758] ร 255), planar NCHW.
The output is L2-normalized; score labels by cosine similarity against text_embeddings.bin,
then softmax with logit scale 100.
GPU-compatibility note
A stock conversion does not run on the ML Drift GPU delegate: torch.nn.MultiheadAttention
lowers to 5D RESHAPE tensors (GPU max is 4D), so the model fails to compile. This export uses a
4D manual-attention rewrite (nn.MultiheadAttention โ explicit 4D matmul + softmax, weights
copied verbatim โ numerically exact) plus the standard GELU โ xยทsigmoid(1.702x) approximation.
With that, the whole encoder is GPU-clean (691/691 ops on the delegate).
License
MIT (OpenAI CLIP / OpenCLIP).
- Downloads last month
- 7