CLIP ViT-B/32 image encoder — LiteRT (TFLite) GPU

OpenAI / OpenCLIP CLIP ViT-B/32 image encoder converted with litert-torch for LiteRT CompiledModel GPU (ML Drift) on Android. Verified end-to-end on a Pixel 8a: the full graph (691/691 ops) runs on the OpenCL GPU delegate at ~40 ms/inference.

Files

clip_image_encoder.tflite — image encoder, NCHW [1, 3, 224, 224] → [1, 512] (L2-normalized).
text_embeddings.bin — pre-computed text embeddings for 96 labels ([96, 512], prompt "a photo of a {label}"). Little-endian: int32 num_labels, int32 dim, float32[num_labels*dim].
labels.txt — the 96 labels, one per line.

Preprocessing / use

RGB → center-crop to square → resize 224×224 → CLIP normalization (mean = [0.4815, 0.4578, 0.4082] × 255, std = [0.2686, 0.2613, 0.2758] × 255), planar NCHW. The output is L2-normalized; score labels by cosine similarity against text_embeddings.bin, then softmax with logit scale 100.

GPU-compatibility note

A stock conversion does not run on the ML Drift GPU delegate: torch.nn.MultiheadAttention lowers to 5D RESHAPE tensors (GPU max is 4D), so the model fails to compile. This export uses a 4D manual-attention rewrite (nn.MultiheadAttention → explicit 4D matmul + softmax, weights copied verbatim → numerically exact) plus the standard GELU → x·sigmoid(1.702x) approximation. With that, the whole encoder is GPU-clean (691/691 ops on the delegate).

License

MIT (OpenAI CLIP / OpenCLIP).

Downloads last month: 7