CLIP ViT-B/32 image encoder โ€” LiteRT (TFLite) GPU

OpenAI / OpenCLIP CLIP ViT-B/32 image encoder converted with litert-torch for LiteRT CompiledModel GPU (ML Drift) on Android. Verified end-to-end on a Pixel 8a: the full graph (691/691 ops) runs on the OpenCL GPU delegate at ~40 ms/inference.

Files

  • clip_image_encoder.tflite โ€” image encoder, NCHW [1, 3, 224, 224] โ†’ [1, 512] (L2-normalized).
  • text_embeddings.bin โ€” pre-computed text embeddings for 96 labels ([96, 512], prompt "a photo of a {label}"). Little-endian: int32 num_labels, int32 dim, float32[num_labels*dim].
  • labels.txt โ€” the 96 labels, one per line.

Preprocessing / use

RGB โ†’ center-crop to square โ†’ resize 224ร—224 โ†’ CLIP normalization (mean = [0.4815, 0.4578, 0.4082] ร— 255, std = [0.2686, 0.2613, 0.2758] ร— 255), planar NCHW. The output is L2-normalized; score labels by cosine similarity against text_embeddings.bin, then softmax with logit scale 100.

GPU-compatibility note

A stock conversion does not run on the ML Drift GPU delegate: torch.nn.MultiheadAttention lowers to 5D RESHAPE tensors (GPU max is 4D), so the model fails to compile. This export uses a 4D manual-attention rewrite (nn.MultiheadAttention โ†’ explicit 4D matmul + softmax, weights copied verbatim โ†’ numerically exact) plus the standard GELU โ†’ xยทsigmoid(1.702x) approximation. With that, the whole encoder is GPU-clean (691/691 ops on the delegate).

License

MIT (OpenAI CLIP / OpenCLIP).

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support