Image Feature Extraction
LiteRT
LiteRT
PerceptionEncoder
on-device
android
gpu
clip
image-encoder
vit
rope
Instructions to use mlboydaisuke/PE-Core-base-patch16-224-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use mlboydaisuke/PE-Core-base-patch16-224-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- PerceptionEncoder
How to use mlboydaisuke/PE-Core-base-patch16-224-LiteRT with PerceptionEncoder:
# Use PE-Core models as CLIP models import core.vision_encoder.pe as pe model = pe.CLIP.from_config("mlboydaisuke/PE-Core-base-patch16-224-LiteRT", pretrained=True)# Use any PE model as a vision encoder import core.vision_encoder.pe as pe model = pe.VisionTransformer.from_config("mlboydaisuke/PE-Core-base-patch16-224-LiteRT", pretrained=True) - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: litert | |
| pipeline_tag: image-feature-extraction | |
| base_model: timm/vit_pe_core_base_patch16_224.fb | |
| tags: | |
| - litert | |
| - tflite | |
| - on-device | |
| - android | |
| - gpu | |
| - clip | |
| - perception-encoder | |
| - image-encoder | |
| - vit | |
| - rope | |
| # Perception Encoder (PE-Core-B16-224) — LiteRT (TFLite) GPU | |
| On-device [LiteRT](https://ai.google.dev/edge/litert) (`.tflite`) conversion of | |
| **Perception Encoder Core** (PE-Core, Meta 2025), the SOTA CLIP-style image tower, | |
| converted from [`timm/vit_pe_core_base_patch16_224.fb`](https://huggingface.co/timm/vit_pe_core_base_patch16_224.fb) | |
| (ViT-B/16, 94M params; original [facebook/PE-Core-B16-224](https://huggingface.co/facebook/PE-Core-B16-224)). | |
| A single forward pass turns one RGB image into a **1024-d L2-normalized image | |
| embedding** for zero-shot classification, retrieval, and similarity — running | |
| **fully on the LiteRT `CompiledModel` GPU accelerator** (ML Drift): **all 1028 | |
| ops are GPU-native (`Replacing 1028 out of 1028 node(s) ... LITERT_CL`), no CPU | |
| fallback, no Flex ops.** | |
| ## Files | |
| | File | Size | Description | | |
| |------|------|-------------| | |
| | `pe_core_base_224_fp16.tflite` | 187 MB | FP16 single-graph model, GPU full-residency | | |
| | `convert_pecore.py` | — | Reproducible conversion script (timm → tflite) | | |
| ## I/O | |
| - **Input**: `[1, 3, 224, 224]` float32, **NCHW**, RGB normalized to **`[-1, 1]`** | |
| i.e. `(pixel/255 - 0.5) / 0.5` (timm mean/std = `(0.5, 0.5, 0.5)`). Normalization | |
| is applied by the caller (not baked into the graph). | |
| - **Output**: `[1, 1024]` float32, **L2-normalized** image embedding. | |
| ## Usage (Android, LiteRT CompiledModel) | |
| ```kotlin | |
| val model = CompiledModel.create( | |
| context.assets, "pe_core_base_224_fp16.tflite", | |
| CompiledModel.Options(Accelerator.GPU), null | |
| ) | |
| val inputs = model.createInputBuffers() | |
| val outputs = model.createOutputBuffers() | |
| inputs[0].writeFloat(nchwFloatArray) // [1,3,224,224], RGB scaled to [-1,1] | |
| model.run(inputs, outputs) | |
| val embedding = outputs[0].readFloat() // [1024], already L2-normalized | |
| ``` | |
| For zero-shot classification, precompute text-label embeddings with the PE-Core | |
| text tower offline and take the dot product on device. | |
| ## Performance | |
| - **~66 ms / image steady-state** on a Pixel 8a (Mali-G615) GPU (best 12.5 ms), | |
| full GPU residency, FP16. | |
| ## Conversion notes | |
| Converted with [litert-torch / ai-edge-torch](https://github.com/google-ai-edge/ai-edge-torch). | |
| Making a RoPE ViT image tower **fully GPU-resident *and* numerically correct** on | |
| the ML Drift GPU delegate required four verbatim (weights-exact, output | |
| corr ≈ 1.0) model-side rewrites — the first three for residency, the last for | |
| on-device numerical correctness: | |
| 1. **Fused-qkv → 4D manual attention** — the fused `qkv` reshape emits a 5D | |
| head-split the GPU delegate rejects; decompose into separate q/k/v projections. | |
| Self-attention uses `scaled_dot_product_attention`, whose lowering keeps the | |
| batch-matmul 3D with a materialized transpose (both required for residency). | |
| 2. **Interleaved 2D-RoPE → rotate-half** — PE-Core's interleaved rotary uses a | |
| strided `x[..., ::2]` that lowers to `GATHER_ND` (GPU-banned). Bake an | |
| even→odd channel permutation into the q/k weights (preserves q·k exactly) and | |
| apply the rotate-half form with constant cos/sin → clean | |
| `MUL`/`ADD`/`SLICE`/`CONCAT`. | |
| 3. **Attention-pool single-query attention → broadcast-multiply + reduce-sum** — | |
| the pooling query is a constant latent, so a batch-matmul there is | |
| `const @ non-const` (rejected at compile, and the reordered `const-RHS` form is | |
| mis-computed on device); expressing it as `(q·k).sum` + softmax + `(attn·v).sum` | |
| is exact and GPU-correct. | |
| 4. **Overflow-safe LayerNorm** — the delegate computes the LayerNorm variance | |
| reduction in **fp16 even for an fp32 graph**; deep-ViT "massive activations" | |
| (|x|~50+) make `sum((x-mean)²)` exceed fp16 max (65504), so the normalization | |
| is wrong and the error compounds with depth (output correlation collapses to | |
| ~0.28 over 12 blocks while *still reporting full GPU residency*). Scaling by | |
| 1/32 before squaring (undone after) keeps the running sum in range — | |
| mathematically identical to `nn.LayerNorm`. | |
| Verified **on a Pixel 8a GPU**: zero banned ops, zero >4D tensors, full residency, | |
| and TFLite(GPU)-vs-PyTorch output correlation = 1.0 (the on-device GPU result — | |
| not just the host CPU result — matches the reference). | |
| ## Training data & PII | |
| PE-Core was pretrained by Meta on a large-scale **web-crawled image–text dataset** | |
| (billions of image–caption pairs, CLIP-style contrastive objective). No new | |
| training was performed for this conversion — it is a weights-exact format change | |
| of the public `timm`/`facebook` checkpoint. Because the source data is | |
| web-scraped, it may incidentally contain people, faces, text, and other PII; | |
| no PII was deliberately collected, and this conversion adds none. Users deploying | |
| the encoder should apply their own content/PII filtering as appropriate. See the | |
| original [PE model card](https://huggingface.co/facebook/PE-Core-B16-224) and | |
| [paper](https://arxiv.org/abs/2504.13181) for full dataset details. | |
| ## License & attribution | |
| - **Apache-2.0** (original [PE-Core](https://huggingface.co/facebook/PE-Core-B16-224) / | |
| [timm checkpoint](https://huggingface.co/timm/vit_pe_core_base_patch16_224.fb)). | |
| - This is a format conversion; all credit to the original authors (Meta / FAIR). | |