AI Kit Gallery - Optimized ONNX Vision Models
This repository contains optimized ONNX models designed for the AI Kit Gallery Android app. These models enable high-performance, offline AI-powered image search and categorization directly on mobile devices.
π Available Models
CLIP Models (OpenAI/clip-vit-base-patch32)
Text Encoder:
clip_text_quantized.onnx(62MB)- Input: Text tokens (Max length 77)
- Output: 512D text embedding
- Optimization: INT8 Dynamic Quantization
- Use Case: Generating embeddings for text queries.
Vision Encoder:
clip_vision_quantized.onnx(337MB)- Input: 224x224 RGB images
- Output: 512D image embedding
- Optimization: Full precision (FP32) to maintain accuracy
- Use Case: Encoding images for similarity search.
ViT Model (Google/vit-base-patch16-224)
- Base Model:
vit_base_quantized.onnx(84MB)- Input: 224x224 RGB images
- Output: 768D image embedding (CLS token)
- Optimization: INT8 Dynamic Quantization
- Use Case: Alternative high-quality vision encoder.
π Quick Start
1. Try the Interactive Demo
You can view or download the demo notebook from Hugging Face: View AI Models Demo
To run it in Colab: Download the .ipynb file and upload it to Google Colab.
2. Download Models
# Install Hugging Face Hub
pip install huggingface_hub
# Download CLIP Models
huggingface-cli download JanadaSroor/vision-models models/clip_text_quantized.onnx --local-dir .
huggingface-cli download JanadaSroor/vision-models models/clip_vision_quantized.onnx --local-dir .
# Download ViT Model
huggingface-cli download JanadaSroor/vision-models models/vit_base_quantized.onnx --local-dir .
π Model Specifications
| Model | Original Size | Compressed Size | Quantization | Input Shape | Output Shape |
|---|---|---|---|---|---|
| CLIP Text | ~120MB | 62MB (β¬οΈ 48%) | β INT8 | [batch, 77] |
[batch, 512] |
| CLIP Vision | ~340MB | 337MB | β FP32 | [batch, 3, 224, 224] |
[batch, 512] |
| ViT Base | ~340MB | 84MB (β¬οΈ 75%) | β INT8 | [batch, 3, 224, 224] |
[batch, 768] |
π Performance Benchmarks
Inference times measured on a standard T4 GPU instance (CPU mode) in Colab:
- CLIP Text (INT8): ~12ms
- CLIP Vision (FP32): ~65ms
- ViT Base (INT8): ~55ms
Note: Mobile performance on modern Android devices (SD 8 Gen 1+) is expected to be 20-30% faster due to NPU/GPU acceleration.
π§ Deployment in Android
These models are optimized for ONNX Runtime Mobile.
- Copy the
.onnxfiles to your project'ssrc/main/assets/directory. - Use the ONNX Runtime Kotlin/Java API to load and run inference:
val session = OrtSession.create(env, modelBytes, options)
val inputs = mapOf("input_ids" to textTensor)
val results = session.run(inputs)
π Optimization Details
We used Hugging Face Optimum and ONNX Runtime Quantization tools to achieve these results:
- Dynamic Quantization: Applied to CLIP Text and ViT Base to reduce memory footprint.
- Operator Fusion: Combined multiple layers into single kernels for faster execution.
- Precision Tuning: Kept CLIP Vision in FP32 as INT8 quantization led to significant accuracy loss (>5%).
π Use Cases
- Semantic Search: "Show me photos of mountains at sunset."
- Image Clustering: Automatically group similar photos.
- Fast Tagging: Detect objects and scenes without cloud APIs.
π License
This project is licensed under the MIT License. Models are subject to their respective original licenses (OpenAI for CLIP, Google for ViT).
Maintained by JanadaSroor | Developed for AI Kit Gallery