AI Kit Gallery - Optimized ONNX Vision Models

This repository contains optimized ONNX models designed for the AI Kit Gallery Android app. These models enable high-performance, offline AI-powered image search and categorization directly on mobile devices.

📁 Available Models

CLIP Models (OpenAI/clip-vit-base-patch32)

Text Encoder: clip_text_quantized.onnx (62MB)
- Input: Text tokens (Max length 77)
- Output: 512D text embedding
- Optimization: INT8 Dynamic Quantization
- Use Case: Generating embeddings for text queries.
Vision Encoder: clip_vision_quantized.onnx (337MB)
- Input: 224x224 RGB images
- Output: 512D image embedding
- Optimization: Full precision (FP32) to maintain accuracy
- Use Case: Encoding images for similarity search.

ViT Model (Google/vit-base-patch16-224)

Base Model: vit_base_quantized.onnx (84MB)
- Input: 224x224 RGB images
- Output: 768D image embedding (CLS token)
- Optimization: INT8 Dynamic Quantization
- Use Case: Alternative high-quality vision encoder.

🚀 Quick Start

1. Try the Interactive Demo

You can view or download the demo notebook from Hugging Face: View AI Models Demo

To run it in Colab: Download the .ipynb file and upload it to Google Colab.

2. Download Models

# Install Hugging Face Hub
pip install huggingface_hub

# Download CLIP Models
huggingface-cli download JanadaSroor/vision-models models/clip_text_quantized.onnx --local-dir .
huggingface-cli download JanadaSroor/vision-models models/clip_vision_quantized.onnx --local-dir .

# Download ViT Model
huggingface-cli download JanadaSroor/vision-models models/vit_base_quantized.onnx --local-dir .

📊 Model Specifications

Model	Original Size	Compressed Size	Quantization	Input Shape	Output Shape
CLIP Text	~120MB	62MB (⬇️ 48%)	✅ INT8	`[batch, 77]`	`[batch, 512]`
CLIP Vision	~340MB	337MB	❌ FP32	`[batch, 3, 224, 224]`	`[batch, 512]`
ViT Base	~340MB	84MB (⬇️ 75%)	✅ INT8	`[batch, 3, 224, 224]`	`[batch, 768]`

🏃 Performance Benchmarks

Inference times measured on a standard T4 GPU instance (CPU mode) in Colab:

CLIP Text (INT8): ~12ms
CLIP Vision (FP32): ~65ms
ViT Base (INT8): ~55ms

Note: Mobile performance on modern Android devices (SD 8 Gen 1+) is expected to be 20-30% faster due to NPU/GPU acceleration.

🔧 Deployment in Android

These models are optimized for ONNX Runtime Mobile.

Copy the .onnx files to your project's src/main/assets/ directory.
Use the ONNX Runtime Kotlin/Java API to load and run inference:

val session = OrtSession.create(env, modelBytes, options)
val inputs = mapOf("input_ids" to textTensor)
val results = session.run(inputs)

📈 Optimization Details

We used Hugging Face Optimum and ONNX Runtime Quantization tools to achieve these results:

Dynamic Quantization: Applied to CLIP Text and ViT Base to reduce memory footprint.
Operator Fusion: Combined multiple layers into single kernels for faster execution.
Precision Tuning: Kept CLIP Vision in FP32 as INT8 quantization led to significant accuracy loss (>5%).

🔍 Use Cases

Semantic Search: "Show me photos of mountains at sunset."
Image Clustering: Automatically group similar photos.
Fast Tagging: Detect objects and scenes without cloud APIs.

📄 License

This project is licensed under the MIT License. Models are subject to their respective original licenses (OpenAI for CLIP, Google for ViT).

Maintained by JanadaSroor | Developed for AI Kit Gallery

Downloads last month: -; Downloads are not tracked for this model. How to track