--- language: - en tags: - onnx - vision - clip - vit - image-similarity - mobile - quantization license: mit pipeline_tag: feature-extraction --- # AI Kit Gallery - Optimized ONNX Vision Models [![View on Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20View%20Demo-Hugging%20Face-orange)](https://huggingface.co/JanadaSroor/vision-models/blob/main/AI_Models_Demo.ipynb) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/JanadaSroor/vision-models) This repository contains optimized ONNX models designed for the [AI Kit Gallery](https://github.com/JanadaSroor/AIkit) Android app. These models enable high-performance, offline AI-powered image search and categorization directly on mobile devices. ## 📁 Available Models ### CLIP Models (OpenAI/clip-vit-base-patch32) - **Text Encoder**: `clip_text_quantized.onnx` (62MB) - **Input**: Text tokens (Max length 77) - **Output**: 512D text embedding - **Optimization**: INT8 Dynamic Quantization - **Use Case**: Generating embeddings for text queries. - **Vision Encoder**: `clip_vision_quantized.onnx` (337MB) - **Input**: 224x224 RGB images - **Output**: 512D image embedding - **Optimization**: Full precision (FP32) to maintain accuracy - **Use Case**: Encoding images for similarity search. ### ViT Model (Google/vit-base-patch16-224) - **Base Model**: `vit_base_quantized.onnx` (84MB) - **Input**: 224x224 RGB images - **Output**: 768D image embedding (CLS token) - **Optimization**: INT8 Dynamic Quantization - **Use Case**: Alternative high-quality vision encoder. ## 🚀 Quick Start ### 1. Try the Interactive Demo You can view or download the demo notebook from Hugging Face: [**View AI Models Demo**](https://huggingface.co/JanadaSroor/vision-models/blob/main/AI_Models_Demo.ipynb) *To run it in Colab: Download the `.ipynb` file and upload it to [Google Colab](https://colab.research.google.com/).* ### 2. Download Models ```bash # Install Hugging Face Hub pip install huggingface_hub # Download CLIP Models huggingface-cli download JanadaSroor/vision-models models/clip_text_quantized.onnx --local-dir . huggingface-cli download JanadaSroor/vision-models models/clip_vision_quantized.onnx --local-dir . # Download ViT Model huggingface-cli download JanadaSroor/vision-models models/vit_base_quantized.onnx --local-dir . ``` ## 📊 Model Specifications | Model | Original Size | Compressed Size | Quantization | Input Shape | Output Shape | |-------|---------------|-----------------|-------------|-------------|--------------| | **CLIP Text** | ~120MB | 62MB (⬇️ 48%) | ✅ INT8 | `[batch, 77]` | `[batch, 512]` | | **CLIP Vision** | ~340MB | 337MB | ❌ FP32 | `[batch, 3, 224, 224]` | `[batch, 512]` | | **ViT Base** | ~340MB | 84MB (⬇️ 75%) | ✅ INT8 | `[batch, 3, 224, 224]` | `[batch, 768]` | ## 🏃 Performance Benchmarks Inference times measured on a standard T4 GPU instance (CPU mode) in Colab: - **CLIP Text (INT8)**: ~12ms - **CLIP Vision (FP32)**: ~65ms - **ViT Base (INT8)**: ~55ms *Note: Mobile performance on modern Android devices (SD 8 Gen 1+) is expected to be 20-30% faster due to NPU/GPU acceleration.* ## 🔧 Deployment in Android These models are optimized for [ONNX Runtime Mobile](https://onnxruntime.ai/docs/install/mobile.html). 1. Copy the `.onnx` files to your project's `src/main/assets/` directory. 2. Use the ONNX Runtime Kotlin/Java API to load and run inference: ```kotlin val session = OrtSession.create(env, modelBytes, options) val inputs = mapOf("input_ids" to textTensor) val results = session.run(inputs) ``` ## 📈 Optimization Details We used `Hugging Face Optimum` and `ONNX Runtime Quantization` tools to achieve these results: - **Dynamic Quantization**: Applied to CLIP Text and ViT Base to reduce memory footprint. - **Operator Fusion**: Combined multiple layers into single kernels for faster execution. - **Precision Tuning**: Kept CLIP Vision in FP32 as INT8 quantization led to significant accuracy loss (>5%). ## 🔍 Use Cases - **Semantic Search**: "Show me photos of mountains at sunset." - **Image Clustering**: Automatically group similar photos. - **Fast Tagging**: Detect objects and scenes without cloud APIs. ## 📄 License This project is licensed under the MIT License. Models are subject to their respective original licenses (OpenAI for CLIP, Google for ViT). --- **Maintained by [JanadaSroor](https://github.com/JanadaSroor)** | Developed for [AI Kit Gallery](https://github.com/JanadaSroor/AIkit)