JanadaSroor
/

vision-models

+---
+language:
+- en
+tags:
+- onnx
+- vision
+- clip
+- vit
+- image-similarity
+- mobile
+- quantization
+license: mit
+pipeline_tag: feature-extraction
+---
+# AI Kit Gallery - Optimized ONNX Vision Models
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JanadaSroor/vision_models/blob/main/AI_Models_Demo.ipynb)
+[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/JanadaSroor)
+This repository contains optimized ONNX models designed for the [AI Kit Gallery](https://github.com/JanadaSroor/AIkit) Android app. These models enable high-performance, offline AI-powered image search and categorization directly on mobile devices.
+## 📁 Available Models
+### CLIP Models (OpenAI/clip-vit-base-patch32)
+- **Text Encoder**: `clip_text_quantized.onnx` (62MB)
+  - **Input**: Text tokens (Max length 77)
+  - **Output**: 512D text embedding
+  - **Optimization**: INT8 Dynamic Quantization
+  - **Use Case**: Generating embeddings for text queries.
+- **Vision Encoder**: `clip_vision_quantized.onnx` (337MB)
+  - **Input**: 224x224 RGB images
+  - **Output**: 512D image embedding
+  - **Optimization**: Full precision (FP32) to maintain accuracy
+  - **Use Case**: Encoding images for similarity search.
+### ViT Model (Google/vit-base-patch16-224)
+- **Base Model**: `vit_base_quantized.onnx` (84MB)
+  - **Input**: 224x224 RGB images
+  - **Output**: 768D image embedding (CLS token)
+  - **Optimization**: INT8 Dynamic Quantization
+  - **Use Case**: Alternative high-quality vision encoder.
+## 🚀 Quick Start
+### 1. Try the Interactive Demo
+Test the models immediately using our Google Colab notebook:
+[**Run AI Models Demo in Colab**](https://colab.research.google.com/github/JanadaSroor/vision_models/blob/main/AI_Models_Demo.ipynb)
+### 2. Download Models
+```bash
+# Install Hugging Face Hub
+pip install huggingface_hub
+# Download CLIP Models
+huggingface-cli download JanadaSroor/vision-models clip_text_quantized.onnx --local-dir ./models
+huggingface-cli download JanadaSroor/vision-models clip_vision_quantized.onnx --local-dir ./models
+# Download ViT Model
+huggingface-cli download JanadaSroor/vision-models vit_base_quantized.onnx --local-dir ./models
+```
+## 📊 Model Specifications
+| Model | Original Size | Compressed Size | Quantization | Input Shape | Output Shape |
+|-------|---------------|-----------------|-------------|-------------|--------------|
+| **CLIP Text** | ~120MB | 62MB (⬇️ 48%) | ✅ INT8 | `[batch, 77]` | `[batch, 512]` |
+| **CLIP Vision** | ~340MB | 337MB | ❌ FP32 | `[batch, 3, 224, 224]` | `[batch, 512]` |
+| **ViT Base** | ~340MB | 84MB (⬇️ 75%) | ✅ INT8 | `[batch, 3, 224, 224]` | `[batch, 768]` |
+## 🏃 Performance Benchmarks
+Inference times measured on a standard T4 GPU instance (CPU mode) in Colab:
+- **CLIP Text (INT8)**: ~12ms
+- **CLIP Vision (FP32)**: ~65ms
+- **ViT Base (INT8)**: ~55ms
+*Note: Mobile performance on modern Android devices (SD 8 Gen 1+) is expected to be 20-30% faster due to NPU/GPU acceleration.*
+## 🔧 Deployment in Android
+These models are optimized for [ONNX Runtime Mobile](https://onnxruntime.ai/docs/install/mobile.html).
+1. Copy the `.onnx` files to your project's `src/main/assets/` directory.
+2. Use the ONNX Runtime Kotlin/Java API to load and run inference:
+```kotlin
+val session = OrtSession.create(env, modelBytes, options)
+val inputs = mapOf("input_ids" to textTensor)
+val results = session.run(inputs)
+```
+## 📈 Optimization Details
+We used `Hugging Face Optimum` and `ONNX Runtime Quantization` tools to achieve these results:
+- **Dynamic Quantization**: Applied to CLIP Text and ViT Base to reduce memory footprint.
+- **Operator Fusion**: Combined multiple layers into single kernels for faster execution.
+- **Precision Tuning**: Kept CLIP Vision in FP32 as INT8 quantization led to significant accuracy loss (>5%).
+## 🔍 Use Cases
+- **Semantic Search**: "Show me photos of mountains at sunset."
+- **Image Clustering**: Automatically group similar photos.
+- **Fast Tagging**: Detect objects and scenes without cloud APIs.
+## 📄 License
+This project is licensed under the MIT License. Models are subject to their respective original licenses (OpenAI for CLIP, Google for ViT).
+---
+**Maintained by [JanadaSroor](https://github.com/JanadaSroor)** | Developed for [AI Kit Gallery](https://github.com/JanadaSroor/AIkit)