|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- onnx |
|
|
- vision |
|
|
- clip |
|
|
- vit |
|
|
- image-similarity |
|
|
- mobile |
|
|
- quantization |
|
|
license: mit |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# AI Kit Gallery - Optimized ONNX Vision Models |
|
|
|
|
|
[](https://huggingface.co/JanadaSroor/vision-models/blob/main/AI_Models_Demo.ipynb) |
|
|
[](https://huggingface.co/JanadaSroor/vision-models) |
|
|
|
|
|
This repository contains optimized ONNX models designed for the [AI Kit Gallery](https://github.com/JanadaSroor/AIkit) Android app. These models enable high-performance, offline AI-powered image search and categorization directly on mobile devices. |
|
|
|
|
|
## π Available Models |
|
|
|
|
|
### CLIP Models (OpenAI/clip-vit-base-patch32) |
|
|
- **Text Encoder**: `clip_text_quantized.onnx` (62MB) |
|
|
- **Input**: Text tokens (Max length 77) |
|
|
- **Output**: 512D text embedding |
|
|
- **Optimization**: INT8 Dynamic Quantization |
|
|
- **Use Case**: Generating embeddings for text queries. |
|
|
|
|
|
- **Vision Encoder**: `clip_vision_quantized.onnx` (337MB) |
|
|
- **Input**: 224x224 RGB images |
|
|
- **Output**: 512D image embedding |
|
|
- **Optimization**: Full precision (FP32) to maintain accuracy |
|
|
- **Use Case**: Encoding images for similarity search. |
|
|
|
|
|
### ViT Model (Google/vit-base-patch16-224) |
|
|
- **Base Model**: `vit_base_quantized.onnx` (84MB) |
|
|
- **Input**: 224x224 RGB images |
|
|
- **Output**: 768D image embedding (CLS token) |
|
|
- **Optimization**: INT8 Dynamic Quantization |
|
|
- **Use Case**: Alternative high-quality vision encoder. |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### 1. Try the Interactive Demo |
|
|
You can view or download the demo notebook from Hugging Face: |
|
|
[**View AI Models Demo**](https://huggingface.co/JanadaSroor/vision-models/blob/main/AI_Models_Demo.ipynb) |
|
|
|
|
|
*To run it in Colab: Download the `.ipynb` file and upload it to [Google Colab](https://colab.research.google.com/).* |
|
|
|
|
|
### 2. Download Models |
|
|
```bash |
|
|
# Install Hugging Face Hub |
|
|
pip install huggingface_hub |
|
|
|
|
|
# Download CLIP Models |
|
|
huggingface-cli download JanadaSroor/vision-models models/clip_text_quantized.onnx --local-dir . |
|
|
huggingface-cli download JanadaSroor/vision-models models/clip_vision_quantized.onnx --local-dir . |
|
|
|
|
|
# Download ViT Model |
|
|
huggingface-cli download JanadaSroor/vision-models models/vit_base_quantized.onnx --local-dir . |
|
|
``` |
|
|
|
|
|
## π Model Specifications |
|
|
|
|
|
| Model | Original Size | Compressed Size | Quantization | Input Shape | Output Shape | |
|
|
|-------|---------------|-----------------|-------------|-------------|--------------| |
|
|
| **CLIP Text** | ~120MB | 62MB (β¬οΈ 48%) | β
INT8 | `[batch, 77]` | `[batch, 512]` | |
|
|
| **CLIP Vision** | ~340MB | 337MB | β FP32 | `[batch, 3, 224, 224]` | `[batch, 512]` | |
|
|
| **ViT Base** | ~340MB | 84MB (β¬οΈ 75%) | β
INT8 | `[batch, 3, 224, 224]` | `[batch, 768]` | |
|
|
|
|
|
## π Performance Benchmarks |
|
|
|
|
|
Inference times measured on a standard T4 GPU instance (CPU mode) in Colab: |
|
|
|
|
|
- **CLIP Text (INT8)**: ~12ms |
|
|
- **CLIP Vision (FP32)**: ~65ms |
|
|
- **ViT Base (INT8)**: ~55ms |
|
|
|
|
|
*Note: Mobile performance on modern Android devices (SD 8 Gen 1+) is expected to be 20-30% faster due to NPU/GPU acceleration.* |
|
|
|
|
|
## π§ Deployment in Android |
|
|
|
|
|
These models are optimized for [ONNX Runtime Mobile](https://onnxruntime.ai/docs/install/mobile.html). |
|
|
|
|
|
1. Copy the `.onnx` files to your project's `src/main/assets/` directory. |
|
|
2. Use the ONNX Runtime Kotlin/Java API to load and run inference: |
|
|
```kotlin |
|
|
val session = OrtSession.create(env, modelBytes, options) |
|
|
val inputs = mapOf("input_ids" to textTensor) |
|
|
val results = session.run(inputs) |
|
|
``` |
|
|
|
|
|
## π Optimization Details |
|
|
|
|
|
We used `Hugging Face Optimum` and `ONNX Runtime Quantization` tools to achieve these results: |
|
|
- **Dynamic Quantization**: Applied to CLIP Text and ViT Base to reduce memory footprint. |
|
|
- **Operator Fusion**: Combined multiple layers into single kernels for faster execution. |
|
|
- **Precision Tuning**: Kept CLIP Vision in FP32 as INT8 quantization led to significant accuracy loss (>5%). |
|
|
|
|
|
## π Use Cases |
|
|
|
|
|
- **Semantic Search**: "Show me photos of mountains at sunset." |
|
|
- **Image Clustering**: Automatically group similar photos. |
|
|
- **Fast Tagging**: Detect objects and scenes without cloud APIs. |
|
|
|
|
|
## π License |
|
|
|
|
|
This project is licensed under the MIT License. Models are subject to their respective original licenses (OpenAI for CLIP, Google for ViT). |
|
|
|
|
|
--- |
|
|
**Maintained by [JanadaSroor](https://github.com/JanadaSroor)** | Developed for [AI Kit Gallery](https://github.com/JanadaSroor/AIkit) |
|
|
|