Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
tags:
|
| 5 |
+
- onnx
|
| 6 |
+
- vision
|
| 7 |
+
- clip
|
| 8 |
+
- vit
|
| 9 |
+
- image-similarity
|
| 10 |
+
- mobile
|
| 11 |
+
- quantization
|
| 12 |
+
license: mit
|
| 13 |
+
pipeline_tag: feature-extraction
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# AI Kit Gallery - Optimized ONNX Vision Models
|
| 17 |
+
|
| 18 |
+
[](https://colab.research.google.com/github/JanadaSroor/vision_models/blob/main/AI_Models_Demo.ipynb)
|
| 19 |
+
[](https://huggingface.co/JanadaSroor)
|
| 20 |
+
|
| 21 |
+
This repository contains optimized ONNX models designed for the [AI Kit Gallery](https://github.com/JanadaSroor/AIkit) Android app. These models enable high-performance, offline AI-powered image search and categorization directly on mobile devices.
|
| 22 |
+
|
| 23 |
+
## π Available Models
|
| 24 |
+
|
| 25 |
+
### CLIP Models (OpenAI/clip-vit-base-patch32)
|
| 26 |
+
- **Text Encoder**: `clip_text_quantized.onnx` (62MB)
|
| 27 |
+
- **Input**: Text tokens (Max length 77)
|
| 28 |
+
- **Output**: 512D text embedding
|
| 29 |
+
- **Optimization**: INT8 Dynamic Quantization
|
| 30 |
+
- **Use Case**: Generating embeddings for text queries.
|
| 31 |
+
|
| 32 |
+
- **Vision Encoder**: `clip_vision_quantized.onnx` (337MB)
|
| 33 |
+
- **Input**: 224x224 RGB images
|
| 34 |
+
- **Output**: 512D image embedding
|
| 35 |
+
- **Optimization**: Full precision (FP32) to maintain accuracy
|
| 36 |
+
- **Use Case**: Encoding images for similarity search.
|
| 37 |
+
|
| 38 |
+
### ViT Model (Google/vit-base-patch16-224)
|
| 39 |
+
- **Base Model**: `vit_base_quantized.onnx` (84MB)
|
| 40 |
+
- **Input**: 224x224 RGB images
|
| 41 |
+
- **Output**: 768D image embedding (CLS token)
|
| 42 |
+
- **Optimization**: INT8 Dynamic Quantization
|
| 43 |
+
- **Use Case**: Alternative high-quality vision encoder.
|
| 44 |
+
|
| 45 |
+
## π Quick Start
|
| 46 |
+
|
| 47 |
+
### 1. Try the Interactive Demo
|
| 48 |
+
Test the models immediately using our Google Colab notebook:
|
| 49 |
+
[**Run AI Models Demo in Colab**](https://colab.research.google.com/github/JanadaSroor/vision_models/blob/main/AI_Models_Demo.ipynb)
|
| 50 |
+
|
| 51 |
+
### 2. Download Models
|
| 52 |
+
```bash
|
| 53 |
+
# Install Hugging Face Hub
|
| 54 |
+
pip install huggingface_hub
|
| 55 |
+
|
| 56 |
+
# Download CLIP Models
|
| 57 |
+
huggingface-cli download JanadaSroor/vision-models clip_text_quantized.onnx --local-dir ./models
|
| 58 |
+
huggingface-cli download JanadaSroor/vision-models clip_vision_quantized.onnx --local-dir ./models
|
| 59 |
+
|
| 60 |
+
# Download ViT Model
|
| 61 |
+
huggingface-cli download JanadaSroor/vision-models vit_base_quantized.onnx --local-dir ./models
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
## π Model Specifications
|
| 65 |
+
|
| 66 |
+
| Model | Original Size | Compressed Size | Quantization | Input Shape | Output Shape |
|
| 67 |
+
|-------|---------------|-----------------|-------------|-------------|--------------|
|
| 68 |
+
| **CLIP Text** | ~120MB | 62MB (β¬οΈ 48%) | β
INT8 | `[batch, 77]` | `[batch, 512]` |
|
| 69 |
+
| **CLIP Vision** | ~340MB | 337MB | β FP32 | `[batch, 3, 224, 224]` | `[batch, 512]` |
|
| 70 |
+
| **ViT Base** | ~340MB | 84MB (β¬οΈ 75%) | β
INT8 | `[batch, 3, 224, 224]` | `[batch, 768]` |
|
| 71 |
+
|
| 72 |
+
## π Performance Benchmarks
|
| 73 |
+
|
| 74 |
+
Inference times measured on a standard T4 GPU instance (CPU mode) in Colab:
|
| 75 |
+
|
| 76 |
+
- **CLIP Text (INT8)**: ~12ms
|
| 77 |
+
- **CLIP Vision (FP32)**: ~65ms
|
| 78 |
+
- **ViT Base (INT8)**: ~55ms
|
| 79 |
+
|
| 80 |
+
*Note: Mobile performance on modern Android devices (SD 8 Gen 1+) is expected to be 20-30% faster due to NPU/GPU acceleration.*
|
| 81 |
+
|
| 82 |
+
## π§ Deployment in Android
|
| 83 |
+
|
| 84 |
+
These models are optimized for [ONNX Runtime Mobile](https://onnxruntime.ai/docs/install/mobile.html).
|
| 85 |
+
|
| 86 |
+
1. Copy the `.onnx` files to your project's `src/main/assets/` directory.
|
| 87 |
+
2. Use the ONNX Runtime Kotlin/Java API to load and run inference:
|
| 88 |
+
```kotlin
|
| 89 |
+
val session = OrtSession.create(env, modelBytes, options)
|
| 90 |
+
val inputs = mapOf("input_ids" to textTensor)
|
| 91 |
+
val results = session.run(inputs)
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## π Optimization Details
|
| 95 |
+
|
| 96 |
+
We used `Hugging Face Optimum` and `ONNX Runtime Quantization` tools to achieve these results:
|
| 97 |
+
- **Dynamic Quantization**: Applied to CLIP Text and ViT Base to reduce memory footprint.
|
| 98 |
+
- **Operator Fusion**: Combined multiple layers into single kernels for faster execution.
|
| 99 |
+
- **Precision Tuning**: Kept CLIP Vision in FP32 as INT8 quantization led to significant accuracy loss (>5%).
|
| 100 |
+
|
| 101 |
+
## π Use Cases
|
| 102 |
+
|
| 103 |
+
- **Semantic Search**: "Show me photos of mountains at sunset."
|
| 104 |
+
- **Image Clustering**: Automatically group similar photos.
|
| 105 |
+
- **Fast Tagging**: Detect objects and scenes without cloud APIs.
|
| 106 |
+
|
| 107 |
+
## π License
|
| 108 |
+
|
| 109 |
+
This project is licensed under the MIT License. Models are subject to their respective original licenses (OpenAI for CLIP, Google for ViT).
|
| 110 |
+
|
| 111 |
+
---
|
| 112 |
+
**Maintained by [JanadaSroor](https://github.com/JanadaSroor)** | Developed for [AI Kit Gallery](https://github.com/JanadaSroor/AIkit)
|