File size: 4,565 Bytes
e66bdce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
006517c
903d71d
e66bdce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
006517c
 
 
 
e66bdce
 
 
 
 
 
 
903d71d
 
e66bdce
 
903d71d
e66bdce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
language:
- en
tags:
- onnx
- vision
- clip
- vit
- image-similarity
- mobile
- quantization
license: mit
pipeline_tag: feature-extraction
---

# AI Kit Gallery - Optimized ONNX Vision Models

[![View on Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20View%20Demo-Hugging%20Face-orange)](https://huggingface.co/JanadaSroor/vision-models/blob/main/AI_Models_Demo.ipynb)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/JanadaSroor/vision-models)

This repository contains optimized ONNX models designed for the [AI Kit Gallery](https://github.com/JanadaSroor/AIkit) Android app. These models enable high-performance, offline AI-powered image search and categorization directly on mobile devices.

## πŸ“ Available Models

### CLIP Models (OpenAI/clip-vit-base-patch32)
- **Text Encoder**: `clip_text_quantized.onnx` (62MB)
  - **Input**: Text tokens (Max length 77)
  - **Output**: 512D text embedding
  - **Optimization**: INT8 Dynamic Quantization
  - **Use Case**: Generating embeddings for text queries.

- **Vision Encoder**: `clip_vision_quantized.onnx` (337MB)
  - **Input**: 224x224 RGB images
  - **Output**: 512D image embedding
  - **Optimization**: Full precision (FP32) to maintain accuracy
  - **Use Case**: Encoding images for similarity search.

### ViT Model (Google/vit-base-patch16-224)
- **Base Model**: `vit_base_quantized.onnx` (84MB)
  - **Input**: 224x224 RGB images
  - **Output**: 768D image embedding (CLS token)
  - **Optimization**: INT8 Dynamic Quantization
  - **Use Case**: Alternative high-quality vision encoder.

## πŸš€ Quick Start

### 1. Try the Interactive Demo
You can view or download the demo notebook from Hugging Face:
[**View AI Models Demo**](https://huggingface.co/JanadaSroor/vision-models/blob/main/AI_Models_Demo.ipynb)

*To run it in Colab: Download the `.ipynb` file and upload it to [Google Colab](https://colab.research.google.com/).*

### 2. Download Models
```bash
# Install Hugging Face Hub
pip install huggingface_hub

# Download CLIP Models
huggingface-cli download JanadaSroor/vision-models models/clip_text_quantized.onnx --local-dir .
huggingface-cli download JanadaSroor/vision-models models/clip_vision_quantized.onnx --local-dir .

# Download ViT Model
huggingface-cli download JanadaSroor/vision-models models/vit_base_quantized.onnx --local-dir .
```

## πŸ“Š Model Specifications

| Model | Original Size | Compressed Size | Quantization | Input Shape | Output Shape |
|-------|---------------|-----------------|-------------|-------------|--------------|
| **CLIP Text** | ~120MB | 62MB (⬇️ 48%) | βœ… INT8 | `[batch, 77]` | `[batch, 512]` |
| **CLIP Vision** | ~340MB | 337MB | ❌ FP32 | `[batch, 3, 224, 224]` | `[batch, 512]` |
| **ViT Base** | ~340MB | 84MB (⬇️ 75%) | βœ… INT8 | `[batch, 3, 224, 224]` | `[batch, 768]` |

## πŸƒ Performance Benchmarks

Inference times measured on a standard T4 GPU instance (CPU mode) in Colab:

- **CLIP Text (INT8)**: ~12ms
- **CLIP Vision (FP32)**: ~65ms
- **ViT Base (INT8)**: ~55ms

*Note: Mobile performance on modern Android devices (SD 8 Gen 1+) is expected to be 20-30% faster due to NPU/GPU acceleration.*

## πŸ”§ Deployment in Android

These models are optimized for [ONNX Runtime Mobile](https://onnxruntime.ai/docs/install/mobile.html).

1. Copy the `.onnx` files to your project's `src/main/assets/` directory.
2. Use the ONNX Runtime Kotlin/Java API to load and run inference:
```kotlin
val session = OrtSession.create(env, modelBytes, options)
val inputs = mapOf("input_ids" to textTensor)
val results = session.run(inputs)
```

## πŸ“ˆ Optimization Details

We used `Hugging Face Optimum` and `ONNX Runtime Quantization` tools to achieve these results:
- **Dynamic Quantization**: Applied to CLIP Text and ViT Base to reduce memory footprint.
- **Operator Fusion**: Combined multiple layers into single kernels for faster execution.
- **Precision Tuning**: Kept CLIP Vision in FP32 as INT8 quantization led to significant accuracy loss (>5%).

## πŸ” Use Cases

- **Semantic Search**: "Show me photos of mountains at sunset."
- **Image Clustering**: Automatically group similar photos.
- **Fast Tagging**: Detect objects and scenes without cloud APIs.

## πŸ“„ License

This project is licensed under the MIT License. Models are subject to their respective original licenses (OpenAI for CLIP, Google for ViT).

---
**Maintained by [JanadaSroor](https://github.com/JanadaSroor)** | Developed for [AI Kit Gallery](https://github.com/JanadaSroor/AIkit)