JanadaSroor commited on
Commit
e66bdce
Β·
verified Β·
1 Parent(s): 4c9cc27

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - onnx
6
+ - vision
7
+ - clip
8
+ - vit
9
+ - image-similarity
10
+ - mobile
11
+ - quantization
12
+ license: mit
13
+ pipeline_tag: feature-extraction
14
+ ---
15
+
16
+ # AI Kit Gallery - Optimized ONNX Vision Models
17
+
18
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JanadaSroor/vision_models/blob/main/AI_Models_Demo.ipynb)
19
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/JanadaSroor)
20
+
21
+ This repository contains optimized ONNX models designed for the [AI Kit Gallery](https://github.com/JanadaSroor/AIkit) Android app. These models enable high-performance, offline AI-powered image search and categorization directly on mobile devices.
22
+
23
+ ## πŸ“ Available Models
24
+
25
+ ### CLIP Models (OpenAI/clip-vit-base-patch32)
26
+ - **Text Encoder**: `clip_text_quantized.onnx` (62MB)
27
+ - **Input**: Text tokens (Max length 77)
28
+ - **Output**: 512D text embedding
29
+ - **Optimization**: INT8 Dynamic Quantization
30
+ - **Use Case**: Generating embeddings for text queries.
31
+
32
+ - **Vision Encoder**: `clip_vision_quantized.onnx` (337MB)
33
+ - **Input**: 224x224 RGB images
34
+ - **Output**: 512D image embedding
35
+ - **Optimization**: Full precision (FP32) to maintain accuracy
36
+ - **Use Case**: Encoding images for similarity search.
37
+
38
+ ### ViT Model (Google/vit-base-patch16-224)
39
+ - **Base Model**: `vit_base_quantized.onnx` (84MB)
40
+ - **Input**: 224x224 RGB images
41
+ - **Output**: 768D image embedding (CLS token)
42
+ - **Optimization**: INT8 Dynamic Quantization
43
+ - **Use Case**: Alternative high-quality vision encoder.
44
+
45
+ ## πŸš€ Quick Start
46
+
47
+ ### 1. Try the Interactive Demo
48
+ Test the models immediately using our Google Colab notebook:
49
+ [**Run AI Models Demo in Colab**](https://colab.research.google.com/github/JanadaSroor/vision_models/blob/main/AI_Models_Demo.ipynb)
50
+
51
+ ### 2. Download Models
52
+ ```bash
53
+ # Install Hugging Face Hub
54
+ pip install huggingface_hub
55
+
56
+ # Download CLIP Models
57
+ huggingface-cli download JanadaSroor/vision-models clip_text_quantized.onnx --local-dir ./models
58
+ huggingface-cli download JanadaSroor/vision-models clip_vision_quantized.onnx --local-dir ./models
59
+
60
+ # Download ViT Model
61
+ huggingface-cli download JanadaSroor/vision-models vit_base_quantized.onnx --local-dir ./models
62
+ ```
63
+
64
+ ## πŸ“Š Model Specifications
65
+
66
+ | Model | Original Size | Compressed Size | Quantization | Input Shape | Output Shape |
67
+ |-------|---------------|-----------------|-------------|-------------|--------------|
68
+ | **CLIP Text** | ~120MB | 62MB (⬇️ 48%) | βœ… INT8 | `[batch, 77]` | `[batch, 512]` |
69
+ | **CLIP Vision** | ~340MB | 337MB | ❌ FP32 | `[batch, 3, 224, 224]` | `[batch, 512]` |
70
+ | **ViT Base** | ~340MB | 84MB (⬇️ 75%) | βœ… INT8 | `[batch, 3, 224, 224]` | `[batch, 768]` |
71
+
72
+ ## πŸƒ Performance Benchmarks
73
+
74
+ Inference times measured on a standard T4 GPU instance (CPU mode) in Colab:
75
+
76
+ - **CLIP Text (INT8)**: ~12ms
77
+ - **CLIP Vision (FP32)**: ~65ms
78
+ - **ViT Base (INT8)**: ~55ms
79
+
80
+ *Note: Mobile performance on modern Android devices (SD 8 Gen 1+) is expected to be 20-30% faster due to NPU/GPU acceleration.*
81
+
82
+ ## πŸ”§ Deployment in Android
83
+
84
+ These models are optimized for [ONNX Runtime Mobile](https://onnxruntime.ai/docs/install/mobile.html).
85
+
86
+ 1. Copy the `.onnx` files to your project's `src/main/assets/` directory.
87
+ 2. Use the ONNX Runtime Kotlin/Java API to load and run inference:
88
+ ```kotlin
89
+ val session = OrtSession.create(env, modelBytes, options)
90
+ val inputs = mapOf("input_ids" to textTensor)
91
+ val results = session.run(inputs)
92
+ ```
93
+
94
+ ## πŸ“ˆ Optimization Details
95
+
96
+ We used `Hugging Face Optimum` and `ONNX Runtime Quantization` tools to achieve these results:
97
+ - **Dynamic Quantization**: Applied to CLIP Text and ViT Base to reduce memory footprint.
98
+ - **Operator Fusion**: Combined multiple layers into single kernels for faster execution.
99
+ - **Precision Tuning**: Kept CLIP Vision in FP32 as INT8 quantization led to significant accuracy loss (>5%).
100
+
101
+ ## πŸ” Use Cases
102
+
103
+ - **Semantic Search**: "Show me photos of mountains at sunset."
104
+ - **Image Clustering**: Automatically group similar photos.
105
+ - **Fast Tagging**: Detect objects and scenes without cloud APIs.
106
+
107
+ ## πŸ“„ License
108
+
109
+ This project is licensed under the MIT License. Models are subject to their respective original licenses (OpenAI for CLIP, Google for ViT).
110
+
111
+ ---
112
+ **Maintained by [JanadaSroor](https://github.com/JanadaSroor)** | Developed for [AI Kit Gallery](https://github.com/JanadaSroor/AIkit)