vision-models / README.md

Upload README.md with huggingface_hub

006517c verified 13 days ago

4.57 kB

	---
	language:
	- en
	tags:
	- onnx
	- vision
	- clip
	- vit
	- image-similarity
	- mobile
	- quantization
	license: mit
	pipeline_tag: feature-extraction
	---

	# AI Kit Gallery - Optimized ONNX Vision Models

	[![View on Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20View%20Demo-Hugging%20Face-orange)](https://huggingface.co/JanadaSroor/vision-models/blob/main/AI_Models_Demo.ipynb)
	[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/JanadaSroor/vision-models)

	This repository contains optimized ONNX models designed for the [AI Kit Gallery](https://github.com/JanadaSroor/AIkit) Android app. These models enable high-performance, offline AI-powered image search and categorization directly on mobile devices.

	## 📁 Available Models

	### CLIP Models (OpenAI/clip-vit-base-patch32)
	- Text Encoder: `clip_text_quantized.onnx` (62MB)
	- Input: Text tokens (Max length 77)
	- Output: 512D text embedding
	- Optimization: INT8 Dynamic Quantization
	- Use Case: Generating embeddings for text queries.

	- Vision Encoder: `clip_vision_quantized.onnx` (337MB)
	- Input: 224x224 RGB images
	- Output: 512D image embedding
	- Optimization: Full precision (FP32) to maintain accuracy
	- Use Case: Encoding images for similarity search.

	### ViT Model (Google/vit-base-patch16-224)
	- Base Model: `vit_base_quantized.onnx` (84MB)
	- Input: 224x224 RGB images
	- Output: 768D image embedding (CLS token)
	- Optimization: INT8 Dynamic Quantization
	- Use Case: Alternative high-quality vision encoder.

	## 🚀 Quick Start

	### 1. Try the Interactive Demo
	You can view or download the demo notebook from Hugging Face:
	[View AI Models Demo](https://huggingface.co/JanadaSroor/vision-models/blob/main/AI_Models_Demo.ipynb)

	To run it in Colab: Download the `.ipynb` file and upload it to [Google Colab](https://colab.research.google.com/).

	### 2. Download Models
	```bash
	# Install Hugging Face Hub
	pip install huggingface_hub

	# Download CLIP Models
	huggingface-cli download JanadaSroor/vision-models models/clip_text_quantized.onnx --local-dir .
	huggingface-cli download JanadaSroor/vision-models models/clip_vision_quantized.onnx --local-dir .

	# Download ViT Model
	huggingface-cli download JanadaSroor/vision-models models/vit_base_quantized.onnx --local-dir .
	```

	## 📊 Model Specifications

	\| Model \| Original Size \| Compressed Size \| Quantization \| Input Shape \| Output Shape \|
	\|-------\|---------------\|-----------------\|-------------\|-------------\|--------------\|
	\| CLIP Text \| ~120MB \| 62MB (⬇️ 48%) \| ✅ INT8 \| `[batch, 77]` \| `[batch, 512]` \|
	\| CLIP Vision \| ~340MB \| 337MB \| ❌ FP32 \| `[batch, 3, 224, 224]` \| `[batch, 512]` \|
	\| ViT Base \| ~340MB \| 84MB (⬇️ 75%) \| ✅ INT8 \| `[batch, 3, 224, 224]` \| `[batch, 768]` \|

	## 🏃 Performance Benchmarks

	Inference times measured on a standard T4 GPU instance (CPU mode) in Colab:

	- CLIP Text (INT8): ~12ms
	- CLIP Vision (FP32): ~65ms
	- ViT Base (INT8): ~55ms

	Note: Mobile performance on modern Android devices (SD 8 Gen 1+) is expected to be 20-30% faster due to NPU/GPU acceleration.

	## 🔧 Deployment in Android

	These models are optimized for [ONNX Runtime Mobile](https://onnxruntime.ai/docs/install/mobile.html).

	1. Copy the `.onnx` files to your project's `src/main/assets/` directory.
	2. Use the ONNX Runtime Kotlin/Java API to load and run inference:
	```kotlin
	val session = OrtSession.create(env, modelBytes, options)
	val inputs = mapOf("input_ids" to textTensor)
	val results = session.run(inputs)
	```

	## 📈 Optimization Details

	We used `Hugging Face Optimum` and `ONNX Runtime Quantization` tools to achieve these results:
	- Dynamic Quantization: Applied to CLIP Text and ViT Base to reduce memory footprint.
	- Operator Fusion: Combined multiple layers into single kernels for faster execution.
	- Precision Tuning: Kept CLIP Vision in FP32 as INT8 quantization led to significant accuracy loss (>5%).

	## 🔍 Use Cases

	- Semantic Search: "Show me photos of mountains at sunset."
	- Image Clustering: Automatically group similar photos.
	- Fast Tagging: Detect objects and scenes without cloud APIs.

	## 📄 License

	This project is licensed under the MIT License. Models are subject to their respective original licenses (OpenAI for CLIP, Google for ViT).

	---
	Maintained by [JanadaSroor](https://github.com/JanadaSroor) \| Developed for [AI Kit Gallery](https://github.com/JanadaSroor/AIkit)