AI Kit Gallery - Optimized ONNX Vision Models

View on Hugging Face Hugging Face

This repository contains optimized ONNX models designed for the AI Kit Gallery Android app. These models enable high-performance, offline AI-powered image search and categorization directly on mobile devices.

πŸ“ Available Models

CLIP Models (OpenAI/clip-vit-base-patch32)

  • Text Encoder: clip_text_quantized.onnx (62MB)

    • Input: Text tokens (Max length 77)
    • Output: 512D text embedding
    • Optimization: INT8 Dynamic Quantization
    • Use Case: Generating embeddings for text queries.
  • Vision Encoder: clip_vision_quantized.onnx (337MB)

    • Input: 224x224 RGB images
    • Output: 512D image embedding
    • Optimization: Full precision (FP32) to maintain accuracy
    • Use Case: Encoding images for similarity search.

ViT Model (Google/vit-base-patch16-224)

  • Base Model: vit_base_quantized.onnx (84MB)
    • Input: 224x224 RGB images
    • Output: 768D image embedding (CLS token)
    • Optimization: INT8 Dynamic Quantization
    • Use Case: Alternative high-quality vision encoder.

πŸš€ Quick Start

1. Try the Interactive Demo

You can view or download the demo notebook from Hugging Face: View AI Models Demo

To run it in Colab: Download the .ipynb file and upload it to Google Colab.

2. Download Models

# Install Hugging Face Hub
pip install huggingface_hub

# Download CLIP Models
huggingface-cli download JanadaSroor/vision-models models/clip_text_quantized.onnx --local-dir .
huggingface-cli download JanadaSroor/vision-models models/clip_vision_quantized.onnx --local-dir .

# Download ViT Model
huggingface-cli download JanadaSroor/vision-models models/vit_base_quantized.onnx --local-dir .

πŸ“Š Model Specifications

Model Original Size Compressed Size Quantization Input Shape Output Shape
CLIP Text ~120MB 62MB (⬇️ 48%) βœ… INT8 [batch, 77] [batch, 512]
CLIP Vision ~340MB 337MB ❌ FP32 [batch, 3, 224, 224] [batch, 512]
ViT Base ~340MB 84MB (⬇️ 75%) βœ… INT8 [batch, 3, 224, 224] [batch, 768]

πŸƒ Performance Benchmarks

Inference times measured on a standard T4 GPU instance (CPU mode) in Colab:

  • CLIP Text (INT8): ~12ms
  • CLIP Vision (FP32): ~65ms
  • ViT Base (INT8): ~55ms

Note: Mobile performance on modern Android devices (SD 8 Gen 1+) is expected to be 20-30% faster due to NPU/GPU acceleration.

πŸ”§ Deployment in Android

These models are optimized for ONNX Runtime Mobile.

  1. Copy the .onnx files to your project's src/main/assets/ directory.
  2. Use the ONNX Runtime Kotlin/Java API to load and run inference:
val session = OrtSession.create(env, modelBytes, options)
val inputs = mapOf("input_ids" to textTensor)
val results = session.run(inputs)

πŸ“ˆ Optimization Details

We used Hugging Face Optimum and ONNX Runtime Quantization tools to achieve these results:

  • Dynamic Quantization: Applied to CLIP Text and ViT Base to reduce memory footprint.
  • Operator Fusion: Combined multiple layers into single kernels for faster execution.
  • Precision Tuning: Kept CLIP Vision in FP32 as INT8 quantization led to significant accuracy loss (>5%).

πŸ” Use Cases

  • Semantic Search: "Show me photos of mountains at sunset."
  • Image Clustering: Automatically group similar photos.
  • Fast Tagging: Detect objects and scenes without cloud APIs.

πŸ“„ License

This project is licensed under the MIT License. Models are subject to their respective original licenses (OpenAI for CLIP, Google for ViT).


Maintained by JanadaSroor | Developed for AI Kit Gallery

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support