{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AI Kit Gallery - Vision Models Demo\n", "\n", "This notebook demonstrates how to use the optimized ONNX models from the [JanadaSroor/vision-models](https://huggingface.co/JanadaSroor/vision-models) repository. These models are designed for high-performance inference on mobile devices.\n", "\n", "## Models Included:\n", "- **CLIP (OpenAI)**: Text-to-Image & Image-to-Image similarity.\n", "- **ViT (Google)**: High-quality image feature extraction.\n", "\n", "All models are quantized (INT8) or optimized for mobile use." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 1. Install Dependencies\n", "!pip install onnxruntime transformers pillow numpy huggingface_hub requests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 2. Import Libraries\n", "import os\n", "import time\n", "import numpy as np\n", "import requests\n", "from io import BytesIO\n", "from PIL import Image\n", "import onnxruntime as ort\n", "from transformers import CLIPProcessor, ViTFeatureExtractor\n", "from huggingface_hub import hf_hub_download" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Download Models from Hugging Face\n", "\n", "We download the models directly from the `JanadaSroor/vision-models` repository." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Configuration\n", "REPO_ID = \"JanadaSroor/vision-models\"\n", "MODELS_DIR = \"models\"\n", "\n", "def download_onnx_model(filename):\n", " print(f\"Downloading {filename}...\")\n", " # Files are stored in the 'models/' subdirectory in the repo\n", " return hf_hub_download(repo_id=REPO_ID, filename=f\"models/{filename}\")\n", "\n", "# Download CLIP Models\n", "clip_text_path = download_onnx_model(\"clip_text_quantized.onnx\")\n", "clip_vision_path = download_onnx_model(\"clip_vision_quantized.onnx\")\n", "\n", "# Download ViT Model\n", "vit_path = download_onnx_model(\"vit_base_quantized.onnx\")\n", "\n", "print(\"\\nāœ… All models downloaded successfully!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Initialize Inference Sessions\n", "\n", "We create ONNX Runtime sessions for hardware-accelerated inference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize ONNX Sessions\n", "text_sess = ort.InferenceSession(clip_text_path)\n", "vision_sess = ort.InferenceSession(clip_vision_path)\n", "vit_sess = ort.InferenceSession(vit_path)\n", "\n", "# Initialize Processors (for tokenizing text and preprocessing images)\n", "clip_processor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n", "vit_extractor = ViTFeatureExtractor.from_pretrained(\"google/vit-base-patch16-224\")\n", "\n", "print(\"āœ… Inference sessions ready.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. CLIP Demo: Search Images with Text\n", "\n", "We will compare a query text against a test image to see the similarity score." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import requests\n", "from PIL import Image\n", "from io import BytesIO\n", "\n", "# Load a test image\n", "url = \"https://images.unsplash.com/photo-1543466835-00a7907e9de1?ixlib=rb-4.0.3&auto=format&fit=crop&w=500&q=80\"\n", "response = requests.get(url)\n", "image = Image.open(BytesIO(response.content)).convert(\"RGB\")\n", "display(image.resize((300, 300)))\n", "\n", "# Define queries\n", "queries = [\"a cute dog\", \"a dog looking\", \"a cat\", \"a car\", \"food\"]\n", "\n", "# ---------- 1. Encode Image ----------\n", "image_inputs = clip_processor(images=image, return_tensors=\"np\")\n", "image_embed = vision_sess.run(None, dict(image_inputs))[0][0]\n", "\n", "# L2 normalize image embedding\n", "image_embed = image_embed / np.linalg.norm(image_embed)\n", "scores = []\n", "\n", "for query in queries:\n", " text_inputs = clip_processor(text=[query], return_tensors=\"np\", padding=True)\n", " text_embed = text_sess.run(None, dict(text_inputs))[0][0]\n", " text_embed = text_embed / np.linalg.norm(text_embed)\n", "\n", " score = 100.0 * np.dot(text_embed, image_embed)\n", " scores.append(score)\n", "\n", "scores = np.array(scores)\n", "\n", "# Softmax over queries (THIS is what CLIP expects)\n", "probs = np.exp(scores) / np.exp(scores).sum()\n", "\n", "print(f\"\\n{'Query':<20} | {'Logit':<10} | {'Prob'}\")\n", "print(\"-\" * 50)\n", "\n", "for q, s, p in zip(queries, scores, probs):\n", " print(f\"{q:<20} | {s:8.2f} | {100*p:.3f}%\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. ViT Demo: Feature Extraction\n", "\n", "Generate a 768-dimensional embedding vector for the image using the ViT model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs = vit_extractor(images=image, return_tensors=\"np\")\n", "outputs = vit_sess.run(None, dict(inputs))\n", "\n", "# For ViT, the first output [0] is the last_hidden_state.\n", "# We typically use the first token (CLS token) as the image representation.\n", "cls_embedding = outputs[0][0][0]\n", "\n", "print(f\"ViT Embedding Shape: {cls_embedding.shape}\")\n", "print(f\"First 10 values: {cls_embedding[:10]}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }