File size: 8,648 Bytes
1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 1c955f7 0b27d75 1f78d31 0b27d75 1c955f7 1f78d31 1c955f7 1f78d31 1c955f7 1f78d31 1c955f7 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 0b27d75 1f78d31 1c955f7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AI Kit Gallery - Vision Models Demo\n",
"\n",
"This notebook demonstrates how to use the optimized ONNX models from the [JanadaSroor/vision-models](https://huggingface.co/JanadaSroor/vision-models) repository. These models are designed for high-performance inference on mobile devices.\n",
"\n",
"## Models Included:\n",
"- **CLIP (OpenAI)**: Text-to-Image & Image-to-Image similarity.\n",
"- **ViT (Google)**: High-quality image feature extraction.\n",
"\n",
"All models are quantized (INT8) or optimized for mobile use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 1. Install Dependencies\n",
"!pip install onnxruntime transformers pillow numpy huggingface_hub requests"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 2. Import Libraries\n",
"import os\n",
"import time\n",
"import numpy as np\n",
"import requests\n",
"from io import BytesIO\n",
"from PIL import Image\n",
"import onnxruntime as ort\n",
"from transformers import CLIPProcessor, ViTFeatureExtractor\n",
"from huggingface_hub import hf_hub_download"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Download Models from Hugging Face\n",
"\n",
"We download the models directly from the `JanadaSroor/vision-models` repository."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Configuration\n",
"REPO_ID = \"JanadaSroor/vision-models\"\n",
"MODELS_DIR = \"models\"\n",
"\n",
"def download_onnx_model(filename):\n",
" print(f\"Downloading {filename}...\")\n",
" # Files are stored in the 'models/' subdirectory in the repo\n",
" return hf_hub_download(repo_id=REPO_ID, filename=f\"models/{filename}\")\n",
"\n",
"# Download CLIP Models\n",
"clip_text_path = download_onnx_model(\"clip_text_quantized.onnx\")\n",
"clip_vision_path = download_onnx_model(\"clip_vision_quantized.onnx\")\n",
"\n",
"# Download ViT Model\n",
"vit_path = download_onnx_model(\"vit_base_quantized.onnx\")\n",
"\n",
"print(\"\\n✅ All models downloaded successfully!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Initialize Inference Sessions\n",
"\n",
"We create ONNX Runtime sessions for hardware-accelerated inference."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize ONNX Sessions\n",
"text_sess = ort.InferenceSession(clip_text_path)\n",
"vision_sess = ort.InferenceSession(clip_vision_path)\n",
"vit_sess = ort.InferenceSession(vit_path)\n",
"\n",
"# Initialize Processors (for tokenizing text and preprocessing images)\n",
"clip_processor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n",
"vit_extractor = ViTFeatureExtractor.from_pretrained(\"google/vit-base-patch16-224\")\n",
"\n",
"print(\"✅ Inference sessions ready.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. CLIP Demo: Search Images with Text\n",
"\n",
"We will compare a query text against a test image to see the similarity score."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import requests\n",
"from PIL import Image\n",
"from io import BytesIO\n",
"\n",
"# Load a test image\n",
"url = \"https://images.unsplash.com/photo-1543466835-00a7907e9de1?ixlib=rb-4.0.3&auto=format&fit=crop&w=500&q=80\"\n",
"response = requests.get(url)\n",
"image = Image.open(BytesIO(response.content)).convert(\"RGB\")\n",
"display(image.resize((300, 300)))\n",
"\n",
"# Define queries\n",
"queries = [\"a cute dog\", \"a dog looking\", \"a cat\", \"a car\", \"food\"]\n",
"\n",
"# ---------- 1. Encode Image ----------\n",
"image_inputs = clip_processor(images=image, return_tensors=\"np\")\n",
"image_embed = vision_sess.run(None, dict(image_inputs))[0][0]\n",
"\n",
"# L2 normalize image embedding\n",
"image_embed = image_embed / np.linalg.norm(image_embed)\n",
"scores = []\n",
"\n",
"for query in queries:\n",
" text_inputs = clip_processor(text=[query], return_tensors=\"np\", padding=True)\n",
" text_embed = text_sess.run(None, dict(text_inputs))[0][0]\n",
" text_embed = text_embed / np.linalg.norm(text_embed)\n",
"\n",
" score = 100.0 * np.dot(text_embed, image_embed)\n",
" scores.append(score)\n",
"\n",
"scores = np.array(scores)\n",
"\n",
"# Softmax over queries (THIS is what CLIP expects)\n",
"probs = np.exp(scores) / np.exp(scores).sum()\n",
"\n",
"print(f\"\\n{'Query':<20} | {'Logit':<10} | {'Prob'}\")\n",
"print(\"-\" * 50)\n",
"\n",
"for q, s, p in zip(queries, scores, probs):\n",
" print(f\"{q:<20} | {s:8.2f} | {100*p:.3f}%\")\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. ViT Demo: Feature Extraction\n",
"\n",
"Generate a 768-dimensional embedding vector for the image using the ViT model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"inputs = vit_extractor(images=image, return_tensors=\"np\")\n",
"outputs = vit_sess.run(None, dict(inputs))\n",
"\n",
"# For ViT, the first output [0] is the last_hidden_state.\n",
"# We typically use the first token (CLS token) as the image representation.\n",
"cls_embedding = outputs[0][0][0]\n",
"\n",
"print(f\"ViT Embedding Shape: {cls_embedding.shape}\")\n",
"print(f\"First 10 values: {cls_embedding[:10]}\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
|