File size: 8,648 Bytes
1f78d31
 
 
 
 
 
0b27d75
1f78d31
0b27d75
1f78d31
 
0b27d75
 
 
 
1f78d31
 
 
 
 
 
 
 
 
0b27d75
1f78d31
 
 
 
 
 
 
 
 
 
 
 
 
 
0b27d75
1f78d31
 
 
 
 
 
 
 
 
 
 
0b27d75
1f78d31
 
 
 
 
 
 
 
0b27d75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f78d31
 
 
 
 
 
0b27d75
1f78d31
0b27d75
1f78d31
 
 
 
 
 
 
 
0b27d75
1f78d31
 
 
 
0b27d75
1f78d31
0b27d75
 
 
1f78d31
 
 
 
 
 
0b27d75
1f78d31
0b27d75
1f78d31
 
 
 
 
 
 
 
1c955f7
 
 
 
 
0b27d75
 
 
 
 
1f78d31
0b27d75
1c955f7
1f78d31
1c955f7
 
 
1f78d31
1c955f7
 
 
1f78d31
 
 
1c955f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f78d31
 
 
 
 
 
0b27d75
1f78d31
0b27d75
1f78d31
 
 
 
 
 
 
 
0b27d75
 
 
 
 
 
 
 
 
1f78d31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c955f7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# AI Kit Gallery - Vision Models Demo\n",
                "\n",
                "This notebook demonstrates how to use the optimized ONNX models from the [JanadaSroor/vision-models](https://huggingface.co/JanadaSroor/vision-models) repository. These models are designed for high-performance inference on mobile devices.\n",
                "\n",
                "## Models Included:\n",
                "- **CLIP (OpenAI)**: Text-to-Image & Image-to-Image similarity.\n",
                "- **ViT (Google)**: High-quality image feature extraction.\n",
                "\n",
                "All models are quantized (INT8) or optimized for mobile use."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# 1. Install Dependencies\n",
                "!pip install onnxruntime transformers pillow numpy huggingface_hub requests"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# 2. Import Libraries\n",
                "import os\n",
                "import time\n",
                "import numpy as np\n",
                "import requests\n",
                "from io import BytesIO\n",
                "from PIL import Image\n",
                "import onnxruntime as ort\n",
                "from transformers import CLIPProcessor, ViTFeatureExtractor\n",
                "from huggingface_hub import hf_hub_download"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 3. Download Models from Hugging Face\n",
                "\n",
                "We download the models directly from the `JanadaSroor/vision-models` repository."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Configuration\n",
                "REPO_ID = \"JanadaSroor/vision-models\"\n",
                "MODELS_DIR = \"models\"\n",
                "\n",
                "def download_onnx_model(filename):\n",
                "    print(f\"Downloading {filename}...\")\n",
                "    # Files are stored in the 'models/' subdirectory in the repo\n",
                "    return hf_hub_download(repo_id=REPO_ID, filename=f\"models/{filename}\")\n",
                "\n",
                "# Download CLIP Models\n",
                "clip_text_path = download_onnx_model(\"clip_text_quantized.onnx\")\n",
                "clip_vision_path = download_onnx_model(\"clip_vision_quantized.onnx\")\n",
                "\n",
                "# Download ViT Model\n",
                "vit_path = download_onnx_model(\"vit_base_quantized.onnx\")\n",
                "\n",
                "print(\"\\n✅ All models downloaded successfully!\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 4. Initialize Inference Sessions\n",
                "\n",
                "We create ONNX Runtime sessions for hardware-accelerated inference."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Initialize ONNX Sessions\n",
                "text_sess = ort.InferenceSession(clip_text_path)\n",
                "vision_sess = ort.InferenceSession(clip_vision_path)\n",
                "vit_sess = ort.InferenceSession(vit_path)\n",
                "\n",
                "# Initialize Processors (for tokenizing text and preprocessing images)\n",
                "clip_processor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n",
                "vit_extractor = ViTFeatureExtractor.from_pretrained(\"google/vit-base-patch16-224\")\n",
                "\n",
                "print(\"✅ Inference sessions ready.\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 5. CLIP Demo: Search Images with Text\n",
                "\n",
                "We will compare a query text against a test image to see the similarity score."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import numpy as np\n",
                "import requests\n",
                "from PIL import Image\n",
                "from io import BytesIO\n",
                "\n",
                "# Load a test image\n",
                "url = \"https://images.unsplash.com/photo-1543466835-00a7907e9de1?ixlib=rb-4.0.3&auto=format&fit=crop&w=500&q=80\"\n",
                "response = requests.get(url)\n",
                "image = Image.open(BytesIO(response.content)).convert(\"RGB\")\n",
                "display(image.resize((300, 300)))\n",
                "\n",
                "# Define queries\n",
                "queries = [\"a cute dog\", \"a dog looking\", \"a cat\", \"a car\", \"food\"]\n",
                "\n",
                "# ---------- 1. Encode Image ----------\n",
                "image_inputs = clip_processor(images=image, return_tensors=\"np\")\n",
                "image_embed = vision_sess.run(None, dict(image_inputs))[0][0]\n",
                "\n",
                "# L2 normalize image embedding\n",
                "image_embed = image_embed / np.linalg.norm(image_embed)\n",
                "scores = []\n",
                "\n",
                "for query in queries:\n",
                "    text_inputs = clip_processor(text=[query], return_tensors=\"np\", padding=True)\n",
                "    text_embed = text_sess.run(None, dict(text_inputs))[0][0]\n",
                "    text_embed = text_embed / np.linalg.norm(text_embed)\n",
                "\n",
                "    score = 100.0 * np.dot(text_embed, image_embed)\n",
                "    scores.append(score)\n",
                "\n",
                "scores = np.array(scores)\n",
                "\n",
                "# Softmax over queries (THIS is what CLIP expects)\n",
                "probs = np.exp(scores) / np.exp(scores).sum()\n",
                "\n",
                "print(f\"\\n{'Query':<20} | {'Logit':<10} | {'Prob'}\")\n",
                "print(\"-\" * 50)\n",
                "\n",
                "for q, s, p in zip(queries, scores, probs):\n",
                "    print(f\"{q:<20} | {s:8.2f} | {100*p:.3f}%\")\n",
                "\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 6. ViT Demo: Feature Extraction\n",
                "\n",
                "Generate a 768-dimensional embedding vector for the image using the ViT model."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "inputs = vit_extractor(images=image, return_tensors=\"np\")\n",
                "outputs = vit_sess.run(None, dict(inputs))\n",
                "\n",
                "# For ViT, the first output [0] is the last_hidden_state.\n",
                "# We typically use the first token (CLS token) as the image representation.\n",
                "cls_embedding = outputs[0][0][0]\n",
                "\n",
                "print(f\"ViT Embedding Shape: {cls_embedding.shape}\")\n",
                "print(f\"First 10 values: {cls_embedding[:10]}\")"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.8.10"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}