JanadaSroor commited on
Commit
0b27d75
·
verified ·
1 Parent(s): 903d71d

Upload AI_Models_Demo.ipynb with huggingface_hub

Browse files
Files changed (1) hide show
  1. AI_Models_Demo.ipynb +63 -54
AI_Models_Demo.ipynb CHANGED
@@ -4,13 +4,15 @@
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
- "# AI Kit Vision Models Demo\n",
8
  "\n",
9
- "This notebook demonstrates how to use the optimized ONNX models from the [JanadaSroor/vision_models](https://huggingface.co/JanadaSroor) repository. These models are designed for high-performance inference on mobile devices and edge applications.\n",
10
  "\n",
11
  "## Models Included:\n",
12
- "- **CLIP (OpenAI)**: For text-to-image and image-to-image similarity search.\n",
13
- "- **ViT (Google)**: For high-quality image feature extraction."
 
 
14
  ]
15
  },
16
  {
@@ -20,7 +22,7 @@
20
  "outputs": [],
21
  "source": [
22
  "# 1. Install Dependencies\n",
23
- "!pip install onnxruntime transformers pillow numpy huggingface_hub"
24
  ]
25
  },
26
  {
@@ -33,10 +35,9 @@
33
  "import os\n",
34
  "import time\n",
35
  "import numpy as np\n",
36
- "import torch\n",
37
- "from PIL import Image\n",
38
  "import requests\n",
39
  "from io import BytesIO\n",
 
40
  "import onnxruntime as ort\n",
41
  "from transformers import CLIPProcessor, ViTFeatureExtractor\n",
42
  "from huggingface_hub import hf_hub_download"
@@ -48,7 +49,7 @@
48
  "source": [
49
  "## 3. Download Models from Hugging Face\n",
50
  "\n",
51
- "We'll download the quantized versions of the models for efficient CPU inference."
52
  ]
53
  },
54
  {
@@ -57,27 +58,32 @@
57
  "metadata": {},
58
  "outputs": [],
59
  "source": [
60
- "def download_model(repo_id, filename):\n",
61
- " print(f\"Downloading {filename} from {repo_id}...\")\n",
62
- " return hf_hub_download(repo_id=repo_id, filename=filename)\n",
63
- "\n",
64
- "# CLIP Models\n",
65
- "REPO_CLIP = \"JanadaSroor/clip-vit-base-patch32-onnx\"\n",
66
- "clip_text_path = download_model(REPO_CLIP, \"clip_text_quantized.onnx\")\n",
67
- "clip_vision_path = download_model(REPO_CLIP, \"clip_vision_quantized.onnx\")\n",
68
- "\n",
69
- "# ViT Model\n",
70
- "REPO_VIT = \"JanadaSroor/vit-base-patch16-224-onnx\"\n",
71
- "vit_path = download_model(REPO_VIT, \"vit_base_quantized.onnx\")"
 
 
 
 
 
72
  ]
73
  },
74
  {
75
  "cell_type": "markdown",
76
  "metadata": {},
77
  "source": [
78
- "## 4. Initialization\n",
79
  "\n",
80
- "Load the ONNX sessions and the processors."
81
  ]
82
  },
83
  {
@@ -86,23 +92,25 @@
86
  "metadata": {},
87
  "outputs": [],
88
  "source": [
89
- "print(\"Initializing ONNX sessions...\")\n",
90
  "text_sess = ort.InferenceSession(clip_text_path)\n",
91
  "vision_sess = ort.InferenceSession(clip_vision_path)\n",
92
  "vit_sess = ort.InferenceSession(vit_path)\n",
93
  "\n",
94
- "print(\"Loading processors...\")\n",
95
  "clip_processor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n",
96
- "vit_extractor = ViTFeatureExtractor.from_pretrained(\"google/vit-base-patch16-224\")"
 
 
97
  ]
98
  },
99
  {
100
  "cell_type": "markdown",
101
  "metadata": {},
102
  "source": [
103
- "## 5. CLIP Demo: Text-to-Image Similarity\n",
104
  "\n",
105
- "We'll take a test image and several text descriptions to see which description matches best."
106
  ]
107
  },
108
  {
@@ -111,40 +119,41 @@
111
  "metadata": {},
112
  "outputs": [],
113
  "source": [
114
- "def get_image(url):\n",
115
- " response = requests.get(url)\n",
116
- " return Image.open(BytesIO(response.content)).convert(\"RGB\")\n",
 
 
117
  "\n",
118
- "# Sample image: A cat\n",
119
- "img_url = \"https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?ixlib=rb-1.2.1&auto=format&fit=crop&w=500&q=80\"\n",
120
- "image = get_image(img_url)\n",
121
- "image.thumbnail((300, 300))\n",
122
- "display(image)\n",
123
  "\n",
124
- "queries = [\"a photo of a cat\", \"a photo of a dog\", \"a photo of a car\", \"a sunset\"]\n",
 
 
125
  "\n",
126
- "# Encode Image\n",
127
- "image_inputs = clip_processor(images=image, return_tensors=\"np\")\n",
128
- "image_embeds = vision_sess.run(None, dict(image_inputs))[0][0]\n",
129
  "\n",
130
- "# Encode Texts and Calculate Similarity\n",
131
- "print(\"\\nSimilarity Scores:\")\n",
132
  "for query in queries:\n",
 
133
  " text_inputs = clip_processor(text=[query], return_tensors=\"np\", padding=True)\n",
134
  " text_embeds = text_sess.run(None, dict(text_inputs))[0][0]\n",
135
  " \n",
136
- " # Cosine Similarity\n",
137
  " similarity = np.dot(text_embeds, image_embeds) / (np.linalg.norm(text_embeds) * np.linalg.norm(image_embeds))\n",
138
- " print(f\"- Query: '{query}' -> Score: {similarity:.4f}\")"
 
139
  ]
140
  },
141
  {
142
  "cell_type": "markdown",
143
  "metadata": {},
144
  "source": [
145
- "## 6. ViT Demo: Image Embedding\n",
146
  "\n",
147
- "Extract high-dimensional features (768D) from an image using the ViT model."
148
  ]
149
  },
150
  {
@@ -153,15 +162,15 @@
153
  "metadata": {},
154
  "outputs": [],
155
  "source": [
156
- "# Encode image with ViT\n",
157
- "vit_inputs = vit_extractor(images=image, return_tensors=\"np\")\n",
158
- "vit_outputs = vit_sess.run(None, dict(vit_inputs))\n",
159
- "\n",
160
- "# The output for vit-base is usually [batch, sequence_length, hidden_size]\n",
161
- "# For image similarity, we typically use the CLS token (index 0)\n",
162
- "vit_embeds = vit_outputs[0][0][0]\n",
163
- "print(f\"ViT Embedding Shape: {vit_embeds.shape}\")\n",
164
- "print(f\"First 5 values: {vit_embeds[:5]}\")"
165
  ]
166
  }
167
  ],
 
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
+ "# AI Kit Gallery - Vision Models Demo\n",
8
  "\n",
9
+ "This notebook demonstrates how to use the optimized ONNX models from the [JanadaSroor/vision-models](https://huggingface.co/JanadaSroor/vision-models) repository. These models are designed for high-performance inference on mobile devices.\n",
10
  "\n",
11
  "## Models Included:\n",
12
+ "- **CLIP (OpenAI)**: Text-to-Image & Image-to-Image similarity.\n",
13
+ "- **ViT (Google)**: High-quality image feature extraction.\n",
14
+ "\n",
15
+ "All models are quantized (INT8) or optimized for mobile use."
16
  ]
17
  },
18
  {
 
22
  "outputs": [],
23
  "source": [
24
  "# 1. Install Dependencies\n",
25
+ "!pip install onnxruntime transformers pillow numpy huggingface_hub requests"
26
  ]
27
  },
28
  {
 
35
  "import os\n",
36
  "import time\n",
37
  "import numpy as np\n",
 
 
38
  "import requests\n",
39
  "from io import BytesIO\n",
40
+ "from PIL import Image\n",
41
  "import onnxruntime as ort\n",
42
  "from transformers import CLIPProcessor, ViTFeatureExtractor\n",
43
  "from huggingface_hub import hf_hub_download"
 
49
  "source": [
50
  "## 3. Download Models from Hugging Face\n",
51
  "\n",
52
+ "We download the models directly from the `JanadaSroor/vision-models` repository."
53
  ]
54
  },
55
  {
 
58
  "metadata": {},
59
  "outputs": [],
60
  "source": [
61
+ "# Configuration\n",
62
+ "REPO_ID = \"JanadaSroor/vision-models\"\n",
63
+ "MODELS_DIR = \"models\"\n",
64
+ "\n",
65
+ "def download_onnx_model(filename):\n",
66
+ " print(f\"Downloading {filename}...\")\n",
67
+ " # Files are stored in the 'models/' subdirectory in the repo\n",
68
+ " return hf_hub_download(repo_id=REPO_ID, filename=f\"models/{filename}\")\n",
69
+ "\n",
70
+ "# Download CLIP Models\n",
71
+ "clip_text_path = download_onnx_model(\"clip_text_quantized.onnx\")\n",
72
+ "clip_vision_path = download_onnx_model(\"clip_vision_quantized.onnx\")\n",
73
+ "\n",
74
+ "# Download ViT Model\n",
75
+ "vit_path = download_onnx_model(\"vit_base_quantized.onnx\")\n",
76
+ "\n",
77
+ "print(\"\\n✅ All models downloaded successfully!\")"
78
  ]
79
  },
80
  {
81
  "cell_type": "markdown",
82
  "metadata": {},
83
  "source": [
84
+ "## 4. Initialize Inference Sessions\n",
85
  "\n",
86
+ "We create ONNX Runtime sessions for hardware-accelerated inference."
87
  ]
88
  },
89
  {
 
92
  "metadata": {},
93
  "outputs": [],
94
  "source": [
95
+ "# Initialize ONNX Sessions\n",
96
  "text_sess = ort.InferenceSession(clip_text_path)\n",
97
  "vision_sess = ort.InferenceSession(clip_vision_path)\n",
98
  "vit_sess = ort.InferenceSession(vit_path)\n",
99
  "\n",
100
+ "# Initialize Processors (for tokenizing text and preprocessing images)\n",
101
  "clip_processor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n",
102
+ "vit_extractor = ViTFeatureExtractor.from_pretrained(\"google/vit-base-patch16-224\")\n",
103
+ "\n",
104
+ "print(\"✅ Inference sessions ready.\")"
105
  ]
106
  },
107
  {
108
  "cell_type": "markdown",
109
  "metadata": {},
110
  "source": [
111
+ "## 5. CLIP Demo: Search Images with Text\n",
112
  "\n",
113
+ "We will compare a query text against a test image to see the similarity score."
114
  ]
115
  },
116
  {
 
119
  "metadata": {},
120
  "outputs": [],
121
  "source": [
122
+ "# Load a test image\n",
123
+ "url = \"https://images.unsplash.com/photo-1543466835-00a7907e9de1?ixlib=rb-4.0.3&auto=format&fit=crop&w=500&q=80\"\n",
124
+ "response = requests.get(url)\n",
125
+ "image = Image.open(BytesIO(response.content)).convert(\"RGB\")\n",
126
+ "display(image.resize((300, 300)))\n",
127
  "\n",
128
+ "# Define queries\n",
129
+ "queries = [\"a cute dog\", \"a running dog\", \"a cat\", \"a car\", \"food\"]\n",
 
 
 
130
  "\n",
131
+ "# 1. Encode Image (CLIP Vision)\n",
132
+ "inputs = clip_processor(images=image, return_tensors=\"np\")\n",
133
+ "image_embeds = vision_sess.run(None, dict(inputs))[0][0]\n",
134
  "\n",
135
+ "# 2. Encode Text & Compare\n",
136
+ "print(f\"\\n{'Query':<20} | {'Score':<10}\")\n",
137
+ "print(\"-\" * 35)\n",
138
  "\n",
 
 
139
  "for query in queries:\n",
140
+ " # Tokenize and encode text\n",
141
  " text_inputs = clip_processor(text=[query], return_tensors=\"np\", padding=True)\n",
142
  " text_embeds = text_sess.run(None, dict(text_inputs))[0][0]\n",
143
  " \n",
144
+ " # Calculate Cosine Similarity\n",
145
  " similarity = np.dot(text_embeds, image_embeds) / (np.linalg.norm(text_embeds) * np.linalg.norm(image_embeds))\n",
146
+ " \n",
147
+ " print(f\"{query:<20} | {similarity:.4f}\")"
148
  ]
149
  },
150
  {
151
  "cell_type": "markdown",
152
  "metadata": {},
153
  "source": [
154
+ "## 6. ViT Demo: Feature Extraction\n",
155
  "\n",
156
+ "Generate a 768-dimensional embedding vector for the image using the ViT model."
157
  ]
158
  },
159
  {
 
162
  "metadata": {},
163
  "outputs": [],
164
  "source": [
165
+ "inputs = vit_extractor(images=image, return_tensors=\"np\")\n",
166
+ "outputs = vit_sess.run(None, dict(inputs))\n",
167
+ "\n",
168
+ "# For ViT, the first output [0] is the last_hidden_state.\n",
169
+ "# We typically use the first token (CLS token) as the image representation.\n",
170
+ "cls_embedding = outputs[0][0][0]\n",
171
+ "\n",
172
+ "print(f\"ViT Embedding Shape: {cls_embedding.shape}\")\n",
173
+ "print(f\"First 10 values: {cls_embedding[:10]}\")"
174
  ]
175
  }
176
  ],