jn12 commited on
Commit
3063e4e
Β·
verified Β·
1 Parent(s): 3b15e79

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +230 -1
README.md CHANGED
@@ -11,4 +11,233 @@ pipeline_tag: zero-shot-image-classification
11
  tags:
12
  - clip
13
  - mobileclip2
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  tags:
12
  - clip
13
  - mobileclip2
14
+ - mobileclip
15
+ - image-text-retrieval
16
+ - onnx
17
+ - qualcomm
18
+ - qai-hub
19
+ - lpcv
20
+ language:
21
+ - en
22
+ ---
23
+
24
+ # 2026LPCV-Track1-MobileCLIP2-B-Best
25
+
26
+ `jn12/2026LPCV-Track1-MobileCLIP2-B-Best` is the exported ONNX version of the best current `MobileCLIP2-B` checkpoint used in this LPCV 2026 Track 1 image-to-text retrieval project.
27
+
28
+ The full project code is available here:
29
+
30
+ `https://github.com/jn12-29/LPCV-Track1-EfficientAI`
31
+
32
+ That repository contains the complete model training pipeline, together with dataset preparation, ONNX export, local evaluation, and deployment-oriented evaluation code.
33
+
34
+ The repository provides separated image and text encoders in ONNX format so they can be evaluated locally with ONNX Runtime or compiled further for Qualcomm device workflows.
35
+
36
+ ## Model overview
37
+
38
+ - Base architecture: `MobileCLIP2-B`
39
+ - Task: image-to-text retrieval
40
+ - Export format: ONNX
41
+ - Runtime target: local ONNX evaluation and Qualcomm deployment flow
42
+
43
+
44
+ ## Repository contents
45
+
46
+ This repository currently provides exported encoder files:
47
+
48
+ - `image_encoder.onnx`
49
+ - `image_encoder.onnx.data`
50
+ - `text_encoder.onnx`
51
+ - `text_encoder.onnx.data`
52
+
53
+ These files can be consumed directly by the local evaluation pipeline in this repository.
54
+
55
+ ## Download
56
+
57
+ ```bash
58
+ hf download jn12/2026LPCV-Track1-MobileCLIP2-B-Best \
59
+ --local-dir ./pretrained/2026LPCV-Track1-MobileCLIP2-B-Best
60
+ ```
61
+
62
+ Expected local layout:
63
+
64
+ ```text
65
+ pretrained/2026LPCV-Track1-MobileCLIP2-B-Best/
66
+ β”œβ”€β”€ image_encoder.onnx
67
+ β”œβ”€β”€ image_encoder.onnx.data
68
+ β”œβ”€β”€ text_encoder.onnx
69
+ └── text_encoder.onnx.data
70
+ ```
71
+
72
+ ## Quick usage
73
+
74
+ ### Evaluate locally with ONNX Runtime
75
+
76
+ Install dependencies:
77
+
78
+ ```bash
79
+ pip install onnxruntime pillow numpy torch torchvision transformers
80
+ hf download openai/clip-vit-base-patch32
81
+ ```
82
+
83
+ Run evaluation with plain ONNX Runtime:
84
+
85
+ ```python
86
+ from pathlib import Path
87
+
88
+ import numpy as np
89
+ import onnxruntime as ort
90
+ import torch
91
+ import torch.nn.functional as F
92
+ from PIL import Image
93
+ from torchvision import transforms
94
+ from transformers import CLIPTokenizer
95
+
96
+
97
+ MODEL_DIR = Path("./pretrained/2026LPCV-Track1-MobileCLIP2-B-Best")
98
+ IMAGE_PATHS = [
99
+ "examples/image1.jpg",
100
+ "examples/image2.jpg",
101
+ ]
102
+ TEXTS = [
103
+ "a red bus on the street",
104
+ "a group of people near a building",
105
+ "a dog running on grass",
106
+ ]
107
+
108
+
109
+ def preprocess_image(image_path: str) -> np.ndarray:
110
+ transform = transforms.Compose(
111
+ [
112
+ transforms.Resize((224, 224)),
113
+ transforms.ToTensor(),
114
+ ]
115
+ )
116
+ image = Image.open(image_path).convert("RGB")
117
+ image_tensor = transform(image).unsqueeze(0)
118
+ return image_tensor.numpy().astype(np.float32)
119
+
120
+
121
+ def l2_normalize(x: np.ndarray) -> np.ndarray:
122
+ return x / np.linalg.norm(x, axis=-1, keepdims=True)
123
+
124
+
125
+ def recall_at_k(image_features: np.ndarray, text_features: np.ndarray, positives, k: int) -> float:
126
+ similarities = image_features @ text_features.T
127
+ topk = np.argsort(-similarities, axis=1)[:, :k]
128
+ hits = 0
129
+ for i, gt in enumerate(positives):
130
+ if any(j in gt for j in topk[i]):
131
+ hits += 1
132
+ return hits / len(positives)
133
+
134
+
135
+ image_session = ort.InferenceSession(
136
+ str(MODEL_DIR / "image_encoder.onnx"),
137
+ providers=["CPUExecutionProvider"],
138
+ )
139
+ text_session = ort.InferenceSession(
140
+ str(MODEL_DIR / "text_encoder.onnx"),
141
+ providers=["CPUExecutionProvider"],
142
+ )
143
+
144
+ tokenizer = CLIPTokenizer.from_pretrained(
145
+ "openai/clip-vit-base-patch32",
146
+ local_files_only=True,
147
+ )
148
+ tokenizer.add_special_tokens({"cls_token": tokenizer.eos_token})
149
+
150
+ image_embeddings = []
151
+ for image_path in IMAGE_PATHS:
152
+ image_input = preprocess_image(image_path)
153
+ image_output = image_session.run(None, {"image": image_input})[0]
154
+ image_embeddings.append(image_output[0])
155
+ image_embeddings = l2_normalize(np.stack(image_embeddings, axis=0))
156
+
157
+ text_embeddings = []
158
+ for text in TEXTS:
159
+ token_ids = tokenizer(
160
+ [text],
161
+ padding="max_length",
162
+ truncation=True,
163
+ max_length=77,
164
+ return_tensors="pt",
165
+ )["input_ids"].numpy().astype(np.int32)
166
+ text_output = text_session.run(None, {"text": token_ids})[0]
167
+ text_embeddings.append(text_output[0])
168
+ text_embeddings = l2_normalize(np.stack(text_embeddings, axis=0))
169
+
170
+ # Example ground-truth mapping:
171
+ # image 0 matches text 0, image 1 matches text 1.
172
+ positive_text_indices = [{0}, {1}]
173
+
174
+ r_at_1 = recall_at_k(image_embeddings, text_embeddings, positive_text_indices, k=1)
175
+ r_at_2 = recall_at_k(image_embeddings, text_embeddings, positive_text_indices, k=2)
176
+
177
+ print(f"Recall@1: {r_at_1:.4f}")
178
+ print(f"Recall@2: {r_at_2:.4f}")
179
+ ```
180
+
181
+
182
+
183
+ ## Preprocessing and tokenization
184
+
185
+ This repository follows the preprocessing used by the project codebase:
186
+
187
+ - images are resized to `224x224`
188
+ - pixel values are scaled to `[0, 1]` by dividing by `255`
189
+ - ImageNet mean/std normalization is not applied
190
+ - text tokenization uses `CLIPTokenizer` from `openai/clip-vit-base-patch32`
191
+ - token sequences use `max_length=77`
192
+
193
+ Before running local evaluation, make sure the tokenizer is available in the local Hugging Face cache:
194
+
195
+ ```bash
196
+ hf download openai/clip-vit-base-patch32
197
+ ```
198
+
199
+ ## Training context
200
+
201
+ The exported ONNX files come from the LPCV 2026 Track 1 training workflow built around:
202
+
203
+ - `MobileCLIP2-B` as the base model
204
+ - contrastive JSONL training data with positives and hard negatives
205
+ - local PyTorch fine-tuning
206
+ - ONNX export for deployment-oriented evaluation
207
+
208
+ The corresponding image-source dataset is available at:
209
+
210
+ `https://huggingface.co/datasets/jn12/VG100K4CL`
211
+
212
+ ## Intended use
213
+
214
+ Use this model if you want to:
215
+
216
+ - reproduce local ONNX evaluation from this repository
217
+ - benchmark the exported retrieval model
218
+ - integrate the encoders into a deployment pipeline
219
+
220
+ This repository is not intended to be a generic sentence-embedding model release or a universal CLIP drop-in replacement.
221
+
222
+
223
+
224
+ ## Citation
225
+
226
+ If you use this model, please cite the Hugging Face repository and the project code:
227
+
228
+ Authors:
229
+
230
+ `Hui Xie, Jinyang Du, Jiacheng Wang, Xiaoze Ge, Fengjun Zhong, Yejun Zeng, Ruihao Gong#, Xiaoning Liu, Shenghao Jin, Jinyang Guo#, Xianglong Liu`
231
+
232
+ ```bibtex
233
+ @misc{mobileclip2b_lpcv2026,
234
+ title = {2026LPCV-Track1-MobileCLIP2-B-Best},
235
+ author = {Hui Xie and Jinyang Du and Jiacheng Wang and Xiaoze Ge and Fengjun Zhong and Yejun Zeng and Ruihao Gong and Xiaoning Liu and Shenghao Jin and Jinyang Guo and Xianglong Liu},
236
+ year = {2026},
237
+ howpublished = {\url{https://huggingface.co/jn12/2026LPCV-Track1-MobileCLIP2-B-Best}}
238
+ }
239
+ ```
240
+
241
+ Project repository:
242
+
243
+ `https://github.com/jn12-29/LPCV-Track1-EfficientAI`