Zero-Shot Image Classification
OpenCLIP
ONNX
English
clip
mobileclip2
mobileclip
image-text-retrieval
qualcomm
qai-hub
lpcv
Instructions to use jn12/2026LPCV-Track1-MobileCLIP2-B-Best with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- OpenCLIP
How to use jn12/2026LPCV-Track1-MobileCLIP2-B-Best with OpenCLIP:
import open_clip model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:jn12/2026LPCV-Track1-MobileCLIP2-B-Best') tokenizer = open_clip.get_tokenizer('hf-hub:jn12/2026LPCV-Track1-MobileCLIP2-B-Best') - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -11,4 +11,233 @@ pipeline_tag: zero-shot-image-classification
|
|
| 11 |
tags:
|
| 12 |
- clip
|
| 13 |
- mobileclip2
|
| 14 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
tags:
|
| 12 |
- clip
|
| 13 |
- mobileclip2
|
| 14 |
+
- mobileclip
|
| 15 |
+
- image-text-retrieval
|
| 16 |
+
- onnx
|
| 17 |
+
- qualcomm
|
| 18 |
+
- qai-hub
|
| 19 |
+
- lpcv
|
| 20 |
+
language:
|
| 21 |
+
- en
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
# 2026LPCV-Track1-MobileCLIP2-B-Best
|
| 25 |
+
|
| 26 |
+
`jn12/2026LPCV-Track1-MobileCLIP2-B-Best` is the exported ONNX version of the best current `MobileCLIP2-B` checkpoint used in this LPCV 2026 Track 1 image-to-text retrieval project.
|
| 27 |
+
|
| 28 |
+
The full project code is available here:
|
| 29 |
+
|
| 30 |
+
`https://github.com/jn12-29/LPCV-Track1-EfficientAI`
|
| 31 |
+
|
| 32 |
+
That repository contains the complete model training pipeline, together with dataset preparation, ONNX export, local evaluation, and deployment-oriented evaluation code.
|
| 33 |
+
|
| 34 |
+
The repository provides separated image and text encoders in ONNX format so they can be evaluated locally with ONNX Runtime or compiled further for Qualcomm device workflows.
|
| 35 |
+
|
| 36 |
+
## Model overview
|
| 37 |
+
|
| 38 |
+
- Base architecture: `MobileCLIP2-B`
|
| 39 |
+
- Task: image-to-text retrieval
|
| 40 |
+
- Export format: ONNX
|
| 41 |
+
- Runtime target: local ONNX evaluation and Qualcomm deployment flow
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
## Repository contents
|
| 45 |
+
|
| 46 |
+
This repository currently provides exported encoder files:
|
| 47 |
+
|
| 48 |
+
- `image_encoder.onnx`
|
| 49 |
+
- `image_encoder.onnx.data`
|
| 50 |
+
- `text_encoder.onnx`
|
| 51 |
+
- `text_encoder.onnx.data`
|
| 52 |
+
|
| 53 |
+
These files can be consumed directly by the local evaluation pipeline in this repository.
|
| 54 |
+
|
| 55 |
+
## Download
|
| 56 |
+
|
| 57 |
+
```bash
|
| 58 |
+
hf download jn12/2026LPCV-Track1-MobileCLIP2-B-Best \
|
| 59 |
+
--local-dir ./pretrained/2026LPCV-Track1-MobileCLIP2-B-Best
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
Expected local layout:
|
| 63 |
+
|
| 64 |
+
```text
|
| 65 |
+
pretrained/2026LPCV-Track1-MobileCLIP2-B-Best/
|
| 66 |
+
βββ image_encoder.onnx
|
| 67 |
+
βββ image_encoder.onnx.data
|
| 68 |
+
βββ text_encoder.onnx
|
| 69 |
+
βββ text_encoder.onnx.data
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
## Quick usage
|
| 73 |
+
|
| 74 |
+
### Evaluate locally with ONNX Runtime
|
| 75 |
+
|
| 76 |
+
Install dependencies:
|
| 77 |
+
|
| 78 |
+
```bash
|
| 79 |
+
pip install onnxruntime pillow numpy torch torchvision transformers
|
| 80 |
+
hf download openai/clip-vit-base-patch32
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
Run evaluation with plain ONNX Runtime:
|
| 84 |
+
|
| 85 |
+
```python
|
| 86 |
+
from pathlib import Path
|
| 87 |
+
|
| 88 |
+
import numpy as np
|
| 89 |
+
import onnxruntime as ort
|
| 90 |
+
import torch
|
| 91 |
+
import torch.nn.functional as F
|
| 92 |
+
from PIL import Image
|
| 93 |
+
from torchvision import transforms
|
| 94 |
+
from transformers import CLIPTokenizer
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
MODEL_DIR = Path("./pretrained/2026LPCV-Track1-MobileCLIP2-B-Best")
|
| 98 |
+
IMAGE_PATHS = [
|
| 99 |
+
"examples/image1.jpg",
|
| 100 |
+
"examples/image2.jpg",
|
| 101 |
+
]
|
| 102 |
+
TEXTS = [
|
| 103 |
+
"a red bus on the street",
|
| 104 |
+
"a group of people near a building",
|
| 105 |
+
"a dog running on grass",
|
| 106 |
+
]
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
def preprocess_image(image_path: str) -> np.ndarray:
|
| 110 |
+
transform = transforms.Compose(
|
| 111 |
+
[
|
| 112 |
+
transforms.Resize((224, 224)),
|
| 113 |
+
transforms.ToTensor(),
|
| 114 |
+
]
|
| 115 |
+
)
|
| 116 |
+
image = Image.open(image_path).convert("RGB")
|
| 117 |
+
image_tensor = transform(image).unsqueeze(0)
|
| 118 |
+
return image_tensor.numpy().astype(np.float32)
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
def l2_normalize(x: np.ndarray) -> np.ndarray:
|
| 122 |
+
return x / np.linalg.norm(x, axis=-1, keepdims=True)
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
def recall_at_k(image_features: np.ndarray, text_features: np.ndarray, positives, k: int) -> float:
|
| 126 |
+
similarities = image_features @ text_features.T
|
| 127 |
+
topk = np.argsort(-similarities, axis=1)[:, :k]
|
| 128 |
+
hits = 0
|
| 129 |
+
for i, gt in enumerate(positives):
|
| 130 |
+
if any(j in gt for j in topk[i]):
|
| 131 |
+
hits += 1
|
| 132 |
+
return hits / len(positives)
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
image_session = ort.InferenceSession(
|
| 136 |
+
str(MODEL_DIR / "image_encoder.onnx"),
|
| 137 |
+
providers=["CPUExecutionProvider"],
|
| 138 |
+
)
|
| 139 |
+
text_session = ort.InferenceSession(
|
| 140 |
+
str(MODEL_DIR / "text_encoder.onnx"),
|
| 141 |
+
providers=["CPUExecutionProvider"],
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
tokenizer = CLIPTokenizer.from_pretrained(
|
| 145 |
+
"openai/clip-vit-base-patch32",
|
| 146 |
+
local_files_only=True,
|
| 147 |
+
)
|
| 148 |
+
tokenizer.add_special_tokens({"cls_token": tokenizer.eos_token})
|
| 149 |
+
|
| 150 |
+
image_embeddings = []
|
| 151 |
+
for image_path in IMAGE_PATHS:
|
| 152 |
+
image_input = preprocess_image(image_path)
|
| 153 |
+
image_output = image_session.run(None, {"image": image_input})[0]
|
| 154 |
+
image_embeddings.append(image_output[0])
|
| 155 |
+
image_embeddings = l2_normalize(np.stack(image_embeddings, axis=0))
|
| 156 |
+
|
| 157 |
+
text_embeddings = []
|
| 158 |
+
for text in TEXTS:
|
| 159 |
+
token_ids = tokenizer(
|
| 160 |
+
[text],
|
| 161 |
+
padding="max_length",
|
| 162 |
+
truncation=True,
|
| 163 |
+
max_length=77,
|
| 164 |
+
return_tensors="pt",
|
| 165 |
+
)["input_ids"].numpy().astype(np.int32)
|
| 166 |
+
text_output = text_session.run(None, {"text": token_ids})[0]
|
| 167 |
+
text_embeddings.append(text_output[0])
|
| 168 |
+
text_embeddings = l2_normalize(np.stack(text_embeddings, axis=0))
|
| 169 |
+
|
| 170 |
+
# Example ground-truth mapping:
|
| 171 |
+
# image 0 matches text 0, image 1 matches text 1.
|
| 172 |
+
positive_text_indices = [{0}, {1}]
|
| 173 |
+
|
| 174 |
+
r_at_1 = recall_at_k(image_embeddings, text_embeddings, positive_text_indices, k=1)
|
| 175 |
+
r_at_2 = recall_at_k(image_embeddings, text_embeddings, positive_text_indices, k=2)
|
| 176 |
+
|
| 177 |
+
print(f"Recall@1: {r_at_1:.4f}")
|
| 178 |
+
print(f"Recall@2: {r_at_2:.4f}")
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
## Preprocessing and tokenization
|
| 184 |
+
|
| 185 |
+
This repository follows the preprocessing used by the project codebase:
|
| 186 |
+
|
| 187 |
+
- images are resized to `224x224`
|
| 188 |
+
- pixel values are scaled to `[0, 1]` by dividing by `255`
|
| 189 |
+
- ImageNet mean/std normalization is not applied
|
| 190 |
+
- text tokenization uses `CLIPTokenizer` from `openai/clip-vit-base-patch32`
|
| 191 |
+
- token sequences use `max_length=77`
|
| 192 |
+
|
| 193 |
+
Before running local evaluation, make sure the tokenizer is available in the local Hugging Face cache:
|
| 194 |
+
|
| 195 |
+
```bash
|
| 196 |
+
hf download openai/clip-vit-base-patch32
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
## Training context
|
| 200 |
+
|
| 201 |
+
The exported ONNX files come from the LPCV 2026 Track 1 training workflow built around:
|
| 202 |
+
|
| 203 |
+
- `MobileCLIP2-B` as the base model
|
| 204 |
+
- contrastive JSONL training data with positives and hard negatives
|
| 205 |
+
- local PyTorch fine-tuning
|
| 206 |
+
- ONNX export for deployment-oriented evaluation
|
| 207 |
+
|
| 208 |
+
The corresponding image-source dataset is available at:
|
| 209 |
+
|
| 210 |
+
`https://huggingface.co/datasets/jn12/VG100K4CL`
|
| 211 |
+
|
| 212 |
+
## Intended use
|
| 213 |
+
|
| 214 |
+
Use this model if you want to:
|
| 215 |
+
|
| 216 |
+
- reproduce local ONNX evaluation from this repository
|
| 217 |
+
- benchmark the exported retrieval model
|
| 218 |
+
- integrate the encoders into a deployment pipeline
|
| 219 |
+
|
| 220 |
+
This repository is not intended to be a generic sentence-embedding model release or a universal CLIP drop-in replacement.
|
| 221 |
+
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
## Citation
|
| 225 |
+
|
| 226 |
+
If you use this model, please cite the Hugging Face repository and the project code:
|
| 227 |
+
|
| 228 |
+
Authors:
|
| 229 |
+
|
| 230 |
+
`Hui Xie, Jinyang Du, Jiacheng Wang, Xiaoze Ge, Fengjun Zhong, Yejun Zeng, Ruihao Gong#, Xiaoning Liu, Shenghao Jin, Jinyang Guo#, Xianglong Liu`
|
| 231 |
+
|
| 232 |
+
```bibtex
|
| 233 |
+
@misc{mobileclip2b_lpcv2026,
|
| 234 |
+
title = {2026LPCV-Track1-MobileCLIP2-B-Best},
|
| 235 |
+
author = {Hui Xie and Jinyang Du and Jiacheng Wang and Xiaoze Ge and Fengjun Zhong and Yejun Zeng and Ruihao Gong and Xiaoning Liu and Shenghao Jin and Jinyang Guo and Xianglong Liu},
|
| 236 |
+
year = {2026},
|
| 237 |
+
howpublished = {\url{https://huggingface.co/jn12/2026LPCV-Track1-MobileCLIP2-B-Best}}
|
| 238 |
+
}
|
| 239 |
+
```
|
| 240 |
+
|
| 241 |
+
Project repository:
|
| 242 |
+
|
| 243 |
+
`https://github.com/jn12-29/LPCV-Track1-EfficientAI`
|