2026LPCV-Track1-MobileCLIP2-B-Best

jn12/2026LPCV-Track1-MobileCLIP2-B-Best is the exported ONNX version of the best current MobileCLIP2-B checkpoint used in this LPCV 2026 Track 1 image-to-text retrieval project.

The full project code is available here:

https://github.com/jn12-29/LPCV-Track1-EfficientAI

That repository contains the complete model training pipeline, together with dataset preparation, ONNX export, local evaluation, and deployment-oriented evaluation code.

The repository provides separated image and text encoders in ONNX format so they can be evaluated locally with ONNX Runtime or compiled further for Qualcomm device workflows.

Model overview

  • Base architecture: MobileCLIP2-B
  • Task: image-to-text retrieval
  • Export format: ONNX
  • Runtime target: local ONNX evaluation and Qualcomm deployment flow

Repository contents

This repository currently provides exported encoder files:

  • image_encoder.onnx
  • image_encoder.onnx.data
  • text_encoder.onnx
  • text_encoder.onnx.data

These files can be consumed directly by the local evaluation pipeline in this repository.

Download

hf download jn12/2026LPCV-Track1-MobileCLIP2-B-Best \
  --local-dir ./pretrained/2026LPCV-Track1-MobileCLIP2-B-Best

Expected local layout:

pretrained/2026LPCV-Track1-MobileCLIP2-B-Best/
β”œβ”€β”€ image_encoder.onnx
β”œβ”€β”€ image_encoder.onnx.data
β”œβ”€β”€ text_encoder.onnx
└── text_encoder.onnx.data

Quick usage

Evaluate locally with ONNX Runtime

Install dependencies:

pip install onnxruntime pillow numpy torch torchvision transformers
hf download openai/clip-vit-base-patch32

Run evaluation with plain ONNX Runtime:

from pathlib import Path

import numpy as np
import onnxruntime as ort
import torch
import torch.nn.functional as F
from PIL import Image
from torchvision import transforms
from transformers import CLIPTokenizer


MODEL_DIR = Path("./pretrained/2026LPCV-Track1-MobileCLIP2-B-Best")
IMAGE_PATHS = [
    "examples/image1.jpg",
    "examples/image2.jpg",
]
TEXTS = [
    "a red bus on the street",
    "a group of people near a building",
    "a dog running on grass",
]


def preprocess_image(image_path: str) -> np.ndarray:
    transform = transforms.Compose(
        [
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
        ]
    )
    image = Image.open(image_path).convert("RGB")
    image_tensor = transform(image).unsqueeze(0)
    return image_tensor.numpy().astype(np.float32)


def l2_normalize(x: np.ndarray) -> np.ndarray:
    return x / np.linalg.norm(x, axis=-1, keepdims=True)


def recall_at_k(image_features: np.ndarray, text_features: np.ndarray, positives, k: int) -> float:
    similarities = image_features @ text_features.T
    topk = np.argsort(-similarities, axis=1)[:, :k]
    hits = 0
    for i, gt in enumerate(positives):
        if any(j in gt for j in topk[i]):
            hits += 1
    return hits / len(positives)


image_session = ort.InferenceSession(
    str(MODEL_DIR / "image_encoder.onnx"),
    providers=["CPUExecutionProvider"],
)
text_session = ort.InferenceSession(
    str(MODEL_DIR / "text_encoder.onnx"),
    providers=["CPUExecutionProvider"],
)

tokenizer = CLIPTokenizer.from_pretrained(
    "openai/clip-vit-base-patch32",
    local_files_only=True,
)
tokenizer.add_special_tokens({"cls_token": tokenizer.eos_token})

image_embeddings = []
for image_path in IMAGE_PATHS:
    image_input = preprocess_image(image_path)
    image_output = image_session.run(None, {"image": image_input})[0]
    image_embeddings.append(image_output[0])
image_embeddings = l2_normalize(np.stack(image_embeddings, axis=0))

text_embeddings = []
for text in TEXTS:
    token_ids = tokenizer(
        [text],
        padding="max_length",
        truncation=True,
        max_length=77,
        return_tensors="pt",
    )["input_ids"].numpy().astype(np.int32)
    text_output = text_session.run(None, {"text": token_ids})[0]
    text_embeddings.append(text_output[0])
text_embeddings = l2_normalize(np.stack(text_embeddings, axis=0))

# Example ground-truth mapping:
# image 0 matches text 0, image 1 matches text 1.
positive_text_indices = [{0}, {1}]

r_at_1 = recall_at_k(image_embeddings, text_embeddings, positive_text_indices, k=1)
r_at_2 = recall_at_k(image_embeddings, text_embeddings, positive_text_indices, k=2)

print(f"Recall@1: {r_at_1:.4f}")
print(f"Recall@2: {r_at_2:.4f}")

Preprocessing and tokenization

This repository follows the preprocessing used by the project codebase:

  • images are resized to 224x224
  • pixel values are scaled to [0, 1] by dividing by 255
  • ImageNet mean/std normalization is not applied
  • text tokenization uses CLIPTokenizer from openai/clip-vit-base-patch32
  • token sequences use max_length=77

Before running local evaluation, make sure the tokenizer is available in the local Hugging Face cache:

hf download openai/clip-vit-base-patch32

Training context

The exported ONNX files come from the LPCV 2026 Track 1 training workflow built around:

  • MobileCLIP2-B as the base model
  • contrastive JSONL training data with positives and hard negatives
  • local PyTorch fine-tuning
  • ONNX export for deployment-oriented evaluation

The corresponding image-source dataset is available at:

https://huggingface.co/datasets/jn12/VG100K4CL

Intended use

Use this model if you want to:

  • reproduce local ONNX evaluation from this repository
  • benchmark the exported retrieval model
  • integrate the encoders into a deployment pipeline

This repository is not intended to be a generic sentence-embedding model release or a universal CLIP drop-in replacement.

Citation

If you use this model, please cite the Hugging Face repository and the project code:

Authors:

Hui Xie, Jinyang Du, Jiacheng Wang, Xiaoze Ge, Fengjun Zhong, Yejun Zeng, Ruihao Gong#, Xiaoning Liu, Shenghao Jin, Jinyang Guo#, Xianglong Liu

@misc{mobileclip2b_lpcv2026,
  title        = {2026LPCV-Track1-MobileCLIP2-B-Best},
  author       = {Hui Xie and Jinyang Du and Jiacheng Wang and Xiaoze Ge and Fengjun Zhong and Yejun Zeng and Ruihao Gong and Xiaoning Liu and Shenghao Jin and Jinyang Guo and Xianglong Liu},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/jn12/2026LPCV-Track1-MobileCLIP2-B-Best}}
}

Project repository:

https://github.com/jn12-29/LPCV-Track1-EfficientAI

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jn12/2026LPCV-Track1-MobileCLIP2-B-Best

Quantized
(1)
this model

Dataset used to train jn12/2026LPCV-Track1-MobileCLIP2-B-Best