You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ViCLIP-OT — The First Foundation Vision-Language Model for Vietnamese Image–Text Retrieval with Optimal Transport

arXiv Github repository Model on Hugging Face

ViCLIP-OT is the first foundation vision-language model specifically designed for Vietnamese image-text retrieval.

ViCLIP-OT models combine CLIP-style contrastive learning with a novel Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance cross-modal alignment and reduce the modality gap. ViCLIP-OT achieves state-of-the-art performance on Vietnamese image-text retrieval benchmarks with strong zero-shot generalization.

Model Variants

Variant Contrastive Loss OT Loss Params Hugging Face
ViCLIP-OT (you are here) CLIP SIGROT 221M minhnguyent546/ViCLIP-OT
ViSigLIP-OT SigLIP SIGROT 221M minhnguyent546/ViSigLIP-OT

Quick Start

Model Overview

ViCLIP-OT is a dual-encoder vision-language model with 221M parameters. The text encoder is based on Vietnamese-SBERT, and the image encoder uses a ViT-B/16 backbone pre-trained with the DINOv3 framework. The model is trained with a hybrid objective that combines CLIP-style contrastive learning and the proposed Similarity-Graph Regularized Optimal Transport (SIGROT) loss.

Feature Text Encoder Image Encoder
Base Model Vietnamese-SBERT DINOv3-ViT-B/16
Parameters 135M 86M
Input Specification 256 tokens (max) 224 x 224 pixels
Pooling Strategy Mean pooling Global average pooling
Output Dimension 768 768

Intended Uses

ViCLIP-OT is a multimodal embedding model that encodes both Vietnamese text and images into a shared representation space. It can be used for:

  • Vietnamese image-text retrieval (text-to-image and image-to-text, image-to-image)
  • Cross-modal semantic search in Vietnamese
  • Feature extraction for downstream Vietnamese vision-language tasks (e.g., visual question answering, image classification)

Usage

via transformers
pip install \
    'transformers>=4.57.0,<5.0.0' \
    'torch>=2.8.0,<2.10.0' \
    'torchvision>=0.23.0,<0.25.0' \
    timm \
    pillow
from transformers import AutoModel, AutoProcessor
import torch

# Initialize the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
model = AutoModel.from_pretrained('minhnguyent546/ViCLIP-OT', trust_remote_code=True)
model.to(device)

# Example images and sentences
image_uris = [
    'http://images.cocodataset.org/train2014/COCO_train2014_000000138621.jpg',
    'http://images.cocodataset.org/train2014/COCO_train2014_000000190580.jpg',
]
sentences = [
    'Một con mèo màu trắng',
    'Một con mèo màu đen',
    'Một cô gái đang lướt sóng',
]

# Encode text
text_embeddings = model.encode_text(
    sentences=sentences,
    batch_size=32,
    show_progress_bar=True,
    convert_to_tensor=True,
    normalize=True,
    padding=True,
    truncation=True,
    max_length=512,
)

# Encode images
image_embeddings = model.encode_image(
    images=image_uris,
    batch_size=32,
    show_progress_bar=True,
    convert_to_tensor=True,
    normalize=True,
)

# Compute cosine similarity between image and text embeddings
similarities = image_embeddings @ text_embeddings.T
print(similarities)

# tensor([[0.2438, 0.1506, 0.7248],
#         [0.4299, 0.5287, 0.2329]])
via sentence-transformers
pip install \
    'transformers>=4.57.0,<5.0.0' \
    'sentence-transformers>=4.0.0' \
    'torch>=2.8.0,<2.10.0' \
    'torchvision>=0.23.0,<0.25.0' \
    timm \
    pillow
from sentence_transformers import SentenceTransformer

# Initialize the model
model = SentenceTransformer('minhnguyent546/ViCLIP-OT', trust_remote_code=True)

# Example images and sentences
image_uris = [
    'http://images.cocodataset.org/train2014/COCO_train2014_000000138621.jpg',
    'http://images.cocodataset.org/train2014/COCO_train2014_000000190580.jpg',
]
sentences = [
    'Một con mèo màu trắng',
    'Một con mèo màu đen',
    'Một cô gái đang lướt sóng',
]

# Encode text
text_embeddings = model.encode(
    sentences,
    batch_size=32,
    show_progress_bar=True,
    convert_to_tensor=True,
    normalize_embeddings=True,
)

# Encode images
image_embeddings = model.encode(
    image_uris,
    batch_size=32,
    show_progress_bar=True,
    convert_to_tensor=True,
    normalize_embeddings=True,
)

# Compute cosine similarity between image and text embeddings
similarities = image_embeddings @ text_embeddings.T
print(similarities)

# tensor([[0.2438, 0.1506, 0.7248],
#         [0.4299, 0.5287, 0.2329]])

Training Details

ViCLIP-OT is trained on UIT-OpenViIC, a large-scale open-domain Vietnamese image captioning dataset containing 9,088 training images with approximately 42,000 captions featuring diverse real-world scenes.

For more details, please refer to the GitHub repository.

Evaluation Results

Image-Text Retrieval on UIT-OpenViIC

The table below summarizes retrieval performance on the UIT-OpenViIC test set. Both models also substantially outperform pretrained multilingual vision-language models evaluated in a zero-shot setting.

Table: Image-text retrieval performance on the test set of the UIT-OpenViIC dataset. UOT denotes Unbalanced Optimal Transport. * indicates zero-shot evaluation. Best results are in bold and second-best are underlined.
Method/Model # Params Text → Image Image → Text Avg.
R@1 R@5 R@10 R@1 R@5 R@10
mSigLIP-base* 370M 14.34 28.94 36.21 20.49 32.23 37.43 28.27
Jina CLIP v2* 865M 30.01 52.09 61.70 40.23 65.02 74.41 53.91
Jina Embedding v4* 4B 23.97 42.22 50.29 41.48 66.77 75.61 50.06
Qwen3-VL-Embedding-2B* 2B 32.13 54.00 62.93 39.83 66.52 77.01 55.40
CLIP 221M 31.19 59.80 71.23 46.60 75.53 85.19 61.59
SigLIP 221M 34.75 63.01 72.96 50.10 79.78 88.04 64.77
CLIP + UOT 221M 29.27 57.62 69.07 43.59 75.03 84.03 59.77
SigLIP + UOT 221M 37.84 65.30 74.98 53.95 80.95 88.81 66.97
SIGROT 221M 40.75 70.72 80.90 37.99 61.11 71.68 60.53
ViCLIP-OT (Ours) 221M 37.57 65.65 75.43 54.35 81.83 89.19 67.34
ViSigLIP-OT (Ours) 221M 39.19 66.71 76.04 57.21 83.83 90.79 68.96

Zero-shot image–text retrieval results on KTVIC and Crossmodal-3600

The table below reports zero-shot retrieval results on KTVIC (with near-duplicate images removed against the UIT-OpenViIC training set) and Crossmodal-3600 (using Vietnamese captions).

Table: Zero-shot image–text retrieval results on KTVIC and Crossmodal-3600. KTVIC images are deduplicated against the UIT-OpenViIC training set. Vietnamese captions are used for Crossmodal-3600.
Method Text → Image Image → Text Avg.
R@1 R@5 R@10 R@1 R@5 R@10
KTVIC – train
CLIP 21.12 46.99 59.22 31.65 59.46 72.49 48.49
SigLIP 23.16 48.78 60.57 35.48 62.22 73.64 50.64
ViCLIP-OT 26.24 52.46 64.14 38.47 64.37 75.48 53.52
ViSigLIP-OT 26.28 52.58 63.49 39.62 66.44 77.78 54.37
KTVIC – test
CLIP 50.32 82.80 89.94 63.06 92.36 97.45 79.32
SigLIP 52.61 83.31 89.94 71.97 94.27 96.18 81.38
ViCLIP-OT 56.69 85.61 91.97 70.06 93.63 98.09 82.68
ViSigLIP-OT 56.56 85.99 91.72 71.34 93.63 97.45 82.78
Crossmodal-3600
CLIP 22.52 45.55 58.01 26.22 53.42 65.06 45.13
SigLIP 26.67 50.31 61.78 31.17 57.78 69.83 49.59
ViCLIP-OT 28.90 55.29 66.37 42.56 68.81 79.17 56.85
ViSigLIP-OT 32.04 57.90 68.95 37.97 64.64 75.53 56.17

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

Citation

If you find ViCLIP-OT useful in your research, please cite the following paper:

@misc{tran2026viclipotfoundationvisionlanguagemodel,
  title={ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport}, 
  author={Quoc-Khang Tran and Minh-Thien Nguyen and Nguyen-Khang Pham},
  year={2026},
  eprint={2602.22678},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.22678}, 
}
Downloads last month
10
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for minhnguyent546/ViCLIP-OT

Finetuned
(28)
this model

Collection including minhnguyent546/ViCLIP-OT

Paper for minhnguyent546/ViCLIP-OT