You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ViCLIP-OT — The First Foundation Vision-Language Model for Vietnamese Image–Text Retrieval with Optimal Transport

ViCLIP-OT is the first foundation vision-language model specifically designed for Vietnamese image-text retrieval.

ViCLIP-OT models combine CLIP-style contrastive learning with a novel Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance cross-modal alignment and reduce the modality gap. ViCLIP-OT achieves state-of-the-art performance on Vietnamese image-text retrieval benchmarks with strong zero-shot generalization.

Model Variants

Variant	Contrastive Loss	OT Loss	Params	Hugging Face
ViCLIP-OT (you are here)	CLIP	SIGROT	221M	minhnguyent546/ViCLIP-OT
ViSigLIP-OT	SigLIP	SIGROT	221M	minhnguyent546/ViSigLIP-OT

Quick Start

Model Overview

ViCLIP-OT is a dual-encoder vision-language model with 221M parameters. The text encoder is based on Vietnamese-SBERT, and the image encoder uses a ViT-B/16 backbone pre-trained with the DINOv3 framework. The model is trained with a hybrid objective that combines CLIP-style contrastive learning and the proposed Similarity-Graph Regularized Optimal Transport (SIGROT) loss.

Feature	Text Encoder	Image Encoder
Base Model	Vietnamese-SBERT	DINOv3-ViT-B/16
Parameters	135M	86M
Input Specification	256 tokens (max)	224 x 224 pixels
Pooling Strategy	Mean pooling	Global average pooling
Output Dimension	768	768

Intended Uses

ViCLIP-OT is a multimodal embedding model that encodes both Vietnamese text and images into a shared representation space. It can be used for:

Vietnamese image-text retrieval (text-to-image and image-to-text, image-to-image)
Cross-modal semantic search in Vietnamese
Feature extraction for downstream Vietnamese vision-language tasks (e.g., visual question answering, image classification)

Usage

via transformers

pip install \
    'transformers>=4.57.0,<5.0.0' \
    'torch>=2.8.0,<2.10.0' \
    'torchvision>=0.23.0,<0.25.0' \
    timm \
    pillow

from transformers import AutoModel, AutoProcessor
import torch

# Initialize the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
model = AutoModel.from_pretrained('minhnguyent546/ViCLIP-OT', trust_remote_code=True)
model.to(device)

# Example images and sentences
image_uris = [
    'http://images.cocodataset.org/train2014/COCO_train2014_000000138621.jpg',
    'http://images.cocodataset.org/train2014/COCO_train2014_000000190580.jpg',
]
sentences = [
    'Một con mèo màu trắng',
    'Một con mèo màu đen',
    'Một cô gái đang lướt sóng',
]

# Encode text
text_embeddings = model.encode_text(
    sentences=sentences,
    batch_size=32,
    show_progress_bar=True,
    convert_to_tensor=True,
    normalize=True,
    padding=True,
    truncation=True,
    max_length=512,
)

# Encode images
image_embeddings = model.encode_image(
    images=image_uris,
    batch_size=32,
    show_progress_bar=True,
    convert_to_tensor=True,
    normalize=True,
)

# Compute cosine similarity between image and text embeddings
similarities = image_embeddings @ text_embeddings.T
print(similarities)

# tensor([[0.2438, 0.1506, 0.7248],
#         [0.4299, 0.5287, 0.2329]])

via sentence-transformers

pip install \
    'transformers>=4.57.0,<5.0.0' \
    'sentence-transformers>=4.0.0' \
    'torch>=2.8.0,<2.10.0' \
    'torchvision>=0.23.0,<0.25.0' \
    timm \
    pillow

from sentence_transformers import SentenceTransformer

# Initialize the model
model = SentenceTransformer('minhnguyent546/ViCLIP-OT', trust_remote_code=True)

# Example images and sentences
image_uris = [
    'http://images.cocodataset.org/train2014/COCO_train2014_000000138621.jpg',
    'http://images.cocodataset.org/train2014/COCO_train2014_000000190580.jpg',
]
sentences = [
    'Một con mèo màu trắng',
    'Một con mèo màu đen',
    'Một cô gái đang lướt sóng',
]

# Encode text
text_embeddings = model.encode(
    sentences,
    batch_size=32,
    show_progress_bar=True,
    convert_to_tensor=True,
    normalize_embeddings=True,
)

# Encode images
image_embeddings = model.encode(
    image_uris,
    batch_size=32,
    show_progress_bar=True,
    convert_to_tensor=True,
    normalize_embeddings=True,
)

# Compute cosine similarity between image and text embeddings
similarities = image_embeddings @ text_embeddings.T
print(similarities)

# tensor([[0.2438, 0.1506, 0.7248],
#         [0.4299, 0.5287, 0.2329]])

Training Details

ViCLIP-OT is trained on UIT-OpenViIC, a large-scale open-domain Vietnamese image captioning dataset containing 9,088 training images with approximately 42,000 captions featuring diverse real-world scenes.

For more details, please refer to the GitHub repository.

Evaluation Results

Image-Text Retrieval on UIT-OpenViIC

The table below summarizes retrieval performance on the UIT-OpenViIC test set. Both models also substantially outperform pretrained multilingual vision-language models evaluated in a zero-shot setting.

**Table:** Image-text retrieval performance on the test set of the UIT-OpenViIC dataset. UOT denotes Unbalanced Optimal Transport. * indicates zero-shot evaluation. Best results are in bold and second-best are underlined.
Method/Model	# Params	Text → Image			Image → Text			Avg.
Method/Model	# Params	R@1	R@5	R@10	R@1	R@5	R@10	Avg.
mSigLIP-base*	370M	14.34	28.94	36.21	20.49	32.23	37.43	28.27
Jina CLIP v2*	865M	30.01	52.09	61.70	40.23	65.02	74.41	53.91
Jina Embedding v4*	4B	23.97	42.22	50.29	41.48	66.77	75.61	50.06
Qwen3-VL-Embedding-2B*	2B	32.13	54.00	62.93	39.83	66.52	77.01	55.40
CLIP	221M	31.19	59.80	71.23	46.60	75.53	85.19	61.59
SigLIP	221M	34.75	63.01	72.96	50.10	79.78	88.04	64.77
CLIP + UOT	221M	29.27	57.62	69.07	43.59	75.03	84.03	59.77
SigLIP + UOT	221M	37.84	65.30	74.98	53.95	80.95	88.81	66.97
SIGROT	221M	40.75	70.72	80.90	37.99	61.11	71.68	60.53
ViCLIP-OT (Ours)	221M	37.57	65.65	75.43	54.35	81.83	89.19	67.34
ViSigLIP-OT (Ours)	221M	39.19	66.71	76.04	57.21	83.83	90.79	68.96

Zero-shot image–text retrieval results on KTVIC and Crossmodal-3600

The table below reports zero-shot retrieval results on KTVIC (with near-duplicate images removed against the UIT-OpenViIC training set) and Crossmodal-3600 (using Vietnamese captions).

**Table:** Zero-shot image–text retrieval results on KTVIC and Crossmodal-3600. KTVIC images are deduplicated against the UIT-OpenViIC training set. Vietnamese captions are used for Crossmodal-3600.
Method	Text → Image			Image → Text			Avg.
Method	R@1	R@5	R@10	R@1	R@5	R@10	Avg.
KTVIC – train
CLIP	21.12	46.99	59.22	31.65	59.46	72.49	48.49
SigLIP	23.16	48.78	60.57	35.48	62.22	73.64	50.64
ViCLIP-OT	26.24	52.46	64.14	38.47	64.37	75.48	53.52
ViSigLIP-OT	26.28	52.58	63.49	39.62	66.44	77.78	54.37
KTVIC – test
CLIP	50.32	82.80	89.94	63.06	92.36	97.45	79.32
SigLIP	52.61	83.31	89.94	71.97	94.27	96.18	81.38
ViCLIP-OT	56.69	85.61	91.97	70.06	93.63	98.09	82.68
ViSigLIP-OT	56.56	85.99	91.72	71.34	93.63	97.45	82.78
Crossmodal-3600
CLIP	22.52	45.55	58.01	26.22	53.42	65.06	45.13
SigLIP	26.67	50.31	61.78	31.17	57.78	69.83	49.59
ViCLIP-OT	28.90	55.29	66.37	42.56	68.81	79.17	56.85
ViSigLIP-OT	32.04	57.90	68.95	37.97	64.64	75.53	56.17

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

Citation

If you find ViCLIP-OT useful in your research, please cite the following paper:

@misc{tran2026viclipotfoundationvisionlanguagemodel,
  title={ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport}, 
  author={Quoc-Khang Tran and Minh-Thien Nguyen and Nguyen-Khang Pham},
  year={2026},
  eprint={2602.22678},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.22678}, 
}