Instructions to use minhnguyent546/ViCLIP-OT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use minhnguyent546/ViCLIP-OT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="minhnguyent546/ViCLIP-OT", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("minhnguyent546/ViCLIP-OT", trust_remote_code=True, dtype="auto") - sentence-transformers
How to use minhnguyent546/ViCLIP-OT with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("minhnguyent546/ViCLIP-OT", trust_remote_code=True) sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
ViCLIP-OT — The First Foundation Vision-Language Model for Vietnamese Image–Text Retrieval with Optimal Transport
ViCLIP-OT is the first foundation vision-language model specifically designed for Vietnamese image-text retrieval.
ViCLIP-OT models combine CLIP-style contrastive learning with a novel Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance cross-modal alignment and reduce the modality gap. ViCLIP-OT achieves state-of-the-art performance on Vietnamese image-text retrieval benchmarks with strong zero-shot generalization.
Model Variants
| Variant | Contrastive Loss | OT Loss | Params | Hugging Face |
|---|---|---|---|---|
| ViCLIP-OT (you are here) | CLIP | SIGROT | 221M | minhnguyent546/ViCLIP-OT |
| ViSigLIP-OT | SigLIP | SIGROT | 221M | minhnguyent546/ViSigLIP-OT |
Quick Start
Model Overview
ViCLIP-OT is a dual-encoder vision-language model with 221M parameters. The text encoder is based on Vietnamese-SBERT, and the image encoder uses a ViT-B/16 backbone pre-trained with the DINOv3 framework. The model is trained with a hybrid objective that combines CLIP-style contrastive learning and the proposed Similarity-Graph Regularized Optimal Transport (SIGROT) loss.
| Feature | Text Encoder | Image Encoder |
|---|---|---|
| Base Model | Vietnamese-SBERT | DINOv3-ViT-B/16 |
| Parameters | 135M | 86M |
| Input Specification | 256 tokens (max) | 224 x 224 pixels |
| Pooling Strategy | Mean pooling | Global average pooling |
| Output Dimension | 768 | 768 |
Intended Uses
ViCLIP-OT is a multimodal embedding model that encodes both Vietnamese text and images into a shared representation space. It can be used for:
- Vietnamese image-text retrieval (text-to-image and image-to-text, image-to-image)
- Cross-modal semantic search in Vietnamese
- Feature extraction for downstream Vietnamese vision-language tasks (e.g., visual question answering, image classification)
Usage
via transformers
pip install \
'transformers>=4.57.0,<5.0.0' \
'torch>=2.8.0,<2.10.0' \
'torchvision>=0.23.0,<0.25.0' \
timm \
pillow
from transformers import AutoModel, AutoProcessor
import torch
# Initialize the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
model = AutoModel.from_pretrained('minhnguyent546/ViCLIP-OT', trust_remote_code=True)
model.to(device)
# Example images and sentences
image_uris = [
'http://images.cocodataset.org/train2014/COCO_train2014_000000138621.jpg',
'http://images.cocodataset.org/train2014/COCO_train2014_000000190580.jpg',
]
sentences = [
'Một con mèo màu trắng',
'Một con mèo màu đen',
'Một cô gái đang lướt sóng',
]
# Encode text
text_embeddings = model.encode_text(
sentences=sentences,
batch_size=32,
show_progress_bar=True,
convert_to_tensor=True,
normalize=True,
padding=True,
truncation=True,
max_length=512,
)
# Encode images
image_embeddings = model.encode_image(
images=image_uris,
batch_size=32,
show_progress_bar=True,
convert_to_tensor=True,
normalize=True,
)
# Compute cosine similarity between image and text embeddings
similarities = image_embeddings @ text_embeddings.T
print(similarities)
# tensor([[0.2438, 0.1506, 0.7248],
# [0.4299, 0.5287, 0.2329]])
via sentence-transformers
pip install \
'transformers>=4.57.0,<5.0.0' \
'sentence-transformers>=4.0.0' \
'torch>=2.8.0,<2.10.0' \
'torchvision>=0.23.0,<0.25.0' \
timm \
pillow
from sentence_transformers import SentenceTransformer
# Initialize the model
model = SentenceTransformer('minhnguyent546/ViCLIP-OT', trust_remote_code=True)
# Example images and sentences
image_uris = [
'http://images.cocodataset.org/train2014/COCO_train2014_000000138621.jpg',
'http://images.cocodataset.org/train2014/COCO_train2014_000000190580.jpg',
]
sentences = [
'Một con mèo màu trắng',
'Một con mèo màu đen',
'Một cô gái đang lướt sóng',
]
# Encode text
text_embeddings = model.encode(
sentences,
batch_size=32,
show_progress_bar=True,
convert_to_tensor=True,
normalize_embeddings=True,
)
# Encode images
image_embeddings = model.encode(
image_uris,
batch_size=32,
show_progress_bar=True,
convert_to_tensor=True,
normalize_embeddings=True,
)
# Compute cosine similarity between image and text embeddings
similarities = image_embeddings @ text_embeddings.T
print(similarities)
# tensor([[0.2438, 0.1506, 0.7248],
# [0.4299, 0.5287, 0.2329]])
Training Details
ViCLIP-OT is trained on UIT-OpenViIC, a large-scale open-domain Vietnamese image captioning dataset containing 9,088 training images with approximately 42,000 captions featuring diverse real-world scenes.
For more details, please refer to the GitHub repository.
Evaluation Results
Image-Text Retrieval on UIT-OpenViIC
The table below summarizes retrieval performance on the UIT-OpenViIC test set. Both models also substantially outperform pretrained multilingual vision-language models evaluated in a zero-shot setting.
| Method/Model | # Params | Text → Image | Image → Text | Avg. | ||||
|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |||
| mSigLIP-base* | 370M | 14.34 | 28.94 | 36.21 | 20.49 | 32.23 | 37.43 | 28.27 |
| Jina CLIP v2* | 865M | 30.01 | 52.09 | 61.70 | 40.23 | 65.02 | 74.41 | 53.91 |
| Jina Embedding v4* | 4B | 23.97 | 42.22 | 50.29 | 41.48 | 66.77 | 75.61 | 50.06 |
| Qwen3-VL-Embedding-2B* | 2B | 32.13 | 54.00 | 62.93 | 39.83 | 66.52 | 77.01 | 55.40 |
| CLIP | 221M | 31.19 | 59.80 | 71.23 | 46.60 | 75.53 | 85.19 | 61.59 |
| SigLIP | 221M | 34.75 | 63.01 | 72.96 | 50.10 | 79.78 | 88.04 | 64.77 |
| CLIP + UOT | 221M | 29.27 | 57.62 | 69.07 | 43.59 | 75.03 | 84.03 | 59.77 |
| SigLIP + UOT | 221M | 37.84 | 65.30 | 74.98 | 53.95 | 80.95 | 88.81 | 66.97 |
| SIGROT | 221M | 40.75 | 70.72 | 80.90 | 37.99 | 61.11 | 71.68 | 60.53 |
| ViCLIP-OT (Ours) | 221M | 37.57 | 65.65 | 75.43 | 54.35 | 81.83 | 89.19 | 67.34 |
| ViSigLIP-OT (Ours) | 221M | 39.19 | 66.71 | 76.04 | 57.21 | 83.83 | 90.79 | 68.96 |
Zero-shot image–text retrieval results on KTVIC and Crossmodal-3600
The table below reports zero-shot retrieval results on KTVIC (with near-duplicate images removed against the UIT-OpenViIC training set) and Crossmodal-3600 (using Vietnamese captions).
| Method | Text → Image | Image → Text | Avg. | ||||
|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
| KTVIC – train | |||||||
| CLIP | 21.12 | 46.99 | 59.22 | 31.65 | 59.46 | 72.49 | 48.49 |
| SigLIP | 23.16 | 48.78 | 60.57 | 35.48 | 62.22 | 73.64 | 50.64 |
| ViCLIP-OT | 26.24 | 52.46 | 64.14 | 38.47 | 64.37 | 75.48 | 53.52 |
| ViSigLIP-OT | 26.28 | 52.58 | 63.49 | 39.62 | 66.44 | 77.78 | 54.37 |
| KTVIC – test | |||||||
| CLIP | 50.32 | 82.80 | 89.94 | 63.06 | 92.36 | 97.45 | 79.32 |
| SigLIP | 52.61 | 83.31 | 89.94 | 71.97 | 94.27 | 96.18 | 81.38 |
| ViCLIP-OT | 56.69 | 85.61 | 91.97 | 70.06 | 93.63 | 98.09 | 82.68 |
| ViSigLIP-OT | 56.56 | 85.99 | 91.72 | 71.34 | 93.63 | 97.45 | 82.78 |
| Crossmodal-3600 | |||||||
| CLIP | 22.52 | 45.55 | 58.01 | 26.22 | 53.42 | 65.06 | 45.13 |
| SigLIP | 26.67 | 50.31 | 61.78 | 31.17 | 57.78 | 69.83 | 49.59 |
| ViCLIP-OT | 28.90 | 55.29 | 66.37 | 42.56 | 68.81 | 79.17 | 56.85 |
| ViSigLIP-OT | 32.04 | 57.90 | 68.95 | 37.97 | 64.64 | 75.53 | 56.17 |
License
This project is licensed under the GNU General Public License v3.0 (GPL-3.0).
Citation
If you find ViCLIP-OT useful in your research, please cite the following paper:
@misc{tran2026viclipotfoundationvisionlanguagemodel,
title={ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport},
author={Quoc-Khang Tran and Minh-Thien Nguyen and Nguyen-Khang Pham},
year={2026},
eprint={2602.22678},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.22678},
}
- Downloads last month
- 10
Model tree for minhnguyent546/ViCLIP-OT
Base model
keepitreal/vietnamese-sbert