Multimodal Search Model
A PyTorch-based multimodal model for image-text retrieval trained on COCO Captions dataset.
Model Architecture
Image Encoder: ViT-Base/16 (Vision Transformer)
- Pre-trained on ImageNet
- Output: 768-dim features
Text Encoder: BERT-Base-Uncased
- Pre-trained on English corpus
- Output: 768-dim features
Projection: Linear layers project both modalities to 512-dim shared embedding space
Training Strategy: Frozen backbones with trainable projection heads
Training Details
- Dataset: COCO Captions 2017 (118K training images, 5K validation images)
- Loss Function: InfoNCE (Contrastive Loss)
- Temperature: 0.07
- Optimizer: AdamW
- Batch Size: 32
- Image Size: 224x224
- Max Text Length: 77 tokens
Performance
Evaluated on COCO val2017 (5,000 images, 25,000 captions):
| Metric | Score |
|---|---|
| Recall@5 | 21.95% |
| Recall@10 | 31.20% |
| Recall@20 | 42.50% |
Usage
import torch
from huggingface_hub import hf_hub_download
# Download model
model_path = hf_hub_download(
repo_id="Potato-Scientist/multi-modal-search",
filename="pytorch_model.bin"
)
# Load checkpoint
checkpoint = torch.load(model_path, map_location='cpu')
# Load into your model
from src.models.multimodal_model import MultimodalModel
model = MultimodalModel(
embedding_dim=512,
freeze_backbones=False,
pretrained=False
)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
Demo
Try the live demo: Multimodal Search Demo
Model Details
- Developed by: Potato-Scientist
- Model type: Vision-Language Model
- Language: English
- License: MIT
- Framework: PyTorch 2.1.0
Citation
If you use this model, please cite:
@misc{multimodal-search-2026,
author = {Potato-Scientist},
title = {Multimodal Search Model},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Potato-Scientist/multimodal-search-model}}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support