Multimodal Search Model

A PyTorch-based multimodal model for image-text retrieval trained on COCO Captions dataset.

Model Architecture

Image Encoder: ViT-Base/16 (Vision Transformer)
- Pre-trained on ImageNet
- Output: 768-dim features
Text Encoder: BERT-Base-Uncased
- Pre-trained on English corpus
- Output: 768-dim features
Projection: Linear layers project both modalities to 512-dim shared embedding space
Training Strategy: Frozen backbones with trainable projection heads

Training Details

Dataset: COCO Captions 2017 (118K training images, 5K validation images)
Loss Function: InfoNCE (Contrastive Loss)
Temperature: 0.07
Optimizer: AdamW
Batch Size: 32
Image Size: 224x224
Max Text Length: 77 tokens

Performance

Evaluated on COCO val2017 (5,000 images, 25,000 captions):

Metric	Score
Recall@5	21.95%
Recall@10	31.20%
Recall@20	42.50%

Usage

import torch
from huggingface_hub import hf_hub_download

# Download model
model_path = hf_hub_download(
    repo_id="Potato-Scientist/multi-modal-search",
    filename="pytorch_model.bin"
)

# Load checkpoint
checkpoint = torch.load(model_path, map_location='cpu')

# Load into your model
from src.models.multimodal_model import MultimodalModel

model = MultimodalModel(
    embedding_dim=512,
    freeze_backbones=False,
    pretrained=False
)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

Demo

Try the live demo: Multimodal Search Demo

Model Details

Developed by: Potato-Scientist
Model type: Vision-Language Model
Language: English
License: MIT
Framework: PyTorch 2.1.0

Citation

If you use this model, please cite:

@misc{multimodal-search-2026,
  author = {Potato-Scientist},
  title = {Multimodal Search Model},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Potato-Scientist/multimodal-search-model}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Potato-Scientist
/

multi-modal-search