Multimodal Search Model

A PyTorch-based multimodal model for image-text retrieval trained on COCO Captions dataset.

Model Architecture

  • Image Encoder: ViT-Base/16 (Vision Transformer)

    • Pre-trained on ImageNet
    • Output: 768-dim features
  • Text Encoder: BERT-Base-Uncased

    • Pre-trained on English corpus
    • Output: 768-dim features
  • Projection: Linear layers project both modalities to 512-dim shared embedding space

  • Training Strategy: Frozen backbones with trainable projection heads

Training Details

  • Dataset: COCO Captions 2017 (118K training images, 5K validation images)
  • Loss Function: InfoNCE (Contrastive Loss)
  • Temperature: 0.07
  • Optimizer: AdamW
  • Batch Size: 32
  • Image Size: 224x224
  • Max Text Length: 77 tokens

Performance

Evaluated on COCO val2017 (5,000 images, 25,000 captions):

Metric Score
Recall@5 21.95%
Recall@10 31.20%
Recall@20 42.50%

Usage

import torch
from huggingface_hub import hf_hub_download

# Download model
model_path = hf_hub_download(
    repo_id="Potato-Scientist/multi-modal-search",
    filename="pytorch_model.bin"
)

# Load checkpoint
checkpoint = torch.load(model_path, map_location='cpu')

# Load into your model
from src.models.multimodal_model import MultimodalModel

model = MultimodalModel(
    embedding_dim=512,
    freeze_backbones=False,
    pretrained=False
)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

Demo

Try the live demo: Multimodal Search Demo

Model Details

  • Developed by: Potato-Scientist
  • Model type: Vision-Language Model
  • Language: English
  • License: MIT
  • Framework: PyTorch 2.1.0

Citation

If you use this model, please cite:

@misc{multimodal-search-2026,
  author = {Potato-Scientist},
  title = {Multimodal Search Model},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Potato-Scientist/multimodal-search-model}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Potato-Scientist/multi-modal-search 1