multi-modal-search / README.md
Potato-Scientist's picture
Update README.md
3a3d6ed verified
---
license: mit
tags:
- multimodal
- image-text-retrieval
- contrastive-learning
- vision-language
library_name: pytorch
---
# Multimodal Search Model
A PyTorch-based multimodal model for image-text retrieval trained on COCO Captions dataset.
## Model Architecture
- **Image Encoder**: ViT-Base/16 (Vision Transformer)
- Pre-trained on ImageNet
- Output: 768-dim features
- **Text Encoder**: BERT-Base-Uncased
- Pre-trained on English corpus
- Output: 768-dim features
- **Projection**: Linear layers project both modalities to 512-dim shared embedding space
- **Training Strategy**: Frozen backbones with trainable projection heads
## Training Details
- **Dataset**: COCO Captions 2017 (118K training images, 5K validation images)
- **Loss Function**: InfoNCE (Contrastive Loss)
- **Temperature**: 0.07
- **Optimizer**: AdamW
- **Batch Size**: 32
- **Image Size**: 224x224
- **Max Text Length**: 77 tokens
## Performance
Evaluated on COCO val2017 (5,000 images, 25,000 captions):
| Metric | Score |
|--------|-------|
| **Recall@5** | **21.95%** |
| Recall@10 | 31.20% |
| Recall@20 | 42.50% |
## Usage
```python
import torch
from huggingface_hub import hf_hub_download
# Download model
model_path = hf_hub_download(
repo_id="Potato-Scientist/multi-modal-search",
filename="pytorch_model.bin"
)
# Load checkpoint
checkpoint = torch.load(model_path, map_location='cpu')
# Load into your model
from src.models.multimodal_model import MultimodalModel
model = MultimodalModel(
embedding_dim=512,
freeze_backbones=False,
pretrained=False
)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
```
## Demo
Try the live demo: [Multimodal Search Demo](https://huggingface.co/spaces/Potato-Scientist/multimodal-search-hf)
## Model Details
- **Developed by**: Potato-Scientist
- **Model type**: Vision-Language Model
- **Language**: English
- **License**: MIT
- **Framework**: PyTorch 2.1.0
## Citation
If you use this model, please cite:
```bibtex
@misc{multimodal-search-2026,
author = {Potato-Scientist},
title = {Multimodal Search Model},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Potato-Scientist/multimodal-search-model}}
}
```