Potato-Scientist
/

multi-modal-search

image-text-retrieval

contrastive-learning

vision-language

Model card Files Files and versions

multi-modal-search / README.md

Potato-Scientist's picture

Potato-Scientist

Update README.md

3a3d6ed verified about 1 month ago

|

history blame contribute delete

2.25 kB

	---
	license: mit
	tags:
	- multimodal
	- image-text-retrieval
	- contrastive-learning
	- vision-language
	library_name: pytorch
	---

	# Multimodal Search Model

	A PyTorch-based multimodal model for image-text retrieval trained on COCO Captions dataset.

	## Model Architecture

	- Image Encoder: ViT-Base/16 (Vision Transformer)
	- Pre-trained on ImageNet
	- Output: 768-dim features

	- Text Encoder: BERT-Base-Uncased
	- Pre-trained on English corpus
	- Output: 768-dim features

	- Projection: Linear layers project both modalities to 512-dim shared embedding space

	- Training Strategy: Frozen backbones with trainable projection heads

	## Training Details

	- Dataset: COCO Captions 2017 (118K training images, 5K validation images)
	- Loss Function: InfoNCE (Contrastive Loss)
	- Temperature: 0.07
	- Optimizer: AdamW
	- Batch Size: 32
	- Image Size: 224x224
	- Max Text Length: 77 tokens

	## Performance

	Evaluated on COCO val2017 (5,000 images, 25,000 captions):

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Recall@5 \| 21.95% \|
	\| Recall@10 \| 31.20% \|
	\| Recall@20 \| 42.50% \|

	## Usage

	```python
	import torch
	from huggingface_hub import hf_hub_download

	# Download model
	model_path = hf_hub_download(
	repo_id="Potato-Scientist/multi-modal-search",
	filename="pytorch_model.bin"
	)

	# Load checkpoint
	checkpoint = torch.load(model_path, map_location='cpu')

	# Load into your model
	from src.models.multimodal_model import MultimodalModel

	model = MultimodalModel(
	embedding_dim=512,
	freeze_backbones=False,
	pretrained=False
	)
	model.load_state_dict(checkpoint['model_state_dict'])
	model.eval()
	```

	## Demo

	Try the live demo: [Multimodal Search Demo](https://huggingface.co/spaces/Potato-Scientist/multimodal-search-hf)

	## Model Details

	- Developed by: Potato-Scientist
	- Model type: Vision-Language Model
	- Language: English
	- License: MIT
	- Framework: PyTorch 2.1.0

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{multimodal-search-2026,
	author = {Potato-Scientist},
	title = {Multimodal Search Model},
	year = {2026},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/Potato-Scientist/multimodal-search-model}}
	}
	```