| | --- |
| | license: mit |
| | tags: |
| | - multimodal |
| | - image-text-retrieval |
| | - contrastive-learning |
| | - vision-language |
| | library_name: pytorch |
| | --- |
| | |
| | # Multimodal Search Model |
| |
|
| | A PyTorch-based multimodal model for image-text retrieval trained on COCO Captions dataset. |
| |
|
| | ## Model Architecture |
| |
|
| | - **Image Encoder**: ViT-Base/16 (Vision Transformer) |
| | - Pre-trained on ImageNet |
| | - Output: 768-dim features |
| | |
| | - **Text Encoder**: BERT-Base-Uncased |
| | - Pre-trained on English corpus |
| | - Output: 768-dim features |
| |
|
| | - **Projection**: Linear layers project both modalities to 512-dim shared embedding space |
| |
|
| | - **Training Strategy**: Frozen backbones with trainable projection heads |
| |
|
| | ## Training Details |
| |
|
| | - **Dataset**: COCO Captions 2017 (118K training images, 5K validation images) |
| | - **Loss Function**: InfoNCE (Contrastive Loss) |
| | - **Temperature**: 0.07 |
| | - **Optimizer**: AdamW |
| | - **Batch Size**: 32 |
| | - **Image Size**: 224x224 |
| | - **Max Text Length**: 77 tokens |
| |
|
| | ## Performance |
| |
|
| | Evaluated on COCO val2017 (5,000 images, 25,000 captions): |
| |
|
| | | Metric | Score | |
| | |--------|-------| |
| | | **Recall@5** | **21.95%** | |
| | | Recall@10 | 31.20% | |
| | | Recall@20 | 42.50% | |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import torch |
| | from huggingface_hub import hf_hub_download |
| | |
| | # Download model |
| | model_path = hf_hub_download( |
| | repo_id="Potato-Scientist/multi-modal-search", |
| | filename="pytorch_model.bin" |
| | ) |
| | |
| | # Load checkpoint |
| | checkpoint = torch.load(model_path, map_location='cpu') |
| | |
| | # Load into your model |
| | from src.models.multimodal_model import MultimodalModel |
| | |
| | model = MultimodalModel( |
| | embedding_dim=512, |
| | freeze_backbones=False, |
| | pretrained=False |
| | ) |
| | model.load_state_dict(checkpoint['model_state_dict']) |
| | model.eval() |
| | ``` |
| |
|
| | ## Demo |
| |
|
| | Try the live demo: [Multimodal Search Demo](https://huggingface.co/spaces/Potato-Scientist/multimodal-search-hf) |
| |
|
| | ## Model Details |
| |
|
| | - **Developed by**: Potato-Scientist |
| | - **Model type**: Vision-Language Model |
| | - **Language**: English |
| | - **License**: MIT |
| | - **Framework**: PyTorch 2.1.0 |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | ```bibtex |
| | @misc{multimodal-search-2026, |
| | author = {Potato-Scientist}, |
| | title = {Multimodal Search Model}, |
| | year = {2026}, |
| | publisher = {HuggingFace}, |
| | howpublished = {\url{https://huggingface.co/Potato-Scientist/multimodal-search-model}} |
| | } |
| | ``` |
| |
|