upload trained checkpoint

Files changed (3) hide show

.gitattributes +4 -0
README.md +100 -0
pytorch_model.bin +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,4 @@

+*.bin filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,100 @@

+---
+license: mit
+tags:
+- multimodal
+- image-text-retrieval
+- contrastive-learning
+- vision-language
+library_name: pytorch
+---
+# Multimodal Search Model
+A PyTorch-based multimodal model for image-text retrieval trained on COCO Captions dataset.
+## Model Architecture
+- **Image Encoder**: ViT-Base/16 (Vision Transformer)
+  - Pre-trained on ImageNet
+  - Output: 768-dim features
+- **Text Encoder**: BERT-Base-Uncased
+  - Pre-trained on English corpus
+  - Output: 768-dim features
+- **Projection**: Linear layers project both modalities to 512-dim shared embedding space
+- **Training Strategy**: Frozen backbones with trainable projection heads
+## Training Details
+- **Dataset**: COCO Captions 2017 (118K training images, 5K validation images)
+- **Loss Function**: InfoNCE (Contrastive Loss)
+- **Temperature**: 0.07
+- **Optimizer**: AdamW
+- **Batch Size**: 32
+- **Image Size**: 224x224
+- **Max Text Length**: 77 tokens
+## Performance
+Evaluated on COCO val2017 (5,000 images, 25,000 captions):
+| Metric | Score |
+|--------|-------|
+| **Recall@5** | **21.95%** |
+| Recall@10 | 31.20% |
+| Recall@20 | 42.50% |
+## Usage
+```python
+import torch
+from huggingface_hub import hf_hub_download
+# Download model
+model_path = hf_hub_download(
+    repo_id="Potato-Scientist/multimodal-search-model",
+    filename="pytorch_model.bin"
+)
+# Load checkpoint
+checkpoint = torch.load(model_path, map_location='cpu')
+# Load into your model
+from src.models.multimodal_model import MultimodalModel
+model = MultimodalModel(
+    embedding_dim=512,
+    freeze_backbones=False,
+    pretrained=False
+)
+model.load_state_dict(checkpoint['model_state_dict'])
+model.eval()
+```
+## Demo
+Try the live demo: [Multimodal Search Demo](https://huggingface.co/spaces/Potato-Scientist/multimodal-search-hf)
+## Model Details
+- **Developed by**: Potato-Scientist
+- **Model type**: Vision-Language Model
+- **Language**: English
+- **License**: MIT
+- **Framework**: PyTorch 2.1.0
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{multimodal-search-2026,
+  author = {Potato-Scientist},
+  title = {Multimodal Search Model},
+  year = {2026},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/Potato-Scientist/multimodal-search-model}}
+}
+```

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6818cc6762cfe286bde724d54e6f333d0275db827efeab7b441264f0d3c24a11
+size 790713733