Parth Sarthi Srivastava commited on
Commit
cdf0449
·
0 Parent(s):

upload trained checkpoint

Browse files
Files changed (3) hide show
  1. .gitattributes +4 -0
  2. README.md +100 -0
  3. pytorch_model.bin +3 -0
.gitattributes ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ *.bin filter=lfs diff=lfs merge=lfs -text
2
+ *.pth filter=lfs diff=lfs merge=lfs -text
3
+ *.pt filter=lfs diff=lfs merge=lfs -text
4
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - multimodal
5
+ - image-text-retrieval
6
+ - contrastive-learning
7
+ - vision-language
8
+ library_name: pytorch
9
+ ---
10
+
11
+ # Multimodal Search Model
12
+
13
+ A PyTorch-based multimodal model for image-text retrieval trained on COCO Captions dataset.
14
+
15
+ ## Model Architecture
16
+
17
+ - **Image Encoder**: ViT-Base/16 (Vision Transformer)
18
+ - Pre-trained on ImageNet
19
+ - Output: 768-dim features
20
+
21
+ - **Text Encoder**: BERT-Base-Uncased
22
+ - Pre-trained on English corpus
23
+ - Output: 768-dim features
24
+
25
+ - **Projection**: Linear layers project both modalities to 512-dim shared embedding space
26
+
27
+ - **Training Strategy**: Frozen backbones with trainable projection heads
28
+
29
+ ## Training Details
30
+
31
+ - **Dataset**: COCO Captions 2017 (118K training images, 5K validation images)
32
+ - **Loss Function**: InfoNCE (Contrastive Loss)
33
+ - **Temperature**: 0.07
34
+ - **Optimizer**: AdamW
35
+ - **Batch Size**: 32
36
+ - **Image Size**: 224x224
37
+ - **Max Text Length**: 77 tokens
38
+
39
+ ## Performance
40
+
41
+ Evaluated on COCO val2017 (5,000 images, 25,000 captions):
42
+
43
+ | Metric | Score |
44
+ |--------|-------|
45
+ | **Recall@5** | **21.95%** |
46
+ | Recall@10 | 31.20% |
47
+ | Recall@20 | 42.50% |
48
+
49
+ ## Usage
50
+
51
+ ```python
52
+ import torch
53
+ from huggingface_hub import hf_hub_download
54
+
55
+ # Download model
56
+ model_path = hf_hub_download(
57
+ repo_id="Potato-Scientist/multimodal-search-model",
58
+ filename="pytorch_model.bin"
59
+ )
60
+
61
+ # Load checkpoint
62
+ checkpoint = torch.load(model_path, map_location='cpu')
63
+
64
+ # Load into your model
65
+ from src.models.multimodal_model import MultimodalModel
66
+
67
+ model = MultimodalModel(
68
+ embedding_dim=512,
69
+ freeze_backbones=False,
70
+ pretrained=False
71
+ )
72
+ model.load_state_dict(checkpoint['model_state_dict'])
73
+ model.eval()
74
+ ```
75
+
76
+ ## Demo
77
+
78
+ Try the live demo: [Multimodal Search Demo](https://huggingface.co/spaces/Potato-Scientist/multimodal-search-hf)
79
+
80
+ ## Model Details
81
+
82
+ - **Developed by**: Potato-Scientist
83
+ - **Model type**: Vision-Language Model
84
+ - **Language**: English
85
+ - **License**: MIT
86
+ - **Framework**: PyTorch 2.1.0
87
+
88
+ ## Citation
89
+
90
+ If you use this model, please cite:
91
+
92
+ ```bibtex
93
+ @misc{multimodal-search-2026,
94
+ author = {Potato-Scientist},
95
+ title = {Multimodal Search Model},
96
+ year = {2026},
97
+ publisher = {HuggingFace},
98
+ howpublished = {\url{https://huggingface.co/Potato-Scientist/multimodal-search-model}}
99
+ }
100
+ ```
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6818cc6762cfe286bde724d54e6f333d0275db827efeab7b441264f0d3c24a11
3
+ size 790713733