Update README.md
Browse files
README.md
CHANGED
|
@@ -5,6 +5,7 @@ datasets:
|
|
| 5 |
base_model:
|
| 6 |
- answerdotai/ModernBERT-base
|
| 7 |
- HuggingFaceTB/SmolVLM-Instruct
|
|
|
|
| 8 |
---
|
| 9 |
# Model Card for Model ID
|
| 10 |
|
|
@@ -14,25 +15,25 @@ Use natural language to search for images.<br>
|
|
| 14 |
|
| 15 |
# How to Get Started with the Model
|
| 16 |
|
| 17 |
-
To use a pretrained model to search a directory of images, go to demo.py. For training, see train.py.<br>
|
| 18 |
|
| 19 |
# Model Details
|
| 20 |
-
**Text encoder
|
| 21 |
https://huggingface.co/answerdotai/ModernBERT-base<br>
|
| 22 |
-
**Vision encoder
|
| 23 |
-
https://huggingface.co/blog/smolvlm
|
| 24 |
|
| 25 |
# Model Description
|
| 26 |
|
| 27 |
ModernBERT-base-CLIP is a multimodal model for Contrastive Language-Image Pretraining (CLIP), designed to align text and image representations in a shared embedding space.
|
| 28 |
-
It leverages a fine-tuned ModernBERT-base text encoder and a frozen vision encoder (
|
| 29 |
linear layers. The model enables natural language-based image retrieval and zero-shot classification by optimizing a contrastive loss, which maximizes the similarity between matching text-image pairs while minimizing the similarity for non-matching pairs.
|
| 30 |
Training was conducted on the Flickr30k dataset, with one-shot evaluation performed on COCO images (... or your own!) using the demo.py script,
|
| 31 |
|
| 32 |
# Datasets
|
| 33 |
|
| 34 |
-
flickr30k: https://huggingface.co/datasets/nlphuji/
|
| 35 |
-
Coco-captioning: https://cocodataset.org/#captions-2015 (demo)
|
| 36 |
|
| 37 |
# Training Procedure
|
| 38 |
|
|
@@ -43,4 +44,4 @@ The model is trained using the InfoNCE contrastive loss, which encourages positi
|
|
| 43 |
|
| 44 |
# Hardware
|
| 45 |
|
| 46 |
-
Nvidia 3080 Ti
|
|
|
|
| 5 |
base_model:
|
| 6 |
- answerdotai/ModernBERT-base
|
| 7 |
- HuggingFaceTB/SmolVLM-Instruct
|
| 8 |
+
pipeline_tag: zero-shot-image-classification
|
| 9 |
---
|
| 10 |
# Model Card for Model ID
|
| 11 |
|
|
|
|
| 15 |
|
| 16 |
# How to Get Started with the Model
|
| 17 |
|
| 18 |
+
To use a pretrained model to search through a directory of images, go to demo.py. For training, see train.py.<br>
|
| 19 |
|
| 20 |
# Model Details
|
| 21 |
+
**Text encoder:** modernBERT-base<br>
|
| 22 |
https://huggingface.co/answerdotai/ModernBERT-base<br>
|
| 23 |
+
**Vision encoder:** IdeficsV3 variant extracted from HF's smolVLM!<br>
|
| 24 |
+
https://huggingface.co/blog/smolvlm<br>
|
| 25 |
|
| 26 |
# Model Description
|
| 27 |
|
| 28 |
ModernBERT-base-CLIP is a multimodal model for Contrastive Language-Image Pretraining (CLIP), designed to align text and image representations in a shared embedding space.
|
| 29 |
+
It leverages a fine-tuned ModernBERT-base text encoder and a frozen vision encoder (extracted from SmolVLM) to generate embeddings, which are projected into a 512-dimensional space using
|
| 30 |
linear layers. The model enables natural language-based image retrieval and zero-shot classification by optimizing a contrastive loss, which maximizes the similarity between matching text-image pairs while minimizing the similarity for non-matching pairs.
|
| 31 |
Training was conducted on the Flickr30k dataset, with one-shot evaluation performed on COCO images (... or your own!) using the demo.py script,
|
| 32 |
|
| 33 |
# Datasets
|
| 34 |
|
| 35 |
+
flickr30k: https://huggingface.co/datasets/nlphuji/flickr30 (training)<br>
|
| 36 |
+
Coco-captioning: https://cocodataset.org/#captions-2015 (demo)<br>
|
| 37 |
|
| 38 |
# Training Procedure
|
| 39 |
|
|
|
|
| 44 |
|
| 45 |
# Hardware
|
| 46 |
|
| 47 |
+
Nvidia 3080 Ti
|