nolan4
/

modernBERT-base-CLIP

Zero-Shot Image Classification

Model card Files Files and versions

nolan4 commited on Jan 10, 2025

Commit

0647efc

·

verified ·

1 Parent(s): b7d8382

Update README.md

Files changed (1) hide show

README.md +9 -8

README.md CHANGED Viewed

@@ -5,6 +5,7 @@ datasets:
 base_model:
 - answerdotai/ModernBERT-base
 - HuggingFaceTB/SmolVLM-Instruct
 ---
 # Model Card for Model ID
@@ -14,25 +15,25 @@ Use natural language to search for images.<br>
 # How to Get Started with the Model
-To use a pretrained model to search a directory of images, go to demo.py. For training, see train.py.<br>
 # Model Details
-**Text encoder (tuned):** modernBERT-base<br>
 https://huggingface.co/answerdotai/ModernBERT-base<br>
-**Vision encoder (frozen):** IdeficsV3 variant extracted from HF's smolVLM!<br>
-https://huggingface.co/blog/smolvlm
 # Model Description
 ModernBERT-base-CLIP is a multimodal model for Contrastive Language-Image Pretraining (CLIP), designed to align text and image representations in a shared embedding space.
-It leverages a fine-tuned ModernBERT-base text encoder and a frozen vision encoder (IdeficsV3 from SmolVLM) to generate embeddings, which are projected into a 512-dimensional space using
 linear layers. The model enables natural language-based image retrieval and zero-shot classification by optimizing a contrastive loss, which maximizes the similarity between matching text-image pairs while minimizing the similarity for non-matching pairs.
 Training was conducted on the Flickr30k dataset, with one-shot evaluation performed on COCO images (... or your own!) using the demo.py script,
 # Datasets
-flickr30k: https://huggingface.co/datasets/nlphuji/flickr30k<br> (training)
-Coco-captioning: https://cocodataset.org/#captions-2015 (demo)
 # Training Procedure
@@ -43,4 +44,4 @@ The model is trained using the InfoNCE contrastive loss, which encourages positi
 # Hardware
-Nvidia 3080 Ti

 base_model:
 - answerdotai/ModernBERT-base
 - HuggingFaceTB/SmolVLM-Instruct
+pipeline_tag: zero-shot-image-classification
 ---
 # Model Card for Model ID
 # How to Get Started with the Model
+To use a pretrained model to search through a directory of images, go to demo.py. For training, see train.py.<br>
 # Model Details
+**Text encoder:** modernBERT-base<br>
 https://huggingface.co/answerdotai/ModernBERT-base<br>
+**Vision encoder:** IdeficsV3 variant extracted from HF's smolVLM!<br>
+https://huggingface.co/blog/smolvlm<br>
 # Model Description
 ModernBERT-base-CLIP is a multimodal model for Contrastive Language-Image Pretraining (CLIP), designed to align text and image representations in a shared embedding space.
+It leverages a fine-tuned ModernBERT-base text encoder and a frozen vision encoder (extracted from SmolVLM) to generate embeddings, which are projected into a 512-dimensional space using
 linear layers. The model enables natural language-based image retrieval and zero-shot classification by optimizing a contrastive loss, which maximizes the similarity between matching text-image pairs while minimizing the similarity for non-matching pairs.
 Training was conducted on the Flickr30k dataset, with one-shot evaluation performed on COCO images (... or your own!) using the demo.py script,
 # Datasets
+flickr30k: https://huggingface.co/datasets/nlphuji/flickr30 (training)<br>
+Coco-captioning: https://cocodataset.org/#captions-2015 (demo)<br>
 # Training Procedure
 # Hardware
+Nvidia 3080 Ti