Update README.md
Browse files
README.md
CHANGED
|
@@ -17,7 +17,6 @@ tags:
|
|
| 17 |
|
| 18 |
This repo contains the model checkpoint for [VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks](https://arxiv.org/abs/2410.05160). In this paper, we aimed at building a unified multimodal embedding model for any tasks. Our model is based on converting an existing well-trained VLM (Phi-3.5-V) into an embedding model. The basic idea is to add an [EOS] token in the end of the sequence, which will be used as the representation of the multimodal inputs.
|
| 19 |
|
| 20 |
-
<img width="1432" alt="abs" src="https://raw.githubusercontent.com/TIGER-AI-Lab/VLM2Vec/refs/heads/main/figures//train_vlm.png">
|
| 21 |
|
| 22 |
## Release
|
| 23 |
Our model is being trained on MMEB-train and evaluated on MMEB-eval with contrastive learning. We only use in-batch negatives for training. Our best results were based on Lora training with batch size of 1024. We also have checkpoint with full training with batch size of 2048. Our results on 36 evaluation datasets are:
|
|
@@ -34,7 +33,7 @@ Our model is being trained on MMEB-train and evaluated on MMEB-eval with contras
|
|
| 34 |
|
| 35 |
### Experimental Results
|
| 36 |
Our model can outperform the existing baselines by a huge margin.
|
| 37 |
-
<img width="900" alt="abs" src="
|
| 38 |
|
| 39 |
## How to use VLM2Vec
|
| 40 |
|
|
|
|
| 17 |
|
| 18 |
This repo contains the model checkpoint for [VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks](https://arxiv.org/abs/2410.05160). In this paper, we aimed at building a unified multimodal embedding model for any tasks. Our model is based on converting an existing well-trained VLM (Phi-3.5-V) into an embedding model. The basic idea is to add an [EOS] token in the end of the sequence, which will be used as the representation of the multimodal inputs.
|
| 19 |
|
|
|
|
| 20 |
|
| 21 |
## Release
|
| 22 |
Our model is being trained on MMEB-train and evaluated on MMEB-eval with contrastive learning. We only use in-batch negatives for training. Our best results were based on Lora training with batch size of 1024. We also have checkpoint with full training with batch size of 2048. Our results on 36 evaluation datasets are:
|
|
|
|
| 33 |
|
| 34 |
### Experimental Results
|
| 35 |
Our model can outperform the existing baselines by a huge margin.
|
| 36 |
+
<img width="900" alt="abs" src="vlm2vec_v1_result.png">
|
| 37 |
|
| 38 |
## How to use VLM2Vec
|
| 39 |
|