Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,16 @@
|
|
| 1 |
## About `GME-VARCO-VISION-Embedding`
|
| 2 |
`GME-VARCO-VISION-Embedding` is a multimodal embedding model that computes semantic similarity between text, images, and videos in a high-dimensional embedding space. In particular, the model focuses on video retrieval, which demands greater complexity and contextual understanding compared to image retrieval. It achieves high retrieval accuracy and strong generalization performance across various scenarios, such as scene-based search, description-based search, and question-answering-based search.
|
| 3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
### Model Architecture and Training Method
|
| 6 |
`GME-VARCO-VISION-Embedding` is based on [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), and uses the parameters of [`Alibaba-NLP/gme-Qwen2-VL-7B-Instruct`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct) to improve the model's general retrieval ability.
|
|
@@ -19,15 +29,6 @@ Our model achieves **state-of-the-art (SOTA) zero-shot performance** on the Mult
|
|
| 19 |
|
| 20 |
<br>
|
| 21 |
|
| 22 |
-
## Demo Video
|
| 23 |
-
Check out our demo videos showcasing our multimodal embedding model in action:
|
| 24 |
-
- [English Demo Video](https://www.youtube.com/watch?v=kCvz82Y1BQg)
|
| 25 |
-
- [Korean Demo Video](https://youtube.com/shorts/jC2J7rbAfxs)
|
| 26 |
-
|
| 27 |
-
The demo demonstrates how our embedding model works together with an AI agent to search for relevant videos based on user queries and generate responses using the retrieved video content.
|
| 28 |
-
|
| 29 |
-
<br>
|
| 30 |
-
|
| 31 |
## Code Examples
|
| 32 |
`GME-VARCO-VISION-Embedding` adopts the inference pipeline of [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
|
| 33 |
|
|
|
|
| 1 |
## About `GME-VARCO-VISION-Embedding`
|
| 2 |
`GME-VARCO-VISION-Embedding` is a multimodal embedding model that computes semantic similarity between text, images, and videos in a high-dimensional embedding space. In particular, the model focuses on video retrieval, which demands greater complexity and contextual understanding compared to image retrieval. It achieves high retrieval accuracy and strong generalization performance across various scenarios, such as scene-based search, description-based search, and question-answering-based search.
|
| 3 |
|
| 4 |
+
<br>
|
| 5 |
+
|
| 6 |
+
## Demo Video
|
| 7 |
+
Check out our demo videos showcasing our multimodal embedding model in action:
|
| 8 |
+
- [English Demo Video](https://www.youtube.com/watch?v=kCvz82Y1BQg)
|
| 9 |
+
- [Korean Demo Video](https://youtube.com/shorts/jC2J7rbAfxs)
|
| 10 |
+
|
| 11 |
+
The demo demonstrates how our embedding model works together with an AI agent to search for relevant videos based on user queries and generate responses using the retrieved video content.
|
| 12 |
+
|
| 13 |
+
<br>
|
| 14 |
|
| 15 |
### Model Architecture and Training Method
|
| 16 |
`GME-VARCO-VISION-Embedding` is based on [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), and uses the parameters of [`Alibaba-NLP/gme-Qwen2-VL-7B-Instruct`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct) to improve the model's general retrieval ability.
|
|
|
|
| 29 |
|
| 30 |
<br>
|
| 31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
## Code Examples
|
| 33 |
`GME-VARCO-VISION-Embedding` adopts the inference pipeline of [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
|
| 34 |
|