File size: 5,368 Bytes
18f5efe 5ed08c6 18f5efe 5ed08c6 18f5efe 5ed08c6 18f5efe 5ed08c6 18f5efe 5ed08c6 18f5efe 5ed08c6 18f5efe 5ed08c6 18f5efe 07dcbfc 5ed08c6 05ff23c 5ed08c6 18f5efe 51dfdee 5ed08c6 18f5efe 5ed08c6 18f5efe 5ed08c6 18f5efe 5ed08c6 18f5efe 5ed08c6 18f5efe 5ed08c6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
library_name: peft
license: apache-2.0
base_model:
- Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: visual-document-retrieval
---
# Eager Embed V1
**eager-embed-v1** is a multimodal dense embedding model built upon a Vision-Language Model (VLM). It is designed to efficiently index documents using both their visual and textual features.
Compared to multi-vector (ColBERT-like) architectures, eager-embed-v1 offers a strong balance between embedding dimensionality and retrieval accuracy, while maintaining efficiency. Unlike those approaches, it does not require a max-sim distance function, further simplifying the retrieval process.
## Model Details
### Model Description
- **Developed by:** Juan Pablo Balarini
- **Funded by:** [Eagerworks](https://eagerworks.com/)
- **Model type:** Embedding model
- **License:** Apache 2.0
- **Finetuned from model:** Qwen3-VL-4B-Instruct
### Model Sources
- **Repository:** [eager-embed](https://github.com/eagerworks/eager-embed)
## How to Get Started with the Model
Load the model and define a helper function to encode messages:
```python
import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from transformers.utils.import_utils import is_flash_attn_2_available
from qwen_vl_utils import process_vision_info
MODEL_NAME = "eagerworks/eager-embed-v1"
DEVICE = torch.device("cpu")
if torch.cuda.is_available():
DEVICE = torch.device("cuda:0")
elif torch.backends.mps.is_available():
DEVICE = torch.device("mps")
DTYPE = torch.bfloat16
processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = Qwen3VLForConditionalGeneration.from_pretrained(
MODEL_NAME,
attn_implementation=(
"flash_attention_2" if is_flash_attn_2_available() else None
),
dtype=DTYPE
).to(DEVICE).eval()
# Function to Encode Message
def encode_message(message):
with torch.no_grad():
texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True) + "<|endoftext|>"
image_inputs, video_inputs = process_vision_info(message)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
return_tensors="pt",
padding="longest",
).to(DEVICE)
model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)
last_hidden_state = model_outputs.hidden_states[-1]
embeddings = last_hidden_state[:, -1]
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=-1)
return embeddings
```
🌍 Multilingual Text Retrieval
```python
example_query = "Query: What is the capital city of Uruguay?"
example_text_1 = "Montevideo es la capital y la ciudad más poblada de la República Oriental del Uruguay, así como la capital del departamento homónimo"
example_text_2 = "El río Uruguay es un río internacional que forma parte de la cuenca del Plata. Nace en Brasil, recorre unos 1.800 km y desemboca en el Río de la Plata"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}]
text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}]
sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2))
print("Similarities:", sim1.item(), sim2.item())
```
📈 Image Document Retrieval (Image, Chart, PDF)
```python
MAX_IMAGE_SIZE = 784
example_query = 'Query: Where can we find the animal llama?'
example_image_1 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/animal-llama.png"
example_image_2 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/meta-llama.png"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}]
image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}]
sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2))
print("Similarities:", sim1.item(), sim2.item())
```
## Training Details
### Training Data
Training was done on a computer with the following specs:
- 8x RTX 5090 for a total of 256 GB of VRAM
- AMD EPYC 9534 64-Core CPU (128 threads)
- 256 RAM
- 2TB SSD
### Training Procedure
Training was done using the [Tevatron framework](https://github.com/texttron/tevatron) and [Deepspeed](https://github.com/deepspeedai/DeepSpeed) for parallel training.
#### Training Hyperparameters
More information on training parameters can be found [here](https://github.com/eagerworks/eager-embed/blob/main/train.sh)
## Evaluation
Model was evaluated on the Vidore 1, 2 and 3 benchmarks. More info can be found [here](https://mteb-leaderboard.hf.space/?benchmark_name=ViDoRe%28v2%29).
## Citation
```bibtex
@article{EagerEmbed,
title={Eager Embed V1: Multimodal Dense Embeddings for Retrieval},
author={Juan Pablo Balarini},
year={2025},
publisher={Eagerworks},
url={https://github.com/eagerworks/eager-embed}
}
```
|