File size: 5,368 Bytes

18f5efe
5ed08c6
 
 
 
 
18f5efe
 
5ed08c6
18f5efe
5ed08c6
18f5efe
5ed08c6
18f5efe
 
 
 
 
5ed08c6
 
 
 
 
18f5efe
5ed08c6
18f5efe
5ed08c6
18f5efe
 
 
07dcbfc
5ed08c6
05ff23c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5ed08c6
18f5efe
 
 
 
 
51dfdee
5ed08c6
 
 
 
18f5efe
 
 
5ed08c6
18f5efe
 
 
5ed08c6
18f5efe
 
 
5ed08c6
18f5efe
5ed08c6
18f5efe
5ed08c6

---
library_name: peft
license: apache-2.0
base_model:
- Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: visual-document-retrieval
---

# Eager Embed V1

**eager-embed-v1** is a multimodal dense embedding model built upon a Vision-Language Model (VLM). It is designed to efficiently index documents using both their visual and textual features.

Compared to multi-vector (ColBERT-like) architectures, eager-embed-v1 offers a strong balance between embedding dimensionality and retrieval accuracy, while maintaining efficiency. Unlike those approaches, it does not require a max-sim distance function, further simplifying the retrieval process.

## Model Details

### Model Description

- **Developed by:** Juan Pablo Balarini
- **Funded by:** [Eagerworks](https://eagerworks.com/)
- **Model type:** Embedding model
- **License:** Apache 2.0
- **Finetuned from model:** Qwen3-VL-4B-Instruct

### Model Sources

- **Repository:** [eager-embed](https://github.com/eagerworks/eager-embed)

## How to Get Started with the Model

Load the model and define a helper function to encode messages:
```python
import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from transformers.utils.import_utils import is_flash_attn_2_available
from qwen_vl_utils import process_vision_info

MODEL_NAME = "eagerworks/eager-embed-v1"
DEVICE = torch.device("cpu")
if torch.cuda.is_available():
    DEVICE = torch.device("cuda:0")
elif torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
DTYPE = torch.bfloat16

processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    MODEL_NAME,
    attn_implementation=(
        "flash_attention_2" if is_flash_attn_2_available() else None
    ),
    dtype=DTYPE
).to(DEVICE).eval()

# Function to Encode Message
def encode_message(message):
    with torch.no_grad():
        texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True) + "<|endoftext|>"
        image_inputs, video_inputs = process_vision_info(message)

        inputs = processor(
            text=texts,
            images=image_inputs,
            videos=video_inputs,
            return_tensors="pt",
            padding="longest",
        ).to(DEVICE)

        model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)

        last_hidden_state = model_outputs.hidden_states[-1]
        embeddings = last_hidden_state[:, -1]
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=-1)
        return embeddings
```

🌍 Multilingual Text Retrieval
```python
example_query = "Query: What is the capital city of Uruguay?"
example_text_1 = "Montevideo es la capital y la ciudad más poblada de la República Oriental del Uruguay, así como la capital del departamento homónimo"
example_text_2 = "El río Uruguay es un río internacional que forma parte de la cuenca del Plata. Nace en Brasil, recorre unos 1.800 km y desemboca en el Río de la Plata"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}]
text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2))

print("Similarities:", sim1.item(), sim2.item())
```

📈 Image Document Retrieval (Image, Chart, PDF)
```python
MAX_IMAGE_SIZE = 784
example_query = 'Query: Where can we find the animal llama?'
example_image_1 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/animal-llama.png"
example_image_2 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/meta-llama.png"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}]
image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2))

print("Similarities:", sim1.item(), sim2.item())
```

## Training Details

### Training Data

Training was done on a computer with the following specs:
- 8x RTX 5090 for a total of 256 GB of VRAM
- AMD EPYC 9534 64-Core CPU (128 threads)
- 256 RAM
- 2TB SSD

### Training Procedure

Training was done using the [Tevatron framework](https://github.com/texttron/tevatron) and [Deepspeed](https://github.com/deepspeedai/DeepSpeed) for parallel training.

#### Training Hyperparameters

More information on training parameters can be found [here](https://github.com/eagerworks/eager-embed/blob/main/train.sh)

## Evaluation

Model was evaluated on the Vidore 1, 2 and 3 benchmarks. More info can be found [here](https://mteb-leaderboard.hf.space/?benchmark_name=ViDoRe%28v2%29).

## Citation

```bibtex
@article{EagerEmbed,
  title={Eager Embed V1: Multimodal Dense Embeddings for Retrieval},
  author={Juan Pablo Balarini},
  year={2025},
  publisher={Eagerworks},
  url={https://github.com/eagerworks/eager-embed}
}
```