---
library_name: peft
license: apache-2.0
base_model:
- Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: visual-document-retrieval
---

# Eager Embed V1

**eager-embed-v1** is a multimodal dense embedding model built upon a Vision-Language Model (VLM). It is designed to efficiently index documents using both their visual and textual features.

Compared to multi-vector (ColBERT-like) architectures, eager-embed-v1 offers a strong balance between embedding dimensionality and retrieval accuracy, while maintaining efficiency. Unlike those approaches, it does not require a max-sim distance function, further simplifying the retrieval process.

## Model Details

### Model Description

- **Developed by:** Juan Pablo Balarini
- **Funded by:** [Eagerworks](https://eagerworks.com/)
- **Model type:** Embedding model
- **License:** Apache 2.0
- **Finetuned from model:** Qwen3-VL-4B-Instruct

### Model Sources

- **Repository:** [eager-embed](https://github.com/eagerworks/eager-embed)

## How to Get Started with the Model

Load the model and define a helper function to encode messages:
```python
import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from transformers.utils.import_utils import is_flash_attn_2_available
from qwen_vl_utils import process_vision_info

MODEL_NAME = "eagerworks/eager-embed-v1"
DEVICE = torch.device("cpu")
if torch.cuda.is_available():
    DEVICE = torch.device("cuda:0")
elif torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
DTYPE = torch.bfloat16

processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    MODEL_NAME,
    attn_implementation=(
        "flash_attention_2" if is_flash_attn_2_available() else None
    ),
    dtype=DTYPE
).to(DEVICE).eval()

# Function to Encode Message
def encode_message(message):
    with torch.no_grad():
        texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True) + "<|endoftext|>"
        image_inputs, video_inputs = process_vision_info(message)

        inputs = processor(
            text=texts,
            images=image_inputs,
            videos=video_inputs,
            return_tensors="pt",
            padding="longest",
        ).to(DEVICE)

        model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)

        last_hidden_state = model_outputs.hidden_states[-1]
        embeddings = last_hidden_state[:, -1]
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=-1)
        return embeddings
```

🌍 Multilingual Text Retrieval
```python
example_query = "Query: What is the capital city of Uruguay?"
example_text_1 = "Montevideo es la capital y la ciudad más poblada de la República Oriental del Uruguay, así como la capital del departamento homónimo"
example_text_2 = "El río Uruguay es un río internacional que forma parte de la cuenca del Plata. Nace en Brasil, recorre unos 1.800 km y desemboca en el Río de la Plata"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}]
text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2))

print("Similarities:", sim1.item(), sim2.item())
```

📈 Image Document Retrieval (Image, Chart, PDF)
```python
MAX_IMAGE_SIZE = 784
example_query = 'Query: Where can we find the animal llama?'
example_image_1 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/animal-llama.png"
example_image_2 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/meta-llama.png"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}]
image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2))

print("Similarities:", sim1.item(), sim2.item())
```

## Training Details

### Training Data

Training was done on a computer with the following specs:
- 8x RTX 5090 for a total of 256 GB of VRAM
- AMD EPYC 9534 64-Core CPU (128 threads)
- 256 RAM
- 2TB SSD

### Training Procedure

Training was done using the [Tevatron framework](https://github.com/texttron/tevatron) and [Deepspeed](https://github.com/deepspeedai/DeepSpeed) for parallel training.

#### Training Hyperparameters

More information on training parameters can be found [here](https://github.com/eagerworks/eager-embed/blob/main/train.sh)

## Evaluation

Model was evaluated on the Vidore 1, 2 and 3 benchmarks. More info can be found [here](https://mteb-leaderboard.hf.space/?benchmark_name=ViDoRe%28v2%29).

## Citation

```bibtex
@article{EagerEmbed,
  title={Eager Embed V1: Multimodal Dense Embeddings for Retrieval},
  author={Juan Pablo Balarini},
  year={2025},
  publisher={Eagerworks},
  url={https://github.com/eagerworks/eager-embed}
}
```