--- library_name: peft license: apache-2.0 base_model: - Qwen/Qwen3-VL-4B-Instruct pipeline_tag: visual-document-retrieval --- # Eager Embed V1 **eager-embed-v1** is a multimodal dense embedding model built upon a Vision-Language Model (VLM). It is designed to efficiently index documents using both their visual and textual features. Compared to multi-vector (ColBERT-like) architectures, eager-embed-v1 offers a strong balance between embedding dimensionality and retrieval accuracy, while maintaining efficiency. Unlike those approaches, it does not require a max-sim distance function, further simplifying the retrieval process. ## Model Details ### Model Description - **Developed by:** Juan Pablo Balarini - **Funded by:** [Eagerworks](https://eagerworks.com/) - **Model type:** Embedding model - **License:** Apache 2.0 - **Finetuned from model:** Qwen3-VL-4B-Instruct ### Model Sources - **Repository:** [eager-embed](https://github.com/eagerworks/eager-embed) ## How to Get Started with the Model Load the model and define a helper function to encode messages: ```python import torch from transformers import AutoProcessor, Qwen3VLForConditionalGeneration from transformers.utils.import_utils import is_flash_attn_2_available from qwen_vl_utils import process_vision_info MODEL_NAME = "eagerworks/eager-embed-v1" DEVICE = torch.device("cpu") if torch.cuda.is_available(): DEVICE = torch.device("cuda:0") elif torch.backends.mps.is_available(): DEVICE = torch.device("mps") DTYPE = torch.bfloat16 processor = AutoProcessor.from_pretrained(MODEL_NAME) model = Qwen3VLForConditionalGeneration.from_pretrained( MODEL_NAME, attn_implementation=( "flash_attention_2" if is_flash_attn_2_available() else None ), dtype=DTYPE ).to(DEVICE).eval() # Function to Encode Message def encode_message(message): with torch.no_grad(): texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True) + "<|endoftext|>" image_inputs, video_inputs = process_vision_info(message) inputs = processor( text=texts, images=image_inputs, videos=video_inputs, return_tensors="pt", padding="longest", ).to(DEVICE) model_outputs = model(**inputs, return_dict=True, output_hidden_states=True) last_hidden_state = model_outputs.hidden_states[-1] embeddings = last_hidden_state[:, -1] embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=-1) return embeddings ``` 🌍 Multilingual Text Retrieval ```python example_query = "Query: What is the capital city of Uruguay?" example_text_1 = "Montevideo es la capital y la ciudad más poblada de la República Oriental del Uruguay, así como la capital del departamento homónimo" example_text_2 = "El río Uruguay es un río internacional que forma parte de la cuenca del Plata. Nace en Brasil, recorre unos 1.800 km y desemboca en el Río de la Plata" query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}] text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}] text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}] sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1)) sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2)) print("Similarities:", sim1.item(), sim2.item()) ``` 📈 Image Document Retrieval (Image, Chart, PDF) ```python MAX_IMAGE_SIZE = 784 example_query = 'Query: Where can we find the animal llama?' example_image_1 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/animal-llama.png" example_image_2 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/meta-llama.png" query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}] image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}] image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}] sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1)) sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2)) print("Similarities:", sim1.item(), sim2.item()) ``` ## Training Details ### Training Data Training was done on a computer with the following specs: - 8x RTX 5090 for a total of 256 GB of VRAM - AMD EPYC 9534 64-Core CPU (128 threads) - 256 RAM - 2TB SSD ### Training Procedure Training was done using the [Tevatron framework](https://github.com/texttron/tevatron) and [Deepspeed](https://github.com/deepspeedai/DeepSpeed) for parallel training. #### Training Hyperparameters More information on training parameters can be found [here](https://github.com/eagerworks/eager-embed/blob/main/train.sh) ## Evaluation Model was evaluated on the Vidore 1, 2 and 3 benchmarks. More info can be found [here](https://mteb-leaderboard.hf.space/?benchmark_name=ViDoRe%28v2%29). ## Citation ```bibtex @article{EagerEmbed, title={Eager Embed V1: Multimodal Dense Embeddings for Retrieval}, author={Juan Pablo Balarini}, year={2025}, publisher={Eagerworks}, url={https://github.com/eagerworks/eager-embed} } ```