| | --- |
| | license: cc-by-nc-4.0 |
| | pipeline_tag: image-text-to-text |
| | library_name: transformers |
| | base_model: |
| | - Qwen/Qwen2.5-7B-Instruct |
| | - google/siglip-so400m-patch14-384 |
| | base_model_relation: merge |
| | language: |
| | - multilingual |
| | tags: |
| | - eagle |
| | - VLM |
| | --- |
| | |
| | # Eagle 2.5 |
| |
|
| | [\[📂 GitHub\]](https://github.com/NVlabs/EAGLE) [\[📜 Eagle 2.5 Tech Report\]](https://arxiv.org/abs/your-paper-id) |
| | [\[🗨️ Chat Demo (NVIDIA Internal Network)\]](http://10.49.128.96:8899) |
| | <!-- [\[🤗 HF Demo\]](TODO) --> |
| |
|
| | ## Introduction |
| |
|
| | Eagle 2.5 is a family of frontier vision-language models (VLMs) designed for long-context multimodal learning. While most existing VLMs focus on short-context tasks, Eagle 2.5 addresses the challenges of long video comprehension and high-resolution image understanding, providing a generalist framework for both. The Eagle 2.5 training framework introduces two key techniques—Automatic Degrade Sampling (ADS) and Image Area Preservation (IAP)—to preserve contextual integrity and visual details. Additionally, the training pipeline is optimized for efficient long-context data training. |
| |
|
| | A major contribution of Eagle 2.5 is the introduction of Eagle-Video-110K, a novel dataset with both story-level and clip-level annotations, specifically curated for long video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, offering a robust solution to the limitations of existing VLMs. Notably, Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial models such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B, despite having significantly fewer parameters. |
| |
|
| | ### Key Innovations |
| |
|
| | - **Information-First Sampling**: |
| | - *Image Area Preservation (IAP)*: Optimizes image tiling to retain most of the original image area and aspect ratio, preserving fine-grained details. |
| | - *Automatic Degrade Sampling (ADS)*: Dynamically balances visual and textual input, ensuring complete text retention while maximizing visual content within context length constraints. |
| | - **Progressive Mixed Post-Training**: |
| | - Gradually increases context length during training, enhancing the model's ability to process varying input sizes and improving information density over static sampling. |
| | - **Diversity-Driven Data Recipe**: |
| | - Combines open-source data (human-annotated and synthetic) with the self-curated Eagle-Video-110K dataset, collected via a diversity-driven strategy and annotated with both story-level and clip-level QA pairs. |
| |
|
| | ## Model Details |
| |
|
| | - **Model Type**: Long-context vision-language model |
| | - **Architecture**: |
| | - Vision encoder: SigLIP (with tiling strategies) |
| | - Language model: Qwen2.5 series |
| | - Multimodal connector: MLP projection |
| | - **Supported Inputs**: |
| | - Long video sequences (up to 512 frames) |
| | - High-resolution images |
| | - Multi-page documents |
| | - Long text |
| | - **Training Strategy**: |
| | - Progressive mixed post-training, expanding from 32K to 128K context length |
| | - Information-first sampling for optimal visual and textual information retention |
| | - **Training Data**: |
| | - Open-source video and document datasets |
| | - Eagle-Video-110K (110K long videos with dual-level annotation) |
| |
|
| | ## Released Models |
| |
|
| | | Model | Date | Download Link | Notes | |
| | |---------------|------------|--------------|---------------------------------------------| |
| | | Eagle2.5-8B | 2025.04.16 | [HF link](https://huggingface.co/nvidia/Eagle-2.5-8B) | Long video (512 frames), high-res support | |
| |
|
| | ## Video Benchmarks |
| |
|
| | | Benchmark | GPT-4o | Gemini-1.5 Pro | InternVL2.5-8B | Qwen2.5-VL-8B | **Eagle2.5-8B** | |
| | |--------------------------------------------|--------------------|-------------------|---------------------|---------------------|---------------------| |
| | | MVBench<sub>test</sub> | - | - | 72.0 | 69.6 | **74.8** | |
| | | Perception_test<sub>val</sub> | - | - | - | 70.5 | **82.0** | |
| | | EgoSchema<sub>fullset</sub> | - | 72.2 | - | 65.0 | **72.2** | |
| | | MMB-Video | 1.63 | 1.30 | 1.68 | 1.79 | **1.94** | |
| | | MLVU<sub>val</sub> | - | - | 68.9 | 70.2 | **77.6** | |
| | | LVBench<sub>val</sub> | 66.7 | 64.0 | 60.0 | 56.0 | **66.4** | |
| | | Video-MME<sub>w/o subtitle</sub> | 71.9 | 75.0 | 64.2 | 65.1 | **72.4** | |
| | | Video-MME<sub>w subtitle</sub> | 77.2 | 81.3 | 66.9 | 71.6 | **75.7** | |
| | | CG-Bench<sub>Clue</sub> | 58.6 | 50.9 | - | 44.5 | **55.8** | |
| | | CG-Bench<sub>Long</sub> | 44.9 | 37.8 | - | 35.5 | **46.6** | |
| | | CG-Bench<sub>mIoU</sub> | 5.73 | 3.85 | - | 2.48 | **13.4** | |
| | | HourVideo<sub>Dev</sub> | - | 37.2 | - | - | **44.5** | |
| | | HourVideo<sub>Test</sub> | - | 37.4 | - | - | **41.8** | |
| | | Charade-STA<sub>mIoU</sub> | 35.7 | - | - | 43.6 | **65.9** | |
| | | HD-EPIC | - | 37.6 | - | - | **42.9** | |
| | | HRVideoBench | - | - | - | - | **68.5** | |
| | | EgoPlan<sub>val</sub> | - | - | - | - | **45.3** | |
| | |
| | ## Embodied Benchmarks |
| | | Benchmark | GPT-4o | Gemini-1.5 Pro | InternVL2.5-8B | Qwen2.5-VL-8B | **Eagle2.5-8B** | |
| | |--------------------------------------------|--------------------|-------------------|---------------------|---------------------|---------------------| |
| | | OpenEQA | - | - | - | - | **63.5** | |
| | | ERQA | 47.0 | 41.8 | - | - | **38.3** | |
| | | EgoPlan<sub>val</sub> | - | - | - | - | **45.3** | |
| | |
| | |
| | ## Image Benchmarks |
| | |
| | | Benchmark | GPT-4o | Gemini-1.5 Pro | InternVL2.5-8B | Qwen2.5-VL-8B | **Eagle2.5-8B** | |
| | |--------------------------------------------|--------------------|-------------------|---------------------|---------------------|---------------------| |
| | | DocVQA<sub>test</sub> | 92.8 | 93.1 | 93.0 | 95.7 | **94.1** | |
| | | ChartQA<sub>test</sub> | 85.7 | 87.2 | 84.8 | 87.3 | **87.5** | |
| | | InfoVQA<sub>test</sub> | 79.2 | 81.0 | 77.6 | 82.6 | **80.4** | |
| | | TextVQA<sub>val</sub> | 77.4 | 78.8 | 79.1 | 84.9 | **83.7** | |
| | | OCRBench<sub>test</sub> | 736 | 754 | 822 | 864 | **869** | |
| | | MMstar<sub>test</sub> | 64.7 | 59.1 | 62.8 | 63.9 | **66.2** | |
| | | RWQA<sub>test</sub> | 75.4 | 67.5 | 70.1 | 68.5 | **76.7** | |
| | | AI2D<sub>test</sub> | 84.6 | 79.1 | 84.5 | 83.9 | **84.5** | |
| | | MMMU<sub>val</sub> | 69.1 | 62.2 | 56.0 | 58.6 | **55.8** | |
| | | MMBench_V11<sub>test</sub> | 83.1 | 74.6 | 83.2 | 82.6 | **81.7** | |
| | | MMVet<sub>GPT-4-Turbo</sub> | 69.1 | 64.0 | 62.8 | 67.1 | **62.9** | |
| | | HallBench<sub>avg</sub> | 55.0 | 45.6 | 50.1 | 52.9 | **54.7** | |
| | | MathVista<sub>testmini</sub> | 63.8 | 63.9 | 64.4 | 68.2 | **67.8** | |
| | | Avg Score | 74.9 | 71.7 | 73.1 | 75.6 | **75.6** | |
| |
|
| | *All numbers are directly extracted from Table 2 and Table 3 of the Eagle 2.5 Tech Report.* |
| |
|
| |
|
| | *Eagle 2.5-8B matches or surpasses the performance of much larger models on long-context video and image benchmarks.* |
| |
|
| | ## Quick Start |
| |
|
| |
|
| | ```bash |
| | pip install transformers==4.51.0 |
| | ``` |
| |
|
| |
|
| | ## single image |
| |
|
| | ```python |
| | from PIL import Image |
| | import requests |
| | from transformers import AutoProcessor, AutoModel |
| | import torch |
| | model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16) |
| | processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True) |
| | processor.tokenizer.padding_side = "left" |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | { |
| | "type": "image", |
| | "image": "https://www.ilankelman.org/stopsigns/australia.jpg", |
| | }, |
| | {"type": "text", "text": "Describe this image."}, |
| | ], |
| | } |
| | ] |
| | |
| | text_list = [processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | )] |
| | image_inputs, video_inputs = processor.process_vision_info(messages) |
| | inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True) |
| | inputs = inputs.to("cuda") |
| | model = model.to("cuda") |
| | generated_ids = model.generate(**inputs, max_new_tokens=1024) |
| | output_text = processor.batch_decode( |
| | generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | ) |
| | print(output_text) |
| | ``` |
| |
|
| | ## stream generation |
| |
|
| | ```python |
| | from PIL import Image |
| | import requests |
| | from transformers import AutoProcessor, AutoModel, AutoTokenizer |
| | import torch |
| | |
| | from transformers import TextIteratorStreamer |
| | import threading |
| | |
| | |
| | model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16) |
| | tokenizer = AutoTokenizer.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True) |
| | processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True) |
| | processor.tokenizer.padding_side = "left" |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | { |
| | "type": "image", |
| | "image": "https://www.ilankelman.org/stopsigns/australia.jpg", |
| | }, |
| | {"type": "text", "text": "Describe this image."}, |
| | ], |
| | } |
| | ] |
| | |
| | text_list = [processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | )] |
| | image_inputs, video_inputs = processor.process_vision_info(messages) |
| | inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True) |
| | inputs = inputs.to("cuda") |
| | model = model.to("cuda") |
| | |
| | streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) |
| | |
| | generation_kwargs = dict( |
| | **inputs, |
| | streamer=streamer, |
| | max_new_tokens=1024, |
| | do_sample=True, |
| | top_p=0.95, |
| | temperature=0.8 |
| | ) |
| | thread = threading.Thread(target=model.generate, kwargs=generation_kwargs) |
| | thread.start() |
| | |
| | |
| | for new_text in streamer: |
| | print(new_text, end="", flush=True) |
| | ``` |
| |
|
| | ## multiple-images |
| |
|
| | ```python |
| | from PIL import Image |
| | import requests |
| | from transformers import AutoProcessor, AutoModel |
| | import torch |
| | model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16) |
| | processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True) |
| | processor.tokenizer.padding_side = "left" |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | { |
| | "type": "image", |
| | "image": "https://www.ilankelman.org/stopsigns/australia.jpg", |
| | }, |
| | { |
| | "type": "image", |
| | "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png", |
| | }, |
| | {"type": "text", "text": "Describe these two images."}, |
| | ], |
| | } |
| | ] |
| | |
| | text_list = [processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | )] |
| | image_inputs, video_inputs = processor.process_vision_info(messages) |
| | inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True) |
| | inputs = inputs.to("cuda") |
| | model = model.to("cuda") |
| | generated_ids = model.generate(**inputs, max_new_tokens=1024) |
| | output_text = processor.batch_decode( |
| | generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | ) |
| | print(output_text) |
| | ``` |
| |
|
| | ## single video |
| |
|
| | ```python |
| | |
| | from PIL import Image |
| | import requests |
| | from transformers import AutoProcessor, AutoModel |
| | import torch |
| | model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16) |
| | processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True) |
| | processor.tokenizer.padding_side = "left" |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | { |
| | "type": "video", |
| | "video": "../Eagle2-8B/space_woaudio.mp4", |
| | }, |
| | {"type": "text", "text": "Describe this video."}, |
| | ], |
| | } |
| | ] |
| | |
| | text_list = [processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | )] |
| | image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True) |
| | |
| | inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs) |
| | inputs = inputs.to("cuda") |
| | model = model.to("cuda") |
| | generated_ids = model.generate(**inputs, max_new_tokens=1024) |
| | output_text = processor.batch_decode( |
| | generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | ) |
| | print(output_text) |
| | |
| | ``` |
| |
|
| | ## multieple videos |
| |
|
| | ```python |
| | from PIL import Image |
| | import requests |
| | from transformers import AutoProcessor, AutoModel |
| | import torch |
| | model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16) |
| | processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True) |
| | processor.tokenizer.padding_side = "left" |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | { |
| | "type": "video", |
| | "video": "../Eagle2-8B/space_woaudio.mp4", |
| | "nframes": 10, |
| | }, |
| | { |
| | "type": "video", |
| | "video": "../Eagle2-8B/video_ocr.mp4", |
| | "nframes": 10, |
| | }, |
| | {"type": "text", "text": "Describe these two videos respectively."}, |
| | ], |
| | } |
| | ] |
| | |
| | text_list = [processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | )] |
| | image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True) |
| | inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs) |
| | inputs = inputs.to("cuda") |
| | model = model.to("cuda") |
| | generated_ids = model.generate(**inputs, max_new_tokens=1024) |
| | output_text = processor.batch_decode( |
| | generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | ) |
| | print(output_text) |
| | ``` |
| |
|
| | ## batch inference |
| |
|
| | ```python |
| | from PIL import Image |
| | import requests |
| | from transformers import AutoProcessor, AutoModel |
| | import torch |
| | model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16) |
| | processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True) |
| | processor.tokenizer.padding_side = "left" |
| | |
| | messages1 = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | { |
| | "type": "image", |
| | "image": "https://www.ilankelman.org/stopsigns/australia.jpg", |
| | }, |
| | {"type": "text", "text": "Describe this image."}, |
| | ], |
| | } |
| | ] |
| | |
| | messages2 = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | { |
| | "type": "image", |
| | "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png", |
| | }, |
| | {"type": "text", "text": "Describe this image."}, |
| | ], |
| | } |
| | ] |
| | |
| | text_list = [processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | ) for messages in [messages1, messages2]] |
| | image_inputs, video_inputs = processor.process_vision_info([messages1, messages2]) |
| | inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True) |
| | inputs = inputs.to("cuda") |
| | model = model.to("cuda") |
| | generated_ids = model.generate(**inputs, max_new_tokens=1024) |
| | output_text = processor.batch_decode( |
| | generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | ) |
| | print(output_text) |
| | ``` |
| |
|
| |
|
| | ## Citation |
| | If you use Eagle 2.5 in your research, please cite: |
| |
|
| | ```latex |
| | @article{chen2025eagle2.5, |
| | title={Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models}, |
| | author={Chen, Guo and Li, Zhiqi and Wang, Shihao and Jiang, Jindong and Liu, Yicheng and Lu, Lidong and Huang, De-An and Byeon, Wonmin and Le, Matthieu and Ehrlich, Max and Lu, Tong and Wang, Limin and Catanzaro, Bryan and Kautz, Jan and Tao, Andrew and Yu, Zhiding and Liu, Guilin}, |
| | journal={arXiv preprint arXiv:2025.xxxx.xxxxx}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | ## Acknowledgements |
| |
|
| | We thank the contributors and collaborators for their valuable discussions and support, including NVIDIA infrastructure and research teams. |
| |
|
| |
|
| | ## License/Terms of Use |
| | - The code is released under the Apache 2.0 license as found in the [LICENSE](https://huggingface.co/NVEagle/Eagle-X5-13B-Chat/blob/main/LICENSE) file. |
| | - The pretrained model weights are released under the [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0) <br> |
| | - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms: |
| | - Model License of Qwen2.5-0.5B-Instruct: [Apache-2.0](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/blob/main/LICENSE) |
| | - Model License of PaliGemma: [Gemma license](https://ai.google.dev/gemma/terms) |
| |
|
| | ## Ethical Considerations |
| | NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. |
| |
|
| | Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). |