| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - AIDC-AI/Ovis-dataset |
| | library_name: transformers |
| | tags: |
| | - MLLM |
| | - ovis |
| | - qwen3 |
| | pipeline_tag: image-text-to-text |
| | language: |
| | - en |
| | - vi |
| | - zh |
| | --- |
| | |
| | # Ovis2.5-2B-Pretrained (Qwen3-1.7B + SigLIP2) - Final Version For Pretraining |
| |
|
| | <div align="center"> |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/637aebed7ce76c3b834cea37/3IK823BZ8w-mz_QfeYkDn.png" width="30%"/> |
| | </div> |
| |
|
| | <p align="center"> |
| | <a href="https://arxiv.org/abs/2508.11737"><img src="https://img.shields.io/badge/📖_Original_Report-Ovis2.5-b31b1b.svg" alt="technical report"></a> |
| | <a href="https://github.com/AIDC-AI/Ovis"><img src="https://img.shields.io/badge/GitHub-AIDC--AI/Ovis-blue?style=flat&logo=github" alt="code"></a> |
| | <a href="https://huggingface.co/collections/AIDC-AI/ovis25-689ec1474633b2aab8809335"><img src="https://img.shields.io/badge/🤗_Official_Models-AIDC--AI/Ovis2.5-yellow" alt="models"></a> |
| | </p> |
| |
|
| | --- |
| |
|
| | # Ovis2.5-2B-Pretrained (Qwen3-1.7B + SigLIP2) |
| |
|
| | **Ovis2.5-2B-Pretrained** is a merged version combining: |
| |
|
| | - **Vision Encoder:** `siglip2-so400m-patch16-512` (from Ovis2.5) |
| | - **Language Model (LLM):** `Qwen3-1.7B` (lightweight, efficient, supports Vietnamese) |
| |
|
| | > **Note:** This is a base/pretrained model, only merged weights, not instruction-tuned. For best conversational performance, further fine-tuning is required. |
| |
|
| | ## Architecture Details |
| |
|
| | | Ovis MLLM | Vision Encoder | Language Model (LLM) | Status | |
| | |--------------------------|-------------------------------|----------------------|-------------------------------| |
| | | VOvis2.5-2B-Pretrained(Final Version) | siglip2-so400m-patch16-512 | Qwen3-1.7B | Base PT Model (Needs SFT)| |
| | | Ovis2.5-2B (Official) | siglip2-so400m-patch16-512 | Qwen3-1.7B | Instruction-Tuned | |
| | | Ovis2.5-9B (Official) | siglip2-so400m-patch16-512 | Qwen3-8B | Instruction-Tuned | |
| |
|
| | **Supported languages:** Vietnamese 🇻🇳, English, Chinese |
| | --- |
| |
|
| | ## 🚀 **Quick Start** |
| |
|
| | ### **Cài đặt** |
| | ```bash |
| | pip install torch==2.8.0 transformers==4.51.3 numpy==1.26.4 |
| | pip install flash-attn==2.7.4.post1 --no-build-isolation |
| | ``` |
| |
|
| | ## Quick Start |
| |
|
| | ```python |
| | import torch |
| | from PIL import Image |
| | from transformers import AutoModelForCausalLM |
| | import requests |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "AIDC-AI/VOvis2.5-2B-pt", |
| | torch_dtype=torch.bfloat16, |
| | trust_remote_code=True |
| | ).cuda() |
| | |
| | messages = [{ |
| | "role": "user", |
| | "content": [ |
| | {"type": "image", "image": Image.open(requests.get("https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/TIlymOb86R6_Mez3bpmcB.png", stream=True).raw)}, |
| | {"type": "text", "text": "Describe the image in detail."}, |
| | ], |
| | }] |
| | |
| | input_ids, pixel_values, grid_thws = model.preprocess_inputs( |
| | messages=messages, |
| | add_generation_prompt=True, |
| | enable_thinking=True |
| | ) |
| | input_ids = input_ids.cuda() |
| | pixel_values = pixel_values.cuda() if pixel_values is not None else None |
| | grid_thws = grid_thws.cuda() if grid_thws is not None else None |
| | |
| | outputs = model.generate( |
| | inputs=input_ids, |
| | pixel_values=pixel_values, |
| | grid_thws=grid_thws, |
| | enable_thinking=True, |
| | enable_thinking_budget=True, |
| | max_new_tokens=3072, |
| | thinking_budget=1024, |
| | ) |
| | |
| | response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(response) |
| | ``` |