ViFortune-AI
/

VOVis2.5-2B-pt

Image-Text-to-Text

text-generation

Model card Files Files and versions

VOVis2.5-2B-pt / README.md

Tnt3o5's picture

Update README.md

87d63be verified 6 months ago

|

history blame contribute delete

3.43 kB

	---
	license: apache-2.0
	datasets:
	- AIDC-AI/Ovis-dataset
	library_name: transformers
	tags:
	- MLLM
	- ovis
	- qwen3
	pipeline_tag: image-text-to-text
	language:
	- en
	- vi
	- zh
	---

	# Ovis2.5-2B-Pretrained (Qwen3-1.7B + SigLIP2) - Final Version For Pretraining

	<div align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/637aebed7ce76c3b834cea37/3IK823BZ8w-mz_QfeYkDn.png" width="30%"/>
	</div>

	<p align="center">
	<a href="https://arxiv.org/abs/2508.11737"><img src="https://img.shields.io/badge/📖_Original_Report-Ovis2.5-b31b1b.svg" alt="technical report"></a>
	<a href="https://github.com/AIDC-AI/Ovis"><img src="https://img.shields.io/badge/GitHub-AIDC--AI/Ovis-blue?style=flat&logo=github" alt="code"></a>
	<a href="https://huggingface.co/collections/AIDC-AI/ovis25-689ec1474633b2aab8809335"><img src="https://img.shields.io/badge/🤗_Official_Models-AIDC--AI/Ovis2.5-yellow" alt="models"></a>
	</p>

	---

	# Ovis2.5-2B-Pretrained (Qwen3-1.7B + SigLIP2)

	Ovis2.5-2B-Pretrained is a merged version combining:

	- Vision Encoder: `siglip2-so400m-patch16-512` (from Ovis2.5)
	- Language Model (LLM): `Qwen3-1.7B` (lightweight, efficient, supports Vietnamese)

	> Note: This is a base/pretrained model, only merged weights, not instruction-tuned. For best conversational performance, further fine-tuning is required.

	## Architecture Details

	\| Ovis MLLM \| Vision Encoder \| Language Model (LLM) \| Status \|
	\|--------------------------\|-------------------------------\|----------------------\|-------------------------------\|
	\| VOvis2.5-2B-Pretrained(Final Version) \| siglip2-so400m-patch16-512 \| Qwen3-1.7B \| Base PT Model (Needs SFT)\|
	\| Ovis2.5-2B (Official) \| siglip2-so400m-patch16-512 \| Qwen3-1.7B \| Instruction-Tuned \|
	\| Ovis2.5-9B (Official) \| siglip2-so400m-patch16-512 \| Qwen3-8B \| Instruction-Tuned \|

	Supported languages: Vietnamese 🇻🇳, English, Chinese
	---

	## 🚀 Quick Start

	### Cài đặt
	```bash
	pip install torch==2.8.0 transformers==4.51.3 numpy==1.26.4
	pip install flash-attn==2.7.4.post1 --no-build-isolation
	```

	## Quick Start

	```python
	import torch
	from PIL import Image
	from transformers import AutoModelForCausalLM
	import requests

	model = AutoModelForCausalLM.from_pretrained(
	"AIDC-AI/VOvis2.5-2B-pt",
	torch_dtype=torch.bfloat16,
	trust_remote_code=True
	).cuda()

	messages = [{
	"role": "user",
	"content": [
	{"type": "image", "image": Image.open(requests.get("https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/TIlymOb86R6_Mez3bpmcB.png", stream=True).raw)},
	{"type": "text", "text": "Describe the image in detail."},
	],
	}]

	input_ids, pixel_values, grid_thws = model.preprocess_inputs(
	messages=messages,
	add_generation_prompt=True,
	enable_thinking=True
	)
	input_ids = input_ids.cuda()
	pixel_values = pixel_values.cuda() if pixel_values is not None else None
	grid_thws = grid_thws.cuda() if grid_thws is not None else None

	outputs = model.generate(
	inputs=input_ids,
	pixel_values=pixel_values,
	grid_thws=grid_thws,
	enable_thinking=True,
	enable_thinking_budget=True,
	max_new_tokens=3072,
	thinking_budget=1024,
	)

	response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```