Instructions to use AIDC-AI/Ovis2.5-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AIDC-AI/Ovis2.5-9B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="AIDC-AI/Ovis2.5-9B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2.5-9B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use AIDC-AI/Ovis2.5-9B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AIDC-AI/Ovis2.5-9B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIDC-AI/Ovis2.5-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/AIDC-AI/Ovis2.5-9B

SGLang

How to use AIDC-AI/Ovis2.5-9B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AIDC-AI/Ovis2.5-9B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIDC-AI/Ovis2.5-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AIDC-AI/Ovis2.5-9B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIDC-AI/Ovis2.5-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use AIDC-AI/Ovis2.5-9B with Docker Model Runner:
```
docker model run hf.co/AIDC-AI/Ovis2.5-9B
```

Quantization support

by princemjp - opened Aug 21, 2025

Discussion

princemjp

Aug 21, 2025

Would there be any awq or bitsandbytes quantization support for this model?

ThetaCursed

Aug 22, 2025

Most likely, in a month and a half they will add GPTQ-Int8 and int4

Intics

Sep 27, 2025

When can I expect the quantized model

wsbagnsv1

Oct 5, 2025

•

edited Oct 5, 2025

When can I expect the quantized model

I have successfully applied Hunyuan’s new SiNQ quantization technique to the 9B and 2B models, but only to the LLM component. To make inference feasible on my hardware, I built a custom “franken-inference” pipeline. The quantized LLM runs on my RTX 4070 Ti, while the visual encoder is offloaded to a secondary RTX 2070. Without this split, the full model would not fit in VRAM even at 4-bit quantization.

Right now, the implementation isn’t heavily optimized and could use some cleanup, but it’s fully functional. During inference, the 9B quantized LLM uses under 9GB of VRAM on the 4070 Ti. The model weights are around 7GB, and the KV cache adds another 2GB. Meanwhile, the visual encoder runs smoothly on the 2070, consuming under 5GB of VRAM.

The current quantization level is approximately 4-bit, but I have headroom to go higher, potentially to 5-bit or even 6-bit, given my 12GB VRAM budget. I’m also exploring adding FlashAttention support to further reduce memory pressure and improve inference speed.

The visual encoder can even be offloaded to the CPU if needed. It’s still surprisingly fast, and this would free up more GPU memory for longer sequences. But it can also be run on the same GPU as the LLM part.

If you’re interested in the code, setup details, or want to collaborate on optimizing it further, I’d love to hear from you!

There is on big issue in my code and its system ram usage, that one spikes a LOT so below 64gb might not be enough atm, im working on a fix though (;

EDIT: Found a massive bug that makes inference speed crazy slow /:

wsbagnsv1

Oct 6, 2025

When can I expect the quantized model

I have successfully applied Hunyuan’s new SiNQ quantization technique to the 9B and 2B models, but only to the LLM component. To make inference feasible on my hardware, I built a custom “franken-inference” pipeline. The quantized LLM runs on my RTX 4070 Ti, while the visual encoder is offloaded to a secondary RTX 2070. Without this split, the full model would not fit in VRAM even at 4-bit quantization.

Right now, the implementation isn’t heavily optimized and could use some cleanup, but it’s fully functional. During inference, the 9B quantized LLM uses under 9GB of VRAM on the 4070 Ti. The model weights are around 7GB, and the KV cache adds another 2GB. Meanwhile, the visual encoder runs smoothly on the 2070, consuming under 5GB of VRAM.

The current quantization level is approximately 4-bit, but I have headroom to go higher, potentially to 5-bit or even 6-bit, given my 12GB VRAM budget. I’m also exploring adding FlashAttention support to further reduce memory pressure and improve inference speed.

The visual encoder can even be offloaded to the CPU if needed. It’s still surprisingly fast, and this would free up more GPU memory for longer sequences. But it can also be run on the same GPU as the LLM part.

If you’re interested in the code, setup details, or want to collaborate on optimizing it further, I’d love to hear from you!

There is on big issue in my code and its system ram usage, that one spikes a LOT so below 64gb might not be enough atm, im working on a fix though (;

EDIT: Found a massive bug that makes inference speed crazy slow /:

okay i fix the main bug, now everything works, though it takes a lot of ram usage still but inference works on my 12gb vram card (;

wsbagnsv1

Oct 6, 2025

When can I expect the quantized model

If you cant wait you can use this one, its a bit messy and stuff but it should be possible to get it running (;
https://huggingface.co/wsbagnsv1/Ovis2.5-9B-sinq-4bit-experimental

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment