Instructions to use AIDC-AI/Ovis2.5-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AIDC-AI/Ovis2.5-9B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AIDC-AI/Ovis2.5-9B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2.5-9B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use AIDC-AI/Ovis2.5-9B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AIDC-AI/Ovis2.5-9B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AIDC-AI/Ovis2.5-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/AIDC-AI/Ovis2.5-9B
- SGLang
How to use AIDC-AI/Ovis2.5-9B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AIDC-AI/Ovis2.5-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AIDC-AI/Ovis2.5-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AIDC-AI/Ovis2.5-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AIDC-AI/Ovis2.5-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use AIDC-AI/Ovis2.5-9B with Docker Model Runner:
docker model run hf.co/AIDC-AI/Ovis2.5-9B
Quantization support
Would there be any awq or bitsandbytes quantization support for this model?
Most likely, in a month and a half they will add GPTQ-Int8 and int4
When can I expect the quantized model
When can I expect the quantized model
I have successfully applied Hunyuan’s new SiNQ quantization technique to the 9B and 2B models, but only to the LLM component. To make inference feasible on my hardware, I built a custom “franken-inference” pipeline. The quantized LLM runs on my RTX 4070 Ti, while the visual encoder is offloaded to a secondary RTX 2070. Without this split, the full model would not fit in VRAM even at 4-bit quantization.
Right now, the implementation isn’t heavily optimized and could use some cleanup, but it’s fully functional. During inference, the 9B quantized LLM uses under 9GB of VRAM on the 4070 Ti. The model weights are around 7GB, and the KV cache adds another 2GB. Meanwhile, the visual encoder runs smoothly on the 2070, consuming under 5GB of VRAM.
The current quantization level is approximately 4-bit, but I have headroom to go higher, potentially to 5-bit or even 6-bit, given my 12GB VRAM budget. I’m also exploring adding FlashAttention support to further reduce memory pressure and improve inference speed.
The visual encoder can even be offloaded to the CPU if needed. It’s still surprisingly fast, and this would free up more GPU memory for longer sequences. But it can also be run on the same GPU as the LLM part.
If you’re interested in the code, setup details, or want to collaborate on optimizing it further, I’d love to hear from you!
There is on big issue in my code and its system ram usage, that one spikes a LOT so below 64gb might not be enough atm, im working on a fix though (;
EDIT: Found a massive bug that makes inference speed crazy slow /:
When can I expect the quantized model
I have successfully applied Hunyuan’s new SiNQ quantization technique to the 9B and 2B models, but only to the LLM component. To make inference feasible on my hardware, I built a custom “franken-inference” pipeline. The quantized LLM runs on my RTX 4070 Ti, while the visual encoder is offloaded to a secondary RTX 2070. Without this split, the full model would not fit in VRAM even at 4-bit quantization.
Right now, the implementation isn’t heavily optimized and could use some cleanup, but it’s fully functional. During inference, the 9B quantized LLM uses under 9GB of VRAM on the 4070 Ti. The model weights are around 7GB, and the KV cache adds another 2GB. Meanwhile, the visual encoder runs smoothly on the 2070, consuming under 5GB of VRAM.
The current quantization level is approximately 4-bit, but I have headroom to go higher, potentially to 5-bit or even 6-bit, given my 12GB VRAM budget. I’m also exploring adding FlashAttention support to further reduce memory pressure and improve inference speed.
The visual encoder can even be offloaded to the CPU if needed. It’s still surprisingly fast, and this would free up more GPU memory for longer sequences. But it can also be run on the same GPU as the LLM part.
If you’re interested in the code, setup details, or want to collaborate on optimizing it further, I’d love to hear from you!
There is on big issue in my code and its system ram usage, that one spikes a LOT so below 64gb might not be enough atm, im working on a fix though (;
EDIT: Found a massive bug that makes inference speed crazy slow /:
okay i fix the main bug, now everything works, though it takes a lot of ram usage still but inference works on my 12gb vram card (;
When can I expect the quantized model
If you cant wait you can use this one, its a bit messy and stuff but it should be possible to get it running (;
https://huggingface.co/wsbagnsv1/Ovis2.5-9B-sinq-4bit-experimental