Instructions to use NexaAI/OmniVLM-968M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use NexaAI/OmniVLM-968M with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="NexaAI/OmniVLM-968M",
	filename="Nano-Vlm-Processor-494M-F16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use NexaAI/OmniVLM-968M with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf NexaAI/OmniVLM-968M:F16
# Run inference directly in the terminal:
llama-cli -hf NexaAI/OmniVLM-968M:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf NexaAI/OmniVLM-968M:F16
# Run inference directly in the terminal:
llama-cli -hf NexaAI/OmniVLM-968M:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf NexaAI/OmniVLM-968M:F16
# Run inference directly in the terminal:
./llama-cli -hf NexaAI/OmniVLM-968M:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf NexaAI/OmniVLM-968M:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf NexaAI/OmniVLM-968M:F16

Use Docker

docker model run hf.co/NexaAI/OmniVLM-968M:F16

LM Studio
Jan
Ollama
How to use NexaAI/OmniVLM-968M with Ollama:
```
ollama run hf.co/NexaAI/OmniVLM-968M:F16
```

Unsloth Studio new

How to use NexaAI/OmniVLM-968M with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for NexaAI/OmniVLM-968M to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for NexaAI/OmniVLM-968M to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for NexaAI/OmniVLM-968M to start chatting

Pi new

How to use NexaAI/OmniVLM-968M with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf NexaAI/OmniVLM-968M:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "NexaAI/OmniVLM-968M:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use NexaAI/OmniVLM-968M with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf NexaAI/OmniVLM-968M:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default NexaAI/OmniVLM-968M:F16

Run Hermes

hermes

Docker Model Runner
How to use NexaAI/OmniVLM-968M with Docker Model Runner:
```
docker model run hf.co/NexaAI/OmniVLM-968M:F16
```

Lemonade

How to use NexaAI/OmniVLM-968M with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull NexaAI/OmniVLM-968M:F16

Run and chat with the model

lemonade run user.OmniVLM-968M-F16

List all available models

lemonade list

OmniVLM

🔥 Latest Update

[Dec 16, 2024] Our work "OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference" is now live on Arxiv! 🚀
[Nov 27, 2024] Model Improvements: OmniVLM v3 model's GGUF file has been updated in this Hugging Face Repo! ✨ 👉 Test these exciting changes in our Hugging Face Space
[Nov 22, 2024] Model Improvements: OmniVLM v2 model's GGUF file has been updated in this Hugging Face Repo! ✨ Key Improvements Include:
- Enhanced Art Descriptions
- Better Complex Image Understanding
- Improved Anime Recognition
- More Accurate Color and Detail Detection
- Expanded World Knowledge

We are continuously improving OmniVLM-968M based on your valuable feedback! More exciting updates coming soon - Stay tuned! ⭐

Introduction

OmniVLM is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:

9x Token Reduction: Reduces image tokens from 729 to 81, cutting latency and computational cost aggressively. Note that the computation of vision encoder and the projection part keep the same, but the computation of language model backbone is reduced due to 9X shorter image token span.
Trustworthy Result: Reduces hallucinations using DPO training from trustworthy data.

Quick Links:

Interactive Demo in our Hugging Face Space. (Updated 2024 Nov 21)
Quickstart for local setup
Learn more in our Blogs

Feedback: Send questions or comments about the model in our Discord

Intended Use Cases

OmniVLM is intended for Visual Question Answering (answering questions about images) and Image Captioning (describing scenes in photos), making it ideal for on-device applications.

Example Demo: Generating captions for a 1046×1568 image on M4 Pro Macbook takes < 2s processing time and requires only 988 MB RAM and 948 MB Storage.

Benchmarks

Below we demonstrate a figure to show how OmniVLM performs against nanollava. In all the tasks, OmniVLM outperforms the previous world's smallest vision-language model.

We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of OmniVLM.

Benchmark	Nexa AI OmniVLM v2	Nexa AI OmniVLM v1	nanoLLAVA
ScienceQA (Eval)	71.0	62.2	59.0
ScienceQA (Test)	71.0	64.5	59.0
POPE	93.3	89.4	84.1
MM-VET	30.9	27.5	23.9
ChartQA (Test)	61.9	59.2	NA
MMMU (Test)	42.1	41.8	28.6
MMMU (Eval)	40.0	39.9	30.4

How to Use On Device

In the following, we demonstrate how to run OmniVLM locally on your device.

Step 1: Install Nexa-SDK (local on-device inference framework)

Install Nexa-SDK

Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.

Step 2: Then run the following code in your terminal

nexa run omniVLM

Model Architecture

OmniVLM's architecture consists of three key components:

Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens.

The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.

Training

We developed OmniVLM through a three-stage training pipeline:

Pretraining: The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.

Supervised Fine-tuning (SFT): We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses.

Direct Preference Optimization (DPO): The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics

What's next for OmniVLM?

OmniVLM is in early development and we are working to address current limitations:

Expand DPO Training: Increase the scope of DPO (Direct Preference Optimization) training in an iterative process to continually improve model performance and response quality.
Improve document and text understanding

In the long term, we aim to develop OmniVLM as a fully optimized, production-ready solution for edge AI multimodal applications.

Blogs | Discord | X(Twitter)

Downloads last month: 717

GGUF

Model size

0.5B params

Architecture

qwen2

Hardware compatibility

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NexaAI/OmniVLM-968M

Finetunes

1 model

Spaces using NexaAI/OmniVLM-968M 6

Collection including NexaAI/OmniVLM-968M

Nexa Models

Collection

Tiny, multimodal on-device models developed by Nexa AI. • 6 items • Updated Nov 25, 2025 • 8

Paper for NexaAI/OmniVLM-968M

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

Paper • 2412.11475 • Published Dec 16, 2024 • 1

NexaAI
/

OmniVLM-968M