Instructions to use VillanovaAI/Villanova-2B-VL-2603 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use VillanovaAI/Villanova-2B-VL-2603 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="VillanovaAI/Villanova-2B-VL-2603")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("VillanovaAI/Villanova-2B-VL-2603")
model = AutoModelForMultimodalLM.from_pretrained("VillanovaAI/Villanova-2B-VL-2603")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use VillanovaAI/Villanova-2B-VL-2603 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "VillanovaAI/Villanova-2B-VL-2603"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VillanovaAI/Villanova-2B-VL-2603",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/VillanovaAI/Villanova-2B-VL-2603

SGLang

How to use VillanovaAI/Villanova-2B-VL-2603 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "VillanovaAI/Villanova-2B-VL-2603" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VillanovaAI/Villanova-2B-VL-2603",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "VillanovaAI/Villanova-2B-VL-2603" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VillanovaAI/Villanova-2B-VL-2603",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use VillanovaAI/Villanova-2B-VL-2603 with Docker Model Runner:
```
docker model run hf.co/VillanovaAI/Villanova-2B-VL-2603
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Model Card for Villanova-2B-VL-2603

Villanova-2B-VL-2603 is a fully open, multilingual Vision-Language Model developed by Villanova.AI. Part of the Villanova project, it extends our text-only Villanova-2B-2603 to visual understanding while preserving native support for five European languages. All model weights, training data sources, and training details are publicly released.

Built on a LLaVA-style architecture pairing a SigLIP vision encoder with the Villanova-2B-Base-2603 language backbone, this ~2.8B-parameter model delivers strong multimodal understanding, visual question answering, and multilingual image captioning under a fully open Apache 2.0 license.

Model Family

Villanova-2B-Base-2603 — Base model (4.4T)
↳ Villanova-2B-2603 — SFT / Instruct
↳ Villanova-2B-2603-GGUF — Quantized
↳ Villanova-2B-VL-2603 — Vision-Language Instruct — 📍 This model
↳ Villanova-2B-VL-2603-GGUF — Quantized

Villanova-2B-Base-2512-Preview — Base model (2.2T) (previous version, not recommended)
↳ Villanova-2B-2512-Preview — SFT / Instruct (previous version, not recommended)

Highlights

European-focused, fully open VLM released under Apache 2.0
Native multilingual support for 5 European languages: English, French, German, Italian, and Spanish, including multilingual image captioning (XM3600) and visual instruction following
Broad visual understanding across general VQA (RealWorldQA, CVQA, MME) and multilingual benchmarks (Multi-MMBench, Multi-AI2D)
Preserves text-only capabilities of the Villanova-2B-2603 language backbone through text-only data mixing in Stage 2
Only ~2.8B parameters, efficient enough for single-GPU inference

Model Summary


Architecture	LLaVA (`LlavaForConditionalGeneration`)
Vision Encoder	SigLIP-SO400M/14 (frozen in Stage 2)
Language Model	Villanova-2B-Base-2603
Total Parameters	~2.79B
Stage 1	Projector-only alignment on multilingual image-caption pairs
Stage 2	LLM unfrozen, vision tower frozen, visual instruction tuning on a fullmix recipe (~1.08M samples)
Languages	English, French, German, Italian, Spanish
Max Sequence Length	32,768 tokens
Precision	bfloat16
License	Apache 2.0

Training Recipe (Stage 1: Projector Alignment)

Stage 1 aligns the vision encoder output to the language model embedding space by training only the multimodal projector, with both the vision tower and the LLM fully frozen. This is a lightweight warmup that teaches the projector how to map SigLIP visual features into the Villanova-2B token space before any instruction tuning.

Data: Multi-Pixmo-Cap, multilingual image caption pairs in EN/DE/ES/FR/IT (brief captions split).

Hyperparameter	Value
Trainable parameters	Multimodal projector only
Vision tower	frozen
LLM	frozen
Learning rate	1e-3
Batch size (per GPU)	2
Gradient accumulation	16
GPUs	8× H100 80GB
Effective batch size	256
Epochs	4
Max seq length	32,768
Precision	bf16-mixed

Training Data

Both stages use only permissively-licensed data (no GPT/Claude-generated content). The curated multilingual derivatives (the Multi-* datasets, translated and post-processed in EN/DE/ES/FR/IT) are released by Villanova.AI on the HuggingFace Hub.

Stage 1: Projector Alignment (~600K samples)

Dataset	Role	Modality	Samples
Multi-Pixmo-Cap	Brief image captioning	Image + text (5 langs)	~600K

Stage 2: Visual Instruction Tuning (~1.08M samples)

Dataset	Role	Modality	Samples
FineVision (AOKVQA)	General VQA	Image + text	16K
FineVision (DocVQA)	Document understanding	Image + text	37K
FineVision (TextVQA)	Scene-text VQA	Image + text	33K
FineVision (VizWiz)	Accessibility VQA	Image + text	6K
FineVision (VQAv2)	General VQA	Image + text	422K
AI2D	Diagram QA	Image + text	7K
TextCaps	Image captioning with text	Image + text	22K
XM3600	Multilingual image captioning	Image + text (5 langs)	41K
Multi-Pixmo-Ask	Multilingual visual instruction	Image + text (5 langs)	112K
Multi-Persona-IF	Multilingual instruction following with persona	Image + text (5 langs)	75K
Multi-Dolly-15k	Text-only general instruction	Text only (5 langs)	14K
Multi-FLAN-CoT	Text-only chain-of-thought reasoning	Text only (5 langs)	38K
Multi-FLAN-NIV2	Text-only NLP task instruction	Text only (5 langs)	38K
Multi-FLAN-P3	Text-only NLP task instruction (P3)	Text only (5 langs)	6K
Multi-SciRIFF	Text-only scientific reasoning	Text only (5 langs)	67K
Multi-SmolTalk-Rewrite	Text-only rewriting tasks	Text only (5 langs)	51K
Multi-SmolTalk-Summarize	Text-only summarization	Text only (5 langs)	91K
Villanova-Hard-Coded	Identity / persona priors	Text only	167

The text-only mixing in Stage 2 prevents catastrophic forgetting of the language model's pre-existing capabilities.

Training Recipe (Stage 2: Visual Instruction Tuning)

Hyperparameter	Value
Backbone	Villanova-2B-Base-2603
Optimizer	AdamW, weight decay 0.01
Learning rate	2e-5
Scheduler	Cosine with warmup
Warmup steps	200
Epochs	4
Batch size (per GPU)	1
Gradient accumulation	16
GPUs	8× H100 80GB
Effective batch size	128
Precision	bf16-mixed
Max seq length	32,768
Vision tower	frozen

How to Use

import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_name = "VillanovaAI/Villanova-2B-VL-2603"
device = "cuda"

processor = AutoProcessor.from_pretrained(model_name)
model = LlavaForConditionalGeneration.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
).to(device)
model.eval()

image = Image.open("example.jpg").convert("RGB")

# The `<image>` placeholder inside the content string marks where the
# image tokens will be inserted by the processor.
messages = [
    {"role": "user", "content": "<image>\nDescribe this image in detail."},
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(device, torch.bfloat16)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Evaluation

Villanova-2B-VL-2603 was evaluated using VLMEvalKit on a suite of standard and multilingual VLM benchmarks covering multiple-choice reasoning, general visual question answering, hallucination robustness, and cross-lingual visual understanding. All evaluations use exact_matching judging (no LLM-as-judge) for full reproducibility.

We compare against Salamandra-VL-7B, a strong European VLM built on the same language family.

Despite using less than a third of the parameters (~2.8B vs ~8.9B), Villanova-2B-VL-2603 matches Salamandra-VL-7B overall, with particular strengths on general VQA and multilingual benchmarks. Multilingual benchmarks are reported as the average across EN/DE/ES/FR/IT.

Category	Benchmark	Salamandra-VL-7B	Villanova-2B-VL-2603
MCQ / Reasoning	MMBench	51.9	52.6
MCQ / Reasoning	MMStar	38.9	32.7
MCQ / Reasoning	AI2D	58.5	51.1
MCQ / Reasoning	ScienceQA	65.0	62.4
General VQA	RealWorldQA	45.5	46.8
General VQA	CVQA	32.9	37.6
General VQA	MME	1369	1565
Hallucination	POPE	86.8	81.1
OCR / Document	OCRBench	558	377
Multilingual (avg)	Multi-MMBench	59.7	61.0
Multilingual (avg)	Multi-AI2D	58.4	66.4
Multilingual (avg)	Multi-MMStar	50.2	47.6
Overall	Average (0-100 benchmarks)	54.8	53.9

The Overall row is the unweighted average across the 10 benchmarks on the 0-100 scale. MME and OCRBench are excluded because they use different scoring scales (0-2800 and 0-1000 respectively).

Multilingual Evaluation (Per-Language Detail)

The multilingual benchmarks (Multi-MMBench, Multi-AI2D, Multi-MMStar) are extensions of the standard benchmarks with parallel test sets in 5 European languages. Below is the per-language breakdown.

Benchmark	Model	DE	EN	ES	FR	IT	Avg
Multi-MMBench	Salamandra-VL-7B	59.3	64.8	62.7	57.0	54.7	59.7
Multi-MMBench	Villanova-2B-VL-2603	60.8	62.8	58.9	60.6	61.6	61.0
Multi-AI2D	Salamandra-VL-7B	57.0	67.1	62.2	53.6	52.2	58.4
Multi-AI2D	Villanova-2B-VL-2603	66.6	68.1	65.1	66.9	65.4	66.4
Multi-MMStar	Salamandra-VL-7B	46.6	56.5	52.3	48.2	47.3	50.2
Multi-MMStar	Villanova-2B-VL-2603	45.9	50.1	47.0	49.2	45.7	47.6

Key takeaways:

Competitive overall average (53.9 vs 54.8) against a model with ~3.2x more parameters
Wins on general VQA: RealWorldQA, CVQA, and MME all outperform Salamandra-VL-7B
Solid multilingual capability across EN/DE/ES/FR/IT, with a particularly strong Multi-AI2D improvement (+8.0 avg, wins on all 5 languages) over Salamandra-VL-7B
Balanced per-language performance: on Multi-AI2D and Multi-MMBench, Villanova performs uniformly across DE/EN/ES/FR/IT (no language collapse)

Intended Use

Multilingual image captioning and description
Visual question answering (single-image)
Document and chart understanding (OCR-light tasks)
Multimodal instruction following in EN/DE/ES/FR/IT
Research on fully-open European VLMs

Limitations

Single-image inference only (no multi-image or video support)
OCR quality on dense, small-text documents is limited compared to specialized OCR-heavy VLMs
As with all VLMs, outputs can contain hallucinations; users should verify factual claims

License

This model is released under the Apache 2.0 License. The training data used for Stage 2 was selected to allow permissive commercial use (no GPT/Claude-generated content).