Model Card for Villanova-2B-VL-2603

Villanova.AI logo

Villanova-2B-VL-2603 is a fully open, multilingual Vision-Language Model developed by Villanova.AI. Part of the Villanova project, it extends our text-only Villanova-2B-2603 to visual understanding while preserving native support for five European languages. All model weights, training data sources, and training details are publicly released.

Built on a LLaVA-style architecture pairing a SigLIP vision encoder with the Villanova-2B-Base-2603 language backbone, this ~2.8B-parameter model delivers strong multimodal understanding, visual question answering, and multilingual image captioning under a fully open Apache 2.0 license.


Model Family

Villanova-2B-Base-2603 — Base model (4.4T)
 ↳ Villanova-2B-2603 — SFT / Instruct
  ↳ Villanova-2B-2603-GGUF — Quantized
 ↳ Villanova-2B-VL-2603 — Vision-Language Instruct — 📍 This model
  ↳ Villanova-2B-VL-2603-GGUF — Quantized

Villanova-2B-Base-2512-Preview — Base model (2.2T) (previous version, not recommended)
 ↳ Villanova-2B-2512-Preview — SFT / Instruct (previous version, not recommended)


Highlights

  • European-focused, fully open VLM released under Apache 2.0
  • Native multilingual support for 5 European languages: English, French, German, Italian, and Spanish, including multilingual image captioning (XM3600) and visual instruction following
  • Broad visual understanding across general VQA (RealWorldQA, CVQA, MME) and multilingual benchmarks (Multi-MMBench, Multi-AI2D)
  • Preserves text-only capabilities of the Villanova-2B-2603 language backbone through text-only data mixing in Stage 2
  • Only ~2.8B parameters, efficient enough for single-GPU inference

Model Summary

Architecture LLaVA (LlavaForConditionalGeneration)
Vision Encoder SigLIP-SO400M/14 (frozen in Stage 2)
Language Model Villanova-2B-Base-2603
Total Parameters ~2.79B
Stage 1 Projector-only alignment on multilingual image-caption pairs
Stage 2 LLM unfrozen, vision tower frozen, visual instruction tuning on a fullmix recipe (~1.08M samples)
Languages English, French, German, Italian, Spanish
Max Sequence Length 32,768 tokens
Precision bfloat16
License Apache 2.0

Training Recipe (Stage 1: Projector Alignment)

Stage 1 aligns the vision encoder output to the language model embedding space by training only the multimodal projector, with both the vision tower and the LLM fully frozen. This is a lightweight warmup that teaches the projector how to map SigLIP visual features into the Villanova-2B token space before any instruction tuning.

Data: Multi-Pixmo-Cap, multilingual image caption pairs in EN/DE/ES/FR/IT (brief captions split).

Hyperparameter Value
Trainable parameters Multimodal projector only
Vision tower frozen
LLM frozen
Learning rate 1e-3
Batch size (per GPU) 2
Gradient accumulation 16
GPUs 8× H100 80GB
Effective batch size 256
Epochs 4
Max seq length 32,768
Precision bf16-mixed

Training Data

Both stages use only permissively-licensed data (no GPT/Claude-generated content). The curated multilingual derivatives (the Multi-* datasets, translated and post-processed in EN/DE/ES/FR/IT) are released by Villanova.AI on the HuggingFace Hub.

Stage 1: Projector Alignment (~600K samples)

Dataset Role Modality Samples
Multi-Pixmo-Cap Brief image captioning Image + text (5 langs) ~600K

Stage 2: Visual Instruction Tuning (~1.08M samples)

Dataset Role Modality Samples
FineVision (AOKVQA) General VQA Image + text 16K
FineVision (DocVQA) Document understanding Image + text 37K
FineVision (TextVQA) Scene-text VQA Image + text 33K
FineVision (VizWiz) Accessibility VQA Image + text 6K
FineVision (VQAv2) General VQA Image + text 422K
AI2D Diagram QA Image + text 7K
TextCaps Image captioning with text Image + text 22K
XM3600 Multilingual image captioning Image + text (5 langs) 41K
Multi-Pixmo-Ask Multilingual visual instruction Image + text (5 langs) 112K
Multi-Persona-IF Multilingual instruction following with persona Image + text (5 langs) 75K
Multi-Dolly-15k Text-only general instruction Text only (5 langs) 14K
Multi-FLAN-CoT Text-only chain-of-thought reasoning Text only (5 langs) 38K
Multi-FLAN-NIV2 Text-only NLP task instruction Text only (5 langs) 38K
Multi-FLAN-P3 Text-only NLP task instruction (P3) Text only (5 langs) 6K
Multi-SciRIFF Text-only scientific reasoning Text only (5 langs) 67K
Multi-SmolTalk-Rewrite Text-only rewriting tasks Text only (5 langs) 51K
Multi-SmolTalk-Summarize Text-only summarization Text only (5 langs) 91K
Villanova-Hard-Coded Identity / persona priors Text only 167

The text-only mixing in Stage 2 prevents catastrophic forgetting of the language model's pre-existing capabilities.

Training Recipe (Stage 2: Visual Instruction Tuning)

Hyperparameter Value
Backbone Villanova-2B-Base-2603
Optimizer AdamW, weight decay 0.01
Learning rate 2e-5
Scheduler Cosine with warmup
Warmup steps 200
Epochs 4
Batch size (per GPU) 1
Gradient accumulation 16
GPUs 8× H100 80GB
Effective batch size 128
Precision bf16-mixed
Max seq length 32,768
Vision tower frozen

How to Use

import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_name = "VillanovaAI/Villanova-2B-VL-2603"
device = "cuda"

processor = AutoProcessor.from_pretrained(model_name)
model = LlavaForConditionalGeneration.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
).to(device)
model.eval()

image = Image.open("example.jpg").convert("RGB")

# The `<image>` placeholder inside the content string marks where the
# image tokens will be inserted by the processor.
messages = [
    {"role": "user", "content": "<image>\nDescribe this image in detail."},
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(device, torch.bfloat16)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Evaluation

Villanova-2B-VL-2603 was evaluated using VLMEvalKit on a suite of standard and multilingual VLM benchmarks covering multiple-choice reasoning, general visual question answering, hallucination robustness, and cross-lingual visual understanding. All evaluations use exact_matching judging (no LLM-as-judge) for full reproducibility.

We compare against Salamandra-VL-7B, a strong European VLM built on the same language family.

Despite using less than a third of the parameters (~2.8B vs ~8.9B), Villanova-2B-VL-2603 matches Salamandra-VL-7B overall, with particular strengths on general VQA and multilingual benchmarks. Multilingual benchmarks are reported as the average across EN/DE/ES/FR/IT.

Category Benchmark Salamandra-VL-7B Villanova-2B-VL-2603
MCQ / Reasoning MMBench 51.9 52.6
MCQ / Reasoning MMStar 38.9 32.7
MCQ / Reasoning AI2D 58.5 51.1
MCQ / Reasoning ScienceQA 65.0 62.4
General VQA RealWorldQA 45.5 46.8
General VQA CVQA 32.9 37.6
General VQA MME 1369 1565
Hallucination POPE 86.8 81.1
OCR / Document OCRBench 558 377
Multilingual (avg) Multi-MMBench 59.7 61.0
Multilingual (avg) Multi-AI2D 58.4 66.4
Multilingual (avg) Multi-MMStar 50.2 47.6
Overall Average (0-100 benchmarks) 54.8 53.9

The Overall row is the unweighted average across the 10 benchmarks on the 0-100 scale. MME and OCRBench are excluded because they use different scoring scales (0-2800 and 0-1000 respectively).

Multilingual Evaluation (Per-Language Detail)

The multilingual benchmarks (Multi-MMBench, Multi-AI2D, Multi-MMStar) are extensions of the standard benchmarks with parallel test sets in 5 European languages. Below is the per-language breakdown.

Benchmark Model DE EN ES FR IT Avg
Multi-MMBench Salamandra-VL-7B 59.3 64.8 62.7 57.0 54.7 59.7
Multi-MMBench Villanova-2B-VL-2603 60.8 62.8 58.9 60.6 61.6 61.0
Multi-AI2D Salamandra-VL-7B 57.0 67.1 62.2 53.6 52.2 58.4
Multi-AI2D Villanova-2B-VL-2603 66.6 68.1 65.1 66.9 65.4 66.4
Multi-MMStar Salamandra-VL-7B 46.6 56.5 52.3 48.2 47.3 50.2
Multi-MMStar Villanova-2B-VL-2603 45.9 50.1 47.0 49.2 45.7 47.6

Key takeaways:

  • Competitive overall average (53.9 vs 54.8) against a model with ~3.2x more parameters
  • Wins on general VQA: RealWorldQA, CVQA, and MME all outperform Salamandra-VL-7B
  • Solid multilingual capability across EN/DE/ES/FR/IT, with a particularly strong Multi-AI2D improvement (+8.0 avg, wins on all 5 languages) over Salamandra-VL-7B
  • Balanced per-language performance: on Multi-AI2D and Multi-MMBench, Villanova performs uniformly across DE/EN/ES/FR/IT (no language collapse)

Intended Use

  • Multilingual image captioning and description
  • Visual question answering (single-image)
  • Document and chart understanding (OCR-light tasks)
  • Multimodal instruction following in EN/DE/ES/FR/IT
  • Research on fully-open European VLMs

Limitations

  • Single-image inference only (no multi-image or video support)
  • OCR quality on dense, small-text documents is limited compared to specialized OCR-heavy VLMs
  • As with all VLMs, outputs can contain hallucinations; users should verify factual claims

License

This model is released under the Apache 2.0 License. The training data used for Stage 2 was selected to allow permissive commercial use (no GPT/Claude-generated content).

Downloads last month
154
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for VillanovaAI/Villanova-2B-VL-2603

Finetuned
(2)
this model

Collection including VillanovaAI/Villanova-2B-VL-2603