Model Card for Villanova-2B-VL-2603
Villanova-2B-VL-2603 is a fully open, multilingual Vision-Language Model developed by Villanova.AI. Part of the Villanova project, it extends our text-only Villanova-2B-2603 to visual understanding while preserving native support for five European languages. All model weights, training data sources, and training details are publicly released.
Built on a LLaVA-style architecture pairing a SigLIP vision encoder with the Villanova-2B-Base-2603 language backbone, this ~2.8B-parameter model delivers strong multimodal understanding, visual question answering, and multilingual image captioning under a fully open Apache 2.0 license.
Model Family
Villanova-2B-Base-2603 — Base model (4.4T)
↳ Villanova-2B-2603 — SFT / Instruct
↳ Villanova-2B-2603-GGUF — Quantized
↳ Villanova-2B-VL-2603 — Vision-Language Instruct — 📍 This model
↳ Villanova-2B-VL-2603-GGUF — Quantized
Villanova-2B-Base-2512-Preview — Base model (2.2T) (previous version, not recommended)
↳ Villanova-2B-2512-Preview — SFT / Instruct (previous version, not recommended)
Highlights
- European-focused, fully open VLM released under Apache 2.0
- Native multilingual support for 5 European languages: English, French, German, Italian, and Spanish, including multilingual image captioning (XM3600) and visual instruction following
- Broad visual understanding across general VQA (RealWorldQA, CVQA, MME) and multilingual benchmarks (Multi-MMBench, Multi-AI2D)
- Preserves text-only capabilities of the Villanova-2B-2603 language backbone through text-only data mixing in Stage 2
- Only ~2.8B parameters, efficient enough for single-GPU inference
Model Summary
| Architecture | LLaVA (LlavaForConditionalGeneration) |
| Vision Encoder | SigLIP-SO400M/14 (frozen in Stage 2) |
| Language Model | Villanova-2B-Base-2603 |
| Total Parameters | ~2.79B |
| Stage 1 | Projector-only alignment on multilingual image-caption pairs |
| Stage 2 | LLM unfrozen, vision tower frozen, visual instruction tuning on a fullmix recipe (~1.08M samples) |
| Languages | English, French, German, Italian, Spanish |
| Max Sequence Length | 32,768 tokens |
| Precision | bfloat16 |
| License | Apache 2.0 |
Training Recipe (Stage 1: Projector Alignment)
Stage 1 aligns the vision encoder output to the language model embedding space by training only the multimodal projector, with both the vision tower and the LLM fully frozen. This is a lightweight warmup that teaches the projector how to map SigLIP visual features into the Villanova-2B token space before any instruction tuning.
Data: Multi-Pixmo-Cap, multilingual image caption pairs in EN/DE/ES/FR/IT (brief captions split).
| Hyperparameter | Value |
|---|---|
| Trainable parameters | Multimodal projector only |
| Vision tower | frozen |
| LLM | frozen |
| Learning rate | 1e-3 |
| Batch size (per GPU) | 2 |
| Gradient accumulation | 16 |
| GPUs | 8× H100 80GB |
| Effective batch size | 256 |
| Epochs | 4 |
| Max seq length | 32,768 |
| Precision | bf16-mixed |
Training Data
Both stages use only permissively-licensed data (no GPT/Claude-generated content). The curated multilingual derivatives (the Multi-* datasets, translated and post-processed in EN/DE/ES/FR/IT) are released by Villanova.AI on the HuggingFace Hub.
Stage 1: Projector Alignment (~600K samples)
| Dataset | Role | Modality | Samples |
|---|---|---|---|
| Multi-Pixmo-Cap | Brief image captioning | Image + text (5 langs) | ~600K |
Stage 2: Visual Instruction Tuning (~1.08M samples)
| Dataset | Role | Modality | Samples |
|---|---|---|---|
| FineVision (AOKVQA) | General VQA | Image + text | 16K |
| FineVision (DocVQA) | Document understanding | Image + text | 37K |
| FineVision (TextVQA) | Scene-text VQA | Image + text | 33K |
| FineVision (VizWiz) | Accessibility VQA | Image + text | 6K |
| FineVision (VQAv2) | General VQA | Image + text | 422K |
| AI2D | Diagram QA | Image + text | 7K |
| TextCaps | Image captioning with text | Image + text | 22K |
| XM3600 | Multilingual image captioning | Image + text (5 langs) | 41K |
| Multi-Pixmo-Ask | Multilingual visual instruction | Image + text (5 langs) | 112K |
| Multi-Persona-IF | Multilingual instruction following with persona | Image + text (5 langs) | 75K |
| Multi-Dolly-15k | Text-only general instruction | Text only (5 langs) | 14K |
| Multi-FLAN-CoT | Text-only chain-of-thought reasoning | Text only (5 langs) | 38K |
| Multi-FLAN-NIV2 | Text-only NLP task instruction | Text only (5 langs) | 38K |
| Multi-FLAN-P3 | Text-only NLP task instruction (P3) | Text only (5 langs) | 6K |
| Multi-SciRIFF | Text-only scientific reasoning | Text only (5 langs) | 67K |
| Multi-SmolTalk-Rewrite | Text-only rewriting tasks | Text only (5 langs) | 51K |
| Multi-SmolTalk-Summarize | Text-only summarization | Text only (5 langs) | 91K |
| Villanova-Hard-Coded | Identity / persona priors | Text only | 167 |
The text-only mixing in Stage 2 prevents catastrophic forgetting of the language model's pre-existing capabilities.
Training Recipe (Stage 2: Visual Instruction Tuning)
| Hyperparameter | Value |
|---|---|
| Backbone | Villanova-2B-Base-2603 |
| Optimizer | AdamW, weight decay 0.01 |
| Learning rate | 2e-5 |
| Scheduler | Cosine with warmup |
| Warmup steps | 200 |
| Epochs | 4 |
| Batch size (per GPU) | 1 |
| Gradient accumulation | 16 |
| GPUs | 8× H100 80GB |
| Effective batch size | 128 |
| Precision | bf16-mixed |
| Max seq length | 32,768 |
| Vision tower | frozen |
How to Use
import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_name = "VillanovaAI/Villanova-2B-VL-2603"
device = "cuda"
processor = AutoProcessor.from_pretrained(model_name)
model = LlavaForConditionalGeneration.from_pretrained(
model_name,
dtype=torch.bfloat16,
).to(device)
model.eval()
image = Image.open("example.jpg").convert("RGB")
# The `<image>` placeholder inside the content string marks where the
# image tokens will be inserted by the processor.
messages = [
{"role": "user", "content": "<image>\nDescribe this image in detail."},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(device, torch.bfloat16)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Evaluation
Villanova-2B-VL-2603 was evaluated using VLMEvalKit on a suite of standard and multilingual VLM benchmarks covering multiple-choice reasoning, general visual question answering, hallucination robustness, and cross-lingual visual understanding. All evaluations use exact_matching judging (no LLM-as-judge) for full reproducibility.
We compare against Salamandra-VL-7B, a strong European VLM built on the same language family.
Despite using less than a third of the parameters (~2.8B vs ~8.9B), Villanova-2B-VL-2603 matches Salamandra-VL-7B overall, with particular strengths on general VQA and multilingual benchmarks. Multilingual benchmarks are reported as the average across EN/DE/ES/FR/IT.
| Category | Benchmark | Salamandra-VL-7B | Villanova-2B-VL-2603 |
|---|---|---|---|
| MCQ / Reasoning | MMBench | 51.9 | 52.6 |
| MCQ / Reasoning | MMStar | 38.9 | 32.7 |
| MCQ / Reasoning | AI2D | 58.5 | 51.1 |
| MCQ / Reasoning | ScienceQA | 65.0 | 62.4 |
| General VQA | RealWorldQA | 45.5 | 46.8 |
| General VQA | CVQA | 32.9 | 37.6 |
| General VQA | MME | 1369 | 1565 |
| Hallucination | POPE | 86.8 | 81.1 |
| OCR / Document | OCRBench | 558 | 377 |
| Multilingual (avg) | Multi-MMBench | 59.7 | 61.0 |
| Multilingual (avg) | Multi-AI2D | 58.4 | 66.4 |
| Multilingual (avg) | Multi-MMStar | 50.2 | 47.6 |
| Overall | Average (0-100 benchmarks) | 54.8 | 53.9 |
The Overall row is the unweighted average across the 10 benchmarks on the 0-100 scale. MME and OCRBench are excluded because they use different scoring scales (0-2800 and 0-1000 respectively).
Multilingual Evaluation (Per-Language Detail)
The multilingual benchmarks (Multi-MMBench, Multi-AI2D, Multi-MMStar) are extensions of the standard benchmarks with parallel test sets in 5 European languages. Below is the per-language breakdown.
| Benchmark | Model | DE | EN | ES | FR | IT | Avg |
|---|---|---|---|---|---|---|---|
| Multi-MMBench | Salamandra-VL-7B | 59.3 | 64.8 | 62.7 | 57.0 | 54.7 | 59.7 |
| Multi-MMBench | Villanova-2B-VL-2603 | 60.8 | 62.8 | 58.9 | 60.6 | 61.6 | 61.0 |
| Multi-AI2D | Salamandra-VL-7B | 57.0 | 67.1 | 62.2 | 53.6 | 52.2 | 58.4 |
| Multi-AI2D | Villanova-2B-VL-2603 | 66.6 | 68.1 | 65.1 | 66.9 | 65.4 | 66.4 |
| Multi-MMStar | Salamandra-VL-7B | 46.6 | 56.5 | 52.3 | 48.2 | 47.3 | 50.2 |
| Multi-MMStar | Villanova-2B-VL-2603 | 45.9 | 50.1 | 47.0 | 49.2 | 45.7 | 47.6 |
Key takeaways:
- Competitive overall average (53.9 vs 54.8) against a model with ~3.2x more parameters
- Wins on general VQA: RealWorldQA, CVQA, and MME all outperform Salamandra-VL-7B
- Solid multilingual capability across EN/DE/ES/FR/IT, with a particularly strong Multi-AI2D improvement (+8.0 avg, wins on all 5 languages) over Salamandra-VL-7B
- Balanced per-language performance: on Multi-AI2D and Multi-MMBench, Villanova performs uniformly across DE/EN/ES/FR/IT (no language collapse)
Intended Use
- Multilingual image captioning and description
- Visual question answering (single-image)
- Document and chart understanding (OCR-light tasks)
- Multimodal instruction following in EN/DE/ES/FR/IT
- Research on fully-open European VLMs
Limitations
- Single-image inference only (no multi-image or video support)
- OCR quality on dense, small-text documents is limited compared to specialized OCR-heavy VLMs
- As with all VLMs, outputs can contain hallucinations; users should verify factual claims
License
This model is released under the Apache 2.0 License. The training data used for Stage 2 was selected to allow permissive commercial use (no GPT/Claude-generated content).
- Downloads last month
- 154
Model tree for VillanovaAI/Villanova-2B-VL-2603
Base model
VillanovaAI/Villanova-2B-Base-2603