Model Card for Villanova-2B-Base-2603
Villanova is a family of fully open, multilingual Large Language Models (LLMs) targeting the five major European languages. All model weights, training data sources, and training details are publicly released.
DISCLAIMER: This is a base model, not instruction-tuned. It is intended as a foundation for downstream fine-tuning and alignment.
Model Family
Villanova-2B-Base-2603 — Base model (4.4T) — 📍 This model
↳ Villanova-2B-2603 — SFT / Instruct
↳ Villanova-2B-2603-GGUF — Quantized
↳ Villanova-2B-VL-2603 — Vision-Language Instruct
↳ Villanova-2B-VL-2603-GGUF — Quantized
Villanova-2B-Base-2512-Preview — Base model (2.2T) (previous version, not recommended)
↳ Villanova-2B-2512-Preview — SFT / Instruct (previous version, not recommended)
Model Summary
Villanova-2B-Base-2603 is a decoder-only transformer with 2 billion parameters, pre-trained from scratch on 4.4 trillion tokens from a curated multilingual corpus. It supports sequences of up to 32,768 tokens. It is large enough to capture rich linguistic and factual knowledge, yet compact enough for fine-tuning and deployment in resource-constrained environments.
Primary languages: English, Italian, Spanish, French, German. Partial support for additional languages and code, but performance outside the five primary languages is not guaranteed.
The Villanova project is committed to full openness and data transparency. Training data sources, mixture details, architectural choices, and hyperparameters are all publicly documented. Data was selected with ethical sourcing as a guiding principle, prioritising high-quality, permissively licensed corpora.
Pre-training
Training followed a two-stage recipe:
Stage 1 (0 → 4.0T tokens) — Broad multilingual data mixture covering the five core languages, plus code, mathematics, and scientific text.
Stage 2 (4.0T → 4.4T tokens) — Cosine annealing over ~400B tokens of higher-quality, curated data.
Villanova-2B-Base-2512-Preview is an intermediate checkpoint of this same training run, released at the 2.2T token mark with an early decay stage applied from 2.0T tokens onward.
Key training settings: AdamW optimizer (β₁=0.9, β₂=0.95, weight decay=0.1), peak learning rate 3×10⁻⁴, BF16/FP8 mixed precision, Flash Attention, sequences of 4,096 tokens. Training ran on 64 NVIDIA H100 GPUs (~30 days, ~36k tokens/GPU/second).
How to Use
This is a base model: it continues text rather than following instructions. For chat or task use, see Villanova-2B-2603.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "VillanovaAI/Villanova-2B-Base-2603"
device = "cuda" # or "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
prompt = "Gravity is a fundamental force of nature that"
model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
print(tokenizer.decode(output_ids, skip_special_tokens=True))
Evaluation
Global evaluation:
| Model | Avg | arc_easy | hellaswag | hellaswag_de | hellaswag_es | hellaswag_fr | hellaswag_it | openbookqa | piqa | sciq | winogrande | xcopa_it | xnli_de | xnli_en | xnli_es | xnli_fr | xquad_de | xquad_en | xquad_es |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EuroLLM-1.7B | 48.72 | 69.07 | 45.04 | 37.97 | 40.98 | 40.05 | 39.46 | 29.80 | 72.20 | 90.60 | 61.25 | 66.00 | 47.99 | 50.24 | 45.58 | 49.00 | 27.50 | 34.60 | 29.65 |
| Llama-3.2-1B | 46.13 | 66.29 | 48.16 | 34.11 | 37.41 | 35.48 | 34.91 | 27.80 | 75.14 | 93.50 | 60.69 | 59.40 | 46.02 | 54.82 | 41.37 | 46.95 | 16.37 | 37.18 | 14.84 |
| Minerva-3B-base-v1.0 | 40.73 | 62.33 | 46.28 | 27.20 | 29.69 | 29.02 | 40.01 | 24.60 | 74.27 | 88.00 | 56.75 | 69.60 | 34.54 | 52.13 | 36.31 | 37.35 | 4.31 | 14.21 | 6.52 |
| OLMo-2-0425-1B | 47.70 | 72.73 | 50.79 | 29.79 | 31.34 | 32.60 | 29.19 | 30.00 | 75.95 | 95.30 | 64.72 | 52.60 | 40.00 | 51.77 | 37.63 | 42.89 | 20.34 | 68.25 | 32.74 |
| Qwen3-1.7-Base | 53.29 | 73.61 | 49.29 | 37.54 | 40.73 | 39.27 | 38.45 | 30.20 | 75.90 | 95.80 | 64.01 | 64.20 | 46.47 | 54.50 | 44.06 | 45.78 | 39.59 | 69.60 | 50.21 |
| salamandra-2b | 50.58 | 71.04 | 47.19 | 38.01 | 42.07 | 40.60 | 38.56 | 26.80 | 72.69 | 91.90 | 61.72 | 65.40 | 47.79 | 51.97 | 49.08 | 48.67 | 41.73 | 41.55 | 33.72 |
| Villanova-2B-Base-2512-Preview | 54.26 | 75.13 | 48.57 | 42.06 | 45.72 | 44.62 | 43.32 | 26.60 | 75.08 | 94.40 | 61.96 | 68.40 | 49.36 | 52.21 | 49.04 | 52.33 | 41.28 | 66.66 | 40.03 |
| Villanova-2B-Base-2603 | 54.91 | 73.74 | 49.53 | 42.91 | 46.81 | 45.49 | 44.21 | 25.20 | 74.32 | 94.10 | 59.04 | 68.80 | 49.48 | 54.30 | 49.00 | 50.72 | 44.94 | 72.52 | 43.37 |
English only:
| Model | Avg | arc_easy | hellaswag | openbookqa | piqa | sciq | winogrande | xnli_en | xquad_en |
|---|---|---|---|---|---|---|---|---|---|
| EuroLLM-1.7B | 56.60 | 69.07 | 45.04 | 29.80 | 72.20 | 90.60 | 61.25 | 50.24 | 34.60 |
| Llama-3.2-1B | 57.95 | 66.29 | 48.16 | 27.80 | 75.14 | 93.50 | 60.69 | 54.82 | 37.18 |
| Minerva-3B-base-v1.0 | 52.32 | 62.33 | 46.28 | 24.60 | 74.27 | 88.00 | 56.75 | 52.13 | 14.21 |
| OLMo-2-0425-1B | 63.69 | 72.73 | 50.79 | 30.00 | 75.95 | 95.30 | 64.72 | 51.77 | 68.25 |
| Qwen3-1.7-Base | 64.11 | 73.61 | 49.29 | 30.20 | 75.90 | 95.80 | 64.01 | 54.50 | 69.60 |
| salamandra-2b | 58.11 | 71.04 | 47.19 | 26.80 | 72.69 | 91.90 | 61.72 | 51.97 | 41.55 |
| Villanova-2B-Base-2512-Preview | 62.58 | 75.13 | 48.57 | 26.60 | 75.08 | 94.40 | 61.96 | 52.21 | 66.66 |
| Villanova-2B-Base-2603 | 62.84 | 73.74 | 49.53 | 25.20 | 74.32 | 94.10 | 59.04 | 54.30 | 72.52 |
Multilingual benchmarks:
| Model | Avg | hellaswag_de | hellaswag_es | hellaswag_fr | hellaswag_it | xcopa_it | xnli_de | xnli_es | xnli_fr | xquad_de | xquad_es |
|---|---|---|---|---|---|---|---|---|---|---|---|
| EuroLLM-1.7B | 42.42 | 37.97 | 40.98 | 40.05 | 39.46 | 66.00 | 47.99 | 45.58 | 49.00 | 27.50 | 29.65 |
| Llama-3.2-1B | 36.69 | 34.11 | 37.41 | 35.48 | 34.91 | 59.40 | 46.02 | 41.37 | 46.95 | 16.37 | 14.84 |
| Minerva-3B-base-v1.0 | 31.45 | 27.20 | 29.69 | 29.02 | 40.01 | 69.60 | 34.54 | 36.31 | 37.35 | 4.31 | 6.52 |
| OLMo-2-0425-1B | 34.91 | 29.79 | 31.34 | 32.60 | 29.19 | 52.60 | 40.00 | 37.63 | 42.89 | 20.34 | 32.74 |
| Qwen3-1.7-Base | 44.63 | 37.54 | 40.73 | 39.27 | 38.45 | 64.20 | 46.47 | 44.06 | 45.78 | 39.59 | 50.21 |
| salamandra-2b | 44.56 | 38.01 | 42.07 | 40.60 | 38.56 | 65.40 | 47.79 | 49.08 | 48.67 | 41.73 | 33.72 |
| Villanova-2B-Base-2512-Preview | 47.61 | 42.06 | 45.72 | 44.62 | 43.32 | 68.40 | 49.36 | 49.04 | 52.33 | 41.28 | 40.03 |
| Villanova-2B-Base-2603 | 48.57 | 42.91 | 46.81 | 45.49 | 44.21 | 68.80 | 49.48 | 49.00 | 50.72 | 44.94 | 43.37 |
Long context (RULER):
Note: Tests were run forcing the context length to 32k, going beyond the default length for models with a native context lower than this threshold.
| Model | Native Context | Avg (32k) |
|---|---|---|
| Qwen3-1.7B-Base | 32k | 0.73 |
| Villanova-2B-Base-2603 | 32k | 0.49 |
| gemma-3-1b-pt | 32k | 0.28 |
| salamandra-2b | 8k | 0.12 |
| EuroLLM-1.7B | 4k | 0.08 |
| OLMo-2-0425-1B | 4k | 0.00 |
| Villanova-2B-Base-2512-Preview | 4k | 0.00 |
| Minerva-3B-base-v1.0 | 16k | 0.00 |
Training Data
The model's training pipeline is divided into two main stages: an initial pre-training stage focused on broad linguistic and factual coverage, and an annealing (decay) stage designed to consolidate knowledge and improve reasoning capabilities.
Stage 1: Pre-training
The first stage was trained on approximately 3.6 trillion tokens (occupying ~15 TB of disk space). The distribution prioritizes five core languages while maintaining a global language coverage baseline. The mixture consists of approximately 37.5% English, large allocations for target Latin-script languages (German, Spanish, French, Italian), 5% code, 2% secondary Latin-script languages, and 6% for broader global languages.
The primary datasets utilized in this stage include:
- Web Corpora: FineWeb-2, FineWeb-Edu, and FineWeb2-HQ provide a massive multilingual foundation.
- Encyclopedic & Academic: FineWiki, alongside academic papers from Arxiv and PubMed (Common Pile).
- Structured Text: FinePDFs supplies high-quality text extracted from structured documents.
- Quantitative & Technical: FineMath and Stack-Edu establish foundational mathematical reasoning and coding proficiency.
Stage 2: Annealing (Decay)
During the final decay stage on 400 billion tokens, the general web data was partially replaced with a highly curated set of academic, structured, and instructional corpora to improve reasoning during parameter crystallization.
High-quality sources introduced in the annealing stage include:
- Common-Pile StackExchange: Q&A threads focusing on technical and scientific domains.
- GitHub Issues & Kaggle Notebooks: A curated concatenation of ~11 billion tokens of repository discussions and ~1.7 billion tokens of analytical notebooks to improve technical problem-solving.
- FLAN Dolma-Mix Subset: Instruction-formatted text extracted from the Dolma 1.7 dataset, carefully curated to avoid evaluation suite contamination.
- Advanced Mathematics: InfiWebMath and FineMath corpora.
Stage 3: Long Context Extension
A final training stage was executed to extend the model's effective context window, processing an additional 50 billion tokens. The data distribution for this stage resembles the annealing mixture, but employs a shifted sampling strategy that strictly prioritizes long-form documents. This targeted approach ensures the model can efficiently process and retrieve information across extended sequences while preserving the high reasoning and knowledge density established during the decay stage.
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 88