Model Card for Villanova-2B-Base-2603

Villanova.AI logo

Villanova is a family of fully open, multilingual Large Language Models (LLMs) targeting the five major European languages. All model weights, training data sources, and training details are publicly released.

DISCLAIMER: This is a base model, not instruction-tuned. It is intended as a foundation for downstream fine-tuning and alignment.


Model Family

Villanova-2B-Base-2603 — Base model (4.4T) — 📍 This model
 ↳ Villanova-2B-2603 — SFT / Instruct
  ↳ Villanova-2B-2603-GGUF — Quantized
 ↳ Villanova-2B-VL-2603 — Vision-Language Instruct
  ↳ Villanova-2B-VL-2603-GGUF — Quantized

Villanova-2B-Base-2512-Preview — Base model (2.2T) (previous version, not recommended)
 ↳ Villanova-2B-2512-Preview — SFT / Instruct (previous version, not recommended)


Model Summary

Villanova-2B-Base-2603 is a decoder-only transformer with 2 billion parameters, pre-trained from scratch on 4.4 trillion tokens from a curated multilingual corpus. It supports sequences of up to 32,768 tokens. It is large enough to capture rich linguistic and factual knowledge, yet compact enough for fine-tuning and deployment in resource-constrained environments.

Primary languages: English, Italian, Spanish, French, German. Partial support for additional languages and code, but performance outside the five primary languages is not guaranteed.

The Villanova project is committed to full openness and data transparency. Training data sources, mixture details, architectural choices, and hyperparameters are all publicly documented. Data was selected with ethical sourcing as a guiding principle, prioritising high-quality, permissively licensed corpora.


Pre-training

Training followed a two-stage recipe:

Stage 1 (0 → 4.0T tokens) — Broad multilingual data mixture covering the five core languages, plus code, mathematics, and scientific text.

Stage 2 (4.0T → 4.4T tokens) — Cosine annealing over ~400B tokens of higher-quality, curated data.

Villanova-2B-Base-2512-Preview is an intermediate checkpoint of this same training run, released at the 2.2T token mark with an early decay stage applied from 2.0T tokens onward.

Key training settings: AdamW optimizer (β₁=0.9, β₂=0.95, weight decay=0.1), peak learning rate 3×10⁻⁴, BF16/FP8 mixed precision, Flash Attention, sequences of 4,096 tokens. Training ran on 64 NVIDIA H100 GPUs (~30 days, ~36k tokens/GPU/second).


How to Use

This is a base model: it continues text rather than following instructions. For chat or task use, see Villanova-2B-2603.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "VillanovaAI/Villanova-2B-Base-2603"
device = "cuda"  # or "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

prompt = "Gravity is a fundamental force of nature that"
model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

Evaluation

Model size/performance

Global evaluation:

Model Avg arc_easy hellaswag hellaswag_de hellaswag_es hellaswag_fr hellaswag_it openbookqa piqa sciq winogrande xcopa_it xnli_de xnli_en xnli_es xnli_fr xquad_de xquad_en xquad_es
EuroLLM-1.7B 48.72 69.07 45.04 37.97 40.98 40.05 39.46 29.80 72.20 90.60 61.25 66.00 47.99 50.24 45.58 49.00 27.50 34.60 29.65
Llama-3.2-1B 46.13 66.29 48.16 34.11 37.41 35.48 34.91 27.80 75.14 93.50 60.69 59.40 46.02 54.82 41.37 46.95 16.37 37.18 14.84
Minerva-3B-base-v1.0 40.73 62.33 46.28 27.20 29.69 29.02 40.01 24.60 74.27 88.00 56.75 69.60 34.54 52.13 36.31 37.35 4.31 14.21 6.52
OLMo-2-0425-1B 47.70 72.73 50.79 29.79 31.34 32.60 29.19 30.00 75.95 95.30 64.72 52.60 40.00 51.77 37.63 42.89 20.34 68.25 32.74
Qwen3-1.7-Base 53.29 73.61 49.29 37.54 40.73 39.27 38.45 30.20 75.90 95.80 64.01 64.20 46.47 54.50 44.06 45.78 39.59 69.60 50.21
salamandra-2b 50.58 71.04 47.19 38.01 42.07 40.60 38.56 26.80 72.69 91.90 61.72 65.40 47.79 51.97 49.08 48.67 41.73 41.55 33.72
Villanova-2B-Base-2512-Preview 54.26 75.13 48.57 42.06 45.72 44.62 43.32 26.60 75.08 94.40 61.96 68.40 49.36 52.21 49.04 52.33 41.28 66.66 40.03
Villanova-2B-Base-2603 54.91 73.74 49.53 42.91 46.81 45.49 44.21 25.20 74.32 94.10 59.04 68.80 49.48 54.30 49.00 50.72 44.94 72.52 43.37

English only:

Model Avg arc_easy hellaswag openbookqa piqa sciq winogrande xnli_en xquad_en
EuroLLM-1.7B 56.60 69.07 45.04 29.80 72.20 90.60 61.25 50.24 34.60
Llama-3.2-1B 57.95 66.29 48.16 27.80 75.14 93.50 60.69 54.82 37.18
Minerva-3B-base-v1.0 52.32 62.33 46.28 24.60 74.27 88.00 56.75 52.13 14.21
OLMo-2-0425-1B 63.69 72.73 50.79 30.00 75.95 95.30 64.72 51.77 68.25
Qwen3-1.7-Base 64.11 73.61 49.29 30.20 75.90 95.80 64.01 54.50 69.60
salamandra-2b 58.11 71.04 47.19 26.80 72.69 91.90 61.72 51.97 41.55
Villanova-2B-Base-2512-Preview 62.58 75.13 48.57 26.60 75.08 94.40 61.96 52.21 66.66
Villanova-2B-Base-2603 62.84 73.74 49.53 25.20 74.32 94.10 59.04 54.30 72.52

Multilingual benchmarks:

Model Avg hellaswag_de hellaswag_es hellaswag_fr hellaswag_it xcopa_it xnli_de xnli_es xnli_fr xquad_de xquad_es
EuroLLM-1.7B 42.42 37.97 40.98 40.05 39.46 66.00 47.99 45.58 49.00 27.50 29.65
Llama-3.2-1B 36.69 34.11 37.41 35.48 34.91 59.40 46.02 41.37 46.95 16.37 14.84
Minerva-3B-base-v1.0 31.45 27.20 29.69 29.02 40.01 69.60 34.54 36.31 37.35 4.31 6.52
OLMo-2-0425-1B 34.91 29.79 31.34 32.60 29.19 52.60 40.00 37.63 42.89 20.34 32.74
Qwen3-1.7-Base 44.63 37.54 40.73 39.27 38.45 64.20 46.47 44.06 45.78 39.59 50.21
salamandra-2b 44.56 38.01 42.07 40.60 38.56 65.40 47.79 49.08 48.67 41.73 33.72
Villanova-2B-Base-2512-Preview 47.61 42.06 45.72 44.62 43.32 68.40 49.36 49.04 52.33 41.28 40.03
Villanova-2B-Base-2603 48.57 42.91 46.81 45.49 44.21 68.80 49.48 49.00 50.72 44.94 43.37

Long context (RULER):

Note: Tests were run forcing the context length to 32k, going beyond the default length for models with a native context lower than this threshold.

Model Native Context Avg (32k)
Qwen3-1.7B-Base 32k 0.73
Villanova-2B-Base-2603 32k 0.49
gemma-3-1b-pt 32k 0.28
salamandra-2b 8k 0.12
EuroLLM-1.7B 4k 0.08
OLMo-2-0425-1B 4k 0.00
Villanova-2B-Base-2512-Preview 4k 0.00
Minerva-3B-base-v1.0 16k 0.00

Training Data

The model's training pipeline is divided into two main stages: an initial pre-training stage focused on broad linguistic and factual coverage, and an annealing (decay) stage designed to consolidate knowledge and improve reasoning capabilities.

Stage 1: Pre-training

The first stage was trained on approximately 3.6 trillion tokens (occupying ~15 TB of disk space). The distribution prioritizes five core languages while maintaining a global language coverage baseline. The mixture consists of approximately 37.5% English, large allocations for target Latin-script languages (German, Spanish, French, Italian), 5% code, 2% secondary Latin-script languages, and 6% for broader global languages.

The primary datasets utilized in this stage include:

  • Web Corpora: FineWeb-2, FineWeb-Edu, and FineWeb2-HQ provide a massive multilingual foundation.
  • Encyclopedic & Academic: FineWiki, alongside academic papers from Arxiv and PubMed (Common Pile).
  • Structured Text: FinePDFs supplies high-quality text extracted from structured documents.
  • Quantitative & Technical: FineMath and Stack-Edu establish foundational mathematical reasoning and coding proficiency.

Stage 2: Annealing (Decay)

During the final decay stage on 400 billion tokens, the general web data was partially replaced with a highly curated set of academic, structured, and instructional corpora to improve reasoning during parameter crystallization.

High-quality sources introduced in the annealing stage include:

  • Common-Pile StackExchange: Q&A threads focusing on technical and scientific domains.
  • GitHub Issues & Kaggle Notebooks: A curated concatenation of ~11 billion tokens of repository discussions and ~1.7 billion tokens of analytical notebooks to improve technical problem-solving.
  • FLAN Dolma-Mix Subset: Instruction-formatted text extracted from the Dolma 1.7 dataset, carefully curated to avoid evaluation suite contamination.
  • Advanced Mathematics: InfiWebMath and FineMath corpora.

Stage 3: Long Context Extension

A final training stage was executed to extend the model's effective context window, processing an additional 50 billion tokens. The data distribution for this stage resembles the annealing mixture, but employs a shifted sampling strategy that strictly prioritizes long-form documents. This targeted approach ensures the model can efficiently process and retrieve information across extended sequences while preserving the high reasoning and knowledge density established during the decay stage.


License

This model is released under the Apache 2.0 License.

Downloads last month
88
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for VillanovaAI/Villanova-2B-Base-2603

Finetunes
2 models
Quantizations
2 models

Datasets used to train VillanovaAI/Villanova-2B-Base-2603

Collection including VillanovaAI/Villanova-2B-Base-2603