Instructions to use VillanovaAI/Villanova-2B-Base-2603 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use VillanovaAI/Villanova-2B-Base-2603 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="VillanovaAI/Villanova-2B-Base-2603")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("VillanovaAI/Villanova-2B-Base-2603")
model = AutoModelForCausalLM.from_pretrained("VillanovaAI/Villanova-2B-Base-2603")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use VillanovaAI/Villanova-2B-Base-2603 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "VillanovaAI/Villanova-2B-Base-2603"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VillanovaAI/Villanova-2B-Base-2603",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/VillanovaAI/Villanova-2B-Base-2603

SGLang

How to use VillanovaAI/Villanova-2B-Base-2603 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "VillanovaAI/Villanova-2B-Base-2603" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VillanovaAI/Villanova-2B-Base-2603",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "VillanovaAI/Villanova-2B-Base-2603" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VillanovaAI/Villanova-2B-Base-2603",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use VillanovaAI/Villanova-2B-Base-2603 with Docker Model Runner:
```
docker model run hf.co/VillanovaAI/Villanova-2B-Base-2603
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Model Card for Villanova-2B-Base-2603

Villanova is a family of fully open, multilingual Large Language Models (LLMs) targeting the five major European languages. All model weights, training data sources, and training details are publicly released.

DISCLAIMER: This is a base model, not instruction-tuned. It is intended as a foundation for downstream fine-tuning and alignment.

Model Family

Villanova-2B-Base-2603 — Base model (4.4T) — 📍 This model
↳ Villanova-2B-2603 — SFT / Instruct
↳ Villanova-2B-2603-GGUF — Quantized
↳ Villanova-2B-VL-2603 — Vision-Language Instruct
↳ Villanova-2B-VL-2603-GGUF — Quantized

Villanova-2B-Base-2512-Preview — Base model (2.2T) (previous version, not recommended)
↳ Villanova-2B-2512-Preview — SFT / Instruct (previous version, not recommended)

Model Summary

Villanova-2B-Base-2603 is a decoder-only transformer with 2 billion parameters, pre-trained from scratch on 4.4 trillion tokens from a curated multilingual corpus. It supports sequences of up to 32,768 tokens. It is large enough to capture rich linguistic and factual knowledge, yet compact enough for fine-tuning and deployment in resource-constrained environments.

Primary languages: English, Italian, Spanish, French, German. Partial support for additional languages and code, but performance outside the five primary languages is not guaranteed.

The Villanova project is committed to full openness and data transparency. Training data sources, mixture details, architectural choices, and hyperparameters are all publicly documented. Data was selected with ethical sourcing as a guiding principle, prioritising high-quality, permissively licensed corpora.

Pre-training

Training followed a two-stage recipe:

Stage 1 (0 → 4.0T tokens) — Broad multilingual data mixture covering the five core languages, plus code, mathematics, and scientific text.

Stage 2 (4.0T → 4.4T tokens) — Cosine annealing over ~400B tokens of higher-quality, curated data.

Villanova-2B-Base-2512-Preview is an intermediate checkpoint of this same training run, released at the 2.2T token mark with an early decay stage applied from 2.0T tokens onward.

Key training settings: AdamW optimizer (β₁=0.9, β₂=0.95, weight decay=0.1), peak learning rate 3×10⁻⁴, BF16/FP8 mixed precision, Flash Attention, sequences of 4,096 tokens. Training ran on 64 NVIDIA H100 GPUs (~30 days, ~36k tokens/GPU/second).

How to Use

This is a base model: it continues text rather than following instructions. For chat or task use, see Villanova-2B-2603.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "VillanovaAI/Villanova-2B-Base-2603"
device = "cuda"  # or "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

prompt = "Gravity is a fundamental force of nature that"
model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

Evaluation

Global evaluation:

Model	Avg	arc_easy	hellaswag	hellaswag_de	hellaswag_es	hellaswag_fr	hellaswag_it	openbookqa	piqa	sciq	winogrande	xcopa_it	xnli_de	xnli_en	xnli_es	xnli_fr	xquad_de	xquad_en	xquad_es
EuroLLM-1.7B	48.72	69.07	45.04	37.97	40.98	40.05	39.46	29.80	72.20	90.60	61.25	66.00	47.99	50.24	45.58	49.00	27.50	34.60	29.65
Llama-3.2-1B	46.13	66.29	48.16	34.11	37.41	35.48	34.91	27.80	75.14	93.50	60.69	59.40	46.02	54.82	41.37	46.95	16.37	37.18	14.84
Minerva-3B-base-v1.0	40.73	62.33	46.28	27.20	29.69	29.02	40.01	24.60	74.27	88.00	56.75	69.60	34.54	52.13	36.31	37.35	4.31	14.21	6.52
OLMo-2-0425-1B	47.70	72.73	50.79	29.79	31.34	32.60	29.19	30.00	75.95	95.30	64.72	52.60	40.00	51.77	37.63	42.89	20.34	68.25	32.74
Qwen3-1.7-Base	53.29	73.61	49.29	37.54	40.73	39.27	38.45	30.20	75.90	95.80	64.01	64.20	46.47	54.50	44.06	45.78	39.59	69.60	50.21
salamandra-2b	50.58	71.04	47.19	38.01	42.07	40.60	38.56	26.80	72.69	91.90	61.72	65.40	47.79	51.97	49.08	48.67	41.73	41.55	33.72
Villanova-2B-Base-2512-Preview	54.26	75.13	48.57	42.06	45.72	44.62	43.32	26.60	75.08	94.40	61.96	68.40	49.36	52.21	49.04	52.33	41.28	66.66	40.03
Villanova-2B-Base-2603	54.91	73.74	49.53	42.91	46.81	45.49	44.21	25.20	74.32	94.10	59.04	68.80	49.48	54.30	49.00	50.72	44.94	72.52	43.37

English only:

Model	Avg	arc_easy	hellaswag	openbookqa	piqa	sciq	winogrande	xnli_en	xquad_en
EuroLLM-1.7B	56.60	69.07	45.04	29.80	72.20	90.60	61.25	50.24	34.60
Llama-3.2-1B	57.95	66.29	48.16	27.80	75.14	93.50	60.69	54.82	37.18
Minerva-3B-base-v1.0	52.32	62.33	46.28	24.60	74.27	88.00	56.75	52.13	14.21
OLMo-2-0425-1B	63.69	72.73	50.79	30.00	75.95	95.30	64.72	51.77	68.25
Qwen3-1.7-Base	64.11	73.61	49.29	30.20	75.90	95.80	64.01	54.50	69.60
salamandra-2b	58.11	71.04	47.19	26.80	72.69	91.90	61.72	51.97	41.55
Villanova-2B-Base-2512-Preview	62.58	75.13	48.57	26.60	75.08	94.40	61.96	52.21	66.66
Villanova-2B-Base-2603	62.84	73.74	49.53	25.20	74.32	94.10	59.04	54.30	72.52

Multilingual benchmarks:

Model	Avg	hellaswag_de	hellaswag_es	hellaswag_fr	hellaswag_it	xcopa_it	xnli_de	xnli_es	xnli_fr	xquad_de	xquad_es
EuroLLM-1.7B	42.42	37.97	40.98	40.05	39.46	66.00	47.99	45.58	49.00	27.50	29.65
Llama-3.2-1B	36.69	34.11	37.41	35.48	34.91	59.40	46.02	41.37	46.95	16.37	14.84
Minerva-3B-base-v1.0	31.45	27.20	29.69	29.02	40.01	69.60	34.54	36.31	37.35	4.31	6.52
OLMo-2-0425-1B	34.91	29.79	31.34	32.60	29.19	52.60	40.00	37.63	42.89	20.34	32.74
Qwen3-1.7-Base	44.63	37.54	40.73	39.27	38.45	64.20	46.47	44.06	45.78	39.59	50.21
salamandra-2b	44.56	38.01	42.07	40.60	38.56	65.40	47.79	49.08	48.67	41.73	33.72
Villanova-2B-Base-2512-Preview	47.61	42.06	45.72	44.62	43.32	68.40	49.36	49.04	52.33	41.28	40.03
Villanova-2B-Base-2603	48.57	42.91	46.81	45.49	44.21	68.80	49.48	49.00	50.72	44.94	43.37

Long context (RULER):

Note: Tests were run forcing the context length to 32k, going beyond the default length for models with a native context lower than this threshold.

Model	Native Context	Avg (32k)
Qwen3-1.7B-Base	32k	0.73
Villanova-2B-Base-2603	32k	0.49
gemma-3-1b-pt	32k	0.28
salamandra-2b	8k	0.12
EuroLLM-1.7B	4k	0.08
OLMo-2-0425-1B	4k	0.00
Villanova-2B-Base-2512-Preview	4k	0.00
Minerva-3B-base-v1.0	16k	0.00

Training Data

The model's training pipeline is divided into two main stages: an initial pre-training stage focused on broad linguistic and factual coverage, and an annealing (decay) stage designed to consolidate knowledge and improve reasoning capabilities.

Stage 1: Pre-training

The first stage was trained on approximately 3.6 trillion tokens (occupying ~15 TB of disk space). The distribution prioritizes five core languages while maintaining a global language coverage baseline. The mixture consists of approximately 37.5% English, large allocations for target Latin-script languages (German, Spanish, French, Italian), 5% code, 2% secondary Latin-script languages, and 6% for broader global languages.

The primary datasets utilized in this stage include:

Web Corpora: FineWeb-2, FineWeb-Edu, and FineWeb2-HQ provide a massive multilingual foundation.
Encyclopedic & Academic: FineWiki, alongside academic papers from Arxiv and PubMed (Common Pile).
Structured Text: FinePDFs supplies high-quality text extracted from structured documents.
Quantitative & Technical: FineMath and Stack-Edu establish foundational mathematical reasoning and coding proficiency.

Stage 2: Annealing (Decay)

During the final decay stage on 400 billion tokens, the general web data was partially replaced with a highly curated set of academic, structured, and instructional corpora to improve reasoning during parameter crystallization.

High-quality sources introduced in the annealing stage include:

Common-Pile StackExchange: Q&A threads focusing on technical and scientific domains.
GitHub Issues & Kaggle Notebooks: A curated concatenation of ~11 billion tokens of repository discussions and ~1.7 billion tokens of analytical notebooks to improve technical problem-solving.
FLAN Dolma-Mix Subset: Instruction-formatted text extracted from the Dolma 1.7 dataset, carefully curated to avoid evaluation suite contamination.
Advanced Mathematics: InfiWebMath and FineMath corpora.

Stage 3: Long Context Extension

A final training stage was executed to extend the model's effective context window, processing an additional 50 billion tokens. The data distribution for this stage resembles the annealing mixture, but employs a shifted sampling strategy that strictly prioritizes long-form documents. This targeted approach ensures the model can efficiently process and retrieve information across extended sequences while preserving the high reasoning and knowledge density established during the decay stage.