Apertus EstLLM 8B 1125 Instruct

tartuNLP/Apertus-EstLLM-8B-Instruct-1125 is mainly a reasearch artifact not intended for production use-cases, it is produced by applying the training procedure identical to tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825. However, unlike the Llama-based model, based on the metrics, we don't observe any improvements, although the score on the baromeeter.ai platform is slightly higher than the original swiss-ai/Apertus-8B-Instruct-2509.

Use with transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "tartuNLP/Apertus-EstLLM-8B-Instruct-1125"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)

# to use on apple silicon, load the following way
# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     dtype=torch.float16,
#     device_map="mps",
# )

tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "user", "content": "Kas sa räägid eesti keelt?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer(text, return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.4,
    # specify eos token to stop at the end of the assistant response
    eos_token_id=tokenizer.eos_token_id,
)

# generated_ids include the input tokens as well, so we only decode new tokens
response = tokenizer.decode(
    generated_ids[0][model_inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)

print(response)

Model Details

Model Description

Developed by: TartuNLP and TalTechNLP research groups
Funded by: Estonian Ministry of Education and Research, “Estonian Language Technology Program 2018-2027”
Model type: Causal Language Model, Instruction-following
Language(s) (NLP): Estonian, English
License: Llama 3.1 Community License Agreement
Finetuned from model: tartuNLP/Llama-3.1-EstLLM-8B-0525

Continued Pre-Training

Continued Pre-Training was performed for a single epoch on:

Estonian National Corpus (8.6B tokens)
Python-Edu (3.3B tokens)
FineMath4-Plus (9.5B tokens)
General Instruction-Augmented Corpora (7.4B tokens)
Cosmopedia v2 (6.9B tokens)

Supervised Fine-Tuning

Approximately 764k examples were used for Supervised Fine-Tuning. The examples mainly come from the Tulu 3 SFT mixture and EuroBlocks. Additional data provided by the Institute of Estonian Language (EKI) was also used. In total about 80% of examples are in English. More details TBA.

Direct Preference Optimization

English-only HelpSteer3 was used as is in the Direct Preference Optimization step, as previous research on Poro 2 models showed that there's no observable benefit from translating preference pairs.

Evaluation

Logits-based

Scores for logits-based evaluation benchmarks are available on the EuroEval leaderboard.

Generative

Every benchmark in this category is treated as a generative problem, and thus the evaluation is performed on the model responses obtained with 0 temperature (not logits). The top scores are higlighted with bold. Second best scores are highlighted with italic bold. Rows are sorted in descending order based on the number of parameters of models (not scores). The test set is used for evaluation of each dataset unless noted otherwise.

Note that all models are evaluated with the same prompt template for comparability, meaning that the scores do not necessarily represent each model's best possible performance. This is especially the case for deepseek-ai/DeepSeek-V3-0324 on some of the benchmarks.

Only models of comparable size are evaluated on benchmarks in English.

Instruction-following

Estonian

Instruction level strict accuracy is reported for IFEval-et.

Model (# parameters ↓)	IFEval-et
moonshotai/Kimi-K2-Instruct	0.7891
deepseek-ai/DeepSeek-V3.2	0.7221
deepseek-ai/DeepSeek-V3-0324	0.7171
mistralai/Mistral-Large-3-675B-Instruct-2512	0.7097
meta-llama/Llama-3.1-405B-Instruct	0.7159
meta-llama/Llama-3.3-70B-Instruct	*0.7705*
Qwen/Qwen2.5-72B-Instruct	0.7407
google/gemma-3-27b-it	0.7655
google/gemma-3-12b-it	0.7556
utter-project/EuroLLM-9B-Instruct	0.5397
mistralai/Ministral-3-8B-Instruct-2512	0.4888
swiss-ai/Apertus-8B-Instruct-2509	0.5484
tartuNLP/Apertus-EstLLM-8B-Instruct-1125	0.4665
meta-llama/Llama-3.1-8B-Instruct	0.3797
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825	0.5174
BSC-LT/salamandra-7b-instruct	0.5195
tartuNLP/Llammas	0.3524
Qwen/Qwen2.5-7B-Instruct	0.4988

English

Instruction level strict accuracy is reported for IFEval-en.

Model (# parameters ↓)	IFEval-en
utter-project/EuroLLM-9B-Instruct	0.7004
mistralai/Ministral-3-8B-Instruct-2512	0.6845
swiss-ai/Apertus-8B-Instruct-2509	0.7808
tartuNLP/Apertus-EstLLM-8B-Instruct-1125	0.6638
meta-llama/Llama-3.1-8B-Instruct	0.8106
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825	0.7527
tartuNLP/Llammas	0.4373
BSC-LT/salamandra-7b-instruct	0.3289
Qwen/Qwen2.5-7B-Instruct	0.7954

Multiple Choice

All datasets except Winogrande-et are evaluated in 0-shot mode. Winogrande-et is evaluated in 3-shot mode. Exact match accuracy is reported for every dataset.

Estonian Language Competence

Model (# parameters ↓)	Grammar-et	Inflection-et	Word-Meanings-et
moonshotai/Kimi-K2-Instruct	0.916	0.6458	0.9689
deepseek-ai/DeepSeek-V3.2	0.781	0.6891	0.8134
deepseek-ai/DeepSeek-V3-0324	0.364	0	0
mistralai/Mistral-Large-3-675B-Instruct-2512	0.796	0.8355	0.9488
meta-llama/Llama-3.1-405B-Instruct	*0.818*	0.9089	0.9438
meta-llama/Llama-3.3-70B-Instruct	0.797	0.6421	0.9408
Qwen/Qwen2.5-72B-Instruct	0.694	0.5208	0.9057
google/gemma-3-27b-it	0.817	0.5934	0.9529
google/gemma-3-12b-it	0.789	0.4227	0.9318
utter-project/EuroLLM-9B-Instruct	0.764	0.367	0.9258
mistralai/Ministral-3-8B-Instruct-2512	0.562	0.4833	0.8395
swiss-ai/Apertus-8B-Instruct-2509	0.512	0.3662	0.9027
tartuNLP/Apertus-EstLLM-8B-Instruct-1125	0.646	0.421	0.9178
meta-llama/Llama-3.1-8B-Instruct	0.657	0.4165	0.8335
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825	0.692	0.5188	*0.9569*
BSC-LT/salamandra-7b-instruct	0.594	0.2668	0.8084
Qwen/Qwen2.5-7B-Instruct	0.598	0.4136	0.7984
tartuNLP/Llammas	0.529	0.2289	0.5326

Knowledge and Reasoning (Estonian)

Model (# parameters ↓)	Winogrande-et	Trivia-et	Exam-et	GlobalPIQA-et	TruthfulQA-et
moonshotai/Kimi-K2-Instruct	0.8138	0.4225	0.8414	0.79	0.7136
deepseek-ai/DeepSeek-V3.2	0.4805	0.38	0.614	0.7	0.5863
deepseek-ai/DeepSeek-V3-0324	*0.8042*	0.27	0.1221	0.04	0.2093
mistralai/Mistral-Large-3-675B-Instruct-2512	0.7487	0.4275	0.7931	0.73	0.6854
meta-llama/Llama-3.1-405B-Instruct	0.7878	0.4713	0.8309	0.58	0.7001
meta-llama/Llama-3.3-70B-Instruct	0.7397	0.3875	0.7652	0.58	0.6255
Qwen/Qwen2.5-72B-Instruct	0.7227	0.315	0.7162	0.65	0.6683
google/gemma-3-27b-it	0.7510	0.325	0.7751	0.71	0.5814
google/gemma-3-12b-it	0.6712	0.3237	0.7069	0.54	0.3158
utter-project/EuroLLM-9B-Instruct	0.5846	0.3738	0.5589	0.55	0.2889
mistralai/Ministral-3-8B-Instruct-2512	0.5812	0.3125	0.5012	0.48	0.3525
swiss-ai/Apertus-8B-Instruct-2509	0.5105	0.345	0.552	0.59	0.366
tartuNLP/Apertus-EstLLM-8B-Instruct-1125	0.5467	0.3575	0.5651	0.63	0.3696
meta-llama/Llama-3.1-8B-Instruct	0.5399	0.2888	0.5	0.54	0.437
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825	0.5812	0.425	0.5093	0.63	0.3525
BSC-LT/salamandra-7b-instruct	0.2878	0.2875	0.3556	0.55	0.3011
Qwen/Qwen2.5-7B-Instruct	0.5473	0.2938	0.4913	0.57	0.4113
tartuNLP/Llammas	0.5037	0.2838	0.3649	0.01	0.2032

Knowledge and Reasoning (English)

Model (# parameters ↓)	Winogrande	GlobalPIQA-en	TruthfulQA	MMLU-Redux	GSM8K
utter-project/EuroLLM-9B-Instruct	0.5059	0.58	0.2962	0.5741	0.5944
meta-llama/Llama-3.1-8B-Instruct	0.5625	0.76	0.5239	0.6959	0.7710
mistralai/Ministral-3-8B-Instruct-2512	0.6503	0.77	0.519	0.7418	0.3927
swiss-ai/Apertus-8B-Instruct-2509	0.5133	0.73	0.3831	0.6099	0.5936
tartuNLP/Apertus-EstLLM-8B-Instruct-1125	0.5348	0.56	0.3647	0.5944	0.5277
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825	0.6084	0.71	0.366	0.6388	0.7202
tartuNLP/Llammas	0.498	0	0.1971	0.3417	0.1456
BSC-LT/salamandra-7b-instruct	0.4029	0.63	0.2717	0.5180	0.0076
Qwen/Qwen2.5-7B-Instruct	0.6627	0.83	0.5875	0.7555	0.7862

Translation

English to Estonian

Model	wmt24pp (BLEU ↑)
BSC-LT/salamandraTA-7b-instruct	0.2713
tartuNLP/Apertus-EstLLM-8B-Instruct-1125	0.2609
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825	0.264
utter-project/EuroLLM-9B-Instruct	0.2602
swiss-ai/Apertus-8B-Instruct-2509	0.2372
tartuNLP/Llammas	0.1472
meta-llama/Llama-3.1-8B-Instruct	0.1406
BSC-LT/salamandra-7b-instruct	0.1201
Qwen/Qwen2.5-7B-Instruct	0.0476

Limitations

We don't recommend using the model for anything other than research purposes due to the apparent degradation in quality compared to the original.

Citation

TBA

Downloads last month: 66

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for tartuNLP/Apertus-EstLLM-8B-Instruct-1125

Base model

swiss-ai/Apertus-8B-2509

Finetuned

tartuNLP/Apertus-EstLLM-8B-1125