Apertus EstLLM 8B 1125 Instruct
tartuNLP/Apertus-EstLLM-8B-Instruct-1125 is mainly a reasearch artifact not intended for production use-cases, it is produced by applying the training procedure
identical to tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825. However, unlike the Llama-based model, based on the metrics, we don't observe any improvements, although the score
on the baromeeter.ai platform is slightly higher than the original swiss-ai/Apertus-8B-Instruct-2509.
Use with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "tartuNLP/Apertus-EstLLM-8B-Instruct-1125"
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype="auto",
device_map="auto"
)
# to use on apple silicon, load the following way
# model = AutoModelForCausalLM.from_pretrained(
# model_name,
# dtype=torch.float16,
# device_map="mps",
# )
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [
{"role": "user", "content": "Kas sa räägid eesti keelt?"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=128,
do_sample=True,
temperature=0.4,
# specify eos token to stop at the end of the assistant response
eos_token_id=tokenizer.eos_token_id,
)
# generated_ids include the input tokens as well, so we only decode new tokens
response = tokenizer.decode(
generated_ids[0][model_inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)
print(response)
Model Details
Model Description
- Developed by: TartuNLP and TalTechNLP research groups
- Funded by: Estonian Ministry of Education and Research, “Estonian Language Technology Program 2018-2027”
- Model type: Causal Language Model, Instruction-following
- Language(s) (NLP): Estonian, English
- License: Llama 3.1 Community License Agreement
- Finetuned from model: tartuNLP/Llama-3.1-EstLLM-8B-0525
Continued Pre-Training
Continued Pre-Training was performed for a single epoch on:
- Estonian National Corpus (8.6B tokens)
- Python-Edu (3.3B tokens)
- FineMath4-Plus (9.5B tokens)
- General Instruction-Augmented Corpora (7.4B tokens)
- Cosmopedia v2 (6.9B tokens)
Supervised Fine-Tuning
Approximately 764k examples were used for Supervised Fine-Tuning. The examples mainly come from the Tulu 3 SFT mixture and EuroBlocks. Additional data provided by the Institute of Estonian Language (EKI) was also used. In total about 80% of examples are in English. More details TBA.
Direct Preference Optimization
English-only HelpSteer3 was used as is in the Direct Preference Optimization step, as previous research on Poro 2 models showed that there's no observable benefit from translating preference pairs.
Evaluation
Logits-based
Scores for logits-based evaluation benchmarks are available on the EuroEval leaderboard.
Generative
Every benchmark in this category is treated as a generative problem, and thus the evaluation is performed on the model responses obtained with 0 temperature (not logits). The top scores are higlighted with bold. Second best scores are highlighted with italic bold. Rows are sorted in descending order based on the number of parameters of models (not scores). The test set is used for evaluation of each dataset unless noted otherwise.
Note that all models are evaluated with the same prompt template for comparability, meaning that the scores do not necessarily represent each model's best possible
performance. This is especially the case for deepseek-ai/DeepSeek-V3-0324 on some of the benchmarks.
Only models of comparable size are evaluated on benchmarks in English.
Instruction-following
Estonian
Instruction level strict accuracy is reported for IFEval-et.
| Model (# parameters ↓) | IFEval-et |
|---|---|
| moonshotai/Kimi-K2-Instruct | 0.7891 |
| deepseek-ai/DeepSeek-V3.2 | 0.7221 |
| deepseek-ai/DeepSeek-V3-0324 | 0.7171 |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.7097 |
| meta-llama/Llama-3.1-405B-Instruct | 0.7159 |
| meta-llama/Llama-3.3-70B-Instruct | 0.7705 |
| Qwen/Qwen2.5-72B-Instruct | 0.7407 |
| google/gemma-3-27b-it | 0.7655 |
| google/gemma-3-12b-it | 0.7556 |
| utter-project/EuroLLM-9B-Instruct | 0.5397 |
| mistralai/Ministral-3-8B-Instruct-2512 | 0.4888 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.5484 |
| tartuNLP/Apertus-EstLLM-8B-Instruct-1125 | 0.4665 |
| meta-llama/Llama-3.1-8B-Instruct | 0.3797 |
| tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.5174 |
| BSC-LT/salamandra-7b-instruct | 0.5195 |
| tartuNLP/Llammas | 0.3524 |
| Qwen/Qwen2.5-7B-Instruct | 0.4988 |
English
Instruction level strict accuracy is reported for IFEval-en.
| Model (# parameters ↓) | IFEval-en |
|---|---|
| utter-project/EuroLLM-9B-Instruct | 0.7004 |
| mistralai/Ministral-3-8B-Instruct-2512 | 0.6845 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.7808 |
| tartuNLP/Apertus-EstLLM-8B-Instruct-1125 | 0.6638 |
| meta-llama/Llama-3.1-8B-Instruct | 0.8106 |
| tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.7527 |
| tartuNLP/Llammas | 0.4373 |
| BSC-LT/salamandra-7b-instruct | 0.3289 |
| Qwen/Qwen2.5-7B-Instruct | 0.7954 |
Multiple Choice
All datasets except Winogrande-et are evaluated in 0-shot mode. Winogrande-et is evaluated in 3-shot mode. Exact match accuracy is reported for every dataset.
Estonian Language Competence
| Model (# parameters ↓) | Grammar-et | Inflection-et | Word-Meanings-et |
|---|---|---|---|
| moonshotai/Kimi-K2-Instruct | 0.916 | 0.6458 | 0.9689 |
| deepseek-ai/DeepSeek-V3.2 | 0.781 | 0.6891 | 0.8134 |
| deepseek-ai/DeepSeek-V3-0324 | 0.364 | 0 | 0 |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.796 | 0.8355 | 0.9488 |
| meta-llama/Llama-3.1-405B-Instruct | 0.818 | 0.9089 | 0.9438 |
| meta-llama/Llama-3.3-70B-Instruct | 0.797 | 0.6421 | 0.9408 |
| Qwen/Qwen2.5-72B-Instruct | 0.694 | 0.5208 | 0.9057 |
| google/gemma-3-27b-it | 0.817 | 0.5934 | 0.9529 |
| google/gemma-3-12b-it | 0.789 | 0.4227 | 0.9318 |
| utter-project/EuroLLM-9B-Instruct | 0.764 | 0.367 | 0.9258 |
| mistralai/Ministral-3-8B-Instruct-2512 | 0.562 | 0.4833 | 0.8395 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.512 | 0.3662 | 0.9027 |
| tartuNLP/Apertus-EstLLM-8B-Instruct-1125 | 0.646 | 0.421 | 0.9178 |
| meta-llama/Llama-3.1-8B-Instruct | 0.657 | 0.4165 | 0.8335 |
| tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.692 | 0.5188 | 0.9569 |
| BSC-LT/salamandra-7b-instruct | 0.594 | 0.2668 | 0.8084 |
| Qwen/Qwen2.5-7B-Instruct | 0.598 | 0.4136 | 0.7984 |
| tartuNLP/Llammas | 0.529 | 0.2289 | 0.5326 |
Knowledge and Reasoning (Estonian)
| Model (# parameters ↓) | Winogrande-et | Trivia-et | Exam-et | GlobalPIQA-et | TruthfulQA-et |
|---|---|---|---|---|---|
| moonshotai/Kimi-K2-Instruct | 0.8138 | 0.4225 | 0.8414 | 0.79 | 0.7136 |
| deepseek-ai/DeepSeek-V3.2 | 0.4805 | 0.38 | 0.614 | 0.7 | 0.5863 |
| deepseek-ai/DeepSeek-V3-0324 | 0.8042 | 0.27 | 0.1221 | 0.04 | 0.2093 |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.7487 | 0.4275 | 0.7931 | 0.73 | 0.6854 |
| meta-llama/Llama-3.1-405B-Instruct | 0.7878 | 0.4713 | 0.8309 | 0.58 | 0.7001 |
| meta-llama/Llama-3.3-70B-Instruct | 0.7397 | 0.3875 | 0.7652 | 0.58 | 0.6255 |
| Qwen/Qwen2.5-72B-Instruct | 0.7227 | 0.315 | 0.7162 | 0.65 | 0.6683 |
| google/gemma-3-27b-it | 0.7510 | 0.325 | 0.7751 | 0.71 | 0.5814 |
| google/gemma-3-12b-it | 0.6712 | 0.3237 | 0.7069 | 0.54 | 0.3158 |
| utter-project/EuroLLM-9B-Instruct | 0.5846 | 0.3738 | 0.5589 | 0.55 | 0.2889 |
| mistralai/Ministral-3-8B-Instruct-2512 | 0.5812 | 0.3125 | 0.5012 | 0.48 | 0.3525 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.5105 | 0.345 | 0.552 | 0.59 | 0.366 |
| tartuNLP/Apertus-EstLLM-8B-Instruct-1125 | 0.5467 | 0.3575 | 0.5651 | 0.63 | 0.3696 |
| meta-llama/Llama-3.1-8B-Instruct | 0.5399 | 0.2888 | 0.5 | 0.54 | 0.437 |
| tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.5812 | 0.425 | 0.5093 | 0.63 | 0.3525 |
| BSC-LT/salamandra-7b-instruct | 0.2878 | 0.2875 | 0.3556 | 0.55 | 0.3011 |
| Qwen/Qwen2.5-7B-Instruct | 0.5473 | 0.2938 | 0.4913 | 0.57 | 0.4113 |
| tartuNLP/Llammas | 0.5037 | 0.2838 | 0.3649 | 0.01 | 0.2032 |
Knowledge and Reasoning (English)
| Model (# parameters ↓) | Winogrande | GlobalPIQA-en | TruthfulQA | MMLU-Redux | GSM8K |
|---|---|---|---|---|---|
| utter-project/EuroLLM-9B-Instruct | 0.5059 | 0.58 | 0.2962 | 0.5741 | 0.5944 |
| meta-llama/Llama-3.1-8B-Instruct | 0.5625 | 0.76 | 0.5239 | 0.6959 | 0.7710 |
| mistralai/Ministral-3-8B-Instruct-2512 | 0.6503 | 0.77 | 0.519 | 0.7418 | 0.3927 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.5133 | 0.73 | 0.3831 | 0.6099 | 0.5936 |
| tartuNLP/Apertus-EstLLM-8B-Instruct-1125 | 0.5348 | 0.56 | 0.3647 | 0.5944 | 0.5277 |
| tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.6084 | 0.71 | 0.366 | 0.6388 | 0.7202 |
| tartuNLP/Llammas | 0.498 | 0 | 0.1971 | 0.3417 | 0.1456 |
| BSC-LT/salamandra-7b-instruct | 0.4029 | 0.63 | 0.2717 | 0.5180 | 0.0076 |
| Qwen/Qwen2.5-7B-Instruct | 0.6627 | 0.83 | 0.5875 | 0.7555 | 0.7862 |
Translation
English to Estonian
| Model | wmt24pp (BLEU ↑) |
|---|---|
| BSC-LT/salamandraTA-7b-instruct | 0.2713 |
| tartuNLP/Apertus-EstLLM-8B-Instruct-1125 | 0.2609 |
| tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.264 |
| utter-project/EuroLLM-9B-Instruct | 0.2602 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.2372 |
| tartuNLP/Llammas | 0.1472 |
| meta-llama/Llama-3.1-8B-Instruct | 0.1406 |
| BSC-LT/salamandra-7b-instruct | 0.1201 |
| Qwen/Qwen2.5-7B-Instruct | 0.0476 |
Limitations
We don't recommend using the model for anything other than research purposes due to the apparent degradation in quality compared to the original.
Citation
TBA
- Downloads last month
- 66
