polish-roberta-base-8k

This is a small model trained using knowledge distillation from polish-roberta-8k. For distillation, we used a combination of several loss functions: KL divergence for the MLM head, as well as MSE and cosine loss to align the last-layer representations of the teacher and the student. The model was trained for one epoch on the Polish subset of the FineWeb 2 dataset. We prepared a packed version of the dataset by concatenating documents into sequences of exactly 8192 tokens. We then trained the model using batches of 32 such sequences and a polynomial learning rate scheduler with a warmup of 1000 iterations and a maximum learning rate of 5e-5.

Evaluation

Evaluation shows that the model compares favorably to other base-sized encoder models (in the 100M-300M parameter range). We conducted fine-tuning experiments on 25 Polish tasks, the same set as for polish-roberta-8k. The table below presents detailed results compared to other popular models: herbert-base-cased and polish-roberta-base-v2.

TASK TYPE DOMAIN METRIC GROUP TASK herbert-base polish-roberta
base-v2
polish-roberta
base-8k
single-label mixed accuracy KLEJ NKJP-NER 94,13 94,32 94,16
single-label semantics accuracy KLEJ CDSC-E 94,22 94,05 94,54
regression semantics spearman KLEJ CDSC-R 93,84 94,64 94,90
single-label social media binary-f1 KLEJ CBD 66,36 70,57 69,35
single-label reviews accuracy KLEJ POLEMO2.0-IN 90,50 90,97 91,27
single-label reviews accuracy KLEJ POLEMO2.0-OUT 77,94 79,11 81,26
single-label mixed binary-f1 KLEJ DYK 68,82 70,38 69,35
single-label news binary-f1 KLEJ PSC 98,94 98,88 98,90
regression reviews 1-wmae KLEJ AR 87,74 87,83 88,05
single-label finance accuracy FinBench banking-short 78,35 78,75 79,79
single-label finance accuracy FinBench banking-long 85,09 85,03 86,99
single-label finance accuracy FinBench banking77 87,29 88,26 89,27
regression finance r2-score FinBench fiqa 52,5 56,63 57,31
single-label finance accuracy FinBench fpb 83,11 83,55 83,63
multi-label finance weighted-f1 FinBench gcn 94,73 95,02 94,87
single-label finance accuracy FinBench stooq 73,33 80,25 81,32
single-label social media accuracy Other 8TAGS 77,81 78,03 79,21
single-label social media accuracy Other BAN-PL 91,71 92,19 92,62
multi-label news weighted-f1 Other MIPD 57,65 58,58 64,39
single-label semantics accuracy Other PPC 84,16 87,05 86,02
single-label semantics accuracy Other SICK-E 85,17 86,71 86,31
regression semantics spearman Other SICK-R 77,82 82,58 83,16
multi-label social media weighted-f1 Other TwitterEMO 67,41 66,46 68,75
single-label reviews accuracy Other IMDB 90,21 91,06 95,02
multi-label law weighted-f1 Other EURLEX 75,37 74,51 79,12

Table 1. Comparison of the mean scores from five fine-tuning runs on 25 discriminative tasks in Polish. The evaluation metrics vary across tasks — the metric used for each task is specified in the METRIC column.

In addition to the detailed results for the three models shown above, we also present a summary of the evaluation of multilingual models that support Polish.

MODEL PARAMS KLEJ
(9 tasks)
FinBench
(7 tasks)
Other
(9 tasks)
Long tasks
(4 tasks)
All tasks
(25 tasks)
EuroBERT/EuroBERT-210m 212M 77.16 76.68 72.48 80.42 75.34
FacebookAI/xlm-roberta-base 278M 84.46 76.63 76.29 74.42 79.33
jhu-clsp/mmBERT-base 307M 83.17 76.59 80.14 82.08 80.24
allegro/herbert-base-cased 124M 85.83 79.20 78.59 77.08 81.37
sdadas/polish-roberta-base-v2 124M 86.75 81.07 79.69 77.30 82.62
sdadas/polish-roberta-base-8k 190M 86.86 81.88 81.62 81.38 83.58

Table 2. Comparison of Polish and multilingual models.

Efficiency

The model includes a custom implementation supporting unpadding and sequence packing, which can significantly speed up inference or training while reducing memory consumption (more information here). Using this feature requires Flash Attention and the Transformers library version 5.4 or newer. To use unpadding, initialize the model with the trust_remote_code=true and attn_implementation="flash_attention_2" parameters, along with 16-bit precision.

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "sdadas/polish-roberta-base-8k",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
    dtype=torch.bfloat16,
    device_map="cuda"
)
print(model.__class__.__name__)  # UnpadRobertaModel

Citation

@misc{dadas2026longcontext,
  title={Long-Context Encoder Models for Polish Language Understanding},
  author={Sławomir Dadas and Rafał Poświata and Marek Kozłowski and Małgorzata Grębowiec and Michał Perełkiewicz and Paweł Klimiuk and Przemysław Boruta},
  year={2026},
  eprint={2603.12191},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.12191}
}
Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for sdadas/polish-roberta-base-8k