polish-roberta-base-8k

This is a small model trained using knowledge distillation from polish-roberta-8k. For distillation, we used a combination of several loss functions: KL divergence for the MLM head, as well as MSE and cosine loss to align the last-layer representations of the teacher and the student. The model was trained for one epoch on the Polish subset of the FineWeb 2 dataset. We prepared a packed version of the dataset by concatenating documents into sequences of exactly 8192 tokens. We then trained the model using batches of 32 such sequences and a polynomial learning rate scheduler with a warmup of 1000 iterations and a maximum learning rate of 5e-5.

Evaluation

Evaluation shows that the model compares favorably to other base-sized encoder models (in the 100M-300M parameter range). We conducted fine-tuning experiments on 25 Polish tasks, the same set as for polish-roberta-8k. The table below presents detailed results compared to other popular models: herbert-base-cased and polish-roberta-base-v2.

TASK TYPE	DOMAIN	METRIC	GROUP	TASK	herbert-base	polish-roberta base-v2	polish-roberta base-8k
single-label	mixed	accuracy	KLEJ	NKJP-NER	94,13	94,32	94,16
single-label	semantics	accuracy	KLEJ	CDSC-E	94,22	94,05	94,54
regression	semantics	spearman	KLEJ	CDSC-R	93,84	94,64	94,90
single-label	social media	binary-f1	KLEJ	CBD	66,36	70,57	69,35
single-label	reviews	accuracy	KLEJ	POLEMO2.0-IN	90,50	90,97	91,27
single-label	reviews	accuracy	KLEJ	POLEMO2.0-OUT	77,94	79,11	81,26
single-label	mixed	binary-f1	KLEJ	DYK	68,82	70,38	69,35
single-label	news	binary-f1	KLEJ	PSC	98,94	98,88	98,90
regression	reviews	1-wmae	KLEJ	AR	87,74	87,83	88,05
single-label	finance	accuracy	FinBench	banking-short	78,35	78,75	79,79
single-label	finance	accuracy	FinBench	banking-long	85,09	85,03	86,99
single-label	finance	accuracy	FinBench	banking77	87,29	88,26	89,27
regression	finance	r2-score	FinBench	fiqa	52,5	56,63	57,31
single-label	finance	accuracy	FinBench	fpb	83,11	83,55	83,63
multi-label	finance	weighted-f1	FinBench	gcn	94,73	95,02	94,87
single-label	finance	accuracy	FinBench	stooq	73,33	80,25	81,32
single-label	social media	accuracy	Other	8TAGS	77,81	78,03	79,21
single-label	social media	accuracy	Other	BAN-PL	91,71	92,19	92,62
multi-label	news	weighted-f1	Other	MIPD	57,65	58,58	64,39
single-label	semantics	accuracy	Other	PPC	84,16	87,05	86,02
single-label	semantics	accuracy	Other	SICK-E	85,17	86,71	86,31
regression	semantics	spearman	Other	SICK-R	77,82	82,58	83,16
multi-label	social media	weighted-f1	Other	TwitterEMO	67,41	66,46	68,75
single-label	reviews	accuracy	Other	IMDB	90,21	91,06	95,02
multi-label	law	weighted-f1	Other	EURLEX	75,37	74,51	79,12

Table 1. Comparison of the mean scores from five fine-tuning runs on 25 discriminative tasks in Polish. The evaluation metrics vary across tasks — the metric used for each task is specified in the METRIC column.

In addition to the detailed results for the three models shown above, we also present a summary of the evaluation of multilingual models that support Polish.

MODEL	PARAMS	KLEJ (9 tasks)	FinBench (7 tasks)	Other (9 tasks)	Long tasks (4 tasks)	All tasks (25 tasks)
EuroBERT/EuroBERT-210m	212M	77.16	76.68	72.48	80.42	75.34
FacebookAI/xlm-roberta-base	278M	84.46	76.63	76.29	74.42	79.33
jhu-clsp/mmBERT-base	307M	83.17	76.59	80.14	82.08	80.24
allegro/herbert-base-cased	124M	85.83	79.20	78.59	77.08	81.37
sdadas/polish-roberta-base-v2	124M	86.75	81.07	79.69	77.30	82.62
sdadas/polish-roberta-base-8k	190M	86.86	81.88	81.62	81.38	83.58

Table 2. Comparison of Polish and multilingual models.

Efficiency

The model includes a custom implementation supporting unpadding and sequence packing, which can significantly speed up inference or training while reducing memory consumption (more information here). Using this feature requires Flash Attention and the Transformers library version 5.4 or newer. To use unpadding, initialize the model with the trust_remote_code=true and attn_implementation="flash_attention_2" parameters, along with 16-bit precision.

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "sdadas/polish-roberta-base-8k",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
    dtype=torch.bfloat16,
    device_map="cuda"
)
print(model.__class__.__name__)  # UnpadRobertaModel

Citation

@misc{dadas2026longcontext,
  title={Long-Context Encoder Models for Polish Language Understanding},
  author={Sławomir Dadas and Rafał Poświata and Marek Kozłowski and Małgorzata Grębowiec and Michał Perełkiewicz and Paweł Klimiuk and Przemysław Boruta},
  year={2026},
  eprint={2603.12191},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.12191}
}

Downloads last month: 61

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using sdadas/polish-roberta-base-8k 1

Paper for sdadas/polish-roberta-base-8k

Long-Context Encoder Models for Polish Language Understanding

Paper • 2603.12191 • Published Mar 12 • 1