SindBERT, the Sailor: Charting the Seas of Turkish NLP
Paper
β’
2510.21364
β’
Published
β’
1
SindBERT is a family of RoBERTa-based Turkish language models pre-trained from scratch on ~312 GB of Turkish text from mC4, OSCAR23, and Wikipedia. The models aim to provide strong downstream performance for Turkish NLP and an openly available large-scale encoder for the community.
We release two variants:
SindBERT-base: 126M parameters (fp32)SindBERT-large: 357M parameters (fp32)| Detail | SindBERT-base | SindBERT-large |
|---|---|---|
| Architecture | RoBERTa-base | RoBERTa-large |
| Parameters | ~126M | ~357M |
| Tokenizer | GPT-2 style byte-level BPE (52,009 vocab) | Same |
| Pretraining corpus | Turkish mC4, OSCAR23, Wikipedia (~312 GB) | Same |
| Objective | Masked Language Modeling | Same |
| Training time | ~29.2 hours (TPUv4-128 pod) | ~6.0 days (TPUv4-128 pod) |
| Precision | fp32 | fp32 |
| Framework | fairseq | fairseq |
We evaluate SindBERT on four Turkish benchmarks:
Legend: Bold = best, italic = second-best per model size.
| Model | PoS | NER | OffensEval-TR 2020 | AVG core | TurBLiMP AVG | AVG all |
|---|---|---|---|---|---|---|
| Large models | ||||||
| SindBERT_large | 94.63 | 93.64 | 82.29 | 90.19 | 89.8 | 90.09 |
| XLM-R_large | 94.39 | 94.44 | 81.99 | 90.27 | 92.7 | 90.73 |
| EuroBERT_610M | 93.33 | 91.85 | 75.57 | 86.92 | 90.0 | 87.84 |
| Base models | ||||||
| ELECTRA_small | 94.28 | 91.92 | 78.17 | 88.12 | 80.6 | 86.24 |
| DistilBERTurk | 94.01 | 91.54 | 79.19 | 88.25 | 87.2 | 87.99 |
| ConvBERTurk | 94.41 | 94.03 | 81.99 | 90.14 | 60.8 | 82.81 |
| ConvBERTurk_mC4 | 94.57 | 93.56 | 81.90 | 90.01 | 55.5 | 81.38 |
| ELECTRA_base | 94.29 | 93.49 | 81.54 | 89.77 | 89.9 | 89.81 |
| ELECTRA_mC4 | 94.40 | 93.43 | 81.38 | 89.74 | 89.9 | 89.78 |
| BERTurk_32k | 93.16 | 94.38 | 81.03 | 89.52 | 93.8 | 90.59 |
| RoBERTurk | 87.99 | 81.09 | 70.01 | 79.70 | - | - |
| SindBERT_base | 94.47 | 93.19 | 81.14 | 89.60 | 90.3 | 89.78 |
| mmBERT_small | 93.75 | 92.51 | 77.28 | 87.85 | 85.1 | 87.16 |
| BERTurk_128k | 94.44 | 93.81 | 81.77 | 90.01 | 95.1 | 91.28 |
| EuroBERT_210M | 92.97 | 90.91 | 75.73 | 86.54 | 86.3 | 86.48 |
| XLM-R_base | 94.23 | 92.90 | 79.77 | 88.97 | 89.2 | 89.03 |
| mmBERT_base | 93.75 | 93.35 | 78.49 | 88.53 | 89.3 | 88.72 |
Get the fairseq checkpoint here.
If you use SindBERT in your research, please cite the following paper:
@misc{scheibleschmitt2025sindbertsailorchartingseas,
title={SindBERT, the Sailor: Charting the Seas of Turkish NLP},
author={Raphael Scheible-Schmitt and Stefan Schweter},
year={2025},
eprint={2510.21364},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.21364},
}
MIT License