SindBERT: Charting the Seas of Turkish NLP

SindBERT is a family of RoBERTa-based Turkish language models pre-trained from scratch on ~312 GB of Turkish text from mC4, OSCAR23, and Wikipedia. The models aim to provide strong downstream performance for Turkish NLP and an openly available large-scale encoder for the community.

We release two variants:

  • SindBERT-base: 126M parameters (fp32)
  • SindBERT-large: 357M parameters (fp32)

Model Details

Detail SindBERT-base SindBERT-large
Architecture RoBERTa-base RoBERTa-large
Parameters ~126M ~357M
Tokenizer GPT-2 style byte-level BPE (52,009 vocab) Same
Pretraining corpus Turkish mC4, OSCAR23, Wikipedia (~312 GB) Same
Objective Masked Language Modeling Same
Training time ~29.2 hours (TPUv4-128 pod) ~6.0 days (TPUv4-128 pod)
Precision fp32 fp32
Framework fairseq fairseq

Downstream Evaluation

We evaluate SindBERT on four Turkish benchmarks:

  • PoS tagging (Turkish UD concat): micro-F1
  • NER (WikiANN TR): micro-F1
  • Offensive language detection (OffensEval-TR 2020): macro-F1
  • Linguistic acceptability (TurBLiMP): average accuracy (16 phenomena)

πŸ§ͺ Evaluation Results

Legend: Bold = best, italic = second-best per model size.

Model PoS NER OffensEval-TR 2020 AVG core TurBLiMP AVG AVG all
Large models
SindBERT_large 94.63 93.64 82.29 90.19 89.8 90.09
XLM-R_large 94.39 94.44 81.99 90.27 92.7 90.73
EuroBERT_610M 93.33 91.85 75.57 86.92 90.0 87.84
Base models
ELECTRA_small 94.28 91.92 78.17 88.12 80.6 86.24
DistilBERTurk 94.01 91.54 79.19 88.25 87.2 87.99
ConvBERTurk 94.41 94.03 81.99 90.14 60.8 82.81
ConvBERTurk_mC4 94.57 93.56 81.90 90.01 55.5 81.38
ELECTRA_base 94.29 93.49 81.54 89.77 89.9 89.81
ELECTRA_mC4 94.40 93.43 81.38 89.74 89.9 89.78
BERTurk_32k 93.16 94.38 81.03 89.52 93.8 90.59
RoBERTurk 87.99 81.09 70.01 79.70 - -
SindBERT_base 94.47 93.19 81.14 89.60 90.3 89.78
mmBERT_small 93.75 92.51 77.28 87.85 85.1 87.16
BERTurk_128k 94.44 93.81 81.77 90.01 95.1 91.28
EuroBERT_210M 92.97 90.91 75.73 86.54 86.3 86.48
XLM-R_base 94.23 92.90 79.77 88.97 89.2 89.03
mmBERT_base 93.75 93.35 78.49 88.53 89.3 88.72

Fairseq Checkpoint

Get the fairseq checkpoint here.

Citations

If you use SindBERT in your research, please cite the following paper:

@misc{scheibleschmitt2025sindbertsailorchartingseas,
      title={SindBERT, the Sailor: Charting the Seas of Turkish NLP}, 
      author={Raphael Scheible-Schmitt and Stefan Schweter},
      year={2025},
      eprint={2510.21364},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.21364}, 
}

πŸ“œ License

MIT License

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for SindBERT/SindBERT_large