SindBERT: Charting the Seas of Turkish NLP

SindBERT is a family of RoBERTa-based Turkish language models pre-trained from scratch on ~312 GB of Turkish text from mC4, OSCAR23, and Wikipedia. The models aim to provide strong downstream performance for Turkish NLP and an openly available large-scale encoder for the community.

We release two variants:

SindBERT-base: 126M parameters (fp32)
SindBERT-large: 357M parameters (fp32)

Model Details

Detail	SindBERT-base	SindBERT-large
Architecture	RoBERTa-base	RoBERTa-large
Parameters	~126M	~357M
Tokenizer	GPT-2 style byte-level BPE (52,009 vocab)	Same
Pretraining corpus	Turkish mC4, OSCAR23, Wikipedia (~312 GB)	Same
Objective	Masked Language Modeling	Same
Training time	~29.2 hours (TPUv4-128 pod)	~6.0 days (TPUv4-128 pod)
Precision	fp32	fp32
Framework	fairseq	fairseq

Downstream Evaluation

We evaluate SindBERT on four Turkish benchmarks:

PoS tagging (Turkish UD concat): micro-F1
NER (WikiANN TR): micro-F1
Offensive language detection (OffensEval-TR 2020): macro-F1
Linguistic acceptability (TurBLiMP): average accuracy (16 phenomena)

🧪 Evaluation Results

Legend: Bold = best, italic = second-best per model size.

Model	PoS	NER	OffensEval-TR 2020	AVG core	TurBLiMP AVG	AVG all
Large models
SindBERT_large	94.63	93.64	82.29	90.19	89.8	90.09
XLM-R_large	94.39	94.44	81.99	90.27	92.7	90.73
EuroBERT_610M	93.33	91.85	75.57	86.92	90.0	87.84
Base models
ELECTRA_small	94.28	91.92	78.17	88.12	80.6	86.24
DistilBERTurk	94.01	91.54	79.19	88.25	87.2	87.99
ConvBERTurk	94.41	94.03	81.99	90.14	60.8	82.81
ConvBERTurk_mC4	94.57	93.56	81.90	90.01	55.5	81.38
ELECTRA_base	94.29	93.49	81.54	89.77	89.9	89.81
ELECTRA_mC4	94.40	93.43	81.38	89.74	89.9	89.78
BERTurk_32k	93.16	94.38	81.03	89.52	93.8	90.59
RoBERTurk	87.99	81.09	70.01	79.70	-	-
SindBERT_base	94.47	93.19	81.14	89.60	90.3	89.78
mmBERT_small	93.75	92.51	77.28	87.85	85.1	87.16
BERTurk_128k	94.44	93.81	81.77	90.01	95.1	91.28
EuroBERT_210M	92.97	90.91	75.73	86.54	86.3	86.48
XLM-R_base	94.23	92.90	79.77	88.97	89.2	89.03
mmBERT_base	93.75	93.35	78.49	88.53	89.3	88.72

Fairseq Checkpoint

Get the fairseq checkpoint here.

Citations

If you use SindBERT in your research, please cite the following paper:

@misc{scheibleschmitt2025sindbertsailorchartingseas,
      title={SindBERT, the Sailor: Charting the Seas of Turkish NLP}, 
      author={Raphael Scheible-Schmitt and Stefan Schweter},
      year={2025},
      eprint={2510.21364},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.21364}, 
}

📜 License

MIT License

Downloads last month: 36

Paper for SindBERT/SindBERT_large

SindBERT, the Sailor: Charting the Seas of Turkish NLP

Paper • 2510.21364 • Published Oct 24, 2025 • 2