SindBERT
/

SindBERT_large

@@ -2,11 +2,93 @@
 language:
 - tr
 license: mit
 ---
-# SindBERT
-Pretrained RoBERTa-style language model for Turkish.
-**Paper:** “SindBERT, the Sailor: Charting the Seas of Turkish NLP” (preprint forthcoming on arXiv)
-Model and paper will be accompanied by a detailed model card soon.

 language:
 - tr
 license: mit
+tags:
+- roberta
+- masked-language-modeling
+- turkish
+- encoder
+- fairseq
+- huggingface
+pipeline_tag: fill-mask
 ---
+# SindBERT: Charting the Seas of Turkish NLP
+**SindBERT** is a family of RoBERTa-based Turkish language models pre-trained from scratch on ~312 GB of Turkish text from mC4, OSCAR23, and Wikipedia. The models aim to provide strong downstream performance for Turkish NLP and an openly available large-scale encoder for the community.
+We release two variants:
+- `SindBERT-base`: 126M parameters (fp32)
+- `SindBERT-large`: 357M parameters (fp32)
+## Model Details
+| Detail             | SindBERT-base                             | SindBERT-large            |
+| ------------------ | ----------------------------------------- | ------------------------- |
+| Architecture       | RoBERTa-base                              | RoBERTa-large             |
+| Parameters         | ~126M                                     | ~357M                     |
+| Tokenizer          | GPT-2 style byte-level BPE (52,009 vocab) | Same                      |
+| Pretraining corpus | Turkish mC4, OSCAR23, Wikipedia (~312 GB) | Same                      |
+| Objective          | Masked Language Modeling                  | Same                      |
+| Training time      | ~29.2 hours (TPUv4-128 pod)               | ~6.0 days (TPUv4-128 pod) |
+| Precision          | fp32                                      | fp32                      |
+| Framework          | fairseq                                   | fairseq                   |
+## Downstream Evaluation
+We evaluate SindBERT on four Turkish benchmarks:
+- PoS tagging (Turkish UD concat): micro-F1
+- NER (WikiANN TR): micro-F1
+- Offensive language detection (OffensEval-TR 2020): macro-F1
+- Linguistic acceptability (TurBLiMP): average accuracy (16 phenomena)
+## 🧪 Evaluation Results
+**Legend**: **Bold = best**, *italic = second-best* per model size.
+| Model           | PoS            | NER            | OffensEval-TR 2020            |  AVG core | TurBLiMP AVG |   AVG all |
+| --------------- | -------------: | -------------: | ----------------------------: | --------: | -----------: | --------: |
+| **Large models**|                |                |                               |           |              |           |
+| SindBERT_large  |      **94.63** |        *93.64* |                     **82.29** |     90.19 |         89.8 |     90.09 |
+| XLM-R_large     |        *94.39* |      **94.44** |                       *81.99* | **90.27** |     **92.7** | **90.73** |
+| EuroBERT_610M   |          93.33 |          91.85 |                         75.57 |     86.92 |       *90.0* |     87.84 |
+| **Base models** |                |                |                               |           |              |           |
+| ELECTRA_small   |          94.28 |          91.92 |                         78.17 |     88.12 |         80.6 |     86.24 |
+| DistilBERTurk   |          94.01 |          91.54 |                         79.19 |     88.25 |         87.2 |     87.99 |
+| ConvBERTurk     |          94.41 |        *94.03* |                     **81.99** | **90.14** |         60.8 |     82.81 |
+| ConvBERTurk_mC4 |      **94.57** |          93.56 |                       *81.90* |   *90.01* |         55.5 |     81.38 |
+| ELECTRA_base    |          94.29 |          93.49 |                         81.54 |     89.77 |         89.9 |     89.81 |
+| ELECTRA_mC4     |          94.40 |          93.43 |                         81.38 |     89.74 |         89.9 |     89.78 |
+| BERTurk_32k     |          93.16 |      **94.38** |                         81.03 |     89.52 |       *93.8* |   *90.59* |
+| RoBERTurk       |          87.99 |          81.09 |                         70.01 |     79.70 |            - |         - |
+| SindBERT_base   |        *94.47* |          93.19 |                         81.14 |     89.60 |         90.3 |     89.78 |
+| mmBERT_small    |          93.75 |          92.51 |                         77.28 |     87.85 |         85.1 |     87.16 |
+| BERTurk_128k    |          94.44 |          93.81 |                         81.77 |   *90.01* |     **95.1** | **91.28** |
+| EuroBERT_210M   |          92.97 |          90.91 |                         75.73 |     86.54 |         86.3 |     86.48 |
+| XLM-R_base      |          94.23 |          92.90 |                         79.77 |     88.97 |         89.2 |     89.03 |
+| mmBERT_base     |          93.75 |          93.35 |                         78.49 |     88.53 |         89.3 |     88.72 |
+## Fairseq Checkpoint
+Get the fairseq checkpoint [here](https://drive.proton.me/urls/KTQKVJ4S4W#cSlP0BpjKiyX).
+## Citations
+If you use SindBERT in your research, please cite the following paper:
+```bibtex
+@misc{scheibleschmitt2025sindbertsailorchartingseas,
+      title={SindBERT, the Sailor: Charting the Seas of Turkish NLP},
+      author={Raphael Scheible-Schmitt and Stefan Schweter},
+      year={2025},
+      eprint={2510.21364},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2510.21364},
+}
+```
+## 📜 License
+MIT License