rjschmitt commited on
Commit
231c67d
·
verified ·
1 Parent(s): 560ac61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -4
README.md CHANGED
@@ -2,11 +2,93 @@
2
  language:
3
  - tr
4
  license: mit
 
 
 
 
 
 
 
 
5
  ---
6
 
7
- # SindBERT
8
- Pretrained RoBERTa-style language model for Turkish.
9
 
10
- **Paper:** “SindBERT, the Sailor: Charting the Seas of Turkish NLP” (preprint forthcoming on arXiv)
11
 
12
- Model and paper will be accompanied by a detailed model card soon.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  language:
3
  - tr
4
  license: mit
5
+ tags:
6
+ - roberta
7
+ - masked-language-modeling
8
+ - turkish
9
+ - encoder
10
+ - fairseq
11
+ - huggingface
12
+ pipeline_tag: fill-mask
13
  ---
14
 
15
+ # SindBERT: Charting the Seas of Turkish NLP
16
+ **SindBERT** is a family of RoBERTa-based Turkish language models pre-trained from scratch on ~312 GB of Turkish text from mC4, OSCAR23, and Wikipedia. The models aim to provide strong downstream performance for Turkish NLP and an openly available large-scale encoder for the community.
17
 
18
+ We release two variants:
19
 
20
+ - `SindBERT-base`: 126M parameters (fp32)
21
+ - `SindBERT-large`: 357M parameters (fp32)
22
+
23
+
24
+ ## Model Details
25
+
26
+ | Detail | SindBERT-base | SindBERT-large |
27
+ | ------------------ | ----------------------------------------- | ------------------------- |
28
+ | Architecture | RoBERTa-base | RoBERTa-large |
29
+ | Parameters | ~126M | ~357M |
30
+ | Tokenizer | GPT-2 style byte-level BPE (52,009 vocab) | Same |
31
+ | Pretraining corpus | Turkish mC4, OSCAR23, Wikipedia (~312 GB) | Same |
32
+ | Objective | Masked Language Modeling | Same |
33
+ | Training time | ~29.2 hours (TPUv4-128 pod) | ~6.0 days (TPUv4-128 pod) |
34
+ | Precision | fp32 | fp32 |
35
+ | Framework | fairseq | fairseq |
36
+
37
+ ## Downstream Evaluation
38
+
39
+ We evaluate SindBERT on four Turkish benchmarks:
40
+
41
+ - PoS tagging (Turkish UD concat): micro-F1
42
+ - NER (WikiANN TR): micro-F1
43
+ - Offensive language detection (OffensEval-TR 2020): macro-F1
44
+ - Linguistic acceptability (TurBLiMP): average accuracy (16 phenomena)
45
+
46
+
47
+ ## 🧪 Evaluation Results
48
+
49
+ **Legend**: **Bold = best**, *italic = second-best* per model size.
50
+
51
+ | Model | PoS | NER | OffensEval-TR 2020 | AVG core | TurBLiMP AVG | AVG all |
52
+ | --------------- | -------------: | -------------: | ----------------------------: | --------: | -----------: | --------: |
53
+ | **Large models**| | | | | | |
54
+ | SindBERT_large | **94.63** | *93.64* | **82.29** | 90.19 | 89.8 | 90.09 |
55
+ | XLM-R_large | *94.39* | **94.44** | *81.99* | **90.27** | **92.7** | **90.73** |
56
+ | EuroBERT_610M | 93.33 | 91.85 | 75.57 | 86.92 | *90.0* | 87.84 |
57
+ | **Base models** | | | | | | |
58
+ | ELECTRA_small | 94.28 | 91.92 | 78.17 | 88.12 | 80.6 | 86.24 |
59
+ | DistilBERTurk | 94.01 | 91.54 | 79.19 | 88.25 | 87.2 | 87.99 |
60
+ | ConvBERTurk | 94.41 | *94.03* | **81.99** | **90.14** | 60.8 | 82.81 |
61
+ | ConvBERTurk_mC4 | **94.57** | 93.56 | *81.90* | *90.01* | 55.5 | 81.38 |
62
+ | ELECTRA_base | 94.29 | 93.49 | 81.54 | 89.77 | 89.9 | 89.81 |
63
+ | ELECTRA_mC4 | 94.40 | 93.43 | 81.38 | 89.74 | 89.9 | 89.78 |
64
+ | BERTurk_32k | 93.16 | **94.38** | 81.03 | 89.52 | *93.8* | *90.59* |
65
+ | RoBERTurk | 87.99 | 81.09 | 70.01 | 79.70 | - | - |
66
+ | SindBERT_base | *94.47* | 93.19 | 81.14 | 89.60 | 90.3 | 89.78 |
67
+ | mmBERT_small | 93.75 | 92.51 | 77.28 | 87.85 | 85.1 | 87.16 |
68
+ | BERTurk_128k | 94.44 | 93.81 | 81.77 | *90.01* | **95.1** | **91.28** |
69
+ | EuroBERT_210M | 92.97 | 90.91 | 75.73 | 86.54 | 86.3 | 86.48 |
70
+ | XLM-R_base | 94.23 | 92.90 | 79.77 | 88.97 | 89.2 | 89.03 |
71
+ | mmBERT_base | 93.75 | 93.35 | 78.49 | 88.53 | 89.3 | 88.72 |
72
+
73
+
74
+ ## Fairseq Checkpoint
75
+ Get the fairseq checkpoint [here](https://drive.proton.me/urls/KTQKVJ4S4W#cSlP0BpjKiyX).
76
+
77
+ ## Citations
78
+ If you use SindBERT in your research, please cite the following paper:
79
+
80
+ ```bibtex
81
+ @misc{scheibleschmitt2025sindbertsailorchartingseas,
82
+ title={SindBERT, the Sailor: Charting the Seas of Turkish NLP},
83
+ author={Raphael Scheible-Schmitt and Stefan Schweter},
84
+ year={2025},
85
+ eprint={2510.21364},
86
+ archivePrefix={arXiv},
87
+ primaryClass={cs.CL},
88
+ url={https://arxiv.org/abs/2510.21364},
89
+ }
90
+ ```
91
+
92
+ ## 📜 License
93
+
94
+ MIT License