TatarNLPWorld
/

distilbert-tatar-morph

@@ -11,24 +11,37 @@ tags:
 - tatar
 - morphology
 - token-classification
-- bert
 ---
-# DistilBERT multilingual fine-tuned for Tatar Morphology
-This model is fine-tuned for morphological analysis of Tatar language on a subset of **80k sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). It predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).
 ## Performance on Test Set
 | Metric | Value | 95% CI |
 |--------|-------|--------|
 | Token Accuracy | 0.9850 | [0.9841, 0.9860] |
-| Micro F1 | 0.9850 | - |
-| Macro F1 | 0.4324 | - |
 ### Accuracy by Part of Speech (Top 10)
-No POS‑wise accuracy data available.
 ## Usage
@@ -50,6 +63,7 @@ import json
 with open("id2tag.json", "r") as f:
     id2tag = json.load(f)
 word_ids = inputs.word_ids()
 prev_word = None
 for idx, word_idx in enumerate(word_ids):
@@ -59,14 +73,30 @@ for idx, word_idx in enumerate(word_ids):
     prev_word = word_idx
 ```
-## Citation
-If you use this model, please cite our paper:
 ```
-@article{arabov2026scaling,
-  author = {Arabov, M. K. and Gilmullin, R. A. and Burnashev, R. A.},
-  title = {Scaling Multilingual Transformers for Low‑Resource Agglutinative Languages: A Benchmark of State‑of‑the‑Art Models on Tatar Morphological Analysis},
-  journal = {…},
-  year = {2026}
 }
 ```

 - tatar
 - morphology
 - token-classification
+- distilbert
 ---
+# DistilBERT multilingual fine-tuned for Tatar Morphological Analysis
+This model is a fine-tuned version of [`distilbert-base-multilingual-cased`](https://huggingface.co/distilbert-base-multilingual-cased) for morphological analysis of the Tatar language. It was trained on a subset of **80,000 sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). The model predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).
 ## Performance on Test Set
 | Metric | Value | 95% CI |
 |--------|-------|--------|
 | Token Accuracy | 0.9850 | [0.9841, 0.9860] |
+| Micro F1 | 0.9851 | [0.9841, 0.9860] |
+| Macro F1 | 0.4324 | [0.4744, 0.5093]* |
+*Note: macro F1 CI as reported in the paper.
 ### Accuracy by Part of Speech (Top 10)
+| POS | Accuracy |
+|-----|----------|
+| PUNCT | 1.0000 |
+| NOUN | 0.9836 |
+| VERB | 0.9535 |
+| ADJ | 0.9626 |
+| PRON | 0.9896 |
+| PART | 0.9973 |
+| PROPN | 0.9754 |
+| ADP | 1.0000 |
+| CCONJ | 1.0000 |
+| ADV | 0.9845 |
 ## Usage
 with open("id2tag.json", "r") as f:
     id2tag = json.load(f)
+# Convert predictions to tags
 word_ids = inputs.word_ids()
 prev_word = None
 for idx, word_idx in enumerate(word_ids):
     prev_word = word_idx
 ```
+Expected output (approximately):
 ```
+Татар -> N+Sg+Nom
+теле -> N+Sg+POSS_3(СЫ)+Nom
+бик -> Adv
+бай -> Adj
+. -> PUNCT
+```
+## Citation
+If you use this model, please cite it as:
+```bibtex
+@misc{arabov-distilbert-tatar-morph-2026,
+  title = {DistilBERT multilingual fine-tuned for Tatar Morphological Analysis},
+  author = {Arabov Mullosharaf Kurbonovich},
+  year = {2026},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/TatarNLPWorld/distilbert-tatar-morph}
 }
 ```
+## License
+Apache 2.0