TatarNLPWorld
/

mbert-tatar-morph

@@ -11,24 +11,37 @@ tags:
 - tatar
 - morphology
 - token-classification
-- bert
 ---
-# Multilingual BERT (mBERT) fine-tuned for Tatar Morphology
-This model is fine-tuned for morphological analysis of Tatar language on a subset of **80k sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). It predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).
 ## Performance on Test Set
 | Metric | Value | 95% CI |
 |--------|-------|--------|
 | Token Accuracy | 0.9905 | [0.9898, 0.9913] |
-| Micro F1 | 0.9905 | - |
-| Macro F1 | 0.5563 | - |
 ### Accuracy by Part of Speech (Top 10)
-No POS‑wise accuracy data available.
 ## Usage
@@ -50,6 +63,7 @@ import json
 with open("id2tag.json", "r") as f:
     id2tag = json.load(f)
 word_ids = inputs.word_ids()
 prev_word = None
 for idx, word_idx in enumerate(word_ids):
@@ -59,14 +73,30 @@ for idx, word_idx in enumerate(word_ids):
     prev_word = word_idx
 ```
-## Citation
-If you use this model, please cite our paper:
 ```
-@article{arabov2026scaling,
-  author = {Arabov, M. K. and Gilmullin, R. A. and Burnashev, R. A.},
-  title = {Scaling Multilingual Transformers for Low‑Resource Agglutinative Languages: A Benchmark of State‑of‑the‑Art Models on Tatar Morphological Analysis},
-  journal = {…},
-  year = {2026}
 }
 ```

 - tatar
 - morphology
 - token-classification
+- mbert
 ---
+# Multilingual BERT (mBERT) fine-tuned for Tatar Morphological Analysis
+This model is a fine-tuned version of [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased) for morphological analysis of the Tatar language. It was trained on a subset of **80,000 sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). The model predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).
 ## Performance on Test Set
 | Metric | Value | 95% CI |
 |--------|-------|--------|
 | Token Accuracy | 0.9905 | [0.9898, 0.9913] |
+| Micro F1 | 0.9905 | [0.9897, 0.9913] |
+| Macro F1 | 0.5563 | [0.5954, 0.6387]* |
+*Note: macro F1 CI as reported in the paper.
 ### Accuracy by Part of Speech (Top 10)
+| POS | Accuracy |
+|-----|----------|
+| PUNCT | 1.0000 |
+| NOUN | 0.9905 |
+| VERB | 0.9718 |
+| ADJ | 0.9718 |
+| PRON | 0.9918 |
+| PART | 0.9986 |
+| PROPN | 0.9779 |
+| ADP | 1.0000 |
+| CCONJ | 1.0000 |
+| ADV | 0.9948 |
 ## Usage
 with open("id2tag.json", "r") as f:
     id2tag = json.load(f)
+# Convert predictions to tags
 word_ids = inputs.word_ids()
 prev_word = None
 for idx, word_idx in enumerate(word_ids):
     prev_word = word_idx
 ```
+Expected output (approximately):
 ```
+Татар -> N+Sg+Nom
+теле -> N+Sg+POSS_3(СЫ)+Nom
+бик -> Adv
+бай -> Adj
+. -> PUNCT
+```
+## Citation
+If you use this model, please cite it as:
+```bibtex
+@misc{arabov-mbert-tatar-morph-2026,
+  title = {Multilingual BERT (mBERT) fine-tuned for Tatar Morphological Analysis},
+  author = {Arabov Mullosharaf Kurbonovich},
+  year = {2026},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/TatarNLPWorld/mbert-tatar-morph}
 }
 ```
+## License
+Apache 2.0