Token Classification
Safetensors
Tatar
bert
tatar
morphology
mbert
ArabovMK commited on
Commit
4cded35
·
verified ·
1 Parent(s): cbd649b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -13
README.md CHANGED
@@ -11,24 +11,37 @@ tags:
11
  - tatar
12
  - morphology
13
  - token-classification
14
- - bert
15
  ---
16
 
17
- # Multilingual BERT (mBERT) fine-tuned for Tatar Morphology
18
 
19
- This model is fine-tuned for morphological analysis of Tatar language on a subset of **80k sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). It predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).
20
 
21
  ## Performance on Test Set
22
 
23
  | Metric | Value | 95% CI |
24
  |--------|-------|--------|
25
  | Token Accuracy | 0.9905 | [0.9898, 0.9913] |
26
- | Micro F1 | 0.9905 | - |
27
- | Macro F1 | 0.5563 | - |
 
 
28
 
29
  ### Accuracy by Part of Speech (Top 10)
30
 
31
- No POS‑wise accuracy data available.
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ## Usage
34
 
@@ -50,6 +63,7 @@ import json
50
  with open("id2tag.json", "r") as f:
51
  id2tag = json.load(f)
52
 
 
53
  word_ids = inputs.word_ids()
54
  prev_word = None
55
  for idx, word_idx in enumerate(word_ids):
@@ -59,14 +73,30 @@ for idx, word_idx in enumerate(word_ids):
59
  prev_word = word_idx
60
  ```
61
 
62
- ## Citation
63
- If you use this model, please cite our paper:
64
 
65
  ```
66
- @article{arabov2026scaling,
67
- author = {Arabov, M. K. and Gilmullin, R. A. and Burnashev, R. A.},
68
- title = {Scaling Multilingual Transformers for Low‑Resource Agglutinative Languages: A Benchmark of State‑of‑the‑Art Models on Tatar Morphological Analysis},
69
- journal = {…},
70
- year = {2026}
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  }
72
  ```
 
 
 
 
 
11
  - tatar
12
  - morphology
13
  - token-classification
14
+ - mbert
15
  ---
16
 
17
+ # Multilingual BERT (mBERT) fine-tuned for Tatar Morphological Analysis
18
 
19
+ This model is a fine-tuned version of [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased) for morphological analysis of the Tatar language. It was trained on a subset of **80,000 sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). The model predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).
20
 
21
  ## Performance on Test Set
22
 
23
  | Metric | Value | 95% CI |
24
  |--------|-------|--------|
25
  | Token Accuracy | 0.9905 | [0.9898, 0.9913] |
26
+ | Micro F1 | 0.9905 | [0.9897, 0.9913] |
27
+ | Macro F1 | 0.5563 | [0.5954, 0.6387]* |
28
+
29
+ *Note: macro F1 CI as reported in the paper.
30
 
31
  ### Accuracy by Part of Speech (Top 10)
32
 
33
+ | POS | Accuracy |
34
+ |-----|----------|
35
+ | PUNCT | 1.0000 |
36
+ | NOUN | 0.9905 |
37
+ | VERB | 0.9718 |
38
+ | ADJ | 0.9718 |
39
+ | PRON | 0.9918 |
40
+ | PART | 0.9986 |
41
+ | PROPN | 0.9779 |
42
+ | ADP | 1.0000 |
43
+ | CCONJ | 1.0000 |
44
+ | ADV | 0.9948 |
45
 
46
  ## Usage
47
 
 
63
  with open("id2tag.json", "r") as f:
64
  id2tag = json.load(f)
65
 
66
+ # Convert predictions to tags
67
  word_ids = inputs.word_ids()
68
  prev_word = None
69
  for idx, word_idx in enumerate(word_ids):
 
73
  prev_word = word_idx
74
  ```
75
 
76
+ Expected output (approximately):
 
77
 
78
  ```
79
+ Татар -> N+Sg+Nom
80
+ теле -> N+Sg+POSS_3(СЫ)+Nom
81
+ бик -> Adv
82
+ бай -> Adj
83
+ . -> PUNCT
84
+ ```
85
+
86
+ ## Citation
87
+
88
+ If you use this model, please cite it as:
89
+
90
+ ```bibtex
91
+ @misc{arabov-mbert-tatar-morph-2026,
92
+ title = {Multilingual BERT (mBERT) fine-tuned for Tatar Morphological Analysis},
93
+ author = {Arabov Mullosharaf Kurbonovich},
94
+ year = {2026},
95
+ publisher = {Hugging Face},
96
+ url = {https://huggingface.co/TatarNLPWorld/mbert-tatar-morph}
97
  }
98
  ```
99
+
100
+ ## License
101
+
102
+ Apache 2.0