Token Classification
Safetensors
Tatar
distilbert
tatar
morphology
ArabovMK commited on
Commit
2ec79f0
·
verified ·
1 Parent(s): 964380b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -13
README.md CHANGED
@@ -11,24 +11,37 @@ tags:
11
  - tatar
12
  - morphology
13
  - token-classification
14
- - bert
15
  ---
16
 
17
- # DistilBERT multilingual fine-tuned for Tatar Morphology
18
 
19
- This model is fine-tuned for morphological analysis of Tatar language on a subset of **80k sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). It predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).
20
 
21
  ## Performance on Test Set
22
 
23
  | Metric | Value | 95% CI |
24
  |--------|-------|--------|
25
  | Token Accuracy | 0.9850 | [0.9841, 0.9860] |
26
- | Micro F1 | 0.9850 | - |
27
- | Macro F1 | 0.4324 | - |
 
 
28
 
29
  ### Accuracy by Part of Speech (Top 10)
30
 
31
- No POS‑wise accuracy data available.
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ## Usage
34
 
@@ -50,6 +63,7 @@ import json
50
  with open("id2tag.json", "r") as f:
51
  id2tag = json.load(f)
52
 
 
53
  word_ids = inputs.word_ids()
54
  prev_word = None
55
  for idx, word_idx in enumerate(word_ids):
@@ -59,14 +73,30 @@ for idx, word_idx in enumerate(word_ids):
59
  prev_word = word_idx
60
  ```
61
 
62
- ## Citation
63
- If you use this model, please cite our paper:
64
 
65
  ```
66
- @article{arabov2026scaling,
67
- author = {Arabov, M. K. and Gilmullin, R. A. and Burnashev, R. A.},
68
- title = {Scaling Multilingual Transformers for Low‑Resource Agglutinative Languages: A Benchmark of State‑of‑the‑Art Models on Tatar Morphological Analysis},
69
- journal = {…},
70
- year = {2026}
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  }
72
  ```
 
 
 
 
 
11
  - tatar
12
  - morphology
13
  - token-classification
14
+ - distilbert
15
  ---
16
 
17
+ # DistilBERT multilingual fine-tuned for Tatar Morphological Analysis
18
 
19
+ This model is a fine-tuned version of [`distilbert-base-multilingual-cased`](https://huggingface.co/distilbert-base-multilingual-cased) for morphological analysis of the Tatar language. It was trained on a subset of **80,000 sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). The model predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).
20
 
21
  ## Performance on Test Set
22
 
23
  | Metric | Value | 95% CI |
24
  |--------|-------|--------|
25
  | Token Accuracy | 0.9850 | [0.9841, 0.9860] |
26
+ | Micro F1 | 0.9851 | [0.9841, 0.9860] |
27
+ | Macro F1 | 0.4324 | [0.4744, 0.5093]* |
28
+
29
+ *Note: macro F1 CI as reported in the paper.
30
 
31
  ### Accuracy by Part of Speech (Top 10)
32
 
33
+ | POS | Accuracy |
34
+ |-----|----------|
35
+ | PUNCT | 1.0000 |
36
+ | NOUN | 0.9836 |
37
+ | VERB | 0.9535 |
38
+ | ADJ | 0.9626 |
39
+ | PRON | 0.9896 |
40
+ | PART | 0.9973 |
41
+ | PROPN | 0.9754 |
42
+ | ADP | 1.0000 |
43
+ | CCONJ | 1.0000 |
44
+ | ADV | 0.9845 |
45
 
46
  ## Usage
47
 
 
63
  with open("id2tag.json", "r") as f:
64
  id2tag = json.load(f)
65
 
66
+ # Convert predictions to tags
67
  word_ids = inputs.word_ids()
68
  prev_word = None
69
  for idx, word_idx in enumerate(word_ids):
 
73
  prev_word = word_idx
74
  ```
75
 
76
+ Expected output (approximately):
 
77
 
78
  ```
79
+ Татар -> N+Sg+Nom
80
+ теле -> N+Sg+POSS_3(СЫ)+Nom
81
+ бик -> Adv
82
+ бай -> Adj
83
+ . -> PUNCT
84
+ ```
85
+
86
+ ## Citation
87
+
88
+ If you use this model, please cite it as:
89
+
90
+ ```bibtex
91
+ @misc{arabov-distilbert-tatar-morph-2026,
92
+ title = {DistilBERT multilingual fine-tuned for Tatar Morphological Analysis},
93
+ author = {Arabov Mullosharaf Kurbonovich},
94
+ year = {2026},
95
+ publisher = {Hugging Face},
96
+ url = {https://huggingface.co/TatarNLPWorld/distilbert-tatar-morph}
97
  }
98
  ```
99
+
100
+ ## License
101
+
102
+ Apache 2.0