Update README.md
Browse files
README.md
CHANGED
|
@@ -11,24 +11,37 @@ tags:
|
|
| 11 |
- tatar
|
| 12 |
- morphology
|
| 13 |
- token-classification
|
| 14 |
-
-
|
| 15 |
---
|
| 16 |
|
| 17 |
-
# DistilBERT multilingual fine-tuned for Tatar
|
| 18 |
|
| 19 |
-
This model is fine-tuned for morphological analysis of Tatar language on a subset of **
|
| 20 |
|
| 21 |
## Performance on Test Set
|
| 22 |
|
| 23 |
| Metric | Value | 95% CI |
|
| 24 |
|--------|-------|--------|
|
| 25 |
| Token Accuracy | 0.9850 | [0.9841, 0.9860] |
|
| 26 |
-
| Micro F1 | 0.
|
| 27 |
-
| Macro F1 | 0.4324 |
|
|
|
|
|
|
|
| 28 |
|
| 29 |
### Accuracy by Part of Speech (Top 10)
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Usage
|
| 34 |
|
|
@@ -50,6 +63,7 @@ import json
|
|
| 50 |
with open("id2tag.json", "r") as f:
|
| 51 |
id2tag = json.load(f)
|
| 52 |
|
|
|
|
| 53 |
word_ids = inputs.word_ids()
|
| 54 |
prev_word = None
|
| 55 |
for idx, word_idx in enumerate(word_ids):
|
|
@@ -59,14 +73,30 @@ for idx, word_idx in enumerate(word_ids):
|
|
| 59 |
prev_word = word_idx
|
| 60 |
```
|
| 61 |
|
| 62 |
-
|
| 63 |
-
If you use this model, please cite our paper:
|
| 64 |
|
| 65 |
```
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
}
|
| 72 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
- tatar
|
| 12 |
- morphology
|
| 13 |
- token-classification
|
| 14 |
+
- distilbert
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# DistilBERT multilingual fine-tuned for Tatar Morphological Analysis
|
| 18 |
|
| 19 |
+
This model is a fine-tuned version of [`distilbert-base-multilingual-cased`](https://huggingface.co/distilbert-base-multilingual-cased) for morphological analysis of the Tatar language. It was trained on a subset of **80,000 sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). The model predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).
|
| 20 |
|
| 21 |
## Performance on Test Set
|
| 22 |
|
| 23 |
| Metric | Value | 95% CI |
|
| 24 |
|--------|-------|--------|
|
| 25 |
| Token Accuracy | 0.9850 | [0.9841, 0.9860] |
|
| 26 |
+
| Micro F1 | 0.9851 | [0.9841, 0.9860] |
|
| 27 |
+
| Macro F1 | 0.4324 | [0.4744, 0.5093]* |
|
| 28 |
+
|
| 29 |
+
*Note: macro F1 CI as reported in the paper.
|
| 30 |
|
| 31 |
### Accuracy by Part of Speech (Top 10)
|
| 32 |
|
| 33 |
+
| POS | Accuracy |
|
| 34 |
+
|-----|----------|
|
| 35 |
+
| PUNCT | 1.0000 |
|
| 36 |
+
| NOUN | 0.9836 |
|
| 37 |
+
| VERB | 0.9535 |
|
| 38 |
+
| ADJ | 0.9626 |
|
| 39 |
+
| PRON | 0.9896 |
|
| 40 |
+
| PART | 0.9973 |
|
| 41 |
+
| PROPN | 0.9754 |
|
| 42 |
+
| ADP | 1.0000 |
|
| 43 |
+
| CCONJ | 1.0000 |
|
| 44 |
+
| ADV | 0.9845 |
|
| 45 |
|
| 46 |
## Usage
|
| 47 |
|
|
|
|
| 63 |
with open("id2tag.json", "r") as f:
|
| 64 |
id2tag = json.load(f)
|
| 65 |
|
| 66 |
+
# Convert predictions to tags
|
| 67 |
word_ids = inputs.word_ids()
|
| 68 |
prev_word = None
|
| 69 |
for idx, word_idx in enumerate(word_ids):
|
|
|
|
| 73 |
prev_word = word_idx
|
| 74 |
```
|
| 75 |
|
| 76 |
+
Expected output (approximately):
|
|
|
|
| 77 |
|
| 78 |
```
|
| 79 |
+
Татар -> N+Sg+Nom
|
| 80 |
+
теле -> N+Sg+POSS_3(СЫ)+Nom
|
| 81 |
+
бик -> Adv
|
| 82 |
+
бай -> Adj
|
| 83 |
+
. -> PUNCT
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
## Citation
|
| 87 |
+
|
| 88 |
+
If you use this model, please cite it as:
|
| 89 |
+
|
| 90 |
+
```bibtex
|
| 91 |
+
@misc{arabov-distilbert-tatar-morph-2026,
|
| 92 |
+
title = {DistilBERT multilingual fine-tuned for Tatar Morphological Analysis},
|
| 93 |
+
author = {Arabov Mullosharaf Kurbonovich},
|
| 94 |
+
year = {2026},
|
| 95 |
+
publisher = {Hugging Face},
|
| 96 |
+
url = {https://huggingface.co/TatarNLPWorld/distilbert-tatar-morph}
|
| 97 |
}
|
| 98 |
```
|
| 99 |
+
|
| 100 |
+
## License
|
| 101 |
+
|
| 102 |
+
Apache 2.0
|