Token Classification
Safetensors
Tatar
distilbert
tatar
morphology
File size: 2,795 Bytes
964380b
 
 
 
 
 
 
 
 
 
 
 
 
2ec79f0
964380b
 
2ec79f0
964380b
2ec79f0
964380b
 
 
 
 
 
2ec79f0
 
 
 
964380b
 
 
2ec79f0
 
 
 
 
 
 
 
 
 
 
 
964380b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bfebb38
 
964380b
 
 
 
 
bfebb38
 
 
 
 
964380b
 
 
 
2ec79f0
964380b
 
2ec79f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
964380b
 
2ec79f0
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
language: tt
license: apache-2.0
datasets:
- TatarNLPWorld/tatar-morphological-corpus
metrics:
- accuracy
- f1
pipeline_tag: token-classification
tags:
- tatar
- morphology
- token-classification
- distilbert
---

# DistilBERT multilingual fine-tuned for Tatar Morphological Analysis

This model is a fine-tuned version of [`distilbert-base-multilingual-cased`](https://huggingface.co/distilbert-base-multilingual-cased) for morphological analysis of the Tatar language. It was trained on a subset of **80,000 sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). The model predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).

## Performance on Test Set

| Metric | Value | 95% CI |
|--------|-------|--------|
| Token Accuracy | 0.9850 | [0.9841, 0.9860] |
| Micro F1 | 0.9851 | [0.9841, 0.9860] |
| Macro F1 | 0.4324 | [0.4744, 0.5093]* |

*Note: macro F1 CI as reported in the paper.

### Accuracy by Part of Speech (Top 10)

| POS | Accuracy |
|-----|----------|
| PUNCT | 1.0000 |
| NOUN | 0.9836 |
| VERB | 0.9535 |
| ADJ | 0.9626 |
| PRON | 0.9896 |
| PART | 0.9973 |
| PROPN | 0.9754 |
| ADP | 1.0000 |
| CCONJ | 1.0000 |
| ADV | 0.9845 |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "TatarNLPWorld/distilbert-tatar-morph"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

tokens = ["Татар", "теле", "бик", "бай", "."]
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

# Get tag mapping from model config
id2tag = model.config.id2label

word_ids = inputs.word_ids()
prev_word = None
for idx, word_idx in enumerate(word_ids):
    if word_idx is not None and word_idx != prev_word:
        tag_id = predictions[0][idx].item()
        if isinstance(id2tag, dict):
            tag = id2tag.get(str(tag_id), id2tag.get(tag_id, "UNK"))
        else:
            tag = id2tag[tag_id] if tag_id < len(id2tag) else "UNK"
        print(tokens[word_idx], "->", tag)
    prev_word = word_idx
```

Expected output (approximately):

```
Татар -> N+Sg+Nom
теле -> N+Sg+POSS_3(СЫ)+Nom
бик -> Adv
бай -> Adj
. -> PUNCT
```

## Citation

If you use this model, please cite it as:

```bibtex
@misc{arabov-distilbert-tatar-morph-2026,
  title = {DistilBERT multilingual fine-tuned for Tatar Morphological Analysis},
  author = {Arabov Mullosharaf Kurbonovich},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/TatarNLPWorld/distilbert-tatar-morph}
}
```

## License

Apache 2.0