TatarNLPWorld
/

distilbert-tatar-morph

Token Classification

Model card Files Files and versions

distilbert-tatar-morph / README.md

ArabovMK's picture

Update README.md

bfebb38 verified 8 days ago

|

history blame contribute delete

2.8 kB

	---
	language: tt
	license: apache-2.0
	datasets:
	- TatarNLPWorld/tatar-morphological-corpus
	metrics:
	- accuracy
	- f1
	pipeline_tag: token-classification
	tags:
	- tatar
	- morphology
	- token-classification
	- distilbert
	---

	# DistilBERT multilingual fine-tuned for Tatar Morphological Analysis

	This model is a fine-tuned version of [`distilbert-base-multilingual-cased`](https://huggingface.co/distilbert-base-multilingual-cased) for morphological analysis of the Tatar language. It was trained on a subset of 80,000 sentences from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). The model predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).

	## Performance on Test Set

	\| Metric \| Value \| 95% CI \|
	\|--------\|-------\|--------\|
	\| Token Accuracy \| 0.9850 \| [0.9841, 0.9860] \|
	\| Micro F1 \| 0.9851 \| [0.9841, 0.9860] \|
	\| Macro F1 \| 0.4324 \| [0.4744, 0.5093]* \|

	*Note: macro F1 CI as reported in the paper.

	### Accuracy by Part of Speech (Top 10)

	\| POS \| Accuracy \|
	\|-----\|----------\|
	\| PUNCT \| 1.0000 \|
	\| NOUN \| 0.9836 \|
	\| VERB \| 0.9535 \|
	\| ADJ \| 0.9626 \|
	\| PRON \| 0.9896 \|
	\| PART \| 0.9973 \|
	\| PROPN \| 0.9754 \|
	\| ADP \| 1.0000 \|
	\| CCONJ \| 1.0000 \|
	\| ADV \| 0.9845 \|

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	model_name = "TatarNLPWorld/distilbert-tatar-morph"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	tokens = ["Татар", "теле", "бик", "бай", "."]
	inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True)
	outputs = model(**inputs)
	predictions = torch.argmax(outputs.logits, dim=2)

	# Get tag mapping from model config
	id2tag = model.config.id2label

	word_ids = inputs.word_ids()
	prev_word = None
	for idx, word_idx in enumerate(word_ids):
	if word_idx is not None and word_idx != prev_word:
	tag_id = predictions[0][idx].item()
	if isinstance(id2tag, dict):
	tag = id2tag.get(str(tag_id), id2tag.get(tag_id, "UNK"))
	else:
	tag = id2tag[tag_id] if tag_id < len(id2tag) else "UNK"
	print(tokens[word_idx], "->", tag)
	prev_word = word_idx
	```

	Expected output (approximately):

	```
	Татар -> N+Sg+Nom
	теле -> N+Sg+POSS_3(СЫ)+Nom
	бик -> Adv
	бай -> Adj
	. -> PUNCT
	```

	## Citation

	If you use this model, please cite it as:

	```bibtex
	@misc{arabov-distilbert-tatar-morph-2026,
	title = {DistilBERT multilingual fine-tuned for Tatar Morphological Analysis},
	author = {Arabov Mullosharaf Kurbonovich},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/TatarNLPWorld/distilbert-tatar-morph}
	}
	```

	## License

	Apache 2.0