XLM‑RoBERTa fine-tuned for Tatar Morphological Analysis

This model is a fine-tuned version of xlm-roberta-base for morphological analysis of the Tatar language. It was trained on a subset of 80,000 sentences from the Tatar Morphological Corpus. The model predicts fine-grained morphological tags (e.g., N+Sg+Nom, V+PRES(Й)+3SG).

Performance on Test Set

Metric	Value	95% CI
Token Accuracy	0.9837	[0.9826, 0.9846]
Micro F1	0.9837	[0.9826, 0.9846]
Macro F1	0.4131	[0.4521, 0.4867]*

*Note: macro F1 CI as reported in the paper.

Accuracy by Part of Speech (Top 10)

POS	Accuracy
PUNCT	1.0000
NOUN	0.9830
VERB	0.9500
ADJ	0.9543
PRON	0.9889
PART	0.9986
PROPN	0.9794
ADP	0.9979
CCONJ	0.9992
ADV	0.9741

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "TatarNLPWorld/xlm-roberta-tatar-morph"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

tokens = ["Татар", "теле", "бик", "бай", "."]
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

# Get tag mapping from model config
id2tag = model.config.id2label

word_ids = inputs.word_ids()
prev_word = None
for idx, word_idx in enumerate(word_ids):
    if word_idx is not None and word_idx != prev_word:
        tag_id = predictions[0][idx].item()
        if isinstance(id2tag, dict):
            tag = id2tag.get(str(tag_id), id2tag.get(tag_id, "UNK"))
        else:
            tag = id2tag[tag_id] if tag_id < len(id2tag) else "UNK"
        print(tokens[word_idx], "->", tag)
    prev_word = word_idx

Expected output (approximately):

Татар -> N+Sg+Nom
теле -> N+Sg+POSS_3(СЫ)+Nom
бик -> Adv
бай -> Adj
. -> PUNCT

Citation

If you use this model, please cite it as:

@misc{arabov-xlm-roberta-tatar-morph-2026,
  title = {XLM‑RoBERTa fine-tuned for Tatar Morphological Analysis},
  author = {Arabov Mullosharaf Kurbonovich},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/TatarNLPWorld/xlm-roberta-tatar-morph}
}

License

Apache 2.0

Downloads last month: 2

Safetensors

Model size

0.3B params

Tensor type

F32