| --- |
| language: tt |
| license: apache-2.0 |
| datasets: |
| - TatarNLPWorld/tatar-morphological-corpus |
| metrics: |
| - accuracy |
| - f1 |
| pipeline_tag: token-classification |
| tags: |
| - tatar |
| - morphology |
| - token-classification |
| - rubert |
| --- |
| |
| # RuBERT fine-tuned for Tatar Morphological Analysis |
|
|
| This model is a fine-tuned version of [`DeepPavlov/rubert-base-cased`](https://huggingface.co/DeepPavlov/rubert-base-cased) for morphological analysis of the Tatar language. It was trained on a subset of **80,000 sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). The model predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`). |
|
|
| ## Performance on Test Set |
|
|
| | Metric | Value | 95% CI | |
| |--------|-------|--------| |
| | Token Accuracy | 0.9861 | [0.9852, 0.9870] | |
| | Micro F1 | 0.9861 | [0.9851, 0.9870] | |
| | Macro F1 | 0.5059 | [0.5432, 0.5836]* | |
|
|
| *Note: macro F1 CI as reported in the paper. |
| |
| ### Accuracy by Part of Speech (Top 10) |
| |
| | POS | Accuracy | |
| |-----|----------| |
| | PUNCT | 1.0000 | |
| | NOUN | 0.9827 | |
| | VERB | 0.9640 | |
| | ADJ | 0.9614 | |
| | PRON | 0.9914 | |
| | PART | 0.9995 | |
| | PROPN | 0.9724 | |
| | ADP | 1.0000 | |
| | CCONJ | 1.0000 | |
| | ADV | 0.9897 | |
| |
| ## Usage |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModelForTokenClassification |
| import torch |
| |
| model_name = "TatarNLPWorld/rubert-tatar-morph" |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForTokenClassification.from_pretrained(model_name) |
| |
| tokens = ["Татар", "теле", "бик", "бай", "."] |
| inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True) |
| outputs = model(**inputs) |
| predictions = torch.argmax(outputs.logits, dim=2) |
| |
| # Get tag mapping from model config |
| id2tag = model.config.id2label |
| |
| word_ids = inputs.word_ids() |
| prev_word = None |
| for idx, word_idx in enumerate(word_ids): |
| if word_idx is not None and word_idx != prev_word: |
| tag_id = predictions[0][idx].item() |
| if isinstance(id2tag, dict): |
| tag = id2tag.get(str(tag_id), id2tag.get(tag_id, "UNK")) |
| else: |
| tag = id2tag[tag_id] if tag_id < len(id2tag) else "UNK" |
| print(tokens[word_idx], "->", tag) |
| prev_word = word_idx |
| ``` |
| |
| Expected output (approximately): |
| |
| ``` |
| Татар -> N+Sg+Nom |
| теле -> N+Sg+POSS_3(СЫ)+Nom |
| бик -> Adv |
| бай -> Adj |
| . -> PUNCT |
| ``` |
| |
| ## Citation |
| |
| If you use this model, please cite it as: |
| |
| ```bibtex |
| @misc{arabov-rubert-tatar-morph-2026, |
| title = {RuBERT fine-tuned for Tatar Morphological Analysis}, |
| author = {Arabov Mullosharaf Kurbonovich}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/TatarNLPWorld/rubert-tatar-morph} |
| } |
| ``` |
| |
| ## License |
| |
| Apache 2.0 |