--- language: tt license: apache-2.0 datasets: - TatarNLPWorld/tatar-morphological-corpus metrics: - accuracy - f1 pipeline_tag: token-classification tags: - tatar - morphology - token-classification - rubert --- # RuBERT fine-tuned for Tatar Morphological Analysis This model is a fine-tuned version of [`DeepPavlov/rubert-base-cased`](https://huggingface.co/DeepPavlov/rubert-base-cased) for morphological analysis of the Tatar language. It was trained on a subset of **80,000 sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). The model predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`). ## Performance on Test Set | Metric | Value | 95% CI | |--------|-------|--------| | Token Accuracy | 0.9861 | [0.9852, 0.9870] | | Micro F1 | 0.9861 | [0.9851, 0.9870] | | Macro F1 | 0.5059 | [0.5432, 0.5836]* | *Note: macro F1 CI as reported in the paper. ### Accuracy by Part of Speech (Top 10) | POS | Accuracy | |-----|----------| | PUNCT | 1.0000 | | NOUN | 0.9827 | | VERB | 0.9640 | | ADJ | 0.9614 | | PRON | 0.9914 | | PART | 0.9995 | | PROPN | 0.9724 | | ADP | 1.0000 | | CCONJ | 1.0000 | | ADV | 0.9897 | ## Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch model_name = "TatarNLPWorld/rubert-tatar-morph" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) tokens = ["Татар", "теле", "бик", "бай", "."] inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True) outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) # Get tag mapping from model config id2tag = model.config.id2label word_ids = inputs.word_ids() prev_word = None for idx, word_idx in enumerate(word_ids): if word_idx is not None and word_idx != prev_word: tag_id = predictions[0][idx].item() if isinstance(id2tag, dict): tag = id2tag.get(str(tag_id), id2tag.get(tag_id, "UNK")) else: tag = id2tag[tag_id] if tag_id < len(id2tag) else "UNK" print(tokens[word_idx], "->", tag) prev_word = word_idx ``` Expected output (approximately): ``` Татар -> N+Sg+Nom теле -> N+Sg+POSS_3(СЫ)+Nom бик -> Adv бай -> Adj . -> PUNCT ``` ## Citation If you use this model, please cite it as: ```bibtex @misc{arabov-rubert-tatar-morph-2026, title = {RuBERT fine-tuned for Tatar Morphological Analysis}, author = {Arabov Mullosharaf Kurbonovich}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/TatarNLPWorld/rubert-tatar-morph} } ``` ## License Apache 2.0