--- language: tt license: apache-2.0 datasets: - TatarNLPWorld/tatar-morphological-corpus metrics: - accuracy - f1 pipeline_tag: token-classification tags: - tatar - morphology - token-classification - distilbert --- # DistilBERT multilingual fine-tuned for Tatar Morphological Analysis This model is a fine-tuned version of [`distilbert-base-multilingual-cased`](https://huggingface.co/distilbert-base-multilingual-cased) for morphological analysis of the Tatar language. It was trained on a subset of **80,000 sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). The model predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`). ## Performance on Test Set | Metric | Value | 95% CI | |--------|-------|--------| | Token Accuracy | 0.9850 | [0.9841, 0.9860] | | Micro F1 | 0.9851 | [0.9841, 0.9860] | | Macro F1 | 0.4324 | [0.4744, 0.5093]* | *Note: macro F1 CI as reported in the paper. ### Accuracy by Part of Speech (Top 10) | POS | Accuracy | |-----|----------| | PUNCT | 1.0000 | | NOUN | 0.9836 | | VERB | 0.9535 | | ADJ | 0.9626 | | PRON | 0.9896 | | PART | 0.9973 | | PROPN | 0.9754 | | ADP | 1.0000 | | CCONJ | 1.0000 | | ADV | 0.9845 | ## Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch model_name = "TatarNLPWorld/distilbert-tatar-morph" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) tokens = ["Татар", "теле", "бик", "бай", "."] inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True) outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) # Get tag mapping from model config id2tag = model.config.id2label word_ids = inputs.word_ids() prev_word = None for idx, word_idx in enumerate(word_ids): if word_idx is not None and word_idx != prev_word: tag_id = predictions[0][idx].item() if isinstance(id2tag, dict): tag = id2tag.get(str(tag_id), id2tag.get(tag_id, "UNK")) else: tag = id2tag[tag_id] if tag_id < len(id2tag) else "UNK" print(tokens[word_idx], "->", tag) prev_word = word_idx ``` Expected output (approximately): ``` Татар -> N+Sg+Nom теле -> N+Sg+POSS_3(СЫ)+Nom бик -> Adv бай -> Adj . -> PUNCT ``` ## Citation If you use this model, please cite it as: ```bibtex @misc{arabov-distilbert-tatar-morph-2026, title = {DistilBERT multilingual fine-tuned for Tatar Morphological Analysis}, author = {Arabov Mullosharaf Kurbonovich}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/TatarNLPWorld/distilbert-tatar-morph} } ``` ## License Apache 2.0