--- language: tt license: apache-2.0 datasets: - TatarNLPWorld/tatar-morphological-corpus metrics: - accuracy - f1 pipeline_tag: token-classification tags: - tatar - morphology - token-classification - mbert --- # Multilingual BERT (mBERT) fine-tuned for Tatar Morphological Analysis This model is a fine-tuned version of [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased) for morphological analysis of the Tatar language. It was trained on a subset of **80,000 sentences** from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). The model predicts fine-grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`). ## Performance on Test Set | Metric | Value | 95% CI | |--------|-------|--------| | Token Accuracy | 0.9905 | [0.9898, 0.9913] | | Micro F1 | 0.9905 | [0.9897, 0.9913] | | Macro F1 | 0.5563 | [0.5954, 0.6387]* | *Note: macro F1 CI as reported in the paper. ### Accuracy by Part of Speech (Top 10) | POS | Accuracy | |-----|----------| | PUNCT | 1.0000 | | NOUN | 0.9905 | | VERB | 0.9718 | | ADJ | 0.9718 | | PRON | 0.9918 | | PART | 0.9986 | | PROPN | 0.9779 | | ADP | 1.0000 | | CCONJ | 1.0000 | | ADV | 0.9948 | ## Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch model_name = "TatarNLPWorld/mbert-tatar-morph" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) tokens = ["Татар", "теле", "бик", "бай", "."] inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True) outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) # Get tag mapping from model config id2tag = model.config.id2label word_ids = inputs.word_ids() prev_word = None for idx, word_idx in enumerate(word_ids): if word_idx is not None and word_idx != prev_word: tag_id = predictions[0][idx].item() if isinstance(id2tag, dict): tag = id2tag.get(str(tag_id), id2tag.get(tag_id, "UNK")) else: tag = id2tag[tag_id] if tag_id < len(id2tag) else "UNK" print(tokens[word_idx], "->", tag) prev_word = word_idx ``` Expected output (approximately): ``` Татар -> N+Sg+Nom теле -> N+Sg+POSS_3(СЫ)+Nom бик -> Adv бай -> Adj . -> PUNCT ``` ## Citation If you use this model, please cite it as: ```bibtex @misc{arabov-mbert-tatar-morph-2026, title = {Multilingual BERT (mBERT) fine-tuned for Tatar Morphological Analysis}, author = {Arabov Mullosharaf Kurbonovich}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/TatarNLPWorld/mbert-tatar-morph} } ``` ## License Apache 2.0