Uzbek POS Tagger (Tahrirchi-BERT fine-tuned, 16 tags)

Model summary

This model is a Part-of-Speech (POS) tagger for Uzbek fine-tuned from tahrirchi/tahrirchi-bert-base using the Hugging Face Transformers library. It predicts one POS label per token (token classification).

  • Task: POS tagging (token classification)
  • Language: Uzbek (uz)
  • Base model: tahrirchi/tahrirchi-bert-base
  • Labels: 16 POS tags

Note: The model is a standard token-classification fine-tune (no additional decoding layers).


Label set

The model predicts the following 16 POS tags:

ADJ, ADP, ADV, AUX, CCONJ, INTJ, MOD, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB

label2id / id2label mappings were created from the sorted unique tag list found in the dataset.


Dataset

The model was trained on a custom Uzbek POS dataset provided as a plain text file:

  • Sentences: 5,626
  • Tokens (words): 64,975
  • Format: one sentence per line, tokens separated by spaces, each token in word/TAG form
    Example: Men/PRON bugun/ADV Toshkentga/PROPN bordim/VERB ./PUNCT

Training procedure

Training used 5-fold cross-validation (KFold(n_splits=5, shuffle=True, random_state=42)).
For each fold, the base model was reloaded and fine-tuned from scratch for that split. The best fold by validation F1 was saved.

Hardware

  • Fine-tuned on Google Colab with an NVIDIA A100 GPU.

Key hyperparameters

  • Max sequence length: 512
  • Learning rate: 2e-5
  • Epochs: 14
  • Batch size: 32 (train) / 32 (eval)
  • Weight decay: 0.01
  • Label smoothing: 0.1
  • Best model selection: metric_for_best_model="f1", load_best_model_at_end=True
  • Token/label alignment: first subword gets the label, subsequent subwords are ignored (-100)

Evaluation results

Best reported validation performance (Fold 4):

  • Accuracy: 0.9810
  • Weighted F1: 0.9811

Per-class report (Fold 4)

              precision    recall  f1-score   support

         ADJ       0.97      0.97      0.97      1075
         ADP       0.97      0.97      0.97       465
         ADV       0.95      0.95      0.95       398
         AUX       1.00      0.98      0.99       177
       CCONJ       1.00      0.98      0.99       389
        INTJ       0.96      0.97      0.97       114
         MOD       0.97      0.99      0.98       138
        NOUN       0.98      0.98      0.98      4195
         NUM       0.98      0.99      0.98       315
        PART       0.97      0.93      0.95        99
        PRON       0.99      0.99      0.99       719
       PROPN       0.92      0.95      0.93       340
       PUNCT       1.00      1.00      1.00      1916
       SCONJ       0.98      0.99      0.98       122
         SYM       0.97      0.98      0.98       195
        VERB       0.99      0.98      0.98      2150

    accuracy                           0.98     12807
   macro avg       0.97      0.98      0.97     12807
weighted avg       0.98      0.98      0.98     12807

These numbers are from one validation split (fold) in cross-validation. Real-world performance may vary depending on domain, spelling quality, and text style.


How to use

With pipeline

from transformers import pipeline

model_id = "<your-username>/<your-repo-name>"
tagger = pipeline(
    "token-classification",
    model=model_id,
    tokenizer=model_id,
    aggregation_strategy="simple"
)

text = "Men bugun Toshkentga bordim."
preds = tagger(text)

for p in preds:
    print(p["word"], p["entity_group"], float(p["score"]))

Intended use

  • Uzbek POS tagging for NLP pipelines (preprocessing, feature extraction, linguistic analysis).
  • Helpful for downstream tasks such as syntactic analysis, information extraction, or rule-based post-processing.

Limitations

  • Domain shift can reduce accuracy (e.g., informal chat text, heavy typos, transliteration).
  • The MOD tag may not map 1:1 to Universal Dependencies (UD) conventions depending on dataset annotation guidelines.
  • Tokenization/subword splitting may affect rare words and spelling variants.

License

This model is released under the Apache License 2.0.


Author / Contact

Downloads last month
5
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MaksudSharipov/Uzbek-POS-Tagger-TahrirchiBERT

Finetuned
(2)
this model