Uzbek POS Tagger (Tahrirchi-BERT fine-tuned, 16 tags)

Model summary

This model is a Part-of-Speech (POS) tagger for Uzbek fine-tuned from tahrirchi/tahrirchi-bert-base using the Hugging Face Transformers library. It predicts one POS label per token (token classification).

Task: POS tagging (token classification)
Language: Uzbek (uz)
Base model: tahrirchi/tahrirchi-bert-base
Labels: 16 POS tags

Note: The model is a standard token-classification fine-tune (no additional decoding layers).

Label set

The model predicts the following 16 POS tags:

ADJ, ADP, ADV, AUX, CCONJ, INTJ, MOD, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB

label2id / id2label mappings were created from the sorted unique tag list found in the dataset.

Dataset

The model was trained on a custom Uzbek POS dataset provided as a plain text file:

Sentences: 5,626
Tokens (words): 64,975
Format: one sentence per line, tokens separated by spaces, each token in word/TAG form
Example: Men/PRON bugun/ADV Toshkentga/PROPN bordim/VERB ./PUNCT

Training procedure

Training used 5-fold cross-validation (KFold(n_splits=5, shuffle=True, random_state=42)).
For each fold, the base model was reloaded and fine-tuned from scratch for that split. The best fold by validation F1 was saved.

Hardware

Fine-tuned on Google Colab with an NVIDIA A100 GPU.

Key hyperparameters

Max sequence length: 512
Learning rate: 2e-5
Epochs: 14
Batch size: 32 (train) / 32 (eval)
Weight decay: 0.01
Label smoothing: 0.1
Best model selection: metric_for_best_model="f1", load_best_model_at_end=True
Token/label alignment: first subword gets the label, subsequent subwords are ignored (-100)

Evaluation results

Best reported validation performance (Fold 4):

Accuracy: 0.9810
Weighted F1: 0.9811

Per-class report (Fold 4)

              precision    recall  f1-score   support

         ADJ       0.97      0.97      0.97      1075
         ADP       0.97      0.97      0.97       465
         ADV       0.95      0.95      0.95       398
         AUX       1.00      0.98      0.99       177
       CCONJ       1.00      0.98      0.99       389
        INTJ       0.96      0.97      0.97       114
         MOD       0.97      0.99      0.98       138
        NOUN       0.98      0.98      0.98      4195
         NUM       0.98      0.99      0.98       315
        PART       0.97      0.93      0.95        99
        PRON       0.99      0.99      0.99       719
       PROPN       0.92      0.95      0.93       340
       PUNCT       1.00      1.00      1.00      1916
       SCONJ       0.98      0.99      0.98       122
         SYM       0.97      0.98      0.98       195
        VERB       0.99      0.98      0.98      2150

    accuracy                           0.98     12807
   macro avg       0.97      0.98      0.97     12807
weighted avg       0.98      0.98      0.98     12807

These numbers are from one validation split (fold) in cross-validation. Real-world performance may vary depending on domain, spelling quality, and text style.

How to use

With `pipeline`

from transformers import pipeline

model_id = "<your-username>/<your-repo-name>"
tagger = pipeline(
    "token-classification",
    model=model_id,
    tokenizer=model_id,
    aggregation_strategy="simple"
)

text = "Men bugun Toshkentga bordim."
preds = tagger(text)

for p in preds:
    print(p["word"], p["entity_group"], float(p["score"]))

Intended use

Uzbek POS tagging for NLP pipelines (preprocessing, feature extraction, linguistic analysis).
Helpful for downstream tasks such as syntactic analysis, information extraction, or rule-based post-processing.

Limitations

Domain shift can reduce accuracy (e.g., informal chat text, heavy typos, transliteration).
The MOD tag may not map 1:1 to Universal Dependencies (UD) conventions depending on dataset annotation guidelines.
Tokenization/subword splitting may affect rare words and spelling variants.

License

This model is released under the Apache License 2.0.

Author / Contact

Author: Maksud Sharipov
Contact: maqsbek72@gmail.com

Downloads last month: 5

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for MaksudSharipov/Uzbek-POS-Tagger-TahrirchiBERT

Base model

tahrirchi/tahrirchi-bert-base

Finetuned

(2)

this model