Uzbek POS Tagger (Tahrirchi-BERT fine-tuned, 16 tags)
Model summary
This model is a Part-of-Speech (POS) tagger for Uzbek fine-tuned from tahrirchi/tahrirchi-bert-base using the Hugging Face Transformers library. It predicts one POS label per token (token classification).
- Task: POS tagging (token classification)
- Language: Uzbek (
uz) - Base model:
tahrirchi/tahrirchi-bert-base - Labels: 16 POS tags
Note: The model is a standard token-classification fine-tune (no additional decoding layers).
Label set
The model predicts the following 16 POS tags:
ADJ, ADP, ADV, AUX, CCONJ, INTJ, MOD, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB
label2id / id2label mappings were created from the sorted unique tag list found in the dataset.
Dataset
The model was trained on a custom Uzbek POS dataset provided as a plain text file:
- Sentences: 5,626
- Tokens (words): 64,975
- Format: one sentence per line, tokens separated by spaces, each token in
word/TAGform
Example:Men/PRON bugun/ADV Toshkentga/PROPN bordim/VERB ./PUNCT
Training procedure
Training used 5-fold cross-validation (KFold(n_splits=5, shuffle=True, random_state=42)).
For each fold, the base model was reloaded and fine-tuned from scratch for that split. The best fold by validation F1 was saved.
Hardware
- Fine-tuned on Google Colab with an NVIDIA A100 GPU.
Key hyperparameters
- Max sequence length: 512
- Learning rate: 2e-5
- Epochs: 14
- Batch size: 32 (train) / 32 (eval)
- Weight decay: 0.01
- Label smoothing: 0.1
- Best model selection:
metric_for_best_model="f1",load_best_model_at_end=True - Token/label alignment: first subword gets the label, subsequent subwords are ignored (
-100)
Evaluation results
Best reported validation performance (Fold 4):
- Accuracy: 0.9810
- Weighted F1: 0.9811
Per-class report (Fold 4)
precision recall f1-score support
ADJ 0.97 0.97 0.97 1075
ADP 0.97 0.97 0.97 465
ADV 0.95 0.95 0.95 398
AUX 1.00 0.98 0.99 177
CCONJ 1.00 0.98 0.99 389
INTJ 0.96 0.97 0.97 114
MOD 0.97 0.99 0.98 138
NOUN 0.98 0.98 0.98 4195
NUM 0.98 0.99 0.98 315
PART 0.97 0.93 0.95 99
PRON 0.99 0.99 0.99 719
PROPN 0.92 0.95 0.93 340
PUNCT 1.00 1.00 1.00 1916
SCONJ 0.98 0.99 0.98 122
SYM 0.97 0.98 0.98 195
VERB 0.99 0.98 0.98 2150
accuracy 0.98 12807
macro avg 0.97 0.98 0.97 12807
weighted avg 0.98 0.98 0.98 12807
These numbers are from one validation split (fold) in cross-validation. Real-world performance may vary depending on domain, spelling quality, and text style.
How to use
With pipeline
from transformers import pipeline
model_id = "<your-username>/<your-repo-name>"
tagger = pipeline(
"token-classification",
model=model_id,
tokenizer=model_id,
aggregation_strategy="simple"
)
text = "Men bugun Toshkentga bordim."
preds = tagger(text)
for p in preds:
print(p["word"], p["entity_group"], float(p["score"]))
Intended use
- Uzbek POS tagging for NLP pipelines (preprocessing, feature extraction, linguistic analysis).
- Helpful for downstream tasks such as syntactic analysis, information extraction, or rule-based post-processing.
Limitations
- Domain shift can reduce accuracy (e.g., informal chat text, heavy typos, transliteration).
- The
MODtag may not map 1:1 to Universal Dependencies (UD) conventions depending on dataset annotation guidelines. - Tokenization/subword splitting may affect rare words and spelling variants.
License
This model is released under the Apache License 2.0.
Author / Contact
- Author: Maksud Sharipov
- Contact: maqsbek72@gmail.com
- Downloads last month
- 5
Model tree for MaksudSharipov/Uzbek-POS-Tagger-TahrirchiBERT
Base model
tahrirchi/tahrirchi-bert-base