language: - uz license: cc-by-nc-4.0 tags: - uzbek - lemmatization - morphology - adjective - mt5 - text2text-generation datasets: - local/csv metrics: - exact_match_accuracy: 91.6% model_name: UzbekAdjectiveLemmatizer base_model: google/mt5-small library_name: transformers

UzbekAdjectiveLemmatizer (mT5-small)

UzbekAdjectiveLemmatizer is a google/mt5-small model fine-tuned to convert Uzbek adjective word forms (suffixed variants) back to their lemma (base form). The model works in a text-to-text format: the input is an adjective form, and the output is its lemma.

UzbekAdjectiveLemmatizer — bu o‘zbek tilidagi sifat so‘z shakllarini (qo‘shimchali variantlarni) lemma (asosiy shakl) ga qaytarish uchun fine-tune qilingan google/mt5-small modelidir. Model text-to-text formatda ishlaydi: kirish sifat shakli, chiqish lemma.

📌 Model overview

Hugging Face model: MaksudSharipov/UzbekAdjectiveLemmatizer
Base model: google/mt5-small
Task: Lemmatization (text-to-text)
Language: Uzbek (uz)
License: CC-BY-NC-4.0
Exact match accuracy (test): 91.6%

🧠 What this model does

This model converts various suffixed forms of adjectives in the Uzbek language back to their base (lemma) form Bu model o‘zbek tilidagi sifat so‘zlarining turli qo‘shimchali shakllarini asosiy lemma shakliga qaytaradi.

Masalan:

Inpun(Kirish)	Ooutput(Chiqarish)
`a’loda`	`a’lo`
`yaxshiroqdan`	`yaxshi`
`eng kichigidan`	`kichik`

The model has been trained on a large number of examples and extracts the lemma with 91.6% accuracy across various suffixed forms. Model juda ko‘p namunalar bilan o‘qitilgan bo‘lib, turli qo‘shimchali shakllarda 91.6% aniqlik bilan lemma chiqaradi.

📦 Training data

Format: CSV
Ustunlar: input, output
- input: qo‘shimchali sifat shakli
- output: lemma
Namuna soni umumiy: 741,890
- Train: 593,512 (80%)
- Val: 74,189 (10%)
- Test: 74,189 (10%)

🔧 Training setup

Tokenizer: T5TokenizerFast (mT5)
Max source length: 32
Max target length: 32
GPU: NVIDIA A100-SXM4-40GB (Google Colab)
Libraries: transformers, datasets, accelerate

📊 Evaluation

The model’s exact match accuracy measured on the test set is 91.6%, meaning that the probability of the produced lemma being an exact match with the ground truth is 91.6%. This result is very good for adjective lemmatization, and the model can be expected to perform effectively on real-world texts as well.

Modelning test to‘plamida o‘lchangan exact match accuracy = 91.6%, ya’ni chiqgan lemma ground-truth bilan to‘liq mos kelish ehtimoli 91.6% ekanligi aniqlangan. Bu ko‘rsatkich sifat lemmatizatsiya uchun juda yaxshi natija bo‘lib, model real matnlarda ham samarali ishlashi mumkin.

🚀 How to use

1) Transformers pipeline bilan

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "MaksudSharipov/UzbekAdjectiveLemmatizer"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

def lemmatize(word: str) -> str:
    inputs = tokenizer(word, return_tensors="pt")
    out = model.generate(**inputs, max_new_tokens=10, num_beams=4)
    return tokenizer.decode(out[0], skip_special_tokens=True)

print(lemmatize("a’loda"))  # -> "a’lo"


Limitations

The model is designed only for adjectives; errors may occur when using it with other parts of speech.
Model faqat sifatlar uchun tayyorlangan; boshqa so‘z turkumlari bilan ishlaganda xatolar bo‘lishi mumkin.

We are currently also working on developing an AI-based Uzbek Lemmatizer for all parts of speech in the Uzbek language.

Citation

If you use it in a scientific paper or project presentation:

Agar ilmiy ishda yoki loyiha taqdimotida foydalansangiz:

@misc{UzbekAdjectiveLemmatizer2026,
  title={UzbekAdjectiveLemmatizer: mT5-small fine-tuned for Uzbek adjective lemmatization},
  author={Maksud Sharipov},
  year={2026},
  howpublished={Hugging Face model repository},
  url={https://huggingface.co/MaksudSharipov/UzbekAdjectiveLemmatizer}
}

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support