| --- |
| language: |
| - uz |
| license: apache-2.0 |
| tags: |
| - whisper |
| - speech-recognition |
| - uzbek |
| - fine-tuned |
| - asr |
| base_model: Kotib/uzbek_stt_v1 |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # Zehnova STT — O'zbek tili uchun Speech-to-Text modeli |
|
|
| O'zbek tili uchun fine-tune qilingan Whisper Medium asosidagi |
| avtomatik nutqni matnга aylantirish modeli. |
|
|
| ## Model haqida |
|
|
| - **Model turi:** Automatic Speech Recognition (ASR) |
| - **Asos model:** `Kotib/uzbek_stt_v1` (Whisper Medium) |
| - **Fine-tuning usuli:** LoRA (Low-Rank Adaptation) |
| - **Til:** O'zbek tili 🇺🇿 |
| - **Muallif:** Jonibek21 |
|
|
| ## Ishlatish |
|
|
| ```python |
| from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline |
| import torch |
| |
| model_id = "Jonibek21/Zehnova-stt-uzbek" |
| |
| model = WhisperForConditionalGeneration.from_pretrained( |
| model_id, |
| torch_dtype=torch.float16 |
| ).to("cuda") |
| |
| processor = WhisperProcessor.from_pretrained(model_id) |
| |
| pipe = pipeline( |
| "automatic-speech-recognition", |
| model=model, |
| tokenizer=processor.tokenizer, |
| feature_extractor=processor.feature_extractor, |
| chunk_length_s=30, |
| stride_length_s=5, |
| batch_size=4, |
| device=0, |
| ) |
| |
| result = pipe( |
| "audio.wav", |
| generate_kwargs={ |
| "language": "uz", |
| "task": "transcribe", |
| "no_repeat_ngram_size": 3 |
| } |
| ) |
| |
| print(result["text"]) |
| ``` |
|
|
| ## Training ma'lumotlari |
|
|
| - **Dataset:** Maxsus O'zbek tili audio dataseti |
| - **Train samples:** 9,214 |
| - **Test samples:** 1,024 |
| - **Dataset vaqti:** 16 soat |
| - **Training hardware:** NVIDIA RTX 3090 (24GB) |
| - **Training framework:** Hugging Face Transformers + PEFT |
| - **Precision:** fp16 |
| - **LoRA rank:** 32 |
| - **LoRA alpha:** 64 |
| - **LoRA target modules:** q_proj, v_proj |
|
|
|
|
| ## 📊 Model Evaluation (WER) |
|
|
| | Category | WER | |
| |--------------|-----| |
| | **Overall** | **~11-13%** | |
| | Clean Speech | ~6-11% | |
| | Noisy/Augme | ~9-16% | |
| | News / Formal| ~11-12% | |
|
|
| > Base model (Kotib/uzbek_stt_v1) overall WER: 16.7% |
| > Zehnova modeli base modeldan **~5% yaxshiroq** natija ko'rsatdi. |
|
|
| ## Cheklovlar |
|
|
| - Faqat o'zbek tilida ishlaydi |
| - Shovqinli audio da sifat pasayishi mumkin |
| - 30 soniyadan uzun audiolar bo'laklarga bo'linadi |
|
|
| ## Date |
|
|
| - 01/05/2026 |