metadata
language:
- uz
license: apache-2.0
tags:
- whisper
- automatic-speech-recognition
- uzbek
- speech-to-text
- asr
metrics:
- wer
- cer
base_model: openai/whisper-medium
pipeline_tag: automatic-speech-recognition
library_name: transformers
datasets:
- custom
model-index:
- name: whisper-medium-uz-v1
results:
- task:
type: automatic-speech-recognition
name: Speech Recognition
metrics:
- type: wer
value: 16.7
name: Overall WER
- type: cer
value: 7
name: Overall CER
Whisper Medium Uzbek v1 by Kotibai & Rubai Team
Developed by Kotibai & Rubai Team
Uzbek Automatic Speech Recognition (ASR) model fine-tuned from Whisper Medium.
Model Description
- Base Model: OpenAI Whisper Medium (769M parameters)
- Language: Uzbek (uz)
- Training Data: ~1,600 hours of Uzbek audio
- Precision: BF16
- Script: Latin (handles Russian loanwords in Latin script: "brat", "davay", "prosto", etc.)
Evaluation Results
| Category | WER |
|---|---|
| Overall | 16.7% |
| Clean Speech | ~6-11% |
| Noisy/Augmented | ~12-24% |
| Dialects | ~16-25% |
Evaluated on 1,864 samples across 8 diverse test sets.
Usage
Using Transformers
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("Kotib/uzbek_stt_v1")
model = WhisperForConditionalGeneration.from_pretrained("Kotib/uzbek_stt_v1")
audio, sr = librosa.load("audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features, language="uz", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Using Pipeline
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="Kotib/uzbek_stt_v1",
chunk_length_s=30,
device="cuda"
)
result = pipe("audio.wav", generate_kwargs={"language": "uz", "task": "transcribe"})
print(result["text"])
Training
Trained in 3 stages using curriculum learning:
| Stage | Hours |
|---|---|
| Foundation | 725h |
| Robustness | 394h |
| Domain Adaptation | 474h |
Intended Use
- Uzbek speech-to-text transcription
- Voice assistants and dictation
- Media transcription and subtitling
Limitations
- Performance degrades on very noisy audio
- May struggle with heavy code-switching
- Optimized for Uzbek only
License
Apache 2.0