OvozifyLabs
/

whisper-small-uz-v1

Automatic Speech Recognition

speech-recognition

Model card Files Files and versions

Firdavs222 commited on 8 days ago

Commit

32f70a1

·

verified ·

1 Parent(s): fa18ce9

Update README.md

Files changed (1) hide show

README.md +83 -3

README.md CHANGED Viewed

@@ -1,3 +1,83 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- uz
+- en
+- ru
+metrics:
+- wer
+base_model:
+- openai/whisper-small
+pipeline_tag: automatic-speech-recognition
+tags:
+- speech-recognition
+- whisper
+- multilingual
+- uzbek
+- russian
+- english
+---
+# Multilingual Whisper (Uz/En/Ru) — Fine-tuned Speech-to-Text Model
+A fine-tuned **Whisper Small** model optimized to transcribe **Uzbek, English, and Russian equally well**.
+This model is intended for real-world speech transcription with a balanced multilingual dataset and performs competitively against strong open-source and commercial STT solutions.
+---
+## Model Details
+### Model Description
+This model extends **OpenAI Whisper Small** by fine-tuning it on a multilingual speech mixture, aimed to deliver robust ASR performance for **Uzbek**, **English**, and **Russian** speakers.
+The goal was to reduce the performance gap between languages, especially improving **Uzbek** speech recognition, where public ASR resources are scarce.
+- **Model type:** Automatic Speech Recognition (ASR)
+- **Language(s):** Uzbek 🇺🇿, English 🇬🇧, Russian 🇷🇺
+- **License:** Apache-2.0
+- **Finetuned from:** openai/whisper-small
+- **Intended usage:** Real-time & offline speech-to-text
+---
+## Trained datasets:
+- DavronSherbaev/uzbekvoice-filtered
+- telegram-voice-messages (private collection)
+- navaistt-open-datasets
+- sovaai/russian-audiobooks
+- librispeech
+## Evaluation
+### Word Error Rate (WER) Comparison
+| Model                          | WER ↓    |
+|--------------------------------|----------|
+| Whisper-small-uz-v1  | **34.00%** |
+| Gemini (Commercial)            | 36.21%  |
+| NavaiSTT v2 (Open-Source)     | 35.14%  |
+| Aisha STT (Commercial)         | 41.71%  |
+The model **outperforms both commercial and open-source Uzbek STT models**, showing strong generalization for informal real-world speech.
+---
+## Usage Example
+```python
+from transformers import WhisperProcessor, WhisperForConditionalGeneration
+import torch
+import torchaudio
+model_id = "Firdavs222/whisper-small-uz-v1" # replace with real model repo
+processor = WhisperProcessor.from_pretrained(model_id)
+model = WhisperForConditionalGeneration.from_pretrained(model_id)
+audio, sr = torchaudio.load("audio.wav")
+inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
+with torch.no_grad():
+    predicted_ids = model.generate(inputs.input_features)
+text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
+print(text)  # → transcribed text here