Whisper Uyghur ASR
Fine-tuned OpenAI Whisper-medium model for Uyghur Automatic Speech Recognition.
Model
- Base Model: openai/whisper-medium
- Language: Uyghur (ug)
- Checkpoint: step-2000
Performance
| Metric | Value |
|---|---|
| WER | 22.53% |
| CER | 12.56% |
Dataset
| Split | Samples | Duration |
|---|---|---|
| Train | 11,287 | 17.44h |
| Validation | 1,045 | 1.37h |
| Test | 795 | 0.86h |
| Total | 13,127 | 19.67h |
Sources:
- Common Voice Uyghur (CC0-1.0): 5,197 samples
- Uyghur Whisper Finetune (CC-BY-4.0): 7,930 samples
Installation
pip install torch transformers datasets librosa soundfile
Usage
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import librosa
# Load model
model = WhisperForConditionalGeneration.from_pretrained("last_model")
processor = WhisperProcessor.from_pretrained("last_model")
# Transcribe
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
predicted_ids = model.generate(**inputs)
text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(text)
Test Result
See last_model/test_audio.aac and last_model/test_audio.txt for sample inference.
Project Structure
βββ last_model/ # Fine-tuned model
β βββ config.json
β βββ model.safetensors
β βββ tokenizer.json
β βββ infer.py
β βββ test_audio.aac
β βββ test_audio.txt
βββ merged_dataset_clean/ # Training dataset
βββ finetune_whisper.py # Training script
βββ training_output.log # Training log
βββ requirements.txt
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch Size | 8 |
| Gradient Accumulation | 2 |
| Learning Rate | 1e-5 |
| Warmup Steps | 500 |
| FP16 | True |
License
- Code: MIT License
- Model: MIT License
- Dataset: CC0-1.0 / CC-BY-4.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support