Fine Tuned Whisper Model For Pashto
This fine-tuned Whisper Medium model provides high-quality Pashto speech-to-text transcription, optimized for diverse accents and noisy environments.
Model Details
Developed by: AbdulMoizShah01
Shared by: AbdulMoizShah01/pashto-whisper-medium
Model type: Encoder–decoder sequence-to-sequence ASR
Language(s): Pashto (ps)
License: Apache 2.0
Fine-tuned from: openai/whisper-medium
Uses
Direct Use
ASR transcription: Convert Pashto audio files (.wav, .mp3, .flac) sampled ≥16 kHz into text.
Research: Evaluate Pashto speech recognition in academic settings.
Downstream Use
Captioning & subtitles for Pashto media.
Preprocessing step in Pashto NLP pipelines (e.g., speech analytics).
Out-of-Scope Use
Non-Pashto languages or code-switching beyond simple borrowings.
Speech translation—this model does not translate outputs.
Bias, Risks, and Limitations
Trained primarily on Mozilla Common Voice Pashto and supplemental domain recordings; may underperform on rare dialects or highly noisy audio.
Potential bias toward speakers in urban settings.
May mis-transcribe uncommon proper nouns or technical terms.
Recommendations
Validate outputs when using in critical settings (e.g., legal or medical transcription).
Provide clear audio sampling at ≥16 kHz for best accuracy.
How to Get Started with the Model
from transformers import WhisperProcessor, WhisperForConditionalGeneration import torch
Load processor and model
processor = WhisperProcessor.from_pretrained("AbdulMoizShah01/pashto-whisper-medium") model = WhisperForConditionalGeneration.from_pretrained("AbdulMoizShah01/pashto-whisper-medium") model.eval()
Load and preprocess audio
import librosa audio, sr = librosa.load("path/to/audio.wav", sr=16000) inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
Training Details
Training Data
Primary dataset: Mozilla Common Voice Pashto (~200 hours).
Supplementary data: Domain recordings from news broadcasts, podcasts, and conversational samples (~300 hours).
Preprocessing: Noise reduction, normalization, and segmentation into 30 s clips.
Training Procedure
Framework: PyTorch & Hugging Face Transformers
Optimizer: AdamW
Learning rate: 5e-6, linear decay
Batch size: 32
Epochs: 100
Hardware: NVIDIA PAFIAST GPUs
Evaluation
Metric
Value
Word Error Rate (WER)
8.5%
Character Error Rate (CER)
3.1%
In-domain accuracy
91.5%
Model tree for AbdulMoizShah01/Whisper-Pashto
Base model
openai/whisper-medium