Whisper Small Shami Fine-Tuned (Arabic ASR)
Model Description
This model is a fine-tuned version of OpenAI Whisper Small optimized for Arabic speech recognition, with improved performance on Levantine (Shami) dialect speech.
The model was fine-tuned using LoRA (Low-Rank Adaptation) to reduce GPU memory requirements and training time while improving transcription accuracy.
The goal of this model is to provide a lightweight and accurate Arabic ASR model suitable for research, speech analytics, and conversational AI systems.
Model Details
- Developed by: Mohammad Bhbouh
- Model type: Automatic Speech Recognition (ASR)
- Base model:
openai/whisper-small - Language(s): Arabic (Levantine / Shami dialect focus)
- Fine-tuning method: LoRA
- Framework: Hugging Face Transformers + PyTorch
Training Data
The model was trained using a mixture of Arabic speech datasets:
- MASC Arabic Speech Dataset
- Arabic Speech Corpus
- Common Voice Arabic v22
Datasets used:
pain/MASChalabi2016/arabic_speech_corpusfsicoli/common_voice_22_0
The datasets contain Arabic audio recordings paired with text transcriptions used for supervised ASR training.
Training Procedure
Preprocessing
Audio was processed using the Whisper preprocessing pipeline:
- Audio resampled to 16 kHz
- Converted to log-Mel spectrogram features
- Text normalized and tokenized using WhisperTokenizer
Training Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Train Batch Size | 8 |
| Eval Batch Size | 8 |
| Learning Rate | 1e-3 |
| Fine-tuning Method | LoRA |
| Metric | Word Error Rate (WER) |
Training Progress
The following table shows training loss, validation loss, and WER during training.
| Step | Training Loss | Validation Loss | WER |
|---|---|---|---|
| 500 | 1.160342 | 0.500775 | 39.813587 |
| 1000 | 0.976839 | 0.489329 | 40.503164 |
| 1500 | 0.899788 | 0.459732 | 37.638768 |
| 2000 | 0.853361 | 0.446455 | 37.043913 |
| 2500 | 0.728488 | 0.443102 | 36.899936 |
| 3000 | 0.707266 | 0.410259 | 33.751373 |
| 3500 | 0.699017 | 0.395678 | 33.357330 |
| 4000 | 0.582821 | 0.383099 | 31.440155 |
| 4500 | 0.391267 | 0.371010 | 30.674800 |
| 5000 | 0.406626 | 0.363339 | 30.587656 |
| 5500 | 0.340996 | 0.349050 | 28.356003 |
| 6000 | 0.346701 | 0.341245 | 27.984693 |
The results show steady improvement in WER as training progresses.
Evaluation
Metric
The main evaluation metric used is:
Word Error Rate (WER)
WER measures the difference between predicted transcription and reference text.
Lower values indicate better speech recognition performance.
Results
| Model | WER |
|---|---|
| Base Whisper Small | 100.29% |
| Fine-Tuned Model | 83.38% |
Improvement
The fine-tuned model achieved:
16.91% reduction in WER
Usage
Example Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
model_id = "mabahboh/whisper-shami"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
- Downloads last month
- 45
Model tree for mabahboh/whisper-shami
Base model
openai/whisper-small