Whisper Small Shami Fine-Tuned (Arabic ASR)

Model Description

This model is a fine-tuned version of OpenAI Whisper Small optimized for Arabic speech recognition, with improved performance on Levantine (Shami) dialect speech.

The model was fine-tuned using LoRA (Low-Rank Adaptation) to reduce GPU memory requirements and training time while improving transcription accuracy.

The goal of this model is to provide a lightweight and accurate Arabic ASR model suitable for research, speech analytics, and conversational AI systems.

Model Details

Developed by: Mohammad Bhbouh
Model type: Automatic Speech Recognition (ASR)
Base model: openai/whisper-small
Language(s): Arabic (Levantine / Shami dialect focus)
Fine-tuning method: LoRA
Framework: Hugging Face Transformers + PyTorch

Training Data

The model was trained using a mixture of Arabic speech datasets:

MASC Arabic Speech Dataset
Arabic Speech Corpus
Common Voice Arabic v22

Datasets used:

pain/MASC
halabi2016/arabic_speech_corpus
fsicoli/common_voice_22_0

The datasets contain Arabic audio recordings paired with text transcriptions used for supervised ASR training.

Training Procedure

Preprocessing

Audio was processed using the Whisper preprocessing pipeline:

Audio resampled to 16 kHz
Converted to log-Mel spectrogram features
Text normalized and tokenized using WhisperTokenizer

Training Hyperparameters

Parameter	Value
Epochs	3
Train Batch Size	8
Eval Batch Size	8
Learning Rate	1e-3
Fine-tuning Method	LoRA
Metric	Word Error Rate (WER)

Training Progress

The following table shows training loss, validation loss, and WER during training.

Step	Training Loss	Validation Loss	WER
500	1.160342	0.500775	39.813587
1000	0.976839	0.489329	40.503164
1500	0.899788	0.459732	37.638768
2000	0.853361	0.446455	37.043913
2500	0.728488	0.443102	36.899936
3000	0.707266	0.410259	33.751373
3500	0.699017	0.395678	33.357330
4000	0.582821	0.383099	31.440155
4500	0.391267	0.371010	30.674800
5000	0.406626	0.363339	30.587656
5500	0.340996	0.349050	28.356003
6000	0.346701	0.341245	27.984693

The results show steady improvement in WER as training progresses.

Evaluation

Metric

The main evaluation metric used is:

Word Error Rate (WER)

WER measures the difference between predicted transcription and reference text.

Lower values indicate better speech recognition performance.

Results

Model	WER
Base Whisper Small	100.29%
Fine-Tuned Model	83.38%

Improvement

The fine-tuned model achieved:

16.91% reduction in WER

Usage

Example Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

model_id = "mabahboh/whisper-shami"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

audio, sr = librosa.load("audio.wav", sr=16000)

inputs = processor(audio, sampling_rate=sr, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(inputs.input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(transcription)

Downloads last month: 7

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for mabahboh/whisper-shami

Base model

openai/whisper-small

Adapter

(232)

this model