Whisper Small Shami Fine-Tuned (Arabic ASR)

Model Description

This model is a fine-tuned version of OpenAI Whisper Small optimized for Arabic speech recognition, with improved performance on Levantine (Shami) dialect speech.

The model was fine-tuned using LoRA (Low-Rank Adaptation) to reduce GPU memory requirements and training time while improving transcription accuracy.

The goal of this model is to provide a lightweight and accurate Arabic ASR model suitable for research, speech analytics, and conversational AI systems.


Model Details

  • Developed by: Mohammad Bhbouh
  • Model type: Automatic Speech Recognition (ASR)
  • Base model: openai/whisper-small
  • Language(s): Arabic (Levantine / Shami dialect focus)
  • Fine-tuning method: LoRA
  • Framework: Hugging Face Transformers + PyTorch

Training Data

The model was trained using a mixture of Arabic speech datasets:

  • MASC Arabic Speech Dataset
  • Arabic Speech Corpus
  • Common Voice Arabic v22

Datasets used:

  • pain/MASC
  • halabi2016/arabic_speech_corpus
  • fsicoli/common_voice_22_0

The datasets contain Arabic audio recordings paired with text transcriptions used for supervised ASR training.


Training Procedure

Preprocessing

Audio was processed using the Whisper preprocessing pipeline:

  • Audio resampled to 16 kHz
  • Converted to log-Mel spectrogram features
  • Text normalized and tokenized using WhisperTokenizer

Training Hyperparameters

Parameter Value
Epochs 3
Train Batch Size 8
Eval Batch Size 8
Learning Rate 1e-3
Fine-tuning Method LoRA
Metric Word Error Rate (WER)

Training Progress

The following table shows training loss, validation loss, and WER during training.

Step Training Loss Validation Loss WER
500 1.160342 0.500775 39.813587
1000 0.976839 0.489329 40.503164
1500 0.899788 0.459732 37.638768
2000 0.853361 0.446455 37.043913
2500 0.728488 0.443102 36.899936
3000 0.707266 0.410259 33.751373
3500 0.699017 0.395678 33.357330
4000 0.582821 0.383099 31.440155
4500 0.391267 0.371010 30.674800
5000 0.406626 0.363339 30.587656
5500 0.340996 0.349050 28.356003
6000 0.346701 0.341245 27.984693

The results show steady improvement in WER as training progresses.


Evaluation

Metric

The main evaluation metric used is:

Word Error Rate (WER)

WER measures the difference between predicted transcription and reference text.

Lower values indicate better speech recognition performance.


Results

Model WER
Base Whisper Small 100.29%
Fine-Tuned Model 83.38%

Improvement

The fine-tuned model achieved:

16.91% reduction in WER


Usage

Example Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

model_id = "mabahboh/whisper-shami"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

audio, sr = librosa.load("audio.wav", sr=16000)

inputs = processor(audio, sampling_rate=sr, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(inputs.input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(transcription)
Downloads last month
45
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mabahboh/whisper-shami

Adapter
(201)
this model