Whisper Medium β Direct Speech-to-ALDi (SADA)
Predicts the Arabic Level of Dialectness (ALDi) score directly from raw speech audio β no intermediate ASR transcription needed.
| Output | Meaning |
|---|---|
| 0.0 | Pure Modern Standard Arabic (MSA / FuαΉ£αΈ₯Δ) |
| 0.5 | Mixed β features of both MSA and dialect |
| 1.0 | Heavy colloquial dialect |
Architecture: OpenAI Whisper Medium encoder + regression head (LayerNorm β Dropout(0.1) β Linear(1024β1))
Training: End-to-end MSE loss on SADA22 (~420 h of Saudi Arabic)
Quick Start
1. Install dependencies
pip install torch transformers huggingface_hub torchaudio
2. Define the model class
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import WhisperModel, WhisperFeatureExtractor
from huggingface_hub import hf_hub_download
class WhisperALDiRegressor(nn.Module):
def __init__(self, model_id: str = "openai/whisper-medium", freeze_encoder: bool = False):
super().__init__()
self.encoder = WhisperModel.from_pretrained(model_id).encoder
hidden = self.encoder.config.d_model # 1024 for whisper-medium
self.head = nn.Sequential(
nn.LayerNorm(hidden),
nn.Dropout(0.1),
nn.Linear(hidden, 1),
)
if freeze_encoder:
for p in self.encoder.parameters():
p.requires_grad = False
def forward(self, input_features: torch.Tensor, attention_mask: torch.Tensor):
enc_out = self.encoder(
input_features=input_features,
attention_mask=attention_mask,
).last_hidden_state # (B, T_enc, H)
# Downsample mask to match encoder output length
reduced_mask = F.interpolate(
attention_mask.unsqueeze(1).float(),
size=enc_out.size(1),
mode="nearest",
).squeeze(1) # (B, T_enc)
mask = reduced_mask.unsqueeze(-1) # (B, T_enc, 1)
pooled = (enc_out * mask).sum(1) / mask.sum(1).clamp(min=1)
return self.head(pooled).squeeze(-1) # (B,) β raw score, no sigmoid
3. Load the checkpoint
REPO_ID = "wageehkhad/whisper-medium-sada-speech-aldi"
ckpt_path = hf_hub_download(repo_id=REPO_ID, filename="checkpoint_epoch5.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model = WhisperALDiRegressor()
model.load_state_dict(ckpt["model_state"])
model.eval()
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-medium")
4. Run inference on an audio file
import torchaudio
def predict_aldi(audio_path: str, model, feature_extractor, device: str = "cpu") -> float:
"""
Returns an ALDi score in [0, 1].
0.0 β Modern Standard Arabic (MSA)
1.0 β Heavy dialect
Accepts any format supported by torchaudio (WAV, FLAC, MP3, etc.).
"""
wav, sr = torchaudio.load(audio_path)
wav = torchaudio.functional.resample(wav, sr, 16_000).mean(0).numpy()
inputs = feature_extractor(wav, sampling_rate=16_000, return_tensors="pt")
input_features = inputs["input_features"].to(device)
if "attention_mask" in inputs:
attention_mask = inputs["attention_mask"].to(device)
else:
# Feature extractor omitted mask β treat all frames as valid
attention_mask = torch.ones(
input_features.shape[-1], dtype=torch.long, device=device
).unsqueeze(0)
model.to(device)
with torch.no_grad():
score = model(input_features, attention_mask)
return float(score.item())
# Example
score = predict_aldi("speech.wav", model, feature_extractor)
print(f"ALDi score: {score:.3f}")
Limitations
- Trained exclusively on Saudi Arabic (SADA22). Scores for other dialects (Egyptian, Levantine, Moroccan, etc.) may be less calibrated.
- Audio is internally chunked to Whisper's 30-second window. For longer files, split into segments and average the scores.
Citation
If you use this model, please cite the ALDi paper and the SADA dataset:
@inproceedings{keleg2023aldi,
title = {ALDi: Quantifying the Arabic Level of Dialectness of Text},
author = {Keleg, Amr and Goldwater, Sharon and Magdy, Walid},
booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2023},
publisher = {Association for Computational Linguistics},
address = {Singapore},
url = {https://aclanthology.org/2023.emnlp-main.655}
}
@misc{sada22,
author = {Al-Gamdi, Ahmed and others},
title = {SADA: Saudi Audio Dataset for Arabic},
year = {2022},
howpublished = {\url{https://huggingface.co/datasets/MohamedRashad/SADA22}},
note = {Accessed 2026}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for wageehkhad/whisper-medium-sada-speech-aldi
Base model
openai/whisper-medium