Whisper Medium β€” Direct Speech-to-ALDi (SADA)

Predicts the Arabic Level of Dialectness (ALDi) score directly from raw speech audio β€” no intermediate ASR transcription needed.

Output Meaning
0.0 Pure Modern Standard Arabic (MSA / FuαΉ£αΈ₯ā)
0.5 Mixed β€” features of both MSA and dialect
1.0 Heavy colloquial dialect

Architecture: OpenAI Whisper Medium encoder + regression head (LayerNorm β†’ Dropout(0.1) β†’ Linear(1024β†’1))
Training: End-to-end MSE loss on SADA22 (~420 h of Saudi Arabic)

Quick Start

1. Install dependencies

pip install torch transformers huggingface_hub torchaudio

2. Define the model class

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import WhisperModel, WhisperFeatureExtractor
from huggingface_hub import hf_hub_download

class WhisperALDiRegressor(nn.Module):
    def __init__(self, model_id: str = "openai/whisper-medium", freeze_encoder: bool = False):
        super().__init__()
        self.encoder = WhisperModel.from_pretrained(model_id).encoder
        hidden = self.encoder.config.d_model  # 1024 for whisper-medium
        self.head = nn.Sequential(
            nn.LayerNorm(hidden),
            nn.Dropout(0.1),
            nn.Linear(hidden, 1),
        )
        if freeze_encoder:
            for p in self.encoder.parameters():
                p.requires_grad = False

    def forward(self, input_features: torch.Tensor, attention_mask: torch.Tensor):
        enc_out = self.encoder(
            input_features=input_features,
            attention_mask=attention_mask,
        ).last_hidden_state  # (B, T_enc, H)

        # Downsample mask to match encoder output length
        reduced_mask = F.interpolate(
            attention_mask.unsqueeze(1).float(),
            size=enc_out.size(1),
            mode="nearest",
        ).squeeze(1)  # (B, T_enc)

        mask = reduced_mask.unsqueeze(-1)  # (B, T_enc, 1)
        pooled = (enc_out * mask).sum(1) / mask.sum(1).clamp(min=1)
        return self.head(pooled).squeeze(-1)  # (B,) β€” raw score, no sigmoid

3. Load the checkpoint

REPO_ID = "wageehkhad/whisper-medium-sada-speech-aldi"

ckpt_path = hf_hub_download(repo_id=REPO_ID, filename="checkpoint_epoch5.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)

model = WhisperALDiRegressor()
model.load_state_dict(ckpt["model_state"])
model.eval()

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-medium")

4. Run inference on an audio file

import torchaudio

def predict_aldi(audio_path: str, model, feature_extractor, device: str = "cpu") -> float:
    """
    Returns an ALDi score in [0, 1].
      0.0 β†’ Modern Standard Arabic (MSA)
      1.0 β†’ Heavy dialect
    Accepts any format supported by torchaudio (WAV, FLAC, MP3, etc.).
    """
    wav, sr = torchaudio.load(audio_path)
    wav = torchaudio.functional.resample(wav, sr, 16_000).mean(0).numpy()

    inputs = feature_extractor(wav, sampling_rate=16_000, return_tensors="pt")
    input_features = inputs["input_features"].to(device)

    if "attention_mask" in inputs:
        attention_mask = inputs["attention_mask"].to(device)
    else:
        # Feature extractor omitted mask β€” treat all frames as valid
        attention_mask = torch.ones(
            input_features.shape[-1], dtype=torch.long, device=device
        ).unsqueeze(0)

    model.to(device)
    with torch.no_grad():
        score = model(input_features, attention_mask)

    return float(score.item())

# Example
score = predict_aldi("speech.wav", model, feature_extractor)
print(f"ALDi score: {score:.3f}")

Limitations

  • Trained exclusively on Saudi Arabic (SADA22). Scores for other dialects (Egyptian, Levantine, Moroccan, etc.) may be less calibrated.
  • Audio is internally chunked to Whisper's 30-second window. For longer files, split into segments and average the scores.

Citation

If you use this model, please cite the ALDi paper and the SADA dataset:

@inproceedings{keleg2023aldi,
  title     = {ALDi: Quantifying the Arabic Level of Dialectness of Text},
  author    = {Keleg, Amr and Goldwater, Sharon and Magdy, Walid},
  booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year      = {2023},
  publisher = {Association for Computational Linguistics},
  address   = {Singapore},
  url       = {https://aclanthology.org/2023.emnlp-main.655}
}

@misc{sada22,
  author       = {Al-Gamdi, Ahmed and others},
  title        = {SADA: Saudi Audio Dataset for Arabic},
  year         = {2022},
  howpublished = {\url{https://huggingface.co/datasets/MohamedRashad/SADA22}},
  note         = {Accessed 2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for wageehkhad/whisper-medium-sada-speech-aldi

Finetuned
(817)
this model

Dataset used to train wageehkhad/whisper-medium-sada-speech-aldi