YASR: Yoruba Automatic Speech Recognition

Fine-tuned from facebook/mms-1b-all on Mozilla Common Voice Yoruba using a signal-processing-informed curriculum learning strategy. Only 2,271,582 of 964,768,990 parameters are updated (0.24%) — the Yoruba adapter layers and CTC projection head. The CNN feature extractor and all 48 transformer encoder layers remain frozen throughout training.


Results

Evaluation Set WER CER Notes
Test set (clean, held-out) 33.85% 10.08% 572 speaker-stratified clips
Tier-3 challenge set (noisy) 36.62% 12.18% 554 out-of-distribution clips, never seen in training
Validation set (greedy, no LM) 40.89% 10.36% Step 3 acoustic model only

Clean-to-noisy degradation: +2.77pp WER (typical ASR systems degrade 10–20pp on out-of-distribution audio). The small gap reflects the signal-processing preprocessing and augmentation choices described below.


Model Architecture

Component Parameters Status
CNN Feature Extractor (7 conv layers) 4,210,176 Frozen
Transformer Encoder (48 layers) 958,286,848 Frozen
Yoruba Adapter Layers (×48, per-layer norm + linear_1 + linear_2) 2,151,168 Trainable
CTC Projection Head (lm_head) 120,414 Trainable
Total trainable 2,271,582 0.24% of model

The MMS adapter architecture inserts a small bottleneck module (down-projection 1280→16, up-projection 16→1280) inside each of the 48 transformer layers. Fine-tuning only these modules and the CTC head is the correct strategy for the low-resource regime: with fewer than 900 clean training clips, updating the full 964M-parameter model would be severely under-determined.


Training Pipeline

Signal Processing Preprocessing (the foundation of every training decision)

All 4,655 clips in the Mozilla Common Voice Yoruba corpus were analysed before any model decisions were made. Key findings and their direct implications:

Preprocessing Finding Implication for Training
Corpus median SNR > 25 dB Justified freezing the CNN feature extractor
Bimodal RMS distribution (two recording populations) Required per-clip −23 LUFS normalisation rather than corpus-level normalisation
pYIN F0 confirmed at 160–180 Hz (Yoruba tonal contours) Justified speed perturbation over pitch-shifting for augmentation
K-means quality clustering → 3 tiers Defined a curriculum: clean data first, noisy data second
Only 878 Tier-1 clips available Set maximum safe trainable parameter budget at 2.27M

Preprocessing steps applied to all audio:

  • Resampling: 16 kHz mono (librosa)
  • Bandpass filter: 80–7,900 Hz, Kaiser window N=255, zero-phase (scipy sosfiltfilt)
  • Wavelet denoising: Daubechies-4, 5 decomposition levels, soft thresholding (MAD estimator)
  • Loudness normalisation: −23 LUFS per clip (ITU-R BS.1770)

Quality tiering (K-means k=3, 49-dimensional MFCC + wavelet feature space):

  • Tier-1 (~1,620 full corpus / 878 training): SNR > 25 dB, spectral flatness ~0.07, strong D3 formant energy
  • Tier-2 (~1,600 full corpus / 1,071 training): SNR 15–25 dB, moderate noise
  • Tier-3 (~1,425 full corpus / 554 training): SNR < 15 dB, clipping or artefacts — excluded from training

Data Split

Rebuilt from validated.tsv (3,470 clips) using speaker-stratified GroupShuffleSplit:

Partition Clips % Speaker overlap
Train 2,503 72.1% 0
Validation 395 11.4% 0
Test 572 16.5% 0

The original Common Voice Yoruba split (41% train / 31% test / 28% dev) was discarded as train-starved and lacking speaker stratification guarantees.

Step 1 — Seed Fine-Tuning on Tier-1 Clips

  • Data: 878 Tier-1 clips only (~1.5 hours)
  • Trainable: Yoruba adapter layers × 48 + lm_head = 2,271,582 params
  • Optimiser: AdamW, lr=1e-4, linear warmup 10% of steps, weight decay=1e-4
  • Batch size: 8, gradient checkpointing enabled
  • Hardware: Tesla T4 15.6 GB VRAM
  • Epochs: 28 (early stopping, patience=5)
  • Best epoch: 23
  • Result: Val WER = 44.95%, Test WER = 45.17%, Test CER = 12.15%

Step 3 — Augmented Training with Tier-1 + Tier-2

  • Data: 4,091 clips (878 Tier-1 originals + 1,071 Tier-2 originals + 2,142 speed-perturbed at 0.9× and 1.1×)
  • Why speed perturbation, not pitch-shifting: Yoruba lexical tone identity is carried by F0 contour shape (rising, level, falling), not absolute pitch. Speed perturbation preserves contour shape while shifting duration proportionally — acoustically valid for tonal language augmentation. Pitch-shifting would alter absolute F0 without changing duration, risking tonal category distortion.
  • Optimiser: AdamW, lr=5e-5 (continued from Step 1 best checkpoint), warmup 5%
  • Batch: 4 + gradient accumulation ×2 = effective batch size 8
  • Mixed precision: torch.cuda.amp (autocast + GradScaler)
  • Epochs: 20 (no early stopping — still improving at epoch 20)
  • Result: Val WER = 40.67%, Test WER = 41.12%, Test CER = 10.91%

Step 4 — KenLM 4-gram Language Model + Beam Search

  • LM: 4-gram KenLM, modified Kneser-Ney smoothing, trained on 5,416 Yoruba sentences
  • Corpus: Mozilla Common Voice validated_sentences.tsv + Step 1/3 training transcripts
  • Decoder: pyctcdecode BeamSearchDecoderCTC
  • Hyperparameters: alpha=0.7 (LM weight), beta=1.0 (word insertion bonus) — swept on validation set
  • Result: Val WER = 35.37%, Test WER = 33.85%, Test CER = 10.08%

Known limitation — combining-codepoint tone stripping: Yoruba dot-below vowels with tone marks (e.g. ọ́ = U+1ECD + U+0301) are stored as two-token sequences in the MMS vocabulary. pyctcdecode's character-level beam alignment does not correctly handle these multi-codepoint sequences when unigram constraints are applied, causing tone marks on ọ, ẹ, ṣ to be stripped in beam search output. Workaround: beam search runs without unigrams, relying on the 4-gram LM for word-level constraints. Greedy decoding (torch.argmax) preserves tone marks correctly.


Usage

Greedy decoding (recommended — preserves tone marks)

from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch, librosa

model_id  = "AzCandi/yasr-yoruba-mms"
processor = AutoProcessor.from_pretrained(model_id)
processor.tokenizer.set_target_lang("yor")
model     = Wav2Vec2ForCTC.from_pretrained(model_id)
model.eval()

# Load audio — 16 kHz mono recommended
speech, _ = librosa.load("audio.wav", sr=16000, mono=True)

# Transcribe
inputs = processor(speech, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

pred_ids      = torch.argmax(logits, dim=-1)
transcription = processor.decode(pred_ids[0])
print(transcription)

Beam search decoding with KenLM (lower WER, may strip some tone marks)

# pip install pyctcdecode kenlm
from pyctcdecode import build_ctcdecoder
import numpy as np

# Build decoder — download lm_4gram.bin from model repo
decoder = build_ctcdecoder(
    labels=list(processor.tokenizer.get_vocab().keys()),
    kenlm_model_path="lm_4gram.bin",
    alpha=0.7,
    beta=1.0,
)

with torch.no_grad():
    logits = model(**inputs).logits

transcription = decoder.decode(logits[0].cpu().numpy())
print(transcription)

Recommended audio preprocessing

import librosa
import scipy.signal as signal
import numpy as np
import pyloudnorm as pyln

def preprocess_yoruba_audio(path):
    # Load and resample
    speech, sr = librosa.load(path, sr=16000, mono=True)

    # Bandpass filter: 80–7,900 Hz, Kaiser N=255, zero-phase
    sos = signal.butter(8, [80, 7900], btype='bandpass',
                        fs=16000, output='sos')
    speech = signal.sosfiltfilt(sos, speech)

    # Loudness normalise to −23 LUFS
    meter  = pyln.Meter(16000)
    loudness = meter.integrated_loudness(speech)
    speech = pyln.normalize.loudness(speech, loudness, -23.0)

    return speech.astype(np.float32)

Signal Processing Details

Parameter Value
Sample rate 16,000 Hz
BPF passband 80–7,900 Hz
BPF window Kaiser, N=255
BPF phase Zero-phase (sosfiltfilt)
Wavelet Daubechies-4 (db4)
Wavelet levels 5
Denoising threshold Soft, per-level MAD estimator
Loudness target −23 LUFS (ITU-R BS.1770)
Quality clustering K-means k=3, 49-dim MFCC + wavelet energy/entropy

Known Limitations

  1. Tone stripping in beam search: Combining-codepoint Yoruba characters (ọ́, ẹ̀, ṣ) may lose their tone marks under pyctcdecode beam alignment. Use greedy decoding if tonal accuracy is critical.
  2. Loanword hallucination: Civic and English-origin loanwords absent from the LM corpus (fọ́ọ̀mù, paragirafu) may be decoded as phonetically similar Yoruba morphemes.
  3. Data scale: Trained on ~1.5 hours of clean audio. Performance will improve substantially with more data — 10–20 hours is estimated to reduce WER to the 15–25% range.
  4. Speed perturbation only: SpecAugment was not applied. Adding frequency and time masking in future training rounds may further improve robustness.

Files in This Repository

File Description
config.json Model architecture configuration
model.safetensors Full model weights (backbone frozen + trained adapters)
tokenizer_config.json Tokeniser configuration for Yoruba
vocab.json 94-token Yoruba character vocabulary
preprocessor_config.json Audio feature extractor configuration
lm_4gram.bin KenLM 4-gram binary language model (3.1 MB)
README.md This model card

Citation

If you use YASR in your research, please cite the MMS paper and Mozilla Common Voice:

@article{pratap2023scaling,
  title={Scaling speech technology to 1,000+ languages},
  author={Pratap, Vineel and Tjandra, Andros and Shi, Bowen and others},
  journal={arXiv preprint arXiv:2305.13516},
  year={2023}
}

@inproceedings{ardila2020common,
  title={Common Voice: A massively-multilingual speech corpus},
  author={Ardila, Rosana and others},
  booktitle={Proceedings of LREC},
  year={2020}
}

Acknowledgements

Developed as part of PhD research in Computer Science at Bowie State University, Signal Processing (COSC 825), Spring 2026. Training was conducted on Google Colab (Tesla T4 GPU). The signal-processing-informed curriculum methodology was grounded in acoustic analysis of the Mozilla Common Voice Yoruba corpus using pYIN F0 extraction, wavelet subband energy profiling, and spectral flatness-ranked K-means quality clustering.

Downloads last month
56
Safetensors
Model size
1.0B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AzCandi/yasr-yoruba-mms

Finetuned
(409)
this model

Dataset used to train AzCandi/yasr-yoruba-mms

Paper for AzCandi/yasr-yoruba-mms

Evaluation results