YASR: Yoruba Automatic Speech Recognition
Fine-tuned from facebook/mms-1b-all on Mozilla Common Voice Yoruba using a signal-processing-informed curriculum learning strategy. Only 2,271,582 of 964,768,990 parameters are updated (0.24%) — the Yoruba adapter layers and CTC projection head. The CNN feature extractor and all 48 transformer encoder layers remain frozen throughout training.
Results
| Evaluation Set | WER | CER | Notes |
|---|---|---|---|
| Test set (clean, held-out) | 33.85% | 10.08% | 572 speaker-stratified clips |
| Tier-3 challenge set (noisy) | 36.62% | 12.18% | 554 out-of-distribution clips, never seen in training |
| Validation set (greedy, no LM) | 40.89% | 10.36% | Step 3 acoustic model only |
Clean-to-noisy degradation: +2.77pp WER (typical ASR systems degrade 10–20pp on out-of-distribution audio). The small gap reflects the signal-processing preprocessing and augmentation choices described below.
Model Architecture
| Component | Parameters | Status |
|---|---|---|
| CNN Feature Extractor (7 conv layers) | 4,210,176 | Frozen |
| Transformer Encoder (48 layers) | 958,286,848 | Frozen |
| Yoruba Adapter Layers (×48, per-layer norm + linear_1 + linear_2) | 2,151,168 | Trainable |
| CTC Projection Head (lm_head) | 120,414 | Trainable |
| Total trainable | 2,271,582 | 0.24% of model |
The MMS adapter architecture inserts a small bottleneck module (down-projection 1280→16, up-projection 16→1280) inside each of the 48 transformer layers. Fine-tuning only these modules and the CTC head is the correct strategy for the low-resource regime: with fewer than 900 clean training clips, updating the full 964M-parameter model would be severely under-determined.
Training Pipeline
Signal Processing Preprocessing (the foundation of every training decision)
All 4,655 clips in the Mozilla Common Voice Yoruba corpus were analysed before any model decisions were made. Key findings and their direct implications:
| Preprocessing Finding | Implication for Training |
|---|---|
| Corpus median SNR > 25 dB | Justified freezing the CNN feature extractor |
| Bimodal RMS distribution (two recording populations) | Required per-clip −23 LUFS normalisation rather than corpus-level normalisation |
| pYIN F0 confirmed at 160–180 Hz (Yoruba tonal contours) | Justified speed perturbation over pitch-shifting for augmentation |
| K-means quality clustering → 3 tiers | Defined a curriculum: clean data first, noisy data second |
| Only 878 Tier-1 clips available | Set maximum safe trainable parameter budget at 2.27M |
Preprocessing steps applied to all audio:
- Resampling: 16 kHz mono (librosa)
- Bandpass filter: 80–7,900 Hz, Kaiser window N=255, zero-phase (scipy sosfiltfilt)
- Wavelet denoising: Daubechies-4, 5 decomposition levels, soft thresholding (MAD estimator)
- Loudness normalisation: −23 LUFS per clip (ITU-R BS.1770)
Quality tiering (K-means k=3, 49-dimensional MFCC + wavelet feature space):
- Tier-1 (~1,620 full corpus / 878 training): SNR > 25 dB, spectral flatness ~0.07, strong D3 formant energy
- Tier-2 (~1,600 full corpus / 1,071 training): SNR 15–25 dB, moderate noise
- Tier-3 (~1,425 full corpus / 554 training): SNR < 15 dB, clipping or artefacts — excluded from training
Data Split
Rebuilt from validated.tsv (3,470 clips) using speaker-stratified GroupShuffleSplit:
| Partition | Clips | % | Speaker overlap |
|---|---|---|---|
| Train | 2,503 | 72.1% | 0 |
| Validation | 395 | 11.4% | 0 |
| Test | 572 | 16.5% | 0 |
The original Common Voice Yoruba split (41% train / 31% test / 28% dev) was discarded as train-starved and lacking speaker stratification guarantees.
Step 1 — Seed Fine-Tuning on Tier-1 Clips
- Data: 878 Tier-1 clips only (~1.5 hours)
- Trainable: Yoruba adapter layers × 48 + lm_head = 2,271,582 params
- Optimiser: AdamW, lr=1e-4, linear warmup 10% of steps, weight decay=1e-4
- Batch size: 8, gradient checkpointing enabled
- Hardware: Tesla T4 15.6 GB VRAM
- Epochs: 28 (early stopping, patience=5)
- Best epoch: 23
- Result: Val WER = 44.95%, Test WER = 45.17%, Test CER = 12.15%
Step 3 — Augmented Training with Tier-1 + Tier-2
- Data: 4,091 clips (878 Tier-1 originals + 1,071 Tier-2 originals + 2,142 speed-perturbed at 0.9× and 1.1×)
- Why speed perturbation, not pitch-shifting: Yoruba lexical tone identity is carried by F0 contour shape (rising, level, falling), not absolute pitch. Speed perturbation preserves contour shape while shifting duration proportionally — acoustically valid for tonal language augmentation. Pitch-shifting would alter absolute F0 without changing duration, risking tonal category distortion.
- Optimiser: AdamW, lr=5e-5 (continued from Step 1 best checkpoint), warmup 5%
- Batch: 4 + gradient accumulation ×2 = effective batch size 8
- Mixed precision: torch.cuda.amp (autocast + GradScaler)
- Epochs: 20 (no early stopping — still improving at epoch 20)
- Result: Val WER = 40.67%, Test WER = 41.12%, Test CER = 10.91%
Step 4 — KenLM 4-gram Language Model + Beam Search
- LM: 4-gram KenLM, modified Kneser-Ney smoothing, trained on 5,416 Yoruba sentences
- Corpus: Mozilla Common Voice validated_sentences.tsv + Step 1/3 training transcripts
- Decoder: pyctcdecode BeamSearchDecoderCTC
- Hyperparameters: alpha=0.7 (LM weight), beta=1.0 (word insertion bonus) — swept on validation set
- Result: Val WER = 35.37%, Test WER = 33.85%, Test CER = 10.08%
Known limitation — combining-codepoint tone stripping: Yoruba dot-below vowels with tone marks (e.g. ọ́ = U+1ECD + U+0301) are stored as two-token sequences in the MMS vocabulary. pyctcdecode's character-level beam alignment does not correctly handle these multi-codepoint sequences when unigram constraints are applied, causing tone marks on ọ, ẹ, ṣ to be stripped in beam search output. Workaround: beam search runs without unigrams, relying on the 4-gram LM for word-level constraints. Greedy decoding (torch.argmax) preserves tone marks correctly.
Usage
Greedy decoding (recommended — preserves tone marks)
from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch, librosa
model_id = "AzCandi/yasr-yoruba-mms"
processor = AutoProcessor.from_pretrained(model_id)
processor.tokenizer.set_target_lang("yor")
model = Wav2Vec2ForCTC.from_pretrained(model_id)
model.eval()
# Load audio — 16 kHz mono recommended
speech, _ = librosa.load("audio.wav", sr=16000, mono=True)
# Transcribe
inputs = processor(speech, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
pred_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(pred_ids[0])
print(transcription)
Beam search decoding with KenLM (lower WER, may strip some tone marks)
# pip install pyctcdecode kenlm
from pyctcdecode import build_ctcdecoder
import numpy as np
# Build decoder — download lm_4gram.bin from model repo
decoder = build_ctcdecoder(
labels=list(processor.tokenizer.get_vocab().keys()),
kenlm_model_path="lm_4gram.bin",
alpha=0.7,
beta=1.0,
)
with torch.no_grad():
logits = model(**inputs).logits
transcription = decoder.decode(logits[0].cpu().numpy())
print(transcription)
Recommended audio preprocessing
import librosa
import scipy.signal as signal
import numpy as np
import pyloudnorm as pyln
def preprocess_yoruba_audio(path):
# Load and resample
speech, sr = librosa.load(path, sr=16000, mono=True)
# Bandpass filter: 80–7,900 Hz, Kaiser N=255, zero-phase
sos = signal.butter(8, [80, 7900], btype='bandpass',
fs=16000, output='sos')
speech = signal.sosfiltfilt(sos, speech)
# Loudness normalise to −23 LUFS
meter = pyln.Meter(16000)
loudness = meter.integrated_loudness(speech)
speech = pyln.normalize.loudness(speech, loudness, -23.0)
return speech.astype(np.float32)
Signal Processing Details
| Parameter | Value |
|---|---|
| Sample rate | 16,000 Hz |
| BPF passband | 80–7,900 Hz |
| BPF window | Kaiser, N=255 |
| BPF phase | Zero-phase (sosfiltfilt) |
| Wavelet | Daubechies-4 (db4) |
| Wavelet levels | 5 |
| Denoising threshold | Soft, per-level MAD estimator |
| Loudness target | −23 LUFS (ITU-R BS.1770) |
| Quality clustering | K-means k=3, 49-dim MFCC + wavelet energy/entropy |
Known Limitations
- Tone stripping in beam search: Combining-codepoint Yoruba characters (ọ́, ẹ̀, ṣ) may lose their tone marks under pyctcdecode beam alignment. Use greedy decoding if tonal accuracy is critical.
- Loanword hallucination: Civic and English-origin loanwords absent from the LM corpus (fọ́ọ̀mù, paragirafu) may be decoded as phonetically similar Yoruba morphemes.
- Data scale: Trained on ~1.5 hours of clean audio. Performance will improve substantially with more data — 10–20 hours is estimated to reduce WER to the 15–25% range.
- Speed perturbation only: SpecAugment was not applied. Adding frequency and time masking in future training rounds may further improve robustness.
Files in This Repository
| File | Description |
|---|---|
| config.json | Model architecture configuration |
| model.safetensors | Full model weights (backbone frozen + trained adapters) |
| tokenizer_config.json | Tokeniser configuration for Yoruba |
| vocab.json | 94-token Yoruba character vocabulary |
| preprocessor_config.json | Audio feature extractor configuration |
| lm_4gram.bin | KenLM 4-gram binary language model (3.1 MB) |
| README.md | This model card |
Citation
If you use YASR in your research, please cite the MMS paper and Mozilla Common Voice:
@article{pratap2023scaling,
title={Scaling speech technology to 1,000+ languages},
author={Pratap, Vineel and Tjandra, Andros and Shi, Bowen and others},
journal={arXiv preprint arXiv:2305.13516},
year={2023}
}
@inproceedings{ardila2020common,
title={Common Voice: A massively-multilingual speech corpus},
author={Ardila, Rosana and others},
booktitle={Proceedings of LREC},
year={2020}
}
Acknowledgements
Developed as part of PhD research in Computer Science at Bowie State University, Signal Processing (COSC 825), Spring 2026. Training was conducted on Google Colab (Tesla T4 GPU). The signal-processing-informed curriculum methodology was grounded in acoustic analysis of the Mozilla Common Voice Yoruba corpus using pYIN F0 extraction, wavelet subband energy profiling, and spectral flatness-ranked K-means quality clustering.
- Downloads last month
- 56
Model tree for AzCandi/yasr-yoruba-mms
Base model
facebook/mms-1b-allDataset used to train AzCandi/yasr-yoruba-mms
Paper for AzCandi/yasr-yoruba-mms
Evaluation results
- Test WER (greedy + KenLM beam search) on Common Voice Yorubatest set self-reported33.850
- Test CER on Common Voice Yorubatest set self-reported10.080