YASR: Yoruba Automatic Speech Recognition

Fine-tuned from facebook/mms-1b-all on Mozilla Common Voice Yoruba using a signal-processing-informed curriculum learning strategy. Only 2,271,582 of 964,768,990 parameters are updated (0.24%) — the Yoruba adapter layers and CTC projection head. The CNN feature extractor and all 48 transformer encoder layers remain frozen throughout training.

Results

Evaluation Set	WER	CER	Notes
Test set (clean, held-out)	33.85%	10.08%	572 speaker-stratified clips
Tier-3 challenge set (noisy)	36.62%	12.18%	554 out-of-distribution clips, never seen in training
Validation set (greedy, no LM)	40.89%	10.36%	Step 3 acoustic model only

Clean-to-noisy degradation: +2.77pp WER (typical ASR systems degrade 10–20pp on out-of-distribution audio). The small gap reflects the signal-processing preprocessing and augmentation choices described below.

Model Architecture

Component	Parameters	Status
CNN Feature Extractor (7 conv layers)	4,210,176	Frozen
Transformer Encoder (48 layers)	958,286,848	Frozen
Yoruba Adapter Layers (×48, per-layer norm + linear_1 + linear_2)	2,151,168	Trainable
CTC Projection Head (lm_head)	120,414	Trainable
Total trainable	2,271,582	0.24% of model

The MMS adapter architecture inserts a small bottleneck module (down-projection 1280→16, up-projection 16→1280) inside each of the 48 transformer layers. Fine-tuning only these modules and the CTC head is the correct strategy for the low-resource regime: with fewer than 900 clean training clips, updating the full 964M-parameter model would be severely under-determined.

Training Pipeline

Signal Processing Preprocessing (the foundation of every training decision)

All 4,655 clips in the Mozilla Common Voice Yoruba corpus were analysed before any model decisions were made. Key findings and their direct implications:

Preprocessing Finding	Implication for Training
Corpus median SNR > 25 dB	Justified freezing the CNN feature extractor
Bimodal RMS distribution (two recording populations)	Required per-clip −23 LUFS normalisation rather than corpus-level normalisation
pYIN F0 confirmed at 160–180 Hz (Yoruba tonal contours)	Justified speed perturbation over pitch-shifting for augmentation
K-means quality clustering → 3 tiers	Defined a curriculum: clean data first, noisy data second
Only 878 Tier-1 clips available	Set maximum safe trainable parameter budget at 2.27M

Preprocessing steps applied to all audio:

Resampling: 16 kHz mono (librosa)
Bandpass filter: 80–7,900 Hz, Kaiser window N=255, zero-phase (scipy sosfiltfilt)
Wavelet denoising: Daubechies-4, 5 decomposition levels, soft thresholding (MAD estimator)
Loudness normalisation: −23 LUFS per clip (ITU-R BS.1770)

Quality tiering (K-means k=3, 49-dimensional MFCC + wavelet feature space):

Tier-1 (~1,620 full corpus / 878 training): SNR > 25 dB, spectral flatness ~0.07, strong D3 formant energy
Tier-2 (~1,600 full corpus / 1,071 training): SNR 15–25 dB, moderate noise
Tier-3 (~1,425 full corpus / 554 training): SNR < 15 dB, clipping or artefacts — excluded from training

Data Split

Rebuilt from validated.tsv (3,470 clips) using speaker-stratified GroupShuffleSplit:

Partition	Clips	%
Train	2,503	72.1%
Validation	395	11.4%
Test	572	16.5%

The original Common Voice Yoruba split (41% train / 31% test / 28% dev) was discarded as train-starved and lacking speaker stratification guarantees.

Step 1 — Seed Fine-Tuning on Tier-1 Clips

Data: 878 Tier-1 clips only (~1.5 hours)
Trainable: Yoruba adapter layers × 48 + lm_head = 2,271,582 params
Optimiser: AdamW, lr=1e-4, linear warmup 10% of steps, weight decay=1e-4
Batch size: 8, gradient checkpointing enabled
Hardware: Tesla T4 15.6 GB VRAM
Epochs: 28 (early stopping, patience=5)
Best epoch: 23
Result: Val WER = 44.95%, Test WER = 45.17%, Test CER = 12.15%

Step 3 — Augmented Training with Tier-1 + Tier-2

Data: 4,091 clips (878 Tier-1 originals + 1,071 Tier-2 originals + 2,142 speed-perturbed at 0.9× and 1.1×)
Why speed perturbation, not pitch-shifting: Yoruba lexical tone identity is carried by F0 contour shape (rising, level, falling), not absolute pitch. Speed perturbation preserves contour shape while shifting duration proportionally — acoustically valid for tonal language augmentation. Pitch-shifting would alter absolute F0 without changing duration, risking tonal category distortion.
Optimiser: AdamW, lr=5e-5 (continued from Step 1 best checkpoint), warmup 5%
Batch: 4 + gradient accumulation ×2 = effective batch size 8
Mixed precision: torch.cuda.amp (autocast + GradScaler)
Epochs: 20 (no early stopping — still improving at epoch 20)
Result: Val WER = 40.67%, Test WER = 41.12%, Test CER = 10.91%

Step 4 — KenLM 4-gram Language Model + Beam Search

LM: 4-gram KenLM, modified Kneser-Ney smoothing, trained on 5,416 Yoruba sentences
Corpus: Mozilla Common Voice validated_sentences.tsv + Step 1/3 training transcripts
Decoder: pyctcdecode BeamSearchDecoderCTC
Hyperparameters: alpha=0.7 (LM weight), beta=1.0 (word insertion bonus) — swept on validation set
Result: Val WER = 35.37%, Test WER = 33.85%, Test CER = 10.08%

Known limitation — combining-codepoint tone stripping: Yoruba dot-below vowels with tone marks (e.g. ọ́ = U+1ECD + U+0301) are stored as two-token sequences in the MMS vocabulary. pyctcdecode's character-level beam alignment does not correctly handle these multi-codepoint sequences when unigram constraints are applied, causing tone marks on ọ, ẹ, ṣ to be stripped in beam search output. Workaround: beam search runs without unigrams, relying on the 4-gram LM for word-level constraints. Greedy decoding (torch.argmax) preserves tone marks correctly.

Usage

Greedy decoding (recommended — preserves tone marks)

from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch, librosa

model_id  = "AzCandi/yasr-yoruba-mms"
processor = AutoProcessor.from_pretrained(model_id)
processor.tokenizer.set_target_lang("yor")
model     = Wav2Vec2ForCTC.from_pretrained(model_id)
model.eval()

# Load audio — 16 kHz mono recommended
speech, _ = librosa.load("audio.wav", sr=16000, mono=True)

# Transcribe
inputs = processor(speech, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

pred_ids      = torch.argmax(logits, dim=-1)
transcription = processor.decode(pred_ids[0])
print(transcription)

Beam search decoding with KenLM (lower WER, may strip some tone marks)

# pip install pyctcdecode kenlm
from pyctcdecode import build_ctcdecoder
import numpy as np

# Build decoder — download lm_4gram.bin from model repo
decoder = build_ctcdecoder(
    labels=list(processor.tokenizer.get_vocab().keys()),
    kenlm_model_path="lm_4gram.bin",
    alpha=0.7,
    beta=1.0,
)

with torch.no_grad():
    logits = model(**inputs).logits

transcription = decoder.decode(logits[0].cpu().numpy())
print(transcription)

Recommended audio preprocessing

import librosa
import scipy.signal as signal
import numpy as np
import pyloudnorm as pyln

def preprocess_yoruba_audio(path):
    # Load and resample
    speech, sr = librosa.load(path, sr=16000, mono=True)

    # Bandpass filter: 80–7,900 Hz, Kaiser N=255, zero-phase
    sos = signal.butter(8, [80, 7900], btype='bandpass',
                        fs=16000, output='sos')
    speech = signal.sosfiltfilt(sos, speech)

    # Loudness normalise to −23 LUFS
    meter  = pyln.Meter(16000)
    loudness = meter.integrated_loudness(speech)
    speech = pyln.normalize.loudness(speech, loudness, -23.0)

    return speech.astype(np.float32)

Signal Processing Details

Parameter	Value
Sample rate	16,000 Hz
BPF passband	80–7,900 Hz
BPF window	Kaiser, N=255
BPF phase	Zero-phase (sosfiltfilt)
Wavelet	Daubechies-4 (db4)
Wavelet levels	5
Denoising threshold	Soft, per-level MAD estimator
Loudness target	−23 LUFS (ITU-R BS.1770)
Quality clustering	K-means k=3, 49-dim MFCC + wavelet energy/entropy

Known Limitations

Tone stripping in beam search: Combining-codepoint Yoruba characters (ọ́, ẹ̀, ṣ) may lose their tone marks under pyctcdecode beam alignment. Use greedy decoding if tonal accuracy is critical.
Loanword hallucination: Civic and English-origin loanwords absent from the LM corpus (fọ́ọ̀mù, paragirafu) may be decoded as phonetically similar Yoruba morphemes.
Data scale: Trained on ~1.5 hours of clean audio. Performance will improve substantially with more data — 10–20 hours is estimated to reduce WER to the 15–25% range.
Speed perturbation only: SpecAugment was not applied. Adding frequency and time masking in future training rounds may further improve robustness.

Files in This Repository

File	Description
config.json	Model architecture configuration
model.safetensors	Full model weights (backbone frozen + trained adapters)
tokenizer_config.json	Tokeniser configuration for Yoruba
vocab.json	94-token Yoruba character vocabulary
preprocessor_config.json	Audio feature extractor configuration
lm_4gram.bin	KenLM 4-gram binary language model (3.1 MB)
README.md	This model card

Citation

If you use YASR in your research, please cite the MMS paper and Mozilla Common Voice:

@article{pratap2023scaling,
  title={Scaling speech technology to 1,000+ languages},
  author={Pratap, Vineel and Tjandra, Andros and Shi, Bowen and others},
  journal={arXiv preprint arXiv:2305.13516},
  year={2023}
}

@inproceedings{ardila2020common,
  title={Common Voice: A massively-multilingual speech corpus},
  author={Ardila, Rosana and others},
  booktitle={Proceedings of LREC},
  year={2020}
}

Acknowledgements

Developed as part of doctoral research in Computer Science. Training was conducted on Google Colab (Tesla T4 GPU). The signal-processing-informed curriculum methodology was grounded in acoustic analysis of the Mozilla Common Voice Yoruba corpus using pYIN F0 extraction, wavelet subband energy profiling, and spectral flatness-ranked K-means quality clustering.

Downloads last month: 2

Safetensors

Model size

1.0B params

Tensor type

F32

Model tree for AzCandi/yasr-yoruba-mms

Base model

facebook/mms-1b-all

Finetuned

(417)

this model

Dataset used to train AzCandi/yasr-yoruba-mms

Paper for AzCandi/yasr-yoruba-mms

Scaling Speech Technology to 1,000+ Languages

Paper • 2305.13516 • Published May 22, 2023 • 12

Evaluation results

Test WER (greedy + KenLM beam search) on Common Voice Yoruba
test set self-reported

33.850
Test CER on Common Voice Yoruba
test set self-reported

10.080