mms-1b-kik-accelerate-2-best

This repository hosts a Kikuyu automatic speech recognition adapter built on top of the facebook/mms-1b-all base model. The checkpoint corresponds to the best evaluation step from the second adapter fine-tuning run conducted with the audio_training_accelerate_adapter.py pipeline.

Model Details

Base model: facebook/mms-1b-all (wav2vec2 encoder + CTC head)
Adapter type: MMS language adapter layers (three bottleneck blocks)
Language: Kikuyu (kik)
Fine-tuning framework: Hugging Face Accelerate (bf16 mixed precision, 8 × A100)
Training data: thinkKenya/kenyan_audio_datasets – Kikuyu split
Checkpoints: Includes the main model.safetensors, tokenizer artifacts, and adapter.kik.safetensors for standalone adapter reuse.

Intended Use

The model is intended for automatic speech recognition and transcription of Kikuyu audio. It performs best on clean speech from the thinkKenya corpus and related domains.

Out-of-scope uses

Non-Kikuyu languages or heavily code-switched speech
Medical, legal, or other high-stakes transcription without human verification

Training Procedure

Optimizer: AdamW (learning rate 1e-4, weight decay 0.01)
Batching: Global batch size 256 (per-device batch 4, gradient accumulation 8, 8 GPUs)
Epochs: Up to 3 with early checkpointing on evaluation WER
Gradient tricks: Base MMS encoder frozen, adapters trainable; gradient checkpointing enabled
Data processing: 16 kHz mono audio, lowercase text normalization, character-level tokenizer rebuilt for Kikuyu vocabulary

Evaluation

Decoder	Subset	WER
Greedy decoding	2,048-sample dev subset	0.391
KenLM beam search (`kikuyu.binary` + `kikuyu_unigrams.txt`)	128-sample test slice	0.4872

WER was computed with the Hugging Face evaluate package (wer metric). LM-assisted results use pyctcdecode with a 4-gram KenLM model derived from the training transcripts.

Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

repo_id = "nickdee96/mms-1b-kik-accelerate-2-best"
processor = Wav2Vec2Processor.from_pretrained(repo_id)
model = Wav2Vec2ForCTC.from_pretrained(repo_id)

# optional: load the adapter weights explicitly
model.load_adapter("kik")

# transcribe audio
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(**inputs).logits
pred_ids = logits.argmax(dim=-1)
transcription = processor.batch_decode(pred_ids)[0]

For LM-assisted decoding, combine the model outputs with pyctcdecode and the provided KenLM binary/unigrams from the mms-1b-kik-accelerate run.

Limitations and Biases

Limited to the linguistic variety present in the training data (broadcast & read speech).
Accuracy may degrade for noisy environments, overlapping speakers, or dialectal variations not represented in the dataset.
KenLM decoding currently triggers warnings about label/unigram alignment; revisit if adapting to downstream tasks.

Citation

If you use this model, please cite the original MMS paper and the thinkKenya dataset creators alongside your work.

Downloads last month: 9

Safetensors

Model size

1.0B params

Tensor type

F32

Evaluation results

WER (LM-assisted) on thinkKenya/kenyan_audio_datasets
self-reported

0.487
WER (greedy) on thinkKenya/kenyan_audio_datasets
self-reported

0.391