mms-1b-kik-accelerate-2-best
This repository hosts a Kikuyu automatic speech recognition adapter built on top of the facebook/mms-1b-all base model. The checkpoint corresponds to the best evaluation step from the second adapter fine-tuning run conducted with the audio_training_accelerate_adapter.py pipeline.
Model Details
- Base model: facebook/mms-1b-all (wav2vec2 encoder + CTC head)
- Adapter type: MMS language adapter layers (three bottleneck blocks)
- Language: Kikuyu (
kik) - Fine-tuning framework: Hugging Face Accelerate (bf16 mixed precision, 8 ร A100)
- Training data: thinkKenya/kenyan_audio_datasets โ Kikuyu split
- Checkpoints: Includes the main
model.safetensors, tokenizer artifacts, andadapter.kik.safetensorsfor standalone adapter reuse.
Intended Use
The model is intended for automatic speech recognition and transcription of Kikuyu audio. It performs best on clean speech from the thinkKenya corpus and related domains.
Out-of-scope uses
- Non-Kikuyu languages or heavily code-switched speech
- Medical, legal, or other high-stakes transcription without human verification
Training Procedure
- Optimizer: AdamW (learning rate 1e-4, weight decay 0.01)
- Batching: Global batch size 256 (per-device batch 4, gradient accumulation 8, 8 GPUs)
- Epochs: Up to 3 with early checkpointing on evaluation WER
- Gradient tricks: Base MMS encoder frozen, adapters trainable; gradient checkpointing enabled
- Data processing: 16 kHz mono audio, lowercase text normalization, character-level tokenizer rebuilt for Kikuyu vocabulary
Evaluation
| Decoder | Subset | WER |
|---|---|---|
| Greedy decoding | 2,048-sample dev subset | 0.391 |
KenLM beam search (kikuyu.binary + kikuyu_unigrams.txt) |
128-sample test slice | 0.4872 |
WER was computed with the Hugging Face evaluate package (wer metric). LM-assisted results use pyctcdecode with a 4-gram KenLM model derived from the training transcripts.
Usage
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
repo_id = "nickdee96/mms-1b-kik-accelerate-2-best"
processor = Wav2Vec2Processor.from_pretrained(repo_id)
model = Wav2Vec2ForCTC.from_pretrained(repo_id)
# optional: load the adapter weights explicitly
model.load_adapter("kik")
# transcribe audio
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**inputs).logits
pred_ids = logits.argmax(dim=-1)
transcription = processor.batch_decode(pred_ids)[0]
For LM-assisted decoding, combine the model outputs with pyctcdecode and the provided KenLM binary/unigrams from the mms-1b-kik-accelerate run.
Limitations and Biases
- Limited to the linguistic variety present in the training data (broadcast & read speech).
- Accuracy may degrade for noisy environments, overlapping speakers, or dialectal variations not represented in the dataset.
- KenLM decoding currently triggers warnings about label/unigram alignment; revisit if adapting to downstream tasks.
Citation
If you use this model, please cite the original MMS paper and the thinkKenya dataset creators alongside your work.
- Downloads last month
- 9
Evaluation results
- WER (LM-assisted) on thinkKenya/kenyan_audio_datasetsself-reported0.487
- WER (greedy) on thinkKenya/kenyan_audio_datasetsself-reported0.391