mms-1b-kik-accelerate-2-best

This repository hosts a Kikuyu automatic speech recognition adapter built on top of the facebook/mms-1b-all base model. The checkpoint corresponds to the best evaluation step from the second adapter fine-tuning run conducted with the audio_training_accelerate_adapter.py pipeline.

Model Details

  • Base model: facebook/mms-1b-all (wav2vec2 encoder + CTC head)
  • Adapter type: MMS language adapter layers (three bottleneck blocks)
  • Language: Kikuyu (kik)
  • Fine-tuning framework: Hugging Face Accelerate (bf16 mixed precision, 8 ร— A100)
  • Training data: thinkKenya/kenyan_audio_datasets โ€“ Kikuyu split
  • Checkpoints: Includes the main model.safetensors, tokenizer artifacts, and adapter.kik.safetensors for standalone adapter reuse.

Intended Use

The model is intended for automatic speech recognition and transcription of Kikuyu audio. It performs best on clean speech from the thinkKenya corpus and related domains.

Out-of-scope uses

  • Non-Kikuyu languages or heavily code-switched speech
  • Medical, legal, or other high-stakes transcription without human verification

Training Procedure

  • Optimizer: AdamW (learning rate 1e-4, weight decay 0.01)
  • Batching: Global batch size 256 (per-device batch 4, gradient accumulation 8, 8 GPUs)
  • Epochs: Up to 3 with early checkpointing on evaluation WER
  • Gradient tricks: Base MMS encoder frozen, adapters trainable; gradient checkpointing enabled
  • Data processing: 16 kHz mono audio, lowercase text normalization, character-level tokenizer rebuilt for Kikuyu vocabulary

Evaluation

Decoder Subset WER
Greedy decoding 2,048-sample dev subset 0.391
KenLM beam search (kikuyu.binary + kikuyu_unigrams.txt) 128-sample test slice 0.4872

WER was computed with the Hugging Face evaluate package (wer metric). LM-assisted results use pyctcdecode with a 4-gram KenLM model derived from the training transcripts.

Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

repo_id = "nickdee96/mms-1b-kik-accelerate-2-best"
processor = Wav2Vec2Processor.from_pretrained(repo_id)
model = Wav2Vec2ForCTC.from_pretrained(repo_id)

# optional: load the adapter weights explicitly
model.load_adapter("kik")

# transcribe audio
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(**inputs).logits
pred_ids = logits.argmax(dim=-1)
transcription = processor.batch_decode(pred_ids)[0]

For LM-assisted decoding, combine the model outputs with pyctcdecode and the provided KenLM binary/unigrams from the mms-1b-kik-accelerate run.

Limitations and Biases

  • Limited to the linguistic variety present in the training data (broadcast & read speech).
  • Accuracy may degrade for noisy environments, overlapping speakers, or dialectal variations not represented in the dataset.
  • KenLM decoding currently triggers warnings about label/unigram alignment; revisit if adapting to downstream tasks.

Citation

If you use this model, please cite the original MMS paper and the thinkKenya dataset creators alongside your work.

Downloads last month
9
Safetensors
Model size
1.0B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results

  • WER (LM-assisted) on thinkKenya/kenyan_audio_datasets
    self-reported
    0.487
  • WER (greedy) on thinkKenya/kenyan_audio_datasets
    self-reported
    0.391