nickdee96's picture
Upload Kikuyu ASR model - WER: 35.74%
08ac0d0 verified
metadata
language:
  - ki
tags:
  - automatic-speech-recognition
  - asr
  - kikuyu
  - wav2vec2
  - mms
  - speech
  - kenyan-languages
  - low-resource
license: apache-2.0
datasets:
  - thinkKenya/kenyan_audio_datasets
model-index:
  - name: MMS 1B Kikuyu ASR
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Kenyan Audio Datasets (Kikuyu)
          type: thinkKenya/kenyan_audio_datasets
          config: Kikuyu
          split: test
          args:
            language: ki
        metrics:
          - name: Word Error Rate
            type: wer
            value: 35.74
          - name: Character Error Rate
            type: cer
            value: N/A
pipeline_tag: automatic-speech-recognition
widget:
  - example_title: Kikuyu Speech Sample
    src: >-
      https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets/resolve/main/data/kikuyu_sample.wav

MMS 1B Kikuyu ASR Model

Model Description

This model is a fine-tuned version of Facebook's MMS (Massively Multilingual Speech) 1B parameter model, specifically adapted for Kikuyu (G末k农y农) automatic speech recognition. The model uses language adapters to efficiently fine-tune the pre-trained MMS model for the Kikuyu language, achieving a Word Error Rate (WER) of 35.74% on the test set.

Model Details

  • Model Type: Wav2Vec2ForCTC with language adapters
  • Base Model: facebook/mms-1b-all
  • Language: Kikuyu (ISO 639-1: ki)
  • Task: Automatic Speech Recognition (ASR)
  • Architecture: Wav2Vec2 with CTC head and language-specific adapters
  • Parameters: ~1B total parameters (only adapter layers fine-tuned)

Training Data

The model was trained on the Kenyan Audio Datasets Kikuyu subset:

  • Training samples: 98,206 audio-text pairs
  • Test samples: 32,736 audio-text pairs
  • Total dataset size: 130,942 samples
  • Audio format: 16kHz sampling rate
  • Text preprocessing: Normalized, lowercase, special characters removed

Training Configuration

Hyperparameters

  • Batch size: 64 (8 per device 脳 4 GPUs 脳 2 gradient accumulation steps)
  • Learning rate: 3e-4
  • Weight decay: 0.01
  • Warmup steps: 500
  • Total training steps: 12,280
  • Epochs: 8
  • Mixed precision: fp16
  • Gradient checkpointing: Enabled

Hardware & Environment

  • GPUs: 4x NVIDIA GPUs
  • Framework: PyTorch with Accelerate
  • Optimization: AdamW optimizer with linear warmup scheduling
  • Distributed training: Multi-GPU with Accelerate

Vocabulary

The model uses a character-level vocabulary specifically designed for Kikuyu, containing 24 tokens:

Characters: a, b, c, d, e, f, g, h, i, j, k, m, n, o, r, t, u, w, y, 末, 农
Special tokens: [PAD], [UNK], | (word separator)

Performance

Metrics

  • Best WER: 35.74% (achieved at training step 1700)
  • Training time: ~3 hours on 4 GPUs
  • Evaluation subset: 2,048 examples per evaluation step

Training Progress

The model showed consistent improvement during training:

  • Step 100: WER 100.52%
  • Step 500: WER 43.34%
  • Step 800: WER 40.02%
  • Step 1200: WER 37.48%
  • Step 1500: WER 36.66%
  • Step 1700: WER 35.74% (best)

Usage

Quick Start

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import soundfile as sf

# Load model and processor
model = Wav2Vec2ForCTC.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")
processor = Wav2Vec2Processor.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")

# Load audio file (16kHz)
audio, sr = sf.read("kikuyu_audio.wav")

# Process audio
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

# Generate transcription
with torch.no_grad():
    logits = model(inputs.input_values).logits

# Decode prediction
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]

print(f"Transcription: {transcription}")

With Pipeline

from transformers import pipeline

# Initialize ASR pipeline
asr = pipeline("automatic-speech-recognition", 
               model="nickdee96/mms-1b-kik-accelerate-2multi")

# Transcribe audio
result = asr("kikuyu_audio.wav")
print(result["text"])

Model Architecture

The model leverages the MMS (Massively Multilingual Speech) architecture with:

  1. Wav2Vec2 Backbone: Pre-trained on 1000+ languages
  2. Language Adapters: Lightweight adapter layers specifically trained for Kikuyu
  3. CTC Head: Connectionist Temporal Classification for sequence-to-sequence learning
  4. Feature Extraction: Convolutional layers for audio feature extraction

Limitations and Bias

Limitations

  • Domain specificity: Trained primarily on read speech, may not generalize well to conversational or spontaneous speech
  • Audio quality: Performance may degrade on low-quality or noisy audio
  • Vocabulary coverage: Limited to characters present in training data
  • Code-switching: May not handle Kikuyu-English code-switching well

Bias Considerations

  • The model reflects the linguistic patterns and potential biases present in the training dataset
  • Performance may vary across different Kikuyu dialects or speaker demographics
  • The dataset composition may not represent all varieties of spoken Kikuyu

Training Infrastructure

Optimizations Applied

  • Vocabulary caching: Efficient vocabulary extraction with caching
  • Multiprocessing: Parallel data processing with 16 processes
  • Feature extraction optimization: Batched audio processing
  • Memory optimization: Gradient checkpointing and mixed precision training

Reproducibility

  • Seed: 42
  • Framework versions: PyTorch 2.x, Transformers 4.x, Accelerate
  • Training logs: Available in model repository

Citation

@misc{kikuyu-asr-2024,
  title={MMS 1B Kikuyu ASR Model},
  author={Kikuyu ASR Team},
  year={2024},
  publisher={Hugging Face},
  journal={Hugging Face Model Hub},
  howpublished={\url{https://huggingface.co/nickdee96/mms-1b-kik-accelerate-2multi}}
}

Acknowledgments

  • Base model: Facebook's MMS team for the pre-trained multilingual model
  • Dataset: thinkKenya for the Kenyan Audio Datasets
  • Infrastructure: Microsoft Azure for compute resources
  • Framework: Hugging Face Transformers and Accelerate libraries

License

This model is released under the Apache 2.0 license, following the base MMS model licensing.

Model Card Contact

For questions or issues with this model, please open an issue in the repository or contact the model authors.