mms-1b-kik-accelerate-2multi / README.md

nickdee96

Upload Kikuyu ASR model - WER: 35.74%

08ac0d0 verified 3 months ago

preview code

raw

history blame contribute delete

6.64 kB

metadata

language:
  - ki
tags:
  - automatic-speech-recognition
  - asr
  - kikuyu
  - wav2vec2
  - mms
  - speech
  - kenyan-languages
  - low-resource
license: apache-2.0
datasets:
  - thinkKenya/kenyan_audio_datasets
model-index:
  - name: MMS 1B Kikuyu ASR
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Kenyan Audio Datasets (Kikuyu)
          type: thinkKenya/kenyan_audio_datasets
          config: Kikuyu
          split: test
          args:
            language: ki
        metrics:
          - name: Word Error Rate
            type: wer
            value: 35.74
          - name: Character Error Rate
            type: cer
            value: N/A
pipeline_tag: automatic-speech-recognition
widget:
  - example_title: Kikuyu Speech Sample
    src: >-
      https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets/resolve/main/data/kikuyu_sample.wav

MMS 1B Kikuyu ASR Model

Model Description

This model is a fine-tuned version of Facebook's MMS (Massively Multilingual Speech) 1B parameter model, specifically adapted for Kikuyu (Gĩkũyũ) automatic speech recognition. The model uses language adapters to efficiently fine-tune the pre-trained MMS model for the Kikuyu language, achieving a Word Error Rate (WER) of 35.74% on the test set.

Model Details

Model Type: Wav2Vec2ForCTC with language adapters
Base Model: facebook/mms-1b-all
Language: Kikuyu (ISO 639-1: ki)
Task: Automatic Speech Recognition (ASR)
Architecture: Wav2Vec2 with CTC head and language-specific adapters
Parameters: ~1B total parameters (only adapter layers fine-tuned)

Training Data

The model was trained on the Kenyan Audio Datasets Kikuyu subset:

Training samples: 98,206 audio-text pairs
Test samples: 32,736 audio-text pairs
Total dataset size: 130,942 samples
Audio format: 16kHz sampling rate
Text preprocessing: Normalized, lowercase, special characters removed

Training Configuration

Hyperparameters

Batch size: 64 (8 per device × 4 GPUs × 2 gradient accumulation steps)
Learning rate: 3e-4
Weight decay: 0.01
Warmup steps: 500
Total training steps: 12,280
Epochs: 8
Mixed precision: fp16
Gradient checkpointing: Enabled

Hardware & Environment

GPUs: 4x NVIDIA GPUs
Framework: PyTorch with Accelerate
Optimization: AdamW optimizer with linear warmup scheduling
Distributed training: Multi-GPU with Accelerate

Vocabulary

The model uses a character-level vocabulary specifically designed for Kikuyu, containing 24 tokens:

Characters: a, b, c, d, e, f, g, h, i, j, k, m, n, o, r, t, u, w, y, ĩ, ũ
Special tokens: [PAD], [UNK], | (word separator)

Performance

Metrics

Best WER: 35.74% (achieved at training step 1700)
Training time: ~3 hours on 4 GPUs
Evaluation subset: 2,048 examples per evaluation step

Training Progress

The model showed consistent improvement during training:

Step 100: WER 100.52%
Step 500: WER 43.34%
Step 800: WER 40.02%
Step 1200: WER 37.48%
Step 1500: WER 36.66%
Step 1700: WER 35.74% (best)

Usage

Quick Start

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import soundfile as sf

# Load model and processor
model = Wav2Vec2ForCTC.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")
processor = Wav2Vec2Processor.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")

# Load audio file (16kHz)
audio, sr = sf.read("kikuyu_audio.wav")

# Process audio
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

# Generate transcription
with torch.no_grad():
    logits = model(inputs.input_values).logits

# Decode prediction
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]

print(f"Transcription: {transcription}")

With Pipeline

from transformers import pipeline

# Initialize ASR pipeline
asr = pipeline("automatic-speech-recognition", 
               model="nickdee96/mms-1b-kik-accelerate-2multi")

# Transcribe audio
result = asr("kikuyu_audio.wav")
print(result["text"])

Model Architecture

The model leverages the MMS (Massively Multilingual Speech) architecture with:

Wav2Vec2 Backbone: Pre-trained on 1000+ languages
Language Adapters: Lightweight adapter layers specifically trained for Kikuyu
CTC Head: Connectionist Temporal Classification for sequence-to-sequence learning
Feature Extraction: Convolutional layers for audio feature extraction

Limitations and Bias

Limitations

Domain specificity: Trained primarily on read speech, may not generalize well to conversational or spontaneous speech
Audio quality: Performance may degrade on low-quality or noisy audio
Vocabulary coverage: Limited to characters present in training data
Code-switching: May not handle Kikuyu-English code-switching well

Bias Considerations

The model reflects the linguistic patterns and potential biases present in the training dataset
Performance may vary across different Kikuyu dialects or speaker demographics
The dataset composition may not represent all varieties of spoken Kikuyu

Training Infrastructure

Optimizations Applied

Vocabulary caching: Efficient vocabulary extraction with caching
Multiprocessing: Parallel data processing with 16 processes
Feature extraction optimization: Batched audio processing
Memory optimization: Gradient checkpointing and mixed precision training

Reproducibility

Seed: 42
Framework versions: PyTorch 2.x, Transformers 4.x, Accelerate
Training logs: Available in model repository

Citation

@misc{kikuyu-asr-2024,
  title={MMS 1B Kikuyu ASR Model},
  author={Kikuyu ASR Team},
  year={2024},
  publisher={Hugging Face},
  journal={Hugging Face Model Hub},
  howpublished={\url{https://huggingface.co/nickdee96/mms-1b-kik-accelerate-2multi}}
}

Acknowledgments

Base model: Facebook's MMS team for the pre-trained multilingual model
Dataset: thinkKenya for the Kenyan Audio Datasets
Infrastructure: Microsoft Azure for compute resources
Framework: Hugging Face Transformers and Accelerate libraries

License

This model is released under the Apache 2.0 license, following the base MMS model licensing.

Model Card Contact

For questions or issues with this model, please open an issue in the repository or contact the model authors.