File size: 6,635 Bytes

08ac0d0

---
language: 
- ki
tags:
- automatic-speech-recognition
- asr
- kikuyu
- wav2vec2
- mms
- speech
- kenyan-languages
- low-resource
license: apache-2.0
datasets:
- thinkKenya/kenyan_audio_datasets
model-index:
- name: MMS 1B Kikuyu ASR
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Kenyan Audio Datasets (Kikuyu)
      type: thinkKenya/kenyan_audio_datasets
      config: Kikuyu
      split: test
      args: 
        language: ki
    metrics:
    - name: Word Error Rate
      type: wer
      value: 35.74
    - name: Character Error Rate  
      type: cer
      value: N/A
pipeline_tag: automatic-speech-recognition
widget:
- example_title: Kikuyu Speech Sample
  src: https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets/resolve/main/data/kikuyu_sample.wav
---

# MMS 1B Kikuyu ASR Model

## Model Description

This model is a fine-tuned version of Facebook's MMS (Massively Multilingual Speech) 1B parameter model, specifically adapted for Kikuyu (Gĩkũyũ) automatic speech recognition. The model uses language adapters to efficiently fine-tune the pre-trained MMS model for the Kikuyu language, achieving a Word Error Rate (WER) of **35.74%** on the test set.

## Model Details

- **Model Type**: Wav2Vec2ForCTC with language adapters
- **Base Model**: `facebook/mms-1b-all`
- **Language**: Kikuyu (ISO 639-1: `ki`)
- **Task**: Automatic Speech Recognition (ASR)
- **Architecture**: Wav2Vec2 with CTC head and language-specific adapters
- **Parameters**: ~1B total parameters (only adapter layers fine-tuned)

## Training Data

The model was trained on the [Kenyan Audio Datasets](https://huggingface.co/datasets/thinkKenya/kenyan_audio_datasets) Kikuyu subset:

- **Training samples**: 98,206 audio-text pairs
- **Test samples**: 32,736 audio-text pairs  
- **Total dataset size**: 130,942 samples
- **Audio format**: 16kHz sampling rate
- **Text preprocessing**: Normalized, lowercase, special characters removed

## Training Configuration

### Hyperparameters
- **Batch size**: 64 (8 per device × 4 GPUs × 2 gradient accumulation steps)
- **Learning rate**: 3e-4
- **Weight decay**: 0.01
- **Warmup steps**: 500
- **Total training steps**: 12,280
- **Epochs**: 8
- **Mixed precision**: fp16
- **Gradient checkpointing**: Enabled

### Hardware & Environment
- **GPUs**: 4x NVIDIA GPUs
- **Framework**: PyTorch with Accelerate
- **Optimization**: AdamW optimizer with linear warmup scheduling
- **Distributed training**: Multi-GPU with Accelerate

## Vocabulary

The model uses a character-level vocabulary specifically designed for Kikuyu, containing **24 tokens**:

```
Characters: a, b, c, d, e, f, g, h, i, j, k, m, n, o, r, t, u, w, y, ĩ, ũ
Special tokens: [PAD], [UNK], | (word separator)
```

## Performance

### Metrics
- **Best WER**: 35.74% (achieved at training step 1700)
- **Training time**: ~3 hours on 4 GPUs
- **Evaluation subset**: 2,048 examples per evaluation step

### Training Progress
The model showed consistent improvement during training:
- Step 100: WER 100.52%
- Step 500: WER 43.34% 
- Step 800: WER 40.02%
- Step 1200: WER 37.48%
- Step 1500: WER 36.66%
- **Step 1700: WER 35.74% (best)**

## Usage

### Quick Start

```python
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import soundfile as sf

# Load model and processor
model = Wav2Vec2ForCTC.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")
processor = Wav2Vec2Processor.from_pretrained("nickdee96/mms-1b-kik-accelerate-2multi")

# Load audio file (16kHz)
audio, sr = sf.read("kikuyu_audio.wav")

# Process audio
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

# Generate transcription
with torch.no_grad():
    logits = model(inputs.input_values).logits

# Decode prediction
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]

print(f"Transcription: {transcription}")
```

### With Pipeline

```python
from transformers import pipeline

# Initialize ASR pipeline
asr = pipeline("automatic-speech-recognition", 
               model="nickdee96/mms-1b-kik-accelerate-2multi")

# Transcribe audio
result = asr("kikuyu_audio.wav")
print(result["text"])
```

## Model Architecture

The model leverages the MMS (Massively Multilingual Speech) architecture with:

1. **Wav2Vec2 Backbone**: Pre-trained on 1000+ languages
2. **Language Adapters**: Lightweight adapter layers specifically trained for Kikuyu
3. **CTC Head**: Connectionist Temporal Classification for sequence-to-sequence learning
4. **Feature Extraction**: Convolutional layers for audio feature extraction

## Limitations and Bias

### Limitations
- **Domain specificity**: Trained primarily on read speech, may not generalize well to conversational or spontaneous speech
- **Audio quality**: Performance may degrade on low-quality or noisy audio
- **Vocabulary coverage**: Limited to characters present in training data
- **Code-switching**: May not handle Kikuyu-English code-switching well

### Bias Considerations
- The model reflects the linguistic patterns and potential biases present in the training dataset
- Performance may vary across different Kikuyu dialects or speaker demographics
- The dataset composition may not represent all varieties of spoken Kikuyu

## Training Infrastructure

### Optimizations Applied
- **Vocabulary caching**: Efficient vocabulary extraction with caching
- **Multiprocessing**: Parallel data processing with 16 processes  
- **Feature extraction optimization**: Batched audio processing
- **Memory optimization**: Gradient checkpointing and mixed precision training

### Reproducibility
- **Seed**: 42
- **Framework versions**: PyTorch 2.x, Transformers 4.x, Accelerate
- **Training logs**: Available in model repository

## Citation

```bibtex
@misc{kikuyu-asr-2024,
  title={MMS 1B Kikuyu ASR Model},
  author={Kikuyu ASR Team},
  year={2024},
  publisher={Hugging Face},
  journal={Hugging Face Model Hub},
  howpublished={\url{https://huggingface.co/nickdee96/mms-1b-kik-accelerate-2multi}}
}
```

## Acknowledgments

- **Base model**: Facebook's MMS team for the pre-trained multilingual model
- **Dataset**: thinkKenya for the Kenyan Audio Datasets
- **Infrastructure**: Microsoft Azure for compute resources
- **Framework**: Hugging Face Transformers and Accelerate libraries

## License

This model is released under the Apache 2.0 license, following the base MMS model licensing.

## Model Card Contact

For questions or issues with this model, please open an issue in the repository or contact the model authors.